The Role of Librarians in Wikidata and WikiCite

 

Screen Shot 2017-06-05 at 4.31.53 PMThe other week I participated in WikiCite 2017, a conference, summit, and hackathon event organized for members of the Wikimedia community to discuss ideas and projects surrounding the concept of adding structured bibliographic metadata to Wikidata to improve the quality of references in the Wikimedia universe. As a Wikidata editor and a librarian, I was pumped to be included in the functional and organizational conversations for WikiCite and learn more about how librarians and GLAMs can contribute.

The Basics (briefly and criminally simplified)

Galleries, Libraries, Archives, and Museums are institutions that collect, preserve, and make available information artefacts and cultural heritage items for use by the public. Before databases librarians managed card catalogs to facilitate access, which were translated into MAchine Readable Cataloging (MARC) data format digital records to create online catalogs (ca. 1970s-2000s). As items in collections are being digitized, librarians et al. add descriptive, administrative, and technical/structural metadata to records and provide access to digital surrogates via digital library or repository, depending on copyright. Metadata, however, is generally not subject to copyright and is often published by GLAMs for analysis and use via direct download, APIs, and in more and more cases, as Linked Open Data. As a field, we’re still at the beginning of this transformation to Linked Open Data and have significant questions still to answer and thorny issues still to resolve.

Screen Shot 2017-06-05 at 4.38.32 PM

Diagram of a Wikidata item  https://www.wikidata.org/wiki/Wikidata:Introduction

Wikidata is a source of machine-readable, multilingual, structured data collected to support Wikimedia projects and CC0 licensed under the theory that simple statements of fact are not subject to copyrightWikidata items are comprised of statements that have properties and values. In the Linked Open Data world these items are graphs with statements expressed in triples. As Wikimedians and Wikidata editors add more of this supporting structured data to Wikipedia, the idea of adding bibliographic metadata to Wikidata started coming up. Essentially – “Here are some great structured data that are incredibly important to the functionality of Wikipedia; how can we add them to this repository that we’re creating in a useable way?” As many librarians (and really anyone that’s written a substantial research paper) are aware, citations are complicated.  

Theory

The success of Wikipedia is built on the community’s requirement of verifiability. For Wikipedia this means that added material must have been previously published or is clearly supported by a reliable source. Currently, references are generated using a system of templates that are powerful but unsophisticated and bury citation data in the bodies of articles.

Wikipedia built the largest curated bibliography in the world and it is largely not discoverable, very difficult to maintain, very hard to analyze, and unusable. The first WikiCite meeting in 2016 set goals and laid foundations for figuring out how to make citations meaningful in Wikimedia. There have been several open source projects created in the last year within this community looking to understand what can be done with citation data. Wikidata is driving the machine-readable, queryable aspect of data within the Wikimedia community and as more people understand what it is and what it’s capable of, more ideas pop up.

Sources are the foundation of Wikipedia’s claim to authority, which depends on the ability of Wikipedia to deliver source information, a.k.a. metadata,  a.k.a. citations. Heather Ford’s talk “The Social Life of (Wikipedia) Sources” on Day 1 discussed the role of citations in Wikipedia and the paradigm of digital references for knowledge representation in order to familiarize WikiCite-ers with the theoretical foundations of information evaluation. Ford described her attempts to answer the question “What is a reliable source?” in a meaningful way by investigating Wikimedia policy and community recommendations and analyzing citations in Wikipedia to identify patterns. Because of information’s role as a mediated representation of knowledge (not knowledge itself), the selection and reference of sources is inherently biased. Therefore, it is the responsibility of designers and data modelers within the WikiCite and Wikidata community to “map the patterns of inevitable systemic bias and constantly calibrate the system.” 

Librarians can offer some good guidelines on source selection and citation in the Wikimedia community. ACRL has a wonderful information literacy framework that identifies core concepts for understanding disseminated knowledgeFRBR was designed by librarians to create an entity-relationship model for databases (catalogs) to reflect the conceptual structure of information resources. Another Day 1 talk from Andrea Zanni, “Wikidata and Booooks”, opened with the slide “books are complex”. Indeed! Last year’s WikiCite 2016 included an attempt to build a data model for books on the presumption that a book is a knowable, simple item for a library or repository. Zanni described the group’s reliance on FRBR as a model through which to understand the structure and biases of a book, and how or whether to represent that in Wikidata. It is far more complex than simply determining a set of properties to describe a book because the metadata in a data model is insufficient to accurately represent the relationship between the item and the work. One way to manage this is to create multiple entities (Wikidata items) and place them within a hierarchy or graph of the work. But there are hundreds of millions of works – is it reasonable to expect WikiCite to add double or quadruple the amount of items for works? Alternatively, a 1:1 ratio inevitably omits a lot of significant information.

Screen Shot 2017-06-06 at 10.15.13 AM

“Wikidata and Booooks”      http://babele.io/slides/wikidatabooks/#/

As a librarian I could spend several more pages writing about the theory of references, citations, bibliographies and the representation of knowledge, but I thought I would dive right in with some BHL metadata to see what is currently possible for your average Wikidata editor.

Practice

Creating a central repository for structured, separable, and open bibliographic metadata is essential for connecting the sum of all human knowledge. In this, librarians can be useful allies. WikiCite draws from the academic community in which citation data (generally from peer-reviewed articles) is crucial for creating and linking knowledge. This graph from Joe Wass’s slide deck for his talk “Crossref Event Data: Transparency First” shows Wikipedia as a significant entry point to scholarly literature.

 

Crossref_Event_Data_Transparency_First.pdf

Wikipedia is the light blue bar that runs through the middle third of the graph. https://commons.wikimedia.org/w/index.php?title=File:Crossref_Event_Data_Transparency_First.pdf&page=12

Indeed, the focus thus far on adding journal articles to Wikidata has been driven by the scientific community and has implicated publishers more than libraries. GLAMs, however, have *tons* of bibliographic and collection metadata for heritage literature, historical texts, special collections, and institutional archives. These data can fill in gaps in citation graphs, create context for other items, and add valuable structured historical content to WikiCite and Wikidata.  

In order to see exactly where we were with WikiCite I pulled some bib records from BHL and went ahead and added them to Wikidata. I created several manually and then tried the Source MetaData tool to import via DOI. I also downloaded the database tables as .tsv files, and hoped to add those to Wikidata. Unfortunately I can’t find a tool or method to support this. I’m also not so sure how well the tables’ structure will work with Wikidata. The big positive take away is that there is a “BHL Page ID” property in Wikidata that I can add a statement for. So in addition to adding a item I can link that item directly to BHL’s book viewer! This seems like a great way to open up collections!

Screen Shot 2017-06-06 at 1.48.05 PM

On the other hand, adding bib records to Wikidata is not without work. All of the work to resolve author names has to be done manually. When creating a new item by hand, I had to research Wikipedia and Wikidata to locate the correct forms of author names to list in the P50 – Author (Wikidata item) property. Source MetaData adds authors as values for P2093 – Author String, not P50. Consider how “Alexander Agassiz” is the name for the Wikipedia article about the scientist, “Alexander Emanuel Agassiz” is the label for the Wikidata item, and BHL reconciles the author name as Agassiz, Alexander, 1835-1910 from the Library of Congress Name Authority File. Wikidata is also insufficient to accurately represent many information resources.  For example: it’s difficult to connect books that are part of monographic series (not serials). Bibliographically it’s important to maintain the relationship between series and book, but I can’t figure out how to. There are also no data models to add and subsequently cite primary source material. One can add a field notebook as a book, but context for expeditions, other scientists, inclusive date ranges instead of a single publication date, the museum or special collections that holds the item, and other organizational relationships are lost.

To conclude then, after learning more about the theory behind WikiCite and where the community is at in terms of adding citation data to Wikidata, I think there are two important functions that Wikidata needs to develop in order to take advantage of GLAM metadata.

  1. Batch importing – a way to generate Wikidata items (and from that, WikiCite citations in other Wikimedia projects) from CSV, TSV, XML, MARC files at scale.
  2. Also, the opposite – as users manipulate and enrich Wikidata items, GLAMs need a way to harvest that new data/metadata and add it our repositories and catalogs.

As WikiCite continues to grow, it will be interesting to follow how (hopefully not whether!) the community integrates GLAM metadata. Generating Linked Open Data citations has the potential to connect objects and concepts with information resources to create context for more accurate and interesting digital representations of knowledge and cultural heritage. And more complete and complex graphs are better at supporting deeper investigations, queries, and visualizations of these data across repositories, collections, and knowledge bases.

22 thoughts on “The Role of Librarians in Wikidata and WikiCite

  1. Batch importing can be done with QuickStatements [1], which uses tab-separated commands to update, or create and populate, Wikidata items. Note that an even-more-powerful v2 of QuickStatements is in development [2].

    A number of tools allow you to harvest statements from Wikidata, for example Beacon [3] and ‘Wikidata to CSV’ [4]

    More tools are listed on Wikidata [5], but clearly we – the Wikidata community – need to do more to make people aware of them!

    [1] https://tools.wmflabs.org/wikidata-todo/quick_statements.php

    [2] https://www.wikidata.org/wiki/User:Magnus_Manske/quick_statements2

    [3] https://tools.wmflabs.org/wikidata-todo/beacon.php

    [4] https://tools.wmflabs.org/ash-django/wd2csv/

    [5] https://www.wikidata.org/wiki/Wikidata:Tools

    Liked by 1 person

    • Thanks for all the links – I have discovered QuickStatements! Great tool, and I’m looking forward to the new version. I’m working now on parsing/formatting/munging BHL .tsvs into the most viable structure for QuickStatements to add everything to the best property. Still a bit stuck on resolving authors (although the resolve authors tool is helpful [https://tools.wmflabs.org/sourcemd/new_resolve_authors.php]). Do you know if there tools for BibTex or RIS files or even MARC records (less for GLAM workflows and more for my own curiosity).

      Like

      • I think resolving authors is going to be difficult, whatever tools we use. Deciding which “Jane Smith” out of 200 or more possibilities is the one we want requires judgement, not code – although ORCID iDs are a big help in that regard [1], for current authors.

        There’s a tool in development [2] that will have Zotero export its metadata as QuickStatements commands, so you could import your BibTex (etc.) into Zotero, an then use that, once it’s a stable release. I’m not aware of anything else, but then I don’t work with BibTex, RIS. or MARC.

        [1] https://en.wikipedia.org/wiki/Wikipedia:ORCID/Institutions

        [2] https://github.com/UB-Mannheim/zotkat/blob/master/Wikidata QuickStatements.js

        Like

      • Firstly, WordPress’s commenting system doesn’t seem to let me reply to Andy below (sigh). Actually Andy I think we can make lots of use of code to help disambiguate authors. There’s a whole literature on using things like patterns of co-authorship to help out, and obviously with biodiversity literature we have lots of scope to use Wikispecies to help out (many BHL authors will be authors of articles in Wikispecies). “Judgement” can almost always be automated 😉

        Liked by 1 person

  2. Hi @KMika11, I think we have a similar impression of getting data into Wikicite (see my post http://iphylo.blogspot.co.uk/2017/05/wikidata-wikicite-and-of-life.html ). Daniel Mietchen has assembled a frighteningly long list of terms that Wikidata as for bibliographic items https://www.wikidata.org/w/index.php?title=Template:Bibliographical_properties. Andy has given some links to tools, but none of this looks particularly straightforward, especially if things are going to be added in bulk.

    Regarding “Alexander Agassiz” , what we really want is to state that the author is the entity with the id https://www.wikidata.org/wiki/Q122968, not have the author as a “dumb string”. Then is doesn’t matter what the string is, we have the concrete link.

    As an academic and a programmer rather than a librarian, I confess I often regard library data models as more complicated than they need to be, and that we can make a lot of progress by avoiding some (most?) of this complexity.

    Like

    • Hello! Yes, I agree with what you’ve written about, and using BHL data to support a Bibliography of Life could be a wonderful collaboration. I’m working on parsing the data tables (well, a selection thereof) to add via QuickStatements, but it seems inevitable that a not insignificant amount of data is going to be lost. This is going to be largely unacceptable for much the GLAM community whereby any public records are to be the most accurate and the most complete.

      That’s exactly the problem to which I was referring re: “Alexander Agassiz” (I should have referenced the Q#). Since BHL lists the author as “Agassiz, Alexander, 1835-1910” QuickStatements and Source MetaData can’t find the Q item and add it as a dumb string. Something complex like resolving authors doesn’t need to be completely automated, but it’s especially hard with BHL data that doesn’t include VIAF IDs and relies on the authority work of contributing institutions. The problem is, this is typical with GLAM metadata.

      Like

      • I think there are several ways forward here, depending on what you want to achieve (and I think it would be helpful to thrash out what the goals would be).

        One approach is to use Wikidata simply as an external “authority file”, so that entities such as publishers, museums, libraries, journals, people, places, etc.” get linked to their Wikidata identifier, and these identifiers can then be used internally by BHL instead of just dumb strings. This ultimately would enable some quite sophisticated queries across domains (such as museums, libraries, funders, publishers, etc.).

        Another approach is to treat Wikidata as a not only an authority file but as a data store, in which case we’d want to add potentially all the BHL bibliographic data into Wikidata, or at least that part of its content that is relevant to Wikimedia projects.

        Regarding disambiguating authors, I think there are two things to think about. Simply trying to disambiguated based on author names is going to be a a challenge, many of the author names in BHL are wrong or incomplete, and many (most?) are variations of other names (e.g., with or without given names spelled out, with or without punctuation). I think a better approach is to map links rather than names. In other words, given that we have a link between Wikidata and Wikispecies for many authors, and we can make a link between a Wikispecies author and a work in BHL (once we parse Wikispecies), then we we can make an assertion that this person is the author of that work. In other words, we need to think about graphs or networks of connections in addition to just matching strings.

        The other consideration is that sometimes we have more information, such as “Agassiz, Alexander, 1835-1910” This sort of string drives me nuts as it’s a classic case of overloading a data field that happens a lot in bibliographic metadata (“hmm, I have more info, let me squeeze that into this field since no other field exists”). But in this case you can use this information. If we parse the string we have , , . We can use this information to query Wikidata with a bit more precision, e.g. here’s a SPARQL query that uses that information to find Alexander Agassiz:

        SELECT ?_human ?name WHERE {

        VALUES ?family {“Agassiz”@en }
        VALUES ?given {“Alexander”@en }
        VALUES ?birth { 1835 }
        VALUES ?death { 1910 }

        ?_human wdt:P31 wd:Q5.
        ?_human rdfs:label ?name .

        ?_human wdt:P734 ?familyName .
        ?familyName rdfs:label ?family .

        ?_human wdt:P735 ?givenName .
        ?givenName rdfs:label ?given .

        ?_human wdt:P569 ?birth_date .
        ?_human wdt:P570 ?death_date .

        FILTER(year(?birth_date) = ?birth) .
        FILTER(year(?death_date) = ?death) .
        FILTER (lang(?name) = ‘en’)
        }

        You can try this query live http://tinyurl.com/ycbyu7v3 So, for entities for which BHL knows a bit more than just a string you could do quite a lot of matching.

        Hoping this helps. Wikidata is very powerful, if a bit clunky. Maybe it’s time to think about some custom tools to help tackle some BHL-specific questions.

        Like

      • Couple of other comments. I guess I had to smile at “GLAM community whereby any public records are to be the most accurate and the most complete”. I’ve seen a lot of public museum and herbarium records that are anything but “most accurate and the most complete”, and a lot of bibliographic metadata from commercial publishers that is similarly inaccurate (occasionally even misleading). So I while “most accurate and the most complete” may be an aspiration, I don’t think we should kid ourselves as to the quality of what we have, it’s often pretty poor, even if sourced from “reputable” sources.

        The other comment is that there is clearly massive duplication of author names in BHL (some of these U’ve contributed via BioStor), and there’s a lot of scope for clustering those names into sets that probably represent the same name,. Once again, this can be automated, see http://iphylo.blogspot.co.uk/2009/01/equivalent-author-names.html

        Like

      • Rod says “there is clearly massive duplication of author names in BHL”.

        One thing Wikidata is good at is spotting such duplicates, especially when they have the same external identifier (e.g a VIAF or ORCID ID). Several organisations whose IDs are in Wikidata, alongside data from several other sources, have found Wikidata’s reports of such duplicates (e.g. [1]), or the ability to query for records, with, say the same birth and death dates, useful in cleaning up their own data, though grouping, merging, or splitting records.

        [1] https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P214

        Like

    • Yes – author names in BHL are very duplicative. There are going to be a lot of problems that stem from these heritage records that were created in the 19th and early 20th century (and earlier) on paper and then migrated to newer technologies. And with BHL you have these old records from several different institutions that used varying conventions and upgraded to different (often proprietary) systems at different times. Authority files and authorized vocabs and ontologies have also been used to varying degrees across different institutions and over time. So you have records with the author “Lamont, James, 1828 – ” (BHL creator ID 36152) from NHM London, “Lamont, James, 1828 – 1913” (BHL creator ID 16345) from UCLA and Smithsonian Libraries, and “Lamont, James, Sir” (creator ID 16344) from U Toronto Libraries. The birth and death dates in the first cases aren’t just there to shove more info into places where it doesn’t fit, but was used as a tactic for many years as a disambiguation method.

      I like the idea of linking between Wikidata and Wikispecies for authors and adding works from BHL to Wikispecies. This seems like the most useful integration between the existing content in WD, WS, and BHL. As for the authority file vs. data storage goal – I’m not sure how realistic it is for BHL to shift workflows and ingestion processes to have records point to WD as an authority file. Catalog records are pulled by BHL and Internet Archive (that is used as the image delivery service) from individual institutions from a Z39.50 fetching process, and adjusting this would implicate other dependent processes. There is also reluctance to include “unstable” things in BHL. By which I just mean that Wikidata is subject to change depending on community edits. This is sort of what I was (perhaps unsuccessfully) trying to get at with the “accurate and complete” comment. Obviously catalogs and metadata repositories are riddled with errors and are not particularly well structured for current technologies, this is why we still have jobs! But it is likely going to be difficult to get most institutions on board to donate data to a repository that isn’t yet structured to accurately represent an organization’s resources. For BHL it’s more of an issue of us having out of date and ambiguous data, but for special collections or museums its going to be important to model specimens and unpublished material. Using Wikidata as a data storage repository however, seems like a goal that many GLAMs would be interested in – for the links across collections and opportunities to improve metadata.

      Like

    • To clarify my own comment: While we would indeed “store the ‘1368’ part of the URI”, we’d also store the rest of the URI, as a “formatter URL” in the property’s definition, in this case as “http://www.biodiversitylibrary.org/creator/$1” or “http://www.biodiversitylibrary.org/creator/$1#/titles”, and Wikidata’s software knows to substitute the “$1” with the iD from the item representing the individual, thus building the full URI on the fly. This allows us to have multiple formatter URLs for a single identifier-property.

      Also, we have tools, like Mix’n’Match [1], which allow us to do bulk imports of identifiers from a site like BHL. It will heuristically match them to Wikidata items, then present them to humans in a simple (some would say “gamified”) interface for confirmation or rejection.

      [1] https://tools.wmflabs.org/mix-n-match

      Like

      • This is very cool! I can see how the creator IDs would work similarly to the BHL page ID property. Could a combination of the mix-n-match tool with Creator IDs be used to effectively merge several authors with distinct Creator IDs in BHL into one Wikidata item? To add two or three BHL Creator IDs to one author Q item?

        Like

  3. Katie asked: “Could a combination of the mix-n-match tool with Creator IDs be used to effectively merge several authors with distinct Creator IDs in BHL into one Wikidata item? To add two or three BHL Creator IDs to one author Q item?”

    Yes; or rather: to collect several BHL identities for a *single* author into one Wikidata item. Once the property I proposed is created, I’d be happy to chat with you in more detail about how we (the Wikidata community) can do this, and how BHL can participate, but perhaps by email, or Skype/ hangout rather than here?

    Liked by 1 person

  4. While not the point of your post, I did want to point out something of an inaccuracy.

    > Wikipedia built the largest curated bibliography in the world and it is largely not discoverable, very difficult to maintain, very hard to analyze, and unusable.

    Every citation on Wikipedia using the CS1 system [1] (that is, {{citation}}, {{cite journal}}, and others) outputs metadata using COinS [2].

    So here is some discovery, in case you did not know about it. 🙂

    [1] https://en.wikipedia.org/wiki/Help:Citation_Style_1
    [2] https://en.wikipedia.org/wiki/COinS

    Like

  5. I estimate that Andy Mabbett proposal of the property for the BHL creator ID already helps a lot, if there’s still information where you think Wikidata currently lacks properties to store it, I invite you to raise the topic on the Wikidata Project Chat. If we currently lack properties to model data that you have, we can create a new property for that data.

    Like

  6. Pingback: BHL and Social Media | Herbarium World

  7. All the BHL authors are now in Mix’n’Match, I have added over 1000 BHL authors and added them in one Wikidata when I thought they were the same.
    When a book is known in the Library of Congress, it often implies that the LoC knows the author as well .. one reason why we should import links to all the books because it allows people to read the books that have been digitised.

    Like

Leave a comment