Wikidata and BHL Update: Part 1

This is a fairly incomplete post about the work that’s going on regarding adding BHL bibliography metadata to Wikidata. I hope to have several more of these posts before the end of the year! 

Following some productive conversations on donating BHL bibliographic metadata to Wikidata, it was discovered almost immediately that BHL’s data is not terribly useful without some serious munging. One of the biggest problems with BHL bibliographic metadata is that it comes from lots of different libraries and museums, legacy cataloging systems, and various types of authority work. For example: BHL attaches Creator IDs to Author names, which is useful for identification and connecting titles and items to their Authors, but they are assigned automatically according to the character strings imported from specific fields in a library catalog’s MARC record. Despite (and perhaps because of) the use of varying authority files to control Author name strings in institutional catalog records, different libraries have contributed items by the same author whose names are are spelled, punctuated, and identified differently. BHL does not conduct authority control on BHL metadata, choosing instead to focus on improving access to items based on content rather than metadata. Fortunately, there are several different ways to go about reconciling and disambiguating data, and one of them is crowdsourcing.

BHL can use Wikidata to tell its users that “Packard, Alpheus S” (Creator ID: 82636), “Packard, A” (Creator ID: 59850), “Packard, A S” (Creator ID: 48286), “Packard, A. S. (Alpheus Spring), 1839-1905” (Creator ID: 1592), and “Packard, Alpheus Spring” (Creator ID: 56087) are all the same person without editing the spelling or legacy metadata from the catalog record.

Screen Shot 2017-08-17 at 4.24.11 PM

Dr. Packard’s Wikidata Item viewed in Reasonator

One way is to use Wikidata as an identifier by adding a property for a BHL Creator ID in Wikidata (P4081) and adding a table in BHL for Wikidata Identifiers that can be associated with those same Creator IDs. By adding identifiers to Wikidata, it becomes a more robust knowledge base that will improve the discoverability of BHL’s content by enriching its metadata externally and solving some metadata problems internally. While some of the reconciling can be done computationally using (still more) authority files, it often misidentifies strings and isn’t very helpful when an author is not in that particular database. These errors are best caught by humans, who WIkidata invites to directly edit mistakes and add identifiers. By adding Creator IDs to Wikidata and in turn adding Wikidata IDs to BHL, BHL can leverage the wisdom of the crowd to reconcile its author metadata.

In order to test this idea and attempt to start down a path that will hopefully lead to more BHL data in Wikidata, I worked with Andy Mabbett (User:PigsOnTheWing) to add a representative set of 1000 BHL CreatorIDs to Wikidata; the first step of which was to disambiguate these authors. In order to procure a sample of 1000 representative authors, I used the rbhl R package to interface with the BHL API and pull a random sample of authors with associated DOIs.1 The rbhl package is an rOpenSci tool and can be found on their GitHub. The R script I used can also be found on GitHub at: https://github.com/kmika11/BHL_Wikidata/blob/master/CreatorID/rbhl_CreatorIDScript.R . Once I was able to generate a table of Author Strings, CreatorIDs, an associated Title, and its DOI I headed over to OpenRefine to start reconciling BHL CreatorIDs. As you’ll remember from a few paragraphs ago, BHL doesn’t conduct authority control and relies instead on the work of partner institutions. This means that there are no external identifiers for authors in BHL. We chose to reconcile against VIAF IDs because VIAF has the most identifiers in Wikidata (for library resources at least). Once there were VIAF IDs, the CreatorIDs could be added as a P4081 property statement to author QIDs. The tool Mix n’ Match makes the part of this process that requires some human thought pretty simple and somewhat fun!2  

Now, my next steps are figuring out what that next steps are. There is some interest to add New York Botanical Garden’s herbaria type specimen to Wikidata along with protologue literature from BHL and perhaps field notebooks and other relevant collecting event items. BHL also has quite a long list of taxon names (3,732,986 names) with metadata for the pages they’ve come from. I don’t think it’s appropriate to push all of this data to Wikidata, but it is a significant dataset that could be useful in varying ways. Another issue is that resolving author strings to VIAF IDs is not an insignificant amount of work. Gerard Meijssen has brought up the idea of using Open Library IDs, which are already resolved to VIAF and often Wikidata, and which may be a solution. BHL hosts its content on the Internet Archive, which is the creator of the Open Library. One would imagine that is a simple hop, skip, and a jump from BHL CreatorIDs to OpenLibrary IDs, but I’m still investigating whether that is, in fact, the case.

Please jump in with any thoughts about Wikidata + BHL or what I’ve described above. I know that WordPress is not terribly conducive to discussion, but that’s how we’re set up for now. I do not claim to have an expert level grasp of Wikidata yet (or BHL for that matter), but this collaboration seems to be a constructive Open Data pursuit!

1. During this step I incorrectly assumed that BHL minted DOIs for all its content including individual articles. BHL does mint DOIs for monographs, and worked with BioStor to add 12,000 DOIs for articles.

2. The manual for using Mix n’ Match can be found at: https://meta.wikimedia.org/wiki/Mix%27n%27match/Manual

Advertisements

2 thoughts on “Wikidata and BHL Update: Part 1

  1. Hi Katie, A couple of quick thoughts. The first is that I really do think we need someplace better to hold this discussion. One way forward might be to use Github. For example, the next Catalogue of Life is being sketched out there in a repository https://github.com/sp2000/colplus and people are weighing in with various ideas, see https://github.com/sp2000/colplus/issues. Creating a repository within, say the BHL organisation https://github.com/gbhl would mean these conversations can be both open and persistent. Quite a few BHL folk are on Github already, so it should be reasonably familiar. You can create some pretty rich documents as part of the discussion, e.g. https://github.com/rdmpage/biostor/issues/63

    You mentioned that you added data to Wikidata. It would be handy to have some links to some examples so readers can see what this looks like. It would also be fun to try some Wikidata queries related to BHL. Lastly, there doesn’t seem to be a BHL TitleID property in Wikidata. Having one would make it easier to do some additional linking and querying.

    Liked by 1 person

  2. Hi Rod, I think GitHub is a good idea – I’ve sort of got a repository started (https://github.com/kmika11/BHL_Wikidata) that we can use, or I’m happy to contribute to another one. An example of an item with a new BHL CreatorID is: Alfred Russel Wallace (https://www.wikidata.org/wiki/Q160627). Another example is George Marx (https://www.wikidata.org/wiki/Q3101745), who has a page in English and German Wikipedias, several other Big Wikipedias (Spanish, Italian, and French), and a Wikispecies page, but no external identifier. Now, he’s got a BHL CreatorID!

    There are some fun queries (https://query.wikidata.org/) that I’ve been playing around with, but I haven’t quite got the hang of SPARQL syntax yet. This query:

    SELECT ?person ?personLabel ?BHLCreatorID ?VIAFid WHERE {
    ?person wdt:P31 wd:Q5.
    ?person wdt:P4081 ?BHLCreatorID.
    ?person wdt:P214 ?VIAFid.
    SERVICE wikibase:label { bd:serviceParam wikibase:language “[AUTO_LANGUAGE],en”. }
    }

    returns items (humans) that have VIAF and BHL Creator IDs. I get 504 items, which is about half of the pilot set. I was also unable to reconcile a percentage of the authors pulled from BHL with VIAF identifiers, which obviously affected adding IDs. The table is on GitHub. I can remove the VIAF ID line to find out how many total items have a BHL Creator ID (548), and I can filter to find items with a BHL CreatorID but no VIAF ID with this query:

    SELECT ?person ?personLabel ?BHLCreatorID ?VIAFid WHERE {
    ?person wdt:P31 wd:Q5.
    ?person wdt:P4081 ?BHLCreatorID.
    FILTER NOT EXISTS { ?person wdt:P214 ?VIAFid }
    SERVICE wikibase:label { bd:serviceParam wikibase:language “[AUTO_LANGUAGE],en”. }
    }

    This returns 44 items, and they’re mostly un-disambiguated names that I am going through and manually editing.

    I also agree that a TitleID property would be a good idea for adding articles and monographs.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s