Wikidata and BHL Update: Part 1

This is a fairly incomplete post about the work that’s going on regarding adding BHL bibliography metadata to Wikidata. I hope to have several more of these posts before the end of the year! 

Following some productive conversations on donating BHL bibliographic metadata to Wikidata, it was discovered almost immediately that BHL’s data is not terribly useful without some serious munging. One of the biggest problems with BHL bibliographic metadata is that it comes from lots of different libraries and museums, legacy cataloging systems, and various types of authority work. For example: BHL attaches Creator IDs to Author names, which is useful for identification and connecting titles and items to their Authors, but they are assigned automatically according to the character strings imported from specific fields in a library catalog’s MARC record. Despite (and perhaps because of) the use of varying authority files to control Author name strings in institutional catalog records, different libraries have contributed items by the same author whose names are are spelled, punctuated, and identified differently. BHL does not conduct authority control on BHL metadata, choosing instead to focus on improving access to items based on content rather than metadata. Fortunately, there are several different ways to go about reconciling and disambiguating data, and one of them is crowdsourcing.

BHL can use Wikidata to tell its users that “Packard, Alpheus S” (Creator ID: 82636), “Packard, A” (Creator ID: 59850), “Packard, A S” (Creator ID: 48286), “Packard, A. S. (Alpheus Spring), 1839-1905” (Creator ID: 1592), and “Packard, Alpheus Spring” (Creator ID: 56087) are all the same person without editing the spelling or legacy metadata from the catalog record.

Screen Shot 2017-08-17 at 4.24.11 PM

Dr. Packard’s Wikidata Item viewed in Reasonator

One way is to use Wikidata as an identifier by adding a property for a BHL Creator ID in Wikidata (P4081) and adding a table in BHL for Wikidata Identifiers that can be associated with those same Creator IDs. By adding identifiers to Wikidata, it becomes a more robust knowledge base that will improve the discoverability of BHL’s content by enriching its metadata externally and solving some metadata problems internally. While some of the reconciling can be done computationally using (still more) authority files, it often misidentifies strings and isn’t very helpful when an author is not in that particular database. These errors are best caught by humans, who WIkidata invites to directly edit mistakes and add identifiers. By adding Creator IDs to Wikidata and in turn adding Wikidata IDs to BHL, BHL can leverage the wisdom of the crowd to reconcile its author metadata.

In order to test this idea and attempt to start down a path that will hopefully lead to more BHL data in Wikidata, I worked with Andy Mabbett (User:PigsOnTheWing) to add a representative set of 1000 BHL CreatorIDs to Wikidata; the first step of which was to disambiguate these authors. In order to procure a sample of 1000 representative authors, I used the rbhl R package to interface with the BHL API and pull a random sample of authors with associated DOIs.1 The rbhl package is an rOpenSci tool and can be found on their GitHub. The R script I used can also be found on GitHub at: . Once I was able to generate a table of Author Strings, CreatorIDs, an associated Title, and its DOI I headed over to OpenRefine to start reconciling BHL CreatorIDs. As you’ll remember from a few paragraphs ago, BHL doesn’t conduct authority control and relies instead on the work of partner institutions. This means that there are no external identifiers for authors in BHL. We chose to reconcile against VIAF IDs because VIAF has the most identifiers in Wikidata (for library resources at least). Once there were VIAF IDs, the CreatorIDs could be added as a P4081 property statement to author QIDs. The tool Mix n’ Match makes the part of this process that requires some human thought pretty simple and somewhat fun!2  

Now, my next steps are figuring out what that next steps are. There is some interest to add New York Botanical Garden’s herbaria type specimen to Wikidata along with protologue literature from BHL and perhaps field notebooks and other relevant collecting event items. BHL also has quite a long list of taxon names (3,732,986 names) with metadata for the pages they’ve come from. I don’t think it’s appropriate to push all of this data to Wikidata, but it is a significant dataset that could be useful in varying ways. Another issue is that resolving author strings to VIAF IDs is not an insignificant amount of work. Gerard Meijssen has brought up the idea of using Open Library IDs, which are already resolved to VIAF and often Wikidata, and which may be a solution. BHL hosts its content on the Internet Archive, which is the creator of the Open Library. One would imagine that is a simple hop, skip, and a jump from BHL CreatorIDs to OpenLibrary IDs, but I’m still investigating whether that is, in fact, the case.

Please jump in with any thoughts about Wikidata + BHL or what I’ve described above. I know that WordPress is not terribly conducive to discussion, but that’s how we’re set up for now. I do not claim to have an expert level grasp of Wikidata yet (or BHL for that matter), but this collaboration seems to be a constructive Open Data pursuit!

1. During this step I incorrectly assumed that BHL minted DOIs for all its content including individual articles. BHL does mint DOIs for monographs, and worked with BioStor to add 12,000 DOIs for articles.

2. The manual for using Mix n’ Match can be found at:

The Role of Librarians in Wikidata and WikiCite


Screen Shot 2017-06-05 at 4.31.53 PMThe other week I participated in WikiCite 2017, a conference, summit, and hackathon event organized for members of the Wikimedia community to discuss ideas and projects surrounding the concept of adding structured bibliographic metadata to Wikidata to improve the quality of references in the Wikimedia universe. As a Wikidata editor and a librarian, I was pumped to be included in the functional and organizational conversations for WikiCite and learn more about how librarians and GLAMs can contribute.

The Basics (briefly and criminally simplified)

Galleries, Libraries, Archives, and Museums are institutions that collect, preserve, and make available information artefacts and cultural heritage items for use by the public. Before databases librarians managed card catalogs to facilitate access, which were translated into MAchine Readable Cataloging (MARC) data format digital records to create online catalogs (ca. 1970s-2000s). As items in collections are being digitized, librarians et al. add descriptive, administrative, and technical/structural metadata to records and provide access to digital surrogates via digital library or repository, depending on copyright. Metadata, however, is generally not subject to copyright and is often published by GLAMs for analysis and use via direct download, APIs, and in more and more cases, as Linked Open Data. As a field, we’re still at the beginning of this transformation to Linked Open Data and have significant questions still to answer and thorny issues still to resolve.

Screen Shot 2017-06-05 at 4.38.32 PM

Diagram of a Wikidata item

Wikidata is a source of machine-readable, multilingual, structured data collected to support Wikimedia projects and CC0 licensed under the theory that simple statements of fact are not subject to copyrightWikidata items are comprised of statements that have properties and values. In the Linked Open Data world these items are graphs with statements expressed in triples. As Wikimedians and Wikidata editors add more of this supporting structured data to Wikipedia, the idea of adding bibliographic metadata to Wikidata started coming up. Essentially – “Here are some great structured data that are incredibly important to the functionality of Wikipedia; how can we add them to this repository that we’re creating in a useable way?” As many librarians (and really anyone that’s written a substantial research paper) are aware, citations are complicated.   Continue reading

Why Transcribe?

Digitization is not a new activity for libraries and cultural heritage institutions, and indeed has become a critical tool for preserving and providing access to archival collections including rare books, manuscripts, and photographs. The potential research value of digitized collections is also not a new phenomenon. However, translating images of content into machine readable data that can be searched, sorted, and otherwise manipulated had not received much attention until crowdsourcing, citizen science, and other types of community collaboration models and platforms were constructed. A definition of transcription is useful to understand some of the competing elements when considering whether and how to transcribe digitized items. Huitfeldt and Sperberg-McQueen distinguish between transcription as an act, as a product, and as a relationship between documents.1 Cultural heritage institutions need to explicitly facilitate the creation and dissemination of each in order to host a successful transcription program. While crowdsourcing methods directly address the act of transcription, libraries are often better suited to produce viable representations of transcription products and relationships in digital repositories. Crowdsourcing thus becomes one of several methods or tools for libraries to develop successful transcription workflows.

Screen Shot 2017-04-07 at 11.38.52 AM.png

Image from William Brewster’s Diary from 1865 that identifies several birds by their common species names (

Transcription helps bridge the gap between digitization and use by enhancing access through full text search, enriching metadata collection, and opening collections to digital textual analysis. Digitized natural history manuscript items are largely hidden due to the lack of item level description for most archival collections. While minimal processing is certainly the better option compared to maintaining an extensive backlog of unprocessed material, digitized handwritten documents are not discoverable based on their unique content without a machine readable facsimile. Indexing transcriptions facilitates discovery of historical records and improves catalog search results. By offering full text transcriptions, the digital collections are opened up to new types of searching, sorting, categorizing, and pattern finding. Research derived from these new data sets can illustrate changes over time across much larger magnitudes of collections and types of information resources.  Continue reading

Reflecting on Open Access and Code4Lib 2017

In considering how to consolidate my thoughts from Code4Lib 2017, I spent some time reviewing the pre-conference workshops and the interesting and directly relevant talks from last week. Ultimately, as I am sure many other attendees discovered, I found that the framework of the conference and a lot of our work as library technologists was best examined by Christina Harlow in her keynote “Resistance is Fertile.”1 There were many (many) other presentations and discussions throughout the conference that were inspiring, enlightening, and compelling, but Harlow synthesized the meaning behind what we all do and applied to it a language and a methodology for doing it better.

And it was remarkable. I think people even cried a bit. We all stood up at the end and clapped a lot.


And over the next few hours and days I thought about how BHL and my position as an NDSR resident fit into this framework and how I can be an agent who advocates for not just Open Access to content but also its ethical and operational background. Harlow keenly argues for investigating the transparency of library policies if not to resolve inherent biases in programming, systems architecture, and design then to encourage further democratizing the “means of production” (of datasets, of metadata, of documentation) in pursuit of accessibility and true openness.2 Continue reading

Transcription Tools: a survey

Field notebooks and diaries have historically been retained by natural history institutions as reference files for museum specimen and associated collecting events. More recently, however, researchers have begun to uncover vast historical data sets as part of their scholarship in scientific taxonomy, species distribution and occurrences, climate change studies, and history of science. Field notebooks contain significant information related to scientific discovery and are rich sources for data that describes biodiversity across space and time. They enhance our understanding of field expeditions by narrating meteorological events, documenting personal observations and emotional perspectives, illustrating habitats and specimen, and recording dates and locations. Unfortunately much of this information is almost totally inaccessible. Even digitized collections require users to sift through hundreds or thousands of images and take highly detailed notes to extract their content.

Enter (hopefully) Citizen Scientists!  

By crowdsourcing the collection of this information and parsing it into sets of structured data, BHL users will be able to engage in qualitative analyses of scientists’ narratives as well as quantitative research across ranges of dates and geographical regions. Full text transcriptions will allow us to index and provide keyword searching to collections, and pulling facets out of this unstructured data will help make the access more meaningful and useable.The ultimate goal is for BHL to integrate taxon names, geographic locations, dates, scientists, and other types of observation and identification information with the published and manuscript items across BHL. By attaching this historical metadata to catalog records of published literature and archival collections, BHL will be able to provide a more complete picture of a given ecosystem at a given time.

To this end one of my first tasks when I arrived at MCZ was to familiarize myself with the current landscape of tools for building crowdsourcing, citizen science, and manuscript transcription projects. While there are several successful designs and models, in order to focus my scope I concentrated on those that met the following criteria:

  • Built or updated within the last three years. Granted there are some important lessons to be learned from some of the older projects, but I need some current references.
  • Use tools that are free or open source. BHL is committed to providing open access to biodiversity literature, and a good way to honor that is to focus on projects that share similar values. 
  • Have an existing volunteer base. While there is a high probability that this project will be used for outreach with BHL users, it is prudent to engage with dedicated volunteers that are already interested in and experts at transcription and citizen science.

I did not require that tools support specific markup or encoding for a few reasons:

  1. Projects generally ask volunteers to either transcribe documents or pull out structured data from them. While we might like to ask for both, there does not seem to be a sustainable model for this quite yet.
  2. Part of BHL’s current workflow for mining scientific names requires plain text (.txt) files with no markup and there is a reasonable chance that this process will be enhanced to pull out dates, locations, and other value additions.

The four tools that I ended up spending some significant time with were Ben Brumfield’s FromThePage, the Australian Museum’s DigiVol, the Smithsonian Institution’s Transcription Center, and The Zooniverse’s Project Builder and their Scribe development framework. I should insert a disclaimer here: I am not starting completely from scratch with this research. MCZ has used both DigiVol and FromThePage for recent transcription projects, and everyone should go check out Science Gossip, the Missouri Botanical Garden’s Zooniverse project developed to generate keyword tags for illustrations ing BHL.



FromThePage, DigiVol, and the SI Transcription Center all operate in fundamentally similar ways, with each providing different features for libraries and volunteers. FromThePage is a lightweight, open source, collaborative transcription platform. It’s defining feature is its use of  wiki style markup to link references and subjects within texts to dynamically index terms. The design is optimized for archives projects and is the simplest tool that can be deployed quickly. It has a very clean interface for viewing, transcribing, and coding people, places, and subjects across a collection of documents. While the markup system is simple, powerful, and effective, it does not fit seamlessly into the existing BHL metadata structure. FromThePage seems to have been developed specifically for archives collections that are not cataloged as libraries are. The wiki tagging could be designed specifically for BHL (and can be exported as TEI compliant XML), but would require a not insignificant amount of processing before uploading to the BHL portal.

DigiVol was built by the Australian Museum as an Atlas of Living Australia project and combines a similarly simple and attractive viewing and transcription interface with tools for extracting specimen data from items. There is not a



simple process for marking up text, but the platform features a form that invites volunteers to enter scientific names of specimen with dates and locations of their collection or observation. This generates a CSV document that retains valuable information in a structured format. DigiVol is a tremendous tool for BHL’s current functionality and architecture, but it does not have the flexibility to support other types of structured data or display markup.

The Smithsonian’s Transcription Center is perhaps the most successful of these tools that are designed for extracting full text transcriptions from archival collections.


Enter a caption

The Transcription Center generates JSON files from text entered into a single data field. Volunteers can utilize a WYSIWYG-like toolbar that applies some TEI compliant markup but minimizes UI interference with the actual process of transcribing. The JSON-stored data allows any type of data to be stored in one database field instead of across several specific tables and can fairly easily interact with XML systems.


Or, almost. Unfortunately, and perhaps understandably considering the strength of the system, the Transcription Center is not available outside of the Smithsonian’s network. While Smithsonian Institute Libraries is a member of BHL, the inclusion of projects outside the scope of SI may be a significant drawback to integrating fully with the Transcription Center. 



Finally, I discovered the Zooniverse. Originally designed for citizen scientists to extract structured data from extremely large data sets, the Zooniverse has recently embraced transcription and other humanities projects in its Scribe framework. Some of its recent forays include AnnoTate and Shakespeare’s World. The Zooniverse team has almost completely redesigned the model for a transcription platform with varying degrees of success. Instead of inviting volunteers to type complete page transcriptions into a text box, they break up the workflow into three types of tasks: Mark, Transcribe, and Verify. Users Mark where they see text on the page to maintain author’s explicit layout and formatting choices; a separate set of users Transcribe the text that was previously Marked, which preserves the relationship between pixels and text; and finally a third group reviews the Mark and Transcribe tasks for quality control. Output data can be harvested raw (from each task) or aggregated (from the whole set of Mark and Transcribe tasks for a given image) along with the level of Zooniverse’s confidence in the accuracy of the transcription. The output data is structured similarly to the Transcription Center’s (JSON), but is extracted as a CSV file, not via a RDBMS.

The Zooniverse relies on the concept of microtasking to break up labor intensive transcriptions that require high levels of intelligence and concentration.


From Victoria Van Hyning’s presentation at the Oxford Internet Institute

By splitting tasks into more manageable chunks with varying degrees of difficulty, citizen scientists can engage with the project at whatever level they desire. The idea relies on principles of gaming that ask for a shorter time commitments in order to encourage volunteers to return. Breaking up the tasks also improves the data quality by mitigating against user fatigue and boredom. While the Scribe framework is currently in beta and does not come without its snafus, the Zooniverse has recently been awarded an IMLS grant to build out its audio and text transcription tools in 2017-2018.

[EDIT 3/16/2017: There are three directions of development for Zooniverse transcription platforms. Scribe, developed in partnership with New York Public Libraries for their project Emigrant City, breaks up the workflow into three explicit Mark, Transcribe, and Verify tasks; projects including AnnoTate and Shakespeare’s World utilize the microtasking functions that break up pages into lines and were developed using the Zooniverse Project Builder; and the third system (as in Operation War Diary) features interpretive tagging in addition to transcription.]

A final recommendation for a transcription tool will be largely informed by some of the choices that I and the other Residents propose for the future development of BHL. Will
the appropriate data (keywords, dates, locations, etc.) continue to be mined from the full-text transcriptions? Or could there be some significant benefits to asking volunteers to pull out that structured information from the images in addition to transcribing? It could be useful to quickly provide access to this data in structured formats, but conversely, establishing a workflow for mining the text will allow staff more flexibility in determining what facets to include and to triage digitized items’ value additions independently from their transcription.

This is a very general overview of some of what I’ve discovered about transcription tools in the last few weeks. If you are familiar with or have used any of these tools, please leave a comment or shoot me an email (! I am very interested in learning about both volunteers’ and libraries’ experiences with transcription projects.

Some resources that I found helpful:


Ben Brumfield’s blog “Manusript Transcription” is a rich source for all types of discussions around transcribing documents.

“Crowdsourcing Transcription: FromThePage and Scripto.” The Chronicle of Higher Education, January 23, 2012.


Stephens, Rhiannon. “The DigiVol Program.”, April 13, 2016.

Prater, Leonie. “DigiVol:Hub of Activity.”, December 17, 2013.

Smithsonian Institute Transcription Center:

The entire issue 12:2 of Collections: A Journal for Museums and Archives Professionals is dedicated to the Transcription Center, and each article presents several important perspectives to consider.

The Zooniverse:

Bowler, Sue. “Zooniverse Goes Mainstream.” A&G, 54:1, February 1, 2013. DOI:

Kwak, Roberta. “Crowdsourcing for Shakespeare.” The New Yorker, January 16, 2017.

Van Hyning, Victoria. “Metadata Extraction and Full Text Transcription on the Zooniverse Platform.” Presentation to Linnean Society, October 10, 2016.

Van Hyning, Victoria. “Humanities and Text-based Projects at Zooniverse.” Presentation to Oxford Internet Institute, February 16, 2016.

Hello World!

Welcome to the NDSR at BHL blog!

Over the next 11 months we will be collaborating as National Digital Stewardship Residents on several projects to develop recommendations and best practices for enhancing tools, curation, and content stewardship for the Biodiversity Heritage Library. As recent graduates of Master’s programs in Library and Information Science, we are excited to contribute to the field of digital stewardship through our work on the Biodiversity Heritage Library and develop leadership skills through the Residency model.

Alicia Esquivel is the Resident at Chicago Botanic Gardens where she is working on completing a content analysis of the quantity of literature in the field of biodiversity, the amount of that literature in the public domain, the representation of each discipline within BHL and an exploration of the methodologies to scope the collections and areas where BHL may target development to better serve the research population.

Marissa Kings is the Resident at the Natural History Museum, Los Angeles County, where she is focusing on identifying high value tools and services used by large-scale digital libraries which might be applied to the next generation of BHL. She will also be exploring digitization workflows at NHMLAC and identifying items to be contributed to BHL.

Pamela McClanahan is the resident at Smithsonian Libraries where she will be conducting a user needs and usability analysis working with the larger taxonomic and biodiversity informatics community to determine user needs and services for providing increased value to BHL content. Pam will analyze this information and input to define recommendations and requirements for expanding the BHL digital library functionality.

Katie Mika, Resident at the Harvard University Museum of Comparative Zoology’s Ernst Mayr Library, is developing tools and methodologies for crowdsourcing full-text transcriptions and structured data from BHL’s manuscript items, including field notebooks, specimen collection records, correspondence, and diaries. Katie’s background is in Archives Management and building digital repositories to support description and access to digitized and born digital photograph, multimedia, and software collections.

Ariadne Rehbein is the Resident at the Missouri Botanical Garden, where she is focusing on natural history illustrations sourced from digitized biodiversity literature. Building upon the successful work of the “Art of Life” team members and citizen scientists, her project will incorporate user research and knowledge of digital scholarship to produce user interface requirements and a report on image discovery best practices.

As a cohort, we residents are collectively tasked with proposing options for substantial improvement to version 2 of BHL on the understanding that the underlying data structures and metadata schemas will be somewhat, if not completely, rebuilt. We therefore have quite bit of latitude to introduce cutting edge technology and incorporate various “wish list” features that BHL staff have collected over several months.

This blog will function as a dynamic record of our work with BHL and the NDSR program through December 2017. You can expect to read posts about our projects’ successes, challenges, and probably some failures in the next several months as well as some interesting discussions about biodiversity librarianship and content and data management in digital libraries. Occasionally we’ll also be posting about attending and presenting at professional conferences, participating at workshops, and engaging in other activities within the wider digital libraries community.

We also hope that this blog will serve as a tool to facilitate communication with other librarians and archivists and anyone interested in the future of BHL. To learn more about BHL or the NDSR program head over to the About page, which includes an overview of the IMLS supported “Foundations to Actions” grant that is funding each of our Residencies and the mission of the Biodiversity Heritage Library as well as some useful links.