Why Transcribe?

Digitization is not a new activity for libraries and cultural heritage institutions, and indeed has become a critical tool for preserving and providing access to archival collections including rare books, manuscripts, and photographs. The potential research value of digitized collections is also not a new phenomenon. However, translating images of content into machine readable data that can be searched, sorted, and otherwise manipulated had not received much attention until crowdsourcing, citizen science, and other types of community collaboration models and platforms were constructed. A definition of transcription is useful to understand some of the competing elements when considering whether and how to transcribe digitized items. Huitfeldt and Sperberg-McQueen distinguish between transcription as an act, as a product, and as a relationship between documents.1 Cultural heritage institutions need to explicitly facilitate the creation and dissemination of each in order to host a successful transcription program. While crowdsourcing methods directly address the act of transcription, libraries are often better suited to produce viable representations of transcription products and relationships in digital repositories. Crowdsourcing thus becomes one of several methods or tools for libraries to develop successful transcription workflows.

Screen Shot 2017-04-07 at 11.38.52 AM.png

Image from William Brewster’s Diary from 1865 that identifies several birds by their common species names (http://biodiversitylibrary.org/page/40222552).

Transcription helps bridge the gap between digitization and use by enhancing access through full text search, enriching metadata collection, and opening collections to digital textual analysis. Digitized natural history manuscript items are largely hidden due to the lack of item level description for most archival collections. While minimal processing is certainly the better option compared to maintaining an extensive backlog of unprocessed material, digitized handwritten documents are not discoverable based on their unique content without a machine readable facsimile. Indexing transcriptions facilitates discovery of historical records and improves catalog search results. By offering full text transcriptions, the digital collections are opened up to new types of searching, sorting, categorizing, and pattern finding. Research derived from these new data sets can illustrate changes over time across much larger magnitudes of collections and types of information resources. 

Screen Shot 2017-04-07 at 11.32.16 AM

Image and graph from CLIR report illustrating that the number of species observations recorded in field notebooks at a given location is typically larger than the number of specimens collected. From “Grinnell to GUIDs: Connecting Natural Science Archives and Specimens.”

This is particularly important when considering biodiversity heritage literature and archives collections due to their significant value in documenting species occurrences, botanical observations, climate patterns, and meteorological events. Transcriptions facilitate the manipulation of this data and support research that extracts knowledge from formal and informal collecting and observation events. The growing research interest in all types of natural history resources including specimen records, species publications, and field recordings can be further enhanced by integrating access to data and images across scientific disciplines, institutions, and resource type.2 The Biodiversity Heritage Library can connect content by pulling information together from specimen, collecting events, and historical documentation into its portal and making it available to aggregators like GBIF and EOL. Initiatives for transcribing documents and records make this information available for digital use and should be developed as strategies optimized to support large scale integration.

Transcription projects for collections are time consuming, intellectually intensive, and expensive for an organization to facilitate, yet crowdsourcing has been identified as a sustainable model for generating transcriptions for large collections and institutions with diverse holdings, and is an exciting way to improve data collection from a diverse range of users for metadata enhancement. Biodiversity research has a strong background in relying on non-scientist community members to collect data. These Citizen Scientist programs and the resulting data are understood as a “public good that is generated through increasingly collaborative tools and resources while supporting public participation in science and Earth stewardship.3 Tracking and understanding biodiversity at varying scales requires fine-grained data to be collected over regions and continents, years and decades. Professional scientists alone are not generally capable of delivering the volume of data, analysis, and interpretation needed to support large-scale biodiversity research questions.4 “Studying large-scale patterns in nature requires a vast amount of data to be collected across an array of locations and habitats over span of years or even decades.”5

Crowdsourcing transcriptions can be understood as a method of gathering data over wider geographical and temporal spaces. By transforming the existing data into a machine readable format, field notes, collection lists, and observation notes become a powerful and rich source of biodiversity information. By transcribing and generating structured data sets from field notes, scientists of yore can be recruited for current research projects. BHL’s content spans hundreds of years and the entire globe, creating a potential pool of observation data that can inform today’s research. In the same way that science departments have turned to public participation to enlist the public in creating scientific knowledge, crowdsourcing transcriptions creates global networks that can generate data to be analyzed for population trends, range changes, shifts in phenologies, climate changes, etc.

Transcriptions will allow BHL to extract data from digitized items to improve the discoverability of hidden collections. My NDSR project addresses similar goals to the Art of Life and Purposeful Gaming projects that enriched the metadata of items to better facilitate access to collections. The Art of Life grant sought to “liberate natural history illustrations from the digitized books and journals in the online Biodiversity Heritage Library through the development of software tools for automated identification and description of visual resources.”6 Images in BHL are described structurally at the page level, facilitating navigation by human users and citation resolvers, but they lacked sufficient descriptive metadata to enable dynamic filtering and inquiry. The Art of Life grant project built new software tools and algorithms to automatically identify illustrations found within the text pages of the BHL corpus and push those illustrations to crowdsourcing environments like Flickr and Wikimedia Commons for their description. Similarly, full text searching of texts is significantly hampered by poor output from OCR software, and historic literature has proven to be particularly problematic because of its tendency to have varying fonts, typesetting, and layouts that make it difficult to accurately render. The Purposeful Gaming project was developed in order to identify a method to quickly and efficiently harness large numbers of users to review and correct particularly problematic works by presenting the task as a game. Each project improves the discoverability of and access to digital texts by enriching descriptive metadata for items at the page level to support full-text searching, data mining, and markup of content in BHL collections. The NDSR transcription project complements Art of Life and Purposeful Gaming by developing a similar method for generating machine readable content that will enhance access to handwritten text, a final category of “hidden content” in BHL.

Screen Shot 2017-04-07 at 11.31.23 AM

Screenshot from BHL Book Viewer for the Journals of William Brewster that shows the poor OCR output and the inability to index pages or Scientific Names without quality transcriptions (http://biodiversitylibrary.org/page/44700560).

The Internet’s speed, reach, temporal flexibility, anonymity, interactivity, and convergence brings people into conversation with each other, lowers barriers to information by creating easier access to professional bodies of knowledge, increases access to useful tools, and enables an online participatory culture.7 By externalizing transcriptions of manuscript items we can leverage the collective intelligence and wisdom of crowds and exploit a large and diverse set of skills, tools, and ideas to bear on archival materials and special collections. The Internet encourages ongoing co-creation of new ideas in which content is generated through a mix of bottom-up (from the people) and top-down (policy-makers, businesses, and media organizations) processes.8 Libraries are ideal institutions to encourage and utilize crowdsourcing initiatives due to their unique placement at the intersection of these processes. Libraries and cultural heritage institutions have the advantages of mission statements and codified ideologies dedicated to enriching the knowledge of the people as well as the organizational structures to mobilize, energize, and capitalize reciprocally on the capabilities of its users. This symbiotic relationship is not only mutually beneficial, but is likely one of the spaces in which GLAMs can thrive in the digital age.


4 thoughts on "Why Transcribe?"

  1. A very interesting blog. With the growth of different platforms such as the Smithsonian Transcription Centre and the DigiVol project I am particularly interested to see how the Biodiversity Heritage Library solves the task of adding these completed transcriptions to the Biodiversity Heritage Library and any results from this integration. A lot of articles and studies have been done on crowdsourcing but very little (that I have seen) seems to have been done on the actual integration of the end product into a repository like BHL or the organisation that enabled the transcription. I would love to see an article that addresses that topic and includes information on the subsequent improvements to metadata that has resulted. But I think I’m getting ahead of the project! It seems to me that BHL should be considering two main issues. The first I’ve touched on above – the integration of transcriptions that have already been completed by other institutions & projects. BHL needs to take advantage of and integrate the results of other projects. I know the From the Page platform creator Ben Brumfield @benwbrum has recently had an issue concerning where to permanently store a completed transcription of several field books. It is all very well to have a transcription crowdsourced but if the institution stores it on their own servers it is frequently just as difficult to find and access as a book. BHL should have a system in place to be able to raise it’s hand and say “I’ll take that!”, to be a repository of biodiversity transcriptions as well at the digitisation images of the field book itself. The second is how BHL will improve its OCR results particularly with old or handwritten volumes. Although I did participate in the games that helped improve OCR content I much preferred transcribing with the Transcription Centre. My reason for this is that the games were too simple and they distanced me from the content – which was what I was interested in and my main reason for transcribing. If I wanted to have something simple to do I’d prefer to participate in ScienceGossip on Zooniverse or improving OCR text in Trove. Both these projects are easy but also require engagement and thinking about the content provided. But even those projects are second to my love of transcription of field books where I can fully engage in the journey the writer is undertaking and get to know the information given as well as have the challenge of solving the puzzle that is the individual’s handwriting. Transcription of handwritten journals is an adventure you undertake with the writer, time travelling to meet them, & requires a willingness to double check old species names, places and people which can lead to unanticipated discoveries. Unlike the Purposeful Gaming games it is not just typing. Any time you spend googling something can lead to unintended discoveries and this is one of my main motivations for doing transcriptions. BHL does need to improve the results of its OCR and also needs to be a repository for biodiversity transcriptions. I’ll be interested to see the results of your project and how BHL might solve these issues.


    • Thank you for your thoughtful comment Siobhan (as always!) – I’m certain that all of the reports generated from the NDSR projects with BHL will be made public in some fashion. I agree with much of what you’ve said, and it would be extremely valuable to see an article that focuses on the technical integration parts of crowdsourcing projects. I’ve discovered that many librarians and managers of GLAMs publish a lot on the process because developing a workflow and surveying the scene are early steps, and ones that can seem very overwhelming. It is nice to see that “expert” institutions and programs are starting to emerge such that smaller organizations can look for guidance or install an instance of an existing platform instead of starting from scratch.

      I’m certainly focusing on integrating existing transcriptions from other projects into BHL. It’s an important question because even if BHL hopes to implement a crowdsourcing program for transcriptions, there will always be member and affiliate institutions that prefer different methods/platforms. I’ve been working with Ari quite a bit recently because it could make sense to propose a coordinated plan for crowdsourcing activities instead of several separate projects for images and illustrations, transcriptions, and OCR corrections. While each might operate on a different platform, BHL might include a “landing page” of sorts to direct users and volunteers to content or to opportunities to contribute knowledge as part of a more synchronized project.


  2. Excellent post! I work mainly on identifying species in BHL’s Flickr albums, and I can easily understand why OCR would have a problem with these in particular. While the text is in a consistent font (usually), the illustrations often have engraved or handwritten type to identify the species. One of my wishes for BHL would be to see an aggregate of illustrations of different species which would link all the books as well. Right not, BHL has a searchable index of taxons that will lead to every work mentioning that taxon, which is fabulous! But the way BHL is searchable doesn’t parse out the illustrations, and the way that the illustrations are now, the text would be captured, but you wouldn’t know if their was also an illustration unless you go to the book and search around the page. Another bit of metadata that we discussed attaching to these illustration is the geographic locality. EOL has a way of identifying habitat and range of species, and thankfully with certain machine tags in Flickr, we can add to this information which, for scientific purposes, helps with identifying shifts in habitat or loss of habitat. And perhaps most importantly are all the things that we cannot think of for research! One great thing about transcription is that it captures all the information and makes it searchable. Like Siobhan says, sometimes we just stumble across something wonderful in our Google searches. It would be great to have a searchable feature that is kind of like a word cloud, like they use on WordPress. Researchers could get a sense of how many times certain words or topics are repeated in a work which may help enlighten their research or even spark something new!


    • I agree that different types of data visualizations would be a wonderful feature! Hopefully that can be a next step once field notebooks are turned into texts and dynamic data instead of relatively static images. BHL is *very* interested in adding geo-references to its content. At the very least this would make it easier to add points on the map for GBIF and EOL species pages. I’m trying to figure out now if there’s a reasonable way to add this feature to transcriptions. It’s not very complicated to tag or encode text with location information, but how useful is this if we do it for field notes but can’t for the published literature?

      I hope BHL gets a coordinated transcription project up as soon as I finish up – I can’t wait to see what users figure out what to do with the texts!


