Hello! We’ve been focusing on transforming our research into recommendation outlines that we presented to the BHL Tech Team last week. As we head into the final quarter of our residencies, we’ll be focusing on tweaking these ideas, developing workflows and proofs of concept, and finalizing our recommendations in a Best Practices White Paper by December. For this update, we wanted to give a preview of what some of these recommendations will look like and invite some preliminary feedback from the BHL NDSR Blog-o-sphere that we can consider as we move into these final months.
Last week, the Northeast Document Conservation Center (NEDCC) hosted its annual Digital Directions conference in Seattle, WA. The conference focuses on the creation and management of digital collections, and as one of my goals during my time as a Resident at the Natural History Museum of Los Angeles County (NHMLAC) is to create a project plan for digitizing materials, this seemed like a great place to get a foundation in the process. It also so happened that Seattle would experience the solar eclipse with 92% totality, which was an added bonus!
This is a fairly incomplete post about the work that’s going on regarding adding BHL bibliography metadata to Wikidata. I hope to have several more of these posts before the end of the year!
Following some productive conversations on donating BHL bibliographic metadata to Wikidata, it was discovered almost immediately that BHL’s data is not terribly useful without some serious munging. One of the biggest problems with BHL bibliographic metadata is that it comes from lots of different libraries and museums, legacy cataloging systems, and various types of authority work. For example: BHL attaches Creator IDs to Author names, which is useful for identification and connecting titles and items to their Authors, but they are assigned automatically according to the character strings imported from specific fields in a library catalog’s MARC record. Despite (and perhaps because of) the use of varying authority files to control Author name strings in institutional catalog records, different libraries have contributed items by the same author whose names are are spelled, punctuated, and identified differently. BHL does not conduct authority control on BHL metadata, choosing instead to focus on improving access to items based on content rather than metadata. Fortunately, there are several different ways to go about reconciling and disambiguating data, and one of them is crowdsourcing.
BHL can use Wikidata to tell its users that “Packard, Alpheus S” (Creator ID: 82636), “Packard, A” (Creator ID: 59850), “Packard, A S” (Creator ID: 48286), “Packard, A. S. (Alpheus Spring), 1839-1905” (Creator ID: 1592), and “Packard, Alpheus Spring” (Creator ID: 56087) are all the same person without editing the spelling or legacy metadata from the catalog record.
One way is to use Wikidata as an identifier by adding a property for a BHL Creator ID in Wikidata (P4081) and adding a table in BHL for Wikidata Identifiers that can be associated with those same Creator IDs. By adding identifiers to Wikidata, it becomes a more robust knowledge base that will improve the discoverability of BHL’s content by enriching its metadata externally and solving some metadata problems internally. While some of the reconciling can be done computationally using (still more) authority files, it often misidentifies strings and isn’t very helpful when an author is not in that particular database. These errors are best caught by humans, who WIkidata invites to directly edit mistakes and add identifiers. By adding Creator IDs to Wikidata and in turn adding Wikidata IDs to BHL, BHL can leverage the wisdom of the crowd to reconcile its author metadata.
In order to test this idea and attempt to start down a path that will hopefully lead to more BHL data in Wikidata, I worked with Andy Mabbett (User:PigsOnTheWing) to add a representative set of 1000 BHL CreatorIDs to Wikidata; the first step of which was to disambiguate these authors. In order to procure a sample of 1000 representative authors, I used the rbhl R package to interface with the BHL API and pull a random sample of authors with associated DOIs.1 The rbhl package is an rOpenSci tool and can be found on their GitHub. The R script I used can also be found on GitHub at: https://github.com/kmika11/BHL_Wikidata/blob/master/CreatorID/rbhl_CreatorIDScript.R . Once I was able to generate a table of Author Strings, CreatorIDs, an associated Title, and its DOI I headed over to OpenRefine to start reconciling BHL CreatorIDs. As you’ll remember from a few paragraphs ago, BHL doesn’t conduct authority control and relies instead on the work of partner institutions. This means that there are no external identifiers for authors in BHL. We chose to reconcile against VIAF IDs because VIAF has the most identifiers in Wikidata (for library resources at least). Once there were VIAF IDs, the CreatorIDs could be added as a P4081 property statement to author QIDs. The tool Mix n’ Match makes the part of this process that requires some human thought pretty simple and somewhat fun!2
Now, my next steps are figuring out what that next steps are. There is some interest to add New York Botanical Garden’s herbaria type specimen to Wikidata along with protologue literature from BHL and perhaps field notebooks and other relevant collecting event items. BHL also has quite a long list of taxon names (3,732,986 names) with metadata for the pages they’ve come from. I don’t think it’s appropriate to push all of this data to Wikidata, but it is a significant dataset that could be useful in varying ways. Another issue is that resolving author strings to VIAF IDs is not an insignificant amount of work. Gerard Meijssen has brought up the idea of using Open Library IDs, which are already resolved to VIAF and often Wikidata, and which may be a solution. BHL hosts its content on the Internet Archive, which is the creator of the Open Library. One would imagine that is a simple hop, skip, and a jump from BHL CreatorIDs to OpenLibrary IDs, but I’m still investigating whether that is, in fact, the case.
Please jump in with any thoughts about Wikidata + BHL or what I’ve described above. I know that WordPress is not terribly conducive to discussion, but that’s how we’re set up for now. I do not claim to have an expert level grasp of Wikidata yet (or BHL for that matter), but this collaboration seems to be a constructive Open Data pursuit!
1. During this step I incorrectly assumed that BHL minted DOIs for all its content including individual articles. BHL does mint DOIs for monographs, and worked with BioStor to add 12,000 DOIs for articles.↩
2. The manual for using Mix n’ Match can be found at: https://meta.wikimedia.org/wiki/Mix%27n%27match/Manual ↩
Last week kicked off the six day library extravaganza known as ALA Annual. The conference, hosted by the American Library Association, was held in Chicago, IL to discuss, learn, and exchange ideas about libraries on the theme “Transforming Our Libraries, Ourselves.” With 25,000 attendees, masses of sessions and talks, and a mountain of freebies, ALA can be an overwhelming experience — we managed to find our way, and wanted to share what we did and learned there.
One of our main goals was to present our “Halfway Remarks” poster on behalf of all of the BHL NDSR Residents. Alicia Esquivel of the Chicago Botanic Garden, and Ariadne Rehbein of the Missouri Botanical Garden attended and presented.
Last week I attended the Inaugural Digital Data in Biodiversity Research Conference sponsored by iDigBio, the University of Michigan Museum of Zoology, the University of Michigan Herbarium, and the University of Michigan Museum of Paleontology. The conference brought together biodiversity researchers, data providers, data aggregators, collection managers, and librarians. We talked about creating digital biodiversity data, sharing this data and using it in research.
The presentations, posters and workshops highlighted research trends in biodiversity and projects that have open access missions similar to BHL’s. I was able to give a talk about using statistical analysis to calculate the size of biodiversity literature and present a poster about visually representing the collection at BHL.
The other week I participated in WikiCite 2017, a conference, summit, and hackathon event organized for members of the Wikimedia community to discuss ideas and projects surrounding the concept of adding structured bibliographic metadata to Wikidata to improve the quality of references in the Wikimedia universe. As a Wikidata editor and a librarian, I was pumped to be included in the functional and organizational conversations for WikiCite and learn more about how librarians and GLAMs can contribute.
The Basics (briefly and criminally simplified)
Galleries, Libraries, Archives, and Museums are institutions that collect, preserve, and make available information artefacts and cultural heritage items for use by the public. Before databases librarians managed card catalogs to facilitate access, which were translated into MAchine Readable Cataloging (MARC) data format digital records to create online catalogs (ca. 1970s-2000s). As items in collections are being digitized, librarians et al. add descriptive, administrative, and technical/structural metadata to records and provide access to digital surrogates via digital library or repository, depending on copyright. Metadata, however, is generally not subject to copyright and is often published by GLAMs for analysis and use via direct download, APIs, and in more and more cases, as Linked Open Data. As a field, we’re still at the beginning of this transformation to Linked Open Data and have significant questions still to answer and thorny issues still to resolve.
Wikidata is a source of machine-readable, multilingual, structured data collected to support Wikimedia projects and CC0 licensed under the theory that simple statements of fact are not subject to copyright. Wikidata items are comprised of statements that have properties and values. In the Linked Open Data world these items are graphs with statements expressed in triples. As Wikimedians and Wikidata editors add more of this supporting structured data to Wikipedia, the idea of adding bibliographic metadata to Wikidata started coming up. Essentially – “Here are some great structured data that are incredibly important to the functionality of Wikipedia; how can we add them to this repository that we’re creating in a useable way?” As many librarians (and really anyone that’s written a substantial research paper) are aware, citations are complicated. Continue reading
For my NDSR project at Smithsonian Libraries, I’ll be gathering feedback from the users of BHL to help inform the next version of the digital library. I’ve had the opportunity to meet with several partners of BHL and sit in on BHL Member, Collection, and Tech Team meetings. Through these interactions, I’ve been able to identify three main groups to solicit feedback from for my research: (1) Consortium Users; (2) System Users; and (3) Individual Users.
1. Consortium Users: A contributor to BHL including Members, Affiliates, Partners, staff, and volunteers
2. System Users: Organizations or individuals who interact with BHL for the purpose of enriching another system via APIs (Application Programming Interface) or manually
3. Individual Users: Anyone visiting the BHL website to search for information to answer their research needs such as, scientists, collection managers, librarians, etc.
As a consortium of natural history and botanical libraries BHL is made up of Members and Affiliates. Each of these consortium users are committed to the mission, vision, and key values of BHL centered around free and open access to biodiversity literature.
Staff at these partner institutions participate in various ways including scanning their biodiversity resources to be added to the BHL content and taking part in BHL working groups and committees as needed. As I’ve sat in on the meetings for some of these committees, I’ve been able to learn more about the BHL Members and Affiliates and their needs. The majority of these members are museums or libraries serving their own users as well. Members are looking for ways to streamline the process for digitizing their materials into BHL and promoting and accessing their content through BHL.
Another important group are volunteers including volunteers through Member institutions or with BHL as a whole. They assist with scanning and uploading content, managing social media, tagging illustrations with taxon names, and participating in crowdsourced transcription efforts, among other activities. These endeavors increase the visibility of and enhance the content in BHL. Check out a couple of BHL’s most active volunteers on social media – Siobhan Leachman and Michelle Marshall and her Historical SciArt.
This post is brought to you by the BHL NDSR Cohort. I, Alicia, introduce our conference packed month of April. Next, Ariadne recaps our DPLAFest presentation followed by Pam’s overview of our NDSR Symposium panel discussion. Lastly, Marissa and Katie offer some feedback and reflections from our first round our presentations.
April was a busy month for all of us residents! We attended and presented at two conferences in two different cities: first, at the 4th annual DPLAFest in Chicago and then the NDSR Symposium in Washington D.C. the following week.
DPLAFest is organized by DPLA, the Digital Public Library of America, which provides free, digital materials from America’s libraries, archives, museums and cultural heritage institutions. The network of DPLA is established on a “hub” model which brings together digitized and born-digital content from across the country to a single access point. BHL serves as one of the content hubs for DPLA which means BHL content gets passed along to DPLA. Our work with BHL connected mainly to the DPLAFest themes of digital libraries and open access content and collaboration across types of institutions.
The 385-acres of the Chicago Botanic Garden (CBG) could not be maintained without the work of dedicated staff, hundreds of volunteers, and careful data management. During my residency at CBG, my mentor, Leora Siegel, arranged an introductory meeting with the head of the Living Plant Documentation, Boyce Tankersley, to help me understand how the management of over 2.6 million plants is possible.
One of the few botanic gardens with AAM (American Alliance of Museums) accreditation, the Chicago Botanic Garden maintains records much like museums do, however, the collection items at CBG happen to be living (and thus can die, move, create new items, etc.). Each plant that enters the collection is given an accession number and deemed to be a member of the permanent collection or given “seasonal” status as a part of a temporary collection (like the orchids that were on view in the orchid show that closed at the end of March). This data is all managed through an internal database.
Digitization is not a new activity for libraries and cultural heritage institutions, and indeed has become a critical tool for preserving and providing access to archival collections including rare books, manuscripts, and photographs. The potential research value of digitized collections is also not a new phenomenon. However, translating images of content into machine readable data that can be searched, sorted, and otherwise manipulated had not received much attention until crowdsourcing, citizen science, and other types of community collaboration models and platforms were constructed. A definition of transcription is useful to understand some of the competing elements when considering whether and how to transcribe digitized items. Huitfeldt and Sperberg-McQueen distinguish between transcription as an act, as a product, and as a relationship between documents.1 Cultural heritage institutions need to explicitly facilitate the creation and dissemination of each in order to host a successful transcription program. While crowdsourcing methods directly address the act of transcription, libraries are often better suited to produce viable representations of transcription products and relationships in digital repositories. Crowdsourcing thus becomes one of several methods or tools for libraries to develop successful transcription workflows.
Transcription helps bridge the gap between digitization and use by enhancing access through full text search, enriching metadata collection, and opening collections to digital textual analysis. Digitized natural history manuscript items are largely hidden due to the lack of item level description for most archival collections. While minimal processing is certainly the better option compared to maintaining an extensive backlog of unprocessed material, digitized handwritten documents are not discoverable based on their unique content without a machine readable facsimile. Indexing transcriptions facilitates discovery of historical records and improves catalog search results. By offering full text transcriptions, the digital collections are opened up to new types of searching, sorting, categorizing, and pattern finding. Research derived from these new data sets can illustrate changes over time across much larger magnitudes of collections and types of information resources. Continue reading