Defining the Scope of Biodiversity Literature

One of the first steps of performing a collection analysis is to define the scope of the collection. While I am focused on analyzing the corpus of BHL for my project, this collection only represents a subset of all biodiversity literature. After defining the scope of biodiversity literature, we can start to understand the coverage of the BHL collection and identify its gaps to target future digitization.

The term “biodiversity” is a contraction of “biological diversity,” first used in 1986 during the planning meeting for National Forum on BioDiversity.1 Simply put, biodiversity is “the variability among living organisms from all sources including, inter alia, terrestrial, marine and other aquatic ecosystems and the ecological complexes of which they are part; this includes diversity within species, between species and of ecosystems.”2 All living life and their environments–quite a large scope.

In order to define this subject area for digitization purposes, the BHL Collections Committee created a Collection Development Policy that defines specific areas of interest to digitize based on BHL user needs. The committee defines BHL users as an interdisciplinary audience composed of “zoologists, botanists, evolutionary biologists, taxonomists, systematists, ecologists, natural history collections managers, scientific illustrators, biological science historiographers, and amateur scientists & hobbyists.” 3 Based on these interests, the committee created an infographic to help distinguish between relevant content or “core literature” topics (blue) and supporting literature topics (green).


The types of information these subjects may cover include: species descriptions, distribution records, climate records, history of scientific discovery, information on extinct species, scientific observations, scientific illustrations, and ecosystem profiles. These types of data can be published in monographs (books) and serials (journals), or unpublished in field notebooks and diaries (handwritten records 4 ). 5

Now that we know what biodiversity literature is, how much of it is out there? And now the big question–how do you calculate all of the written materials, both in and out of copyright, about all living life and all environments in which they inhibit? We can make some educated guesses.

In 2010, members of the BHL Collections Committee estimated the core literature (botanical, mycological and zoological) to consist of 495,000,000 pages6. This estimate was calculated by identifying the core literature in botany, which was chosen over zoology and mycology because of its superior documentation. Two extensive bibliographies were chosen for assessment: Taxonomic Literature, 2nd Edition (TL-2)7 which documents monographs published from 1753-1940 and Botanico-Periodicum-Huntianum (BPH) which documents the number of serials published from 1665 to present. Estimates of each were taken to determine the amount of volumes in each (37,600 in TL-2 and 30,000 in BPH) 8. Then the average number of pages per volume based on a sample (15,040,000 in TL-2 and 96,000,000 in BPH).

The estimate of botanical literature is then used to estimate mycological and zoological literature based on the number of scientifically defined species. By determining the ratio of known botanical species to known mycological and zoological species9, (310,129 botanical species, 98,988 mycological species and 1,424,153 zoological species) the amount of pages per species can be estimated for each category. Using these estimates, total pages for all species would be 497,574,779.

This estimate can give us an idea of the size of the scope of biodiversity literature (and a number of pages to aspire to), but BHL is also interested in assessing areas of coverage based on geographic, taxonomic, and subject/discipline data points. Some methods I am exploring include:

  • using Library of Congress Subject Headings to determine distribution of BHL collections (there is some very exciting research about turning LCSH into hierarchical trees for browsing and searching collections that I would like to replicate for BHL!)
  • using taxonomic name data10 to analyze species coverage
  • using comprehensive subject bibliographies to assess coverage (a pilot study was performed in 2015 using pteridological literature by BHL librarians).

Stay tuned to read more about these projects as they develop!

1. [ Issues in Science and Technology Librarianship ]
2. [Convention on Biological Diversity]
3. [Collection Development Policy ]
4. [See Katie’s previous post about using transcription tools for handwritten documents!

5. [A large part of biodiversity information is datasets which fall outside of BHL collecting scope which focuses on literature.]
6. [For reference, BHL currently (as of 2/22/17) has 51,362,213 pages online–check the counter at the bottom right page of BHL portal for updated stats!]
7. [For further info on turning TL-2 into a database (like BPH online) ]
8. [BPH online estimates 34,000 titles, this increase can be due both to new journals being indexed in the past 6 years since this estimate and degree of error in the BHL estimate. But hey, they got pretty close!

9. [According to A.D. Chapman’s Numbers of Living Species in Australia and the World. 2nd edn (2009.]
10. [BHL uses Global Names Architecture’s Global Names Recognition and Discovery (GNRD), a taxonomic name recognition algorithm, to search through all of the texts digitized in BHL and extract the scientific names]


7 thoughts on “Defining the Scope of Biodiversity Literature

  1. Great post! I’m excited to see what your research comes up with, especially as it concerns taxonomy. One thing I’ve noticed in the us of GNRD in BHL is that it misses species because of the way that formatting has changed. I was trying to locate one today (Agaricus ruthae), and it wasn’t caught by the search. I stumbled across a page mentioning this species here: Interestingly, it does catch the genus, but of course, there could be any number of species! Just thought I’d share that with you! If you have any other questions, please let me know! — Michelle Marshall 🙂

    Liked by 3 people

    • Hi Michelle! You’re right, GNRD does miss taxonomic names in the BHL. If you look at the OCR from the your example page, you can see that it’s being read as “ruthee” because of ae (ash) letter. So if the OCR isn’t correctly identifying words, there’s no way GNRD would catch it.

      For fun I decided to run the PDF through GNRD directly to see if their OCR would catch anything that BHL didn’t (our OCR is performed by Internet Archive who uses ABBYY and GNRD uses Tesseract). Unfortunately Agaricus ruthae still wasn’t recognized. Only 6 names were found:
      Nolanea infula
      Volvaria temperatus

      I then got curious about OCR software tools for other languages and specifically older languages/texts. I found this tool from Ryan Baumann for performing OCR on Latin texts (, catering to text fonts and characters from 1500-1800 (your example is from 1879) including the ash letter. I downloaded the beta version and Latin training data and ran the page for English and Latin OCR. Unfortunately, Agaricus ruthae still wasn’t recognized! It would be interesting to see if any other OCR tools for languages in which the ash letter is recognized would correctly read this species name. I would guess the italicization is giving these OCR tools a hard time as well. Thanks for your comment!

      Liked by 3 people

      • Hey Alicia and Michelle! I just wanted to throw out there that part of the development for a transcription tool is going to include a method for correcting text. This is likely going to include the OCR returned text in addition to the crowdsourced generated content. Users will be able to make corrections to the OCR and transcriptions and the content will be reindexed once submitted. It’s not a perfect solution, but it will be an improvement!


      • Thanks so much for the feedback and discussion! It’s amazing what OCR can do and how it will evolve. Those fancy letters, like the ash letter prove to be problematic in other ways, like misspellings across taxonomic databases.

        Liked by 1 person

  2. 2 other things to keep in mind about OCR Quality – 1) quality can depend on when OCR file was generated. I checked IA and their OCR file for this item was generated in 2009. OCR software has improved alot since then and we might get a different result if it were run through a current version of ABBYY software. 2) OCR software settings (such as resolution, color mode, and compression) will affect quality (see this article IA typically has software settings that are optimized for higher throughput (i.e. large quantities) which can result in less quality sometimes.

    Liked by 2 people

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s