Digital Data in Biodiversity Research

Last week I attended the Inaugural Digital Data in Biodiversity Research Conference sponsored by iDigBio, the University of Michigan Museum of Zoology, the University of Michigan Herbarium, and the University of Michigan Museum of Paleontology. The conference brought together biodiversity researchers, data providers, data aggregators, collection managers, and librarians. We talked about creating digital biodiversity data, sharing this data and using it in research.

The presentations, posters and workshops highlighted research trends in biodiversity and projects that have open access missions similar to BHL’s. I was able to give a talk about using statistical analysis to calculate the size of biodiversity literature and present a poster about visually representing the collection at BHL. 


I was excited to be able to share my work on content analysis at BHL and my progress on estimating the size of biodiversity literature or–what is not in BHL?–with others in the digital biodiversity community. This seemingly unanswerable question can help BHL understand their collection development goal and determine how close they are to it.

cmrIn order to answer this question for BHL, I am borrowing a method from population ecology: capture-mark-recapture (CMR). In its most basic application, capture-mark-recapture requires you to collect (or capture) a sample of a population, mark that sample, release the sample back into the full population and let it mix back into the wild, and collect a second sample. The second sample will contain a mixture of captured and recaptured (with marks from first capture) specimen. The ratio of recaptured specimen in the second sample is representative of the ratio of the first captured sample to the full population size.

While this method has predominantly been used in biology, it has recently been adapted to the bibliometrics field to calculate the size of a particular set of literature. For example, in order to determine a stopping point for a systematic review, Kastner et. al (2008), used capture-mark-recapture method to estimate the total number of articles in the domain of clinical decision support tools for osteoporosis disease management by querying four large bibliographic databases and analyzing their subsets.

In 2013, Lane et al. applied the same methods to patient rounding practices in critical care medicine by querying 4 bibliographic databases. Khabsa and Giles (2014) used capture-mark-recapture methods to estimate the amount of scholarly documents available on the web by analyzing counts on Google Scholar and Microsoft Academic Search. And in 2010, Ariño, used probabilistic models to determine the size of specimen-level data existing in natural history collections around the world by crossing data from multiple sets of biodiversity collection lists, finding commonalities between them and estimating the likelihood of totally obscure data from the fraction of known data missing from specific datasets in the set. While each study applied different models (such as Lincoln-Peterson model, Poisson regression model, and Seber probabilistic model), they all apply the same general CMR model.

Each of these example studies I am basing my work on had a more narrow focus than BHL’s collection scope. BHL’s collection development policy states that BHL focuses on “materials most relevant to the study of biodiversity” and defines biodiversity as

“ ‘the variability among living organisms…and thus, includes all levels of organismic organization, from genes to ecosystems, as well as other disciplines affecting the study of the biodiversity of life on Earth.”

In order to explore this methodology in a manageable fashion, it is necessary for me to choose a subset of biodiversity literature, or a stratified random sample, instead of calculating all of biodiversity literature. The first step in my process is to define the “area” of my study. Next, I have to determine how to search for the organism. And finally, I will determine which statistical model to use.

The chosen “organism” can essentially be anything, and a timeframe should be specified that reflects the research of the organism. Since I am based at a botanic garden, choosing an organism from the plant kingdom seemed appropriate.

The genus Artemisia has somewhere between 200-400 species including sagebrush, wormwood, and mugwort and is one of the largest in the family Asteraceae. It spans eastern and western hemispheres, won Herb of the Year from the International Herb Association in 2014, and its species have been mentioned in ancient Chinese texts and the Bible. The diversity of the genus, span of geographic location, and culinary uses of some species types will allow me to encounter and work through many foreseeable issues with this statistical model such as:

  • identifying common names in other languages
  • excluding irrelevant topics such as Italian baroque painting (Artemisia Gentileschi) and Greek mythology
  • excluding irrelevant content such as recipes (tarragon and absinthe).

After choosing a sample organism, it is important to understand the research trends surrounding that organism in order to determine an appropriate time frame to search.

For possible future iterations of this process at BHL, I suggest consulting a subject specialist for guidance here. Given my limited time frame, however, I went forward with my own subject research and consultation from my project mentor at CBG in choosing the years 1950-1980 for my sample search. Based on preliminary searches into the genus Artemisia, I came across multiple sources from the 1970s (using Colin W. Wright’s Artemisia as a reference guide). Additionally, in 1972, Artemisia was found to be used as an antimalarial drug through the creation of artemisinin.

Now that I have selected the area of my study, I must determine how to search for the organism. Each search will be as extensive as possible, an attempt to gather all literature about the genus Artemisia from 1950-1980 that would be relevant to BHL. By conducting multiple searches in different areas, I will basically create a bibliography that I can use to compare to materials already in BHL. The significance in collected citations is in their overlap. Materials with limited access online, born digital materials, and print materials are included in my search.

The first place I have searched is WorldCat, a global network of library catalogs.
I created a list of keywords that include common names and misspellings of those names and excluded keywords such as: Gentileschi, painting, art and mythology. I searched only for books and serials from 1950-1980. I exported all of the search results to EndNote, and after clearing duplicates I have 1706 items remaining for this search.

I used the same query in Google Scholar which returned about 17,000 results. I’m currently encountering some issues with collecting all of these results at once, and I am exploring using the tool Publish or Perish to gather the 1000 most cited results.

These bibliography lists from WorldCat and Google Scholar will be compared to each other and to the list of relevant Artemisia materials in BHL. Since BHL indexes scientific names through Global Names Recognition and Discovery, a list of materials about Artemisia can be pulled from these. However, because each species in Artemisia is indexed separately (as seen here), I will most likely pull metadata from the scientific names data export, not the portal.

The last step is to analyze these bibliographic lists and apply the appropriate statistical model given the conditions of the study. The example studies used Lincoln-Petersen, Saber probabilistic and Poisson regression models.

Check out the full slide presentation here and feel free to leave comments, questions, or suggestions here on the blog or email me directly at esquivelndsr (at) gmail (dot) com.

It was great to hear about other projects and issues in digital biodiversity data. There seemed to be a lot of conversations about digital pipelines and workflows and how different types of data move through different stages.

For example, the precursor to the analog to digital pipeline represented in this pyramid is the literal digitization process–scanning images, entering metadata into a digital repository, scanning literature, etc., which turns specimens, literature and analog data into digital data. In order to become information, this data must be cleaned, structured and indexed to be made usable and shareable. This information then has to be analyzed in order to become knowledge.

Biodiversity data currently exists in each of these states along the digital pipeline–as data, information or knowledge. There are specimen data waiting to be digitized, digitized data waiting to be normalized, information waiting to be analyzed and knowledge waiting to be published. Because of this state of multilevel existence, those in the biodiversity field are working towards solutions for multiple digitization stages at once.

For instance, the Paleobiology Database and MorphoSource are digital repositories for biodiversity data that also act as active databases. Each are project-based archives that allow users to store and organize current data and utilize others’ data in the same environment. This type of multistage problem solving seems prevalent in digital biodiversity data as well as an interest in shortening or reworking the field to digital pipeline.

Dori Contreras of the University of California Museum of Paleontology, presented about her field work and the process of getting her specimen and metadata into her research institution. She tested two different processing methods: batch processing and integration of tasks, and reorganized and customized her workspace to increase efficiency.One of the most surprising discoveries from the conference was the wide range of data types and data formats being used in organismal biology. I was expecting to hear a lot about specimen and species occurrence datasets (which we did) but did not realize how prevalent 3D data is in this field.


Bothragonus swanni from #ScanAllFish Project.

It was also exciting to hear about the Macaulay Library at Cornell Ornithology Labs that holds photographs, audio and video recordings of birds that can inform about behavioral sciences in ways that collected specimen cannot.

While larger datasets and more data points can create better research, it does not come without pitfalls. It is important to understand how to properly structure data and understand bias in data. These issues were discussed at the conference as well including a presentation by Katelin D. Pearson of Florida State University on the bias in botanical data and Joan Damerow of the Field Museum of Natural History’s presentation on the analysis of taxonomic data quality in GBIF.

For more information about the Inaugural Digital Data in Biodiversity Research Conference head to the conference wiki to find slide presentations and session recordings.



Mastodon skull from the UM Paleontology Department.

Works Cited:

Ariño, AH.(2010) “Approaches to estimating the universe of natural history collections data.” Biodiversity Informatics, 7: 81–92.

Kastner, M., S. E. Straus, K. A. McKibbon, and C. H. Goldsmith (2009). “The Capture-mark-recapture Technique Can Be Used as a Stopping Rule When Searching in Systematic Reviews.” Journal of Clinical Epidemiology.

Khabsa M, Giles CL (2014) “The Number of Scholarly Documents on the Public Web.” PLOS ONE 9(5): e93949.

Lane, D., Dykeman, J., Ferri, M., Goldsmith, C., Stelfox, H. (2013). “Capture-mark-recapture as a tool for estimating the number of articles available for systematic reviews in critical care medicine.” Journal of Critical Care.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s