Exploring Trends in the Source of US Bee Data in GBIF

Understanding how the sources of the data have changed over time.

9 min readMar 30, 2022

Background

There is a growing body of scientific research that uses data from the Global Biodiversity Information Facility (GBIF), which is a giant repository of biodiversity data compiled from a multitude of sources.

On one hand, this offers unique opportunities to access reams of previously inaccessible data. On the other hand, the data is rife with biases and idiosyncrasies that can potentially lead to spurious conclusions.

A while ago I posted a Twitter thread criticizing some of the conclusions of a high-profile paper on bee declines that relied on GBIF data. I delved into the data on the genus Perdita and concluded that for many species that were not found recently, it was due to shortage of taxonomists and lack of digitized bee data. Ultimately I think that for Perdita, and likely many other bee groups, GBIF data is insufficient to detect declines because there simply isn’t enough of the right kind of data to tell if bees are declining or not.

That paper is certainly not alone in using GBIF data to draw broad conclusions. However, I focus on it because it is one of the most high-profile ones and it has been highly cited in the scientific literature and widely reported in the popular media. It also provides easily accessible data that allows for deeper exploration and the authors should be commended for making the data and code available. This is in contrast to many studies that do not make their data easily available, which creates a barrier to scientific progress, though it may shield them from criticism.

Exploring bee data trends in the United States

During my exploration of the Perdita data, I also examined some broader patterns of the source of GBIF data in the United States as a whole. I never shared it at the time, but I have decided to put it here because I think it tells an interesting story that also has implications for others who may want to use GBIF data to examine bee trends.

This examination is not so much about what the data says about trends in bee abundance and distribution, instead it’s about where the data comes from and what implications that may have for interpreting and understanding the data.

I downloaded the data from the paper using the official GBIF link. It’s a large file, weighing in at 4.7 GB. I tried opening it in Excel, and that did not work. So I had to dust off my Python skills in order to take a peek at the data. I’ve made the code available on GitHub for anyone who wants to take a look.

Digging into the data

The first thing I did was get some basic summary statistics about the dataset in order to get a sense of how many records we’re dealing with:

1. Total number of bees in the entire dataset: 4,327,069
2. Total number of bees in dataset identified to species: 3,459,093
3. Total number of bee records from the US: 1,433,465
4. Total number of US bee records identified to species: 1,145,578

The main thing I’m interested in here is the source of the data, or which institutions are reporting the data found in the dataset. For this part I’m only focusing on bees from the US, since that’s where I’m located and where I have expertise. The US data is also the source of one third of the species-level data in the entire dataset, so trends in the US bee data have an outsized impact on the dataset as a whole.

In total, there were 143 unique sources of US bee data. However, only 15 of them had more than 10,000 records and only 4 had more than 100,000 records. The sources, along with the total number of US bee records for each are shown below.

One of the most interesting things is that the #1 source of US bee data on GBIF is the USGS Native Bee Inventory and Monitoring Lab (BIML), with nearly 400,000 records. This is important because that lab didn’t start up until (relatively) recently. In fact, it only reports a total of 30 records before the year 2000 and doesn’t start reporting data in large numbers until the year 2002, when it reported 9,277 total US bee records.

The other top data sources (with greater than 100,000 US bee records) are the USDA’s Bee Biology and Systematics Laboratory (BBSL), the Snow Entomological Museum at the University of Kansas (SEMC), and the Illinois Natural History Survey (INHS).

Comparing pre- and post-2000 data

Since the year 2000 is roughly the dividing line for the emergence of the USGS Bee Monitoring Lab (and a nice round number), what happens when we compare the historic period (defined as the period from 1951–2000) to the recent period (defined as the period from to 2001–2015)?

First, the total number of US specimens identified to species are relatively equal between the two time periods, with 408,587 from 1951–2000 and 438,137 from 2001–2015. However, the source of the data has essentially flipped: in the period 1951–2000, three museums (BBSL, SEMC, and INHS) reported 72% of the data, with 46.0% from BBSL, 15.3% from SEMC, and 10.7% from INHS.

In contrast, in the recent period (2001–2015), 69.4% of the data has come from the USGS Bee Monitoring Lab, with the top three museums of the historic period decreasing to a total of 10.0% of the total, with 9.4% from BBSL, 0.4% from SEMC, and 0.2% from INHS.

The changes can be illustrated by plotting the proportion of specimens from the top four data sources over time, shown here spanning the time period from 1951–2018. That graph shows an initial mix of three sources, which is slowly dominated by BBSL data (in teal) up to the year 2000, whereupon the data becomes swamped by the USGS (in orange).

Proportion of identified specimens by the top four sources. USGS = United States Geological Survey Bee Monitoring and Inventory Lab, Maryland. BBSL = Bee Biology and Systematics Lab in Logan, Utah. SEMC = Snow Entomological Museum at University of Kansas. INHS = Illinois Natural History Survey, Champaign Illinois.

Here is the same data but plotted to show the total numbers of identified specimens rather than their proportion:

Implications for interpreting bee data

In general, I believe that the data generated by the USGS bee monitoring lab is not equivalent to the other top data sources. The other three data sources that have contributed at least 100,000 records to GBIF are what I would consider “traditional” museums, with large collections that are hubs of taxonomic research. In addition, all three are (or were) very bee-focused. For example, the USDA Bee Biology and Systematics Laboratory (where I did my PhD) is focused near 100% on bees, the Snow Entomological Museum was home to Charles Michener, one of the greatest bee taxonomists, and the Illinois Natural History Survey was the home of Wally LaBerge, who was one of the most prolific bee taxonomists. All of these labs have rich histories of bee taxonomists performing research, training students, and building up their collections.

In contrast, the USGS Bee Monitoring Lab is not a traditional museum, as it focuses on speed and efficiency of processing and identifying specimens and it does not maintain a collection or preserve most of the specimens. Instead, it actively destroys the majority of its specimens.

There are some additional factors that would lead me to expect the data from the USGS Bee Monitoring Lab to be different from other sources. For example, as reported in Kammerer et al. (2020), the USGS lab primarily uses pan traps to catch bees, with 82% of the dataset reported as being collected from pan traps. I’ve criticized pan traps elsewhere because they tend to capture both large numbers of common species as well as many groups that are extremely difficult to identify. The predominance of pan traps in the USGS data is also important because pan traps are a relatively recent invention, having gained prominence only in the 1990’s as a method for sampling bees. In contrast, most historic bee data was collected using insect nets, and it is well-documented that the two methods have their own unique biases. Further, entomology museums tend to use a broader mix of collecting methods employed by a greater diversity of collectors.

Finally, the USGS lab is located in Maryland, with a collection focus on the mid-Atlantic states. In contrast, the other top data sources are located in Utah (BBSL), Kansas (SEMC), and Illinois (INHS). These geographic differences make direct comparisons difficult because the eastern and western US have different bee faunas, with the western bee fauna being markedly more diverse. For example, the state of Utah alone has higher bee diversity than the entire eastern United States. This geographic bias is reflected in the data: after 1950, the BBSL dataset has 861 species from Utah whereas the USGS dataset has just 184 species from Utah. The pattern is similar for California species: after 1950 the BBSL dataset has 1081 species from California versus just 150 in the USGS dataset. Given that the entire USGS dataset has 1083 species from the whole United States, this is a significant difference.

As a result of these methodological and geographic differences, I would expect the USGS data to have fewer species than the other major data sources even though it contains tens of thousands more specimens per year. This is largely borne out by the data.

Below, I have graphed total number of bee species reported per year, and I’ve lumped the BBSL SEMC and INHS collections together (shown in grey) in order to avoid double-counting species. The USGS species are shown in orange, and specimens that are in both datasets are in black.

Raw number of species gathered by traditional museums (data from BBSL, SEMC, and INHS are pooled in the grey bars) and data gathered by USGS (orange bars). Species that were found by both the traditional museums and the USGS are shown in black.

Conclusion

Overall, this exploration of the data has made me feel more confident in my initial conclusions that GBIF data is not good for monitoring most bees. The source of the data has changed dramatically over time, with three traditional museums providing about 70% of the data in the historic period (1951–2000) and the USGS Bee Inventory and Monitoring Lab providing about 70% of the more recent data (2001–2015). And even though the USGS lab has collected far more specimens per year, their different focus, collecting methods, geographic biases, and recent origin make them difficult to compare to the traditional museums. Indeed, it’s unclear how much these hundreds of thousands of collected specimens actually contribute to the goal of monitoring bees.

One of the big takeaways from this is that the the institutional sources of US bee data in GBIF has changed dramatically over time. This makes it difficult to look at broad-scale changes in the US bee fauna, and researchers should be extremely wary of making conservation decisions based only on GBIF data. The changes in the source of the data are simply so substantial that they likely swamp most other effects. Unfortunately there is no easy way to fix this.

It’s worth noting that the bee data in GBIF is dwarfed by the data that is either undatabased or not reported to GBIF. Many museums with large bee collections, such as the American Museum of Natural History and the UC Riverside Entomology Collection, have few to no to records on GBIF. Others, such as the BBSL collection, withhold hundreds of thousands of records from GBIF. However, even if the traditional bee museums digitized or reported more of their material, the lack of modern bee taxonomists and support for taxonomic research means that much of that data hasn’t been updated to modern taxonomic concepts. In other words, there isn’t much point in digitizing bees that were identified decades ago using out-of-date or incorrect species concepts.

Can these findings about US data be used to generalize about other countries? I don’t think so. For me, one of the big takeaways here is that the changes in data sources are due largely to historical factors. I would expect most countries and regions to have their own unique scientific history that may not necessarily match up with the trends in other countries.

Finally, because I’m only looking at GBIF data up through 2015, it doesn’t hit on another major change that GBIF data is undergoing: the extremely rapid rise of iNaturalist. With over 200,000 bee records (with 130,000 “research grade” observations to species) in the US from the year 2020 alone, iNaturalist has become the largest data source for bees in GBIF. While I believe that data is valuable, it comes with its own unique set of biases and limitations that make it even more difficult to compare to data from traditional natural history collections.