Scholarly Communication Analytics: Open Access Evidence in Unpaywall

Najko Jahn; Anne Hobert

Unpaywall, developed and maintained by the team of Impactstory, finds open access copies of scholarly literature (Piwowar et al. 2018). Providing DOIs to Unpaywall’s REST API not only returns open access full-text links, but also helpful metadata about the open access status of publications indexed in Crossref, a DOI registration agency. While the API allows to retrieve a limited amount of records, Unpaywall also offers database snapshots for large-scale analysis, which more and more bibliometric databases and open access monitoring services utilise. However, documentation about how database and service providers work with these dumps are hard to find.

In this blog post, we describe how we loaded Unpaywall’s data dump into Google BigQuery, a cloud-based service that allows fast analysis of large datasets, and how we interfaced BigQuery for our analysis with R. We wanted to know the extent of open access status information in Unpaywall, particularly, how this information can be utilised for bibliometric research. In our case, we intend to match open access status information from Unpaywall with the Web of Science in-house database from the German Competence Center for Bibliometrics to determine factors influencing open access publication activities among German Universities as part of our BMBF-funded research project OAUNI.

Store and analyse large datasets with Google BigQuery

The Unpaywall data dump from February 2019 comprises more than 100 million records amounting to a file size of more than 100 GB. Working with datasets of such a large size is generally non-trivial. We therefore chose Google’s BigQuery as a cloud-based solution. From our perspective BigQuery has several advantages. Firstly, it is a highly performant tool enabling us to query and manipulate large datasets very fast. We would not be able to achieve a similarly satisfying performance with a local database deployed on our standard laptops. Secondly, using this cloud-based service gives us the possibility to share access to our database with colleagues and collaborators. Finally, the already existing interfaces to BigQuery from R allow us to incorporate this environment into our familiar data analytics workflow.

As a preparatory step, we loaded the entire dataset into a local Mongo DB database, exported relevant fields and rows for the study of the open access status of scholarly output as compressed JSON Lines files, and uploaded them to Google Cloud Storage. To import these files into BigQuery, we had to specify a schema, which we share in the source code repository of this blog. We used the BigQuery user interface, where the files are automatically decompressed and the corresponding tables are created.

Unpaywall Overview

In R, we interface our Unpaywall dataset stored in Google BigQuery with the packages DBI and bigrquery.

# connect to google bg where we imported the json lines Unpaywall dump
library(DBI)
library(bigrquery)
con <- dbConnect(
  bigrquery::bigquery(),
  project = "api-project-764811344545",
  dataset = "oadoi_full"
)

Our BigQuery project has two tables, one containing all records between 2008 and 2012, and another for more recent works published since 2013. When connecting with tbl() from dplyr, Google asks us to login via a web browser or to supply a private access token to interface our access-restricted database.

library(dplyr)
upw_08_12 <- tbl(con, "feb_19_mongo_export_2008_2012_full_all_genres")
upw_13_19 <- tbl(con, "feb_19_mongo_export_2013_Feb2019_full_all_genres")

bigrquery allows querying BigQuery tables using SQL or dplyr functions. The latter is convenient for us, because we have just started to learn SQL, but feel more experienced in the tidyverse, a popular collection of R packages following the Wickham-Grolemund approach to practise data science (Wickham and Grolemund 2017). Here’s an example where we call BigQuery with dplyr, which is part of the tidyverse, to obtain the first ten records from 2018. We restrict our search to journal articles, the most common genre in Unpaywall.

library(tidyverse)
upw_13_19 %>%
  filter(year == 2018, genre == "journal-article") %>%
  head(10)

#> # Source:   lazy query [?? x 13]
#> # Database: BigQueryConnection
#>     year genre     updated             published_date journal_is_in_d…
#>    <int> <chr>     <dttm>              <date>         <lgl>           
#>  1  2018 journal-… 2018-06-20 21:37:24 2018-01-01     FALSE           
#>  2  2018 journal-… 2019-01-26 20:22:21 2018-05-07     TRUE            
#>  3  2018 journal-… 2018-06-19 15:09:52 2018-03-23     TRUE            
#>  4  2018 journal-… 2019-01-15 22:31:55 2018-01-12     FALSE           
#>  5  2018 journal-… 2018-06-21 16:16:44 2018-04-01     FALSE           
#>  6  2018 journal-… 2018-06-18 19:20:28 2018-05-10     TRUE            
#>  7  2018 journal-… 2018-06-18 20:12:06 2018-01-01     FALSE           
#>  8  2018 journal-… 2019-01-15 13:40:41 2018-01-04     FALSE           
#>  9  2018 journal-… 2019-02-11 09:39:25 2018-12-14     TRUE            
#> 10  2018 journal-… 2018-12-09 03:14:10 2018-12-03     TRUE            
#> # … with 8 more variables: journal_is_oa <lgl>, journal_issns <chr>,
#> #   oa_locations <list>, doi <chr>, is_oa <lgl>, publisher <chr>,
#> #   journal_name <chr>, data_standard <int>

Notice that our schema follows the Unpaywall data format. However, we excluded the large data field z-authors. Moreover, we did not consider the fields title, doi-url, which is redundant to the doi field, and best-oa-location, which is derived from the Open Access location object.

In this blog post, we will examine the following open access indicators:

is_oa: A logical value indicating whether an open access version of the article was found or not.
journal_is_in_doaj: A logical value indicating whether an article was published in a journal registered in the Directory of Open Access Journals (DOAJ).

The column oa_locations is a list-column that contains individual metadata about all open access full-text links found per article. By definition, open access provision is not limited to one route, but multiple copies of an article can be made freely available at the same time using various means (Suber 2012).

Here are the three data variables from the oa_locations object that we will focus on:

is_best: A logical value defined by Unpaywall’s algorithm that describes the most relevant open access location. The algorithm prioritises publisher-hosted content.
host_type: Is the open access full-text provided by a publisher or a repository?
evidence: How did Unpaywall find the open access full-text?

Open Access availability (`is_oa`)

To start with, we retrieve the number and proportion of journal articles with open access full-text published between 2008 and 2018 using Unpaywall’s most basic open access indicator is_oa, a logical value, which is TRUE when at least one open access full-text was found. After matching and summarising the is_oa observations by year with dplyr, collect() from the dplyr database complement dbplyr loads the aggregated data from BigQuery into a local tibble. We use the lubridate package to transform the year variable to a date object.

library(lubridate)
oa_08_12 <- upw_08_12 %>%
  # query and aggregate with dpylr 
  filter(genre == "journal-article") %>%
  group_by(year, is_oa) %>%
  summarise(n = n()) %>% 
  # load the data into a local tibble
  collect()
oa_13_18 <- upw_13_19 %>%
  # query and aggregate with dpylr
  filter(genre == "journal-article", year < 2019) %>%
  group_by(year, is_oa) %>%
  summarise(n = n()) %>% 
  # load the data into a local tibble
  collect()
my_df <- bind_rows(oa_08_12, oa_13_18) %>%
  # calculate proportion per year
  ungroup() %>%
  mutate(year = lubridate::ymd(paste0(year, "-01-01"))) %>%
  group_by(year, is_oa) %>%
  summarise(n = sum(n)) %>%
  mutate(prop = n / sum(n))
my_df

#> # A tibble: 22 × 4
#> # Groups:   year [11]
#>    year       is_oa       n  prop
#>    <date>     <lgl>   <int> <dbl>
#>  1 2008-01-01 FALSE 1454745 0.711
#>  2 2008-01-01 TRUE   590250 0.289
#>  3 2009-01-01 FALSE 1562765 0.701
#>  4 2009-01-01 TRUE   665951 0.299
#>  5 2010-01-01 FALSE 1724760 0.697
#>  6 2010-01-01 TRUE   749139 0.303
#>  7 2011-01-01 FALSE 1585092 0.654
#>  8 2011-01-01 TRUE   838014 0.346
#>  9 2012-01-01 FALSE 1636130 0.630
#> 10 2012-01-01 TRUE   962036 0.370
#> # … with 12 more rows

In total, 31,159,960 journal articles published between 2008 and 2018 were included in Unpaywall. For 11,633,886 articles, Unpaywall was able to link a DOI to at least one freely available full-text (37 %). This means that around every third scholarly journal article published since 2008 is currently openly available.

Next, let’s plot the prevalence of open access to journal articles over time using the data visualisation package ggplot2, which is also part of the tidyverse. To make our ggplot object interactive, we turn it into a plotly chart, a javascript library, using ggplotly(). The tooltip presents the total number and percentage for each category and year. We use the package scales to format the y-axis.

library(scales)
plot_a <- my_df %>%
  # prepare label that we want to present as tooltip
  mutate(`Proportion in %` = round(prop * 100, 2)) %>%
  ggplot(aes(year, n, label = `Proportion in %`)) +
  geom_area(aes(fill = is_oa, group = is_oa),  alpha = 0.8) +
  labs(x = "Year published", y = "Journal Articles",
       title = "Open Access to Journal Articles") +
  scale_fill_manual("Is OA?",
                    values = c("#b3b3b3a0", "#56B4E9")) +
  scale_x_date(date_labels = "%y") +
  scale_y_continuous(labels = scales::number_format(big.mark = " ")) +
  theme_minimal(base_family = "Roboto") +
  theme(plot.margin = margin(30, 30, 30, 30)) +
  theme(panel.grid.minor = element_blank()) +
  theme(axis.ticks = element_blank()) +
  theme(panel.grid.major.x = element_blank()) +
  theme(panel.border = element_blank())
# turn ggplot object into interactive plotly chart
plotly::ggplotly(plot_a, tooltip = c("label", "y"))

Figure 1: Open access to journal articles according to Unpaywall. Blue area represents journal articles with at least one freely available full-text, grey area represents toll-access articles.

While a general growth of journal articles and open access provision to them can be observed, there is a considerable decline in the number of journal articles published in 2018, presumably because of an indexing lag between Crossref and Unpaywall. The decline in open access full-text availability was even clearer, suggesting that some open access content is provided only after a certain period of time.

Unpaywall Open Access Hosting Types (`host_type`)

Using Unpaywall’s open access location types allows for a more detailed analysis of open access provision. In the following, we explore the variable host_type, showing whether Unpaywall found the open access full-text on a publisher’s website or in a repository. Furthermore, we specifically highlight articles from fully open access journals that are indexed in the Directory of Open Access Journals (DOAJ) as indicated by the journal_is_in_doaj variable. As a start, we only examine the best open access location per DOI, is_best. As said before, this variable is defined by Unpaywall’s algorithm that prioritises publisher-hosted content.

Instead of dplyr, we are now querying BigQuery with SQL. Before, we built and tested the SQL queries in the BigQuery user interface. The SQL code is stored in separate files (host_type_08_12.sql and host_type_13_18.sql), which we share in the source code repository of this blog.

host_type_08_12_query <- readLines("database/host_type_08_12.sql") %>%
  paste(collapse = "")
host_type_13_18_query <- readLines("database/host_type_13_18.sql") %>%
  paste(collapse = "")

After calling BigQuery using SQL with the DBI interface, we bind the two resulting data frames into one. Using case_when() from dplyr, we create a host column distinguishing between “DOAJ-listed Journal,” “Other Journals” and “Repositories only” open access provision.

host_type_08_12_query_df <- dbGetQuery(con, host_type_08_12_query)
host_type_13_18_query_df <- dbGetQuery(con, host_type_13_18_query)
host_type_df <-
  bind_rows(host_type_08_12_query_df, host_type_13_18_query_df) %>%
  mutate(
    host = case_when(
      journal_is_in_doaj == TRUE ~ "DOAJ-listed Journal",
      host_type == "publisher" ~ "Other Journals",
      host_type == "repository" ~ "Repositories only"
    )
  ) %>%
  mutate(year = lubridate::ymd(paste0(year, "-01-01")))
host_type_df

#> # A tibble: 34 × 5
#>    year       host_type  journal_is_in_doaj number_of_articles host   
#>    <date>     <chr>      <lgl>                           <int> <chr>  
#>  1 2011-01-01 publisher  TRUE                           170917 DOAJ-l…
#>  2 2011-01-01 repository FALSE                          178323 Reposi…
#>  3 2012-01-01 repository FALSE                          188662 Reposi…
#>  4 2012-01-01 publisher  TRUE                           219293 DOAJ-l…
#>  5 2012-01-01 publisher  FALSE                          554081 Other …
#>  6 2008-01-01 publisher  FALSE                          358552 Other …
#>  7 2008-01-01 publisher  TRUE                            81731 DOAJ-l…
#>  8 2008-01-01 repository FALSE                          149967 Reposi…
#>  9 2010-01-01 publisher  FALSE                          436375 Other …
#> 10 2010-01-01 repository FALSE                          173419 Reposi…
#> # … with 24 more rows

To explore our data, we follow Claus Wilke’s excellent book “Fundamentals of Data Visualization”(Wilke 2019) and visualise our proportions separately as parts of the total. Again, our final ggplot graphic is transformed to an interactive plotly chart.

# calculate all oa articles per year
all_articles <- host_type_df %>%
  ungroup() %>%
  group_by(year) %>%
  summarise(number_of_articles = sum(number_of_articles))

plot_b <-
  ggplot(host_type_df, aes(x = year, y = number_of_articles, text = paste0("Publication year: ", lubridate::year(year)))) +
  geom_bar(
    data = all_articles,
    aes(fill = "All OA Articles"),
    color = "transparent",
    stat = "identity"
  ) +
  geom_bar(aes(fill = "by Host"), color = "transparent", stat = "identity") +
  facet_wrap( ~ host, nrow = 1) +
  scale_fill_manual(values = c("#b3b3b3a0", "#56B4E9"), name = "") +
  labs(x = "Year", y = "OA Articles (Total)", title = "Open Access to Journal Articles by Unpaywall host") +
  theme(legend.position = "top",
        legend.justification = "right") +
  scale_x_date(date_labels = "%y") +
  scale_y_continuous(labels = scales::number_format(big.mark = " ")) +
  theme_minimal(base_family = "Roboto") +
  theme(plot.margin = margin(30, 30, 30, 30)) +
  theme(panel.grid.minor = element_blank()) +
  theme(axis.ticks = element_blank()) +
  theme(panel.grid.major.x = element_blank()) +
  theme(panel.border = element_blank())
# turn ggplot object into interactive plotly chart
plotly::ggplotly(plot_b, tooltip = c("y", "text"))

Figure 2: Open access to journal articles by open access hosting location. Colored bars represent the number of open access articles per host (“DOAJ-listed Journal,” “Other Journals,” “Repositories only”), grey bars the total number of journal articles indexed in Crossref, where Unpaywall was able to identify at least one openly available full-text.

The figure shows that most publisher-provided open access links were obtained from journals that were not indexed in the DOAJ: these are 6,531,822 articles, representing 56 % of all journal articles with openly available full-text identified by Unpaywall.

While the number of publications in DOAJ-indexed journals is rising constantly, open access provided by other journal types and repositories declined from 2017 to 2018. Indeed, there is a considerable amount of journals that delay open access provision (Laakso and Björk 2013). A prominent example is the journal Cell where all articles are made freely available after an embargo period of twelve months. Also, self-archiving in repositories is often subject to embargo periods imposed by publishers, or researchers upload their publications later (Björk et al. 2013). A more detailed analysis of delayed open access, however, is challenging using Unpaywall data only, because Unpaywall has not tracked so far the point of time when articles were made open access.

Unpaywall Open Access Evidence Types (`evidence`)

The evidence field of the oa_locations object contains more detailed open access status information. We use again SQL queries stored in separate files (evidence_08_12.sql and evidence_13_18.sql) to be found in the source code repository, and create a data.frame with the relevant fields.

library(tidyverse)
# define queries
evidence_08_12_query <- readLines("database/evidence_08_12.sql") %>%
  paste(collapse = "")
evidence_13_18_query <- readLines("database/evidence_13_18.sql") %>%
  paste(collapse = "")
# fetch records and bind them to one data frame
evidence_08_12 <- dbGetQuery(con, evidence_08_12_query)
evidence_13_18 <- dbGetQuery(con, evidence_13_18_query)
evidence_df <- bind_rows(evidence_08_12, evidence_13_18) %>%
  ungroup() %>%
  mutate(year = lubridate::ymd(paste0(year, "-01-01")))

The evidence field indicates how Unpaywall found the article at a specific location and identified it as open access, for example via PubMed Central or via license information from Crossref.

For each evidence type we want to see how many articles were identified as open access in this way. To this end, we created the following table that shows the total number of articles per evidence type as well as their proportion and cumulative proportions with respect to the total number of all articles.

# calculate numbers and proportion of articles per evidence type
evidence_df %>%
  group_by(evidence) %>%
  summarize(N_records = sum(number_of_articles)) %>%
  arrange(desc(N_records)) %>%
  mutate(
    prop = N_records / sum(N_records) * 100,
    cum_prop = cumsum(prop)
  ) -> articles_per_type_df
articles_per_type_df %>%
  knitr::kable(
    col.names = c(
      "Evidence Types",
      "Number of Articles",
      "Proportion of all Articles in %",
      "Cumulative Proportion in %"
    ),
    big.mark = ",",
    caption = "Number of articles per evidence type. Columns show the total number per evidence type, the proportion of articles of this type with respect to the number of all articles and the cumulative proportion of articles associated with any of the above evidence types."
  )

Table 1: Number of articles per evidence type. Columns show the total number per evidence type, the proportion of articles of this type with respect to the number of all articles and the cumulative proportion of articles associated with any of the above evidence types.
Evidence Types	Number of Articles	Proportion of all Articles in %	Cumulative Proportion in %
open (via free pdf)	4415372	21.84	22
open (via page says license)	3404452	16.84	39
oa repository (via OAI-PMH doi match)	3247290	16.06	55
oa journal (via doaj)	2960898	14.65	69
oa repository (via OAI-PMH title and first author match)	2704088	13.38	83
oa repository (via pmcid lookup)	2103323	10.41	93
open (via crossref license)	985336	4.87	98
open (via page says Open Access)	126113	0.62	99
open (via crossref license, author manuscript)	82079	0.41	99
oa journal (via publisher name)	63234	0.31	99
oa repository (via OAI-PMH title and last author match)	60535	0.30	100
oa repository (via OAI-PMH title match)	57106	0.28	100
oa repository (via doi prefix)	3583	0.02	100
oa journal (via issn in doaj)	392	0.00	100
manual	27	0.00	100

It can be seen that the long tail of the least frequent 8 categories at the bottom of the table with proportions smaller than 1 % of all articles only makes up 1.9 % of all articles in total, which is why we will aggregate these evidence types in the category Other in the following.

# collate least frequent articles as 'Other'
articles_per_type_df %>%
  mutate(evidence = as_factor(evidence)) %>%
  mutate(evidence_grouped = fct_relevel(fct_other(evidence, keep = evidence[.$prop > 1]), "Other")) %>%
  group_by(evidence_grouped) %>%
  summarize(number_of_articles = sum(N_records)) %>%
  mutate(
    prop = number_of_articles / sum(number_of_articles) * 100,
    cum_prop = cumsum(number_of_articles) / sum(number_of_articles) *
      100
  ) -> articles_per_type_grouped_df
# group according to categorization with "Other"
evidence_grouped_df <- evidence_df %>%
  mutate(evidence = as_factor(evidence)) %>%
  mutate(
    evidence_grouped = factor(
      fct_other(evidence, keep = articles_per_type_grouped_df$evidence_grouped),
      levels = articles_per_type_grouped_df$evidence_grouped
    )
  ) %>%
  mutate(evidence_grouped = fct_relevel(fct_rev(evidence_grouped), "Other")) %>%
  group_by(evidence_grouped, is_best, year) %>%
  summarize(number_of_articles = sum(number_of_articles))

So far, we only examined the best open access location per DOI, indicated by is_best, a variable defined by Unpaywall algorithm that prioritises publisher-hosted content. However, evidence types in Unpaywall are not exclusive categories. On the contrary, many records are associated with several evidence types, because various ways to openly access full-texts have been found by Unpaywall. For this reason, the following figure distinguishes whether a given evidence type is classified as best open access location by Unpaywall or not. It is clearly visible that Unpaywall prioritises publisher hosted content (open, oa_journal) over repository depositions (oa_repository), as they state on their website. However, the figure also shows that an existing free pdf version on the publisher’s website (likely even without licensing information) is prioritised over journal level classifications as for example being indexed in DOAJ.

evidence_grouped_df %>%
  group_by(evidence_grouped, is_best) %>%
  summarize(number_of_articles = sum(number_of_articles)) %>%
#create plot
  ggplot(aes(x = evidence_grouped, y = number_of_articles, fill = is_best)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("#b3b3b3a0", "#56B4E9"), name = "Is best?") +
  theme_minimal(base_family = "Roboto") +
  theme(plot.margin = margin(30, 30, 30, 30)) +
  theme(panel.grid.minor = element_blank()) +
  theme(axis.ticks = element_blank()) +
  theme(panel.grid.major.y = element_blank()) +
  theme(panel.border = element_blank()) +
  coord_flip() +
  labs(y = "Number of Open Access Articles", x = "Evidence Type",
       title = "Number of Open Access Articles per Unpaywall Evidence Type") -> plot_ev_types_is_best
#create interactive plot
plotly::ggplotly(plot_ev_types_is_best, tooltip = c("y"))

Figure 3: Number of articles per evidence type. Least frequent evidence types are collated as category Other. In blue, the amount of articles where the corresponding evidence type is classified as best_oa_location by Unpaywall is shown.

To investigate the development of the most prevalent evidence types over time, we use a faceted graph. We again observe declines in the proportion of repository-based evidences which are chosen as best location.

# use collated data frame
evidence_grouped_df %>%
  ggplot(aes(year, number_of_articles, fill = is_best, text = paste0("Publication year: ", lubridate::year(year)))) +
  geom_bar(stat = "identity") +
  facet_wrap( ~ fct_rev(evidence_grouped), ncol = 2) +
  scale_fill_manual(values = c("#b3b3b3a0", "#56B4E9"), name = "Is best?") +
  theme_minimal(base_family = "Roboto") +
  theme(panel.grid.minor = element_blank()) +
  theme(axis.ticks = element_blank()) +
  theme(panel.grid.major.x = element_blank()) +
  theme(panel.border = element_blank()) +
  scale_x_date(date_labels = "%y") +
  labs(x = "Publication Year", y = "Number of Open Access Articles",
       title = "Unpaywall Open Access Evidence Categories per Year") -> plot_ev_types_per_year
#create interactive plot
plotly::ggplotly(plot_ev_types_per_year, tooltip = c("y", "text")) -> plotlyfacetfig
# move y-axes label to the left to ensure readability
plotlyfacetfig[['x']][['layout']][['annotations']][[2]][['x']] <- -0.08
plotlyfacetfig

Figure 4: Development of the number of articles per evidence type over time. Least frequent evidence types are collated as category Other. For each type the total number of articles per year is shown for publication years from 2008 to 2018. In blue, the amount of articles where the corresponding evidence type is classified as best_oa_location by Unpaywall is highlighted.

Overlap of Open Access Provision and Evidence Types

Many open access articles are accessible through a number of locations, including the publisher’s website and also one or more open access repositories. Unpaywall not only describes one, but all open access full-texts it discovers with useful metadata. In the following, we will analyse if and to which extent the various open access indicators intersect. We start with an analysis of the overlap between host types, followed by determining set intersections between Unpaywall’s evidence types.

Overlap between Host Types

To present articles that are both provided by publishers and repositories, we use the BigQuery SQL function STRING_AGG to create a new variable where we concatenate the different host_types per open access article (for more details, see the full SQL queries host_type_intersect_08_12.sql and host_type_intersect_13_18.sql.

host_type_intersect_08_12_query <- readLines("database/host_type_intersect_08_12.sql") %>%
  paste(collapse = "")
host_type_intersect_13_18_query <- readLines("database/host_type_intersect_13_18.sql") %>%
  paste(collapse = "")

Again, we call BigQuery and load the aggregated data into our local R session. This time, we do not want to present the total number of open access publications, but its relative share. We already obtained the total number of articles, which are stored in the my_df data.frame.

host_type_08_12_intersect_df <-
  dbGetQuery(con, host_type_intersect_08_12_query)
host_type_13_18_intersect_df <-
  dbGetQuery(con, host_type_intersect_13_18_query)
host_type_intersect <-
  bind_rows(host_type_08_12_intersect_df, host_type_13_18_intersect_df) %>%
  mutate(year = lubridate::ymd(paste0(year, "-01-01"))) %>%
  mutate(
    host = case_when(
      host_type_count == "publisher" ~ "Publisher only",
      host_type_count == "publisher,repository" ~ "Publisher & Repository",
      host_type_count == "repository" ~ "Repositories only"
    )
  ) %>%
  mutate(host = factor(
    host,
    levels = c("Publisher only", "Publisher & Repository", "Repositories only")
  ))
# obtain yearly publication volumes
host_type_intersect <- my_df %>%
  group_by(year) %>%
  summarise(all_articles = sum(n)) %>%
  # join with host type figures
  right_join(host_type_intersect, by = "year") %>%
  # calculate proportion
  mutate(prop = number_of_articles / all_articles)
host_type_intersect

#> # A tibble: 33 × 6
#>    year       all_articles host_type_count      number_of_artic… host 
#>    <date>            <int> <chr>                           <int> <fct>
#>  1 2008-01-01      2044995 publisher                      312392 Publ…
#>  2 2008-01-01      2044995 repository                     149967 Repo…
#>  3 2008-01-01      2044995 publisher,repository           127891 Publ…
#>  4 2009-01-01      2228716 publisher                      347502 Publ…
#>  5 2009-01-01      2228716 repository                     165372 Repo…
#>  6 2009-01-01      2228716 publisher,repository           153077 Publ…
#>  7 2010-01-01      2473899 publisher                      396449 Publ…
#>  8 2010-01-01      2473899 publisher,repository           179271 Publ…
#>  9 2010-01-01      2473899 repository                     173419 Repo…
#> 10 2011-01-01      2423106 publisher                      444220 Publ…
#> # … with 23 more rows, and 1 more variable: prop <dbl>

Let’s visualise the host type distribution including the overlap between publisher and repository-provided open access:

# get overall oa share
host_type_all <- host_type_intersect %>%
  group_by(year) %>%
  summarise(prop = sum(prop))
# make a ggplot graphic
plot_host_intersect <-
  ggplot(host_type_intersect, aes(x = year, y = prop, text = paste0("Publication year: ", year(year)))) +
  geom_bar(
    data = host_type_all,
    aes(fill = "All OA Articles"),
    color = "transparent",
    stat = "identity"
  ) +
  geom_bar(aes(fill = "by Host"), color = "transparent", stat = "identity") +
  facet_wrap(~ host, nrow = 1) +
  scale_fill_manual(values = c("#b3b3b3a0", "#56B4E9"), name = "") +
  labs(x = "Year", y = "OA Share",
       title = "Overlap between Open Access Host Types in Unpaywall") +
  scale_x_date(date_labels = "%y") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 5L)) +
  theme_minimal(base_family = "Roboto") +
  theme(plot.margin = margin(30, 30, 30, 30)) +
  theme(panel.grid.minor = element_blank()) +
  theme(axis.ticks = element_blank()) +
  theme(panel.grid.major.x = element_blank()) +
  theme(panel.border = element_blank())
# turn ggplot object into interactive plotly chart
plotly::ggplotly(plot_host_intersect, tooltip = c("y", "text"))

Figure 5: Open access to journal articles by open access hosting location. Coloured bars represent the number of open access articles per Unpaywall host category: “publisher” and “repository,” grey bars the percentage of open access to journal articles indexed in Crossref from Unpaywall. Because open access provision is not mutually exclusive, the overlap between “publisher” and “repository” hosted open access full-texts is also shown.

The figure shows that Unpaywall found most open access full-text on publishers’ websites (82%). 56 % of all open access full-texts was not archived in a repository. The overlap of open access provided by both routes also deserves attention: a proportion of around 26 % of all open access articles was accessible from both publishers’ websites and repositories.

Overlaps between Evidence types

The categorisation of evidence types is not exclusive, either. Therefore, many records will be associated with more than one evidence type.

To better understand this overlap, we generate a new column specifying all found combinations of evidence types using concatenation with the SQL function STRING_AGG. The SQL queries we use (evidence_single_cat_08_12.sql and evidence_single_cat_13_18.sql) can be found in the source code repository.

library(tidyverse)
# define queries
evidence_single_cat_08_12_query <- readLines("database/evidence_single_cat_08_12.sql") %>%
  paste(collapse = "")
evidence_single_cat_13_18_query <- readLines("database/evidence_single_cat_13_18.sql") %>%
  paste(collapse = "")
# fetch records and bind them to one data frame
evidence_categories_08_12 <- dbGetQuery(con, evidence_single_cat_08_12_query)
evidence_categories_13_18 <- dbGetQuery(con, evidence_single_cat_13_18_query)
evidence_categories_df <- bind_rows(evidence_categories_08_12, evidence_categories_13_18) %>%
  group_by(ev_cat) %>%
  summarize(number_of_articles = sum(number_of_articles)) %>%
  arrange(desc(number_of_articles))
evidence_categories_df

#> # A tibble: 411 × 2
#>    ev_cat                                             number_of_artic…
#>    <chr>                                                         <int>
#>  1 open (via free pdf)                                         3290683
#>  2 oa repository (via OAI-PMH title and first author…           945958
#>  3 open (via page says license)                                 910028
#>  4 oa journal (via doaj)&open (via page says license)           687962
#>  5 open (via crossref license)                                  546154
#>  6 oa journal (via doaj)                                        515453
#>  7 oa journal (via doaj)&oa repository (via OAI-PMH …           513620
#>  8 oa repository (via OAI-PMH doi match)&oa reposito…           462086
#>  9 oa repository (via OAI-PMH doi match)                        462071
#> 10 oa repository (via OAI-PMH doi match)&oa reposito…           260696
#> # … with 401 more rows

We first illustrate for each evidence type - collating again the least frequent types in the category Other - the amount of articles which corresponds exclusively to this type and no others.

#determine number of articles corresponding only to one evidence type
evidence_single_cat_df <- evidence_df %>%
  group_by(evidence) %>%
  summarize(number_of_articles = sum(number_of_articles)) %>%
  left_join(evidence_categories_df, by = c("evidence" = "ev_cat")) %>%
  rename(number_of_articles = number_of_articles.x, number_of_single_cat = number_of_articles.y) %>%
  mutate(number_of_articles = replace_na(number_of_articles, 0),
         number_of_single_cat = replace_na(number_of_single_cat, 0))
#aggregate least frequent types as category `Other`
evidence_single_cat_grouped_df <- evidence_single_cat_df %>%
  ungroup() %>%
  mutate(evidence = as_factor(evidence)) %>%
  mutate(
    evidence_grouped = factor(
      fct_other(evidence, keep = articles_per_type_grouped_df$evidence_grouped),
      levels = articles_per_type_grouped_df$evidence_grouped
    )
  ) %>%
  mutate(evidence_grouped = fct_relevel(fct_rev(evidence_grouped), "Other")) %>%
  group_by(evidence_grouped) %>%
  summarize(
    number_of_articles = sum(number_of_articles),
    number_of_single_cat = sum(number_of_single_cat)
  ) %>%
  #arrange data in order to enable stacked barplots
  mutate(multiple = number_of_articles-number_of_single_cat, single = number_of_single_cat) %>%
  select(evidence_grouped, single, multiple) %>%
  gather(is_single, number_of_articles, -evidence_grouped) %>%
  mutate(is_single = case_when(
    is_single == "single" ~ TRUE,
    is_single == "multiple" ~ FALSE
    )) %>%
    #rename column for correct display in plot
  rename(proportion = number_of_articles)
#create aggregated proportions barplot
evidence_single_cat_grouped_df %>%
  ggplot(aes(x = evidence_grouped, y = proportion, fill = is_single)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_fill_manual(values = c("#b3b3b3a0", "#56B4E9"), name = "Is unique?") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal(base_family = "Roboto") +
  theme(plot.margin = margin(30, 30, 30, 30)) +
  theme(panel.grid.minor = element_blank()) +
  theme(axis.ticks = element_blank()) +
  theme(panel.grid.major.y = element_blank()) +
  theme(panel.border = element_blank()) +
  coord_flip() +
  theme(legend.position = "top",
        legend.justification = "right") +
  labs(y = "Proportion of Articles", x = "Evidence Type",
       title = "Proportion of Articles per Evidence Type") -> plot_ev_types_is_single_prop
#create interactive plot
plotly::ggplotly(plot_ev_types_is_single_prop, tooltip = c("y"))

Figure 6: Proportion of articles per evidence type. In blue, the amount of articles uniquely associated with the corresponding evidence type is shown.

It is interesting to see, that the evidence type which appears most often as a unique form of open access provision is via an openly available pdf on the publisher’s website, meaning that no other evidence, like license information from Crossref, was found. This confirms that Unpaywall’s open access detection has benefited from scraping publishers’ websites.

Moreover, a critical amount of articles is found only through repository-based evidence sources and hence, is available only via the green route. Still, the figure shows a phenomenon that we observed also for the host_type, namely that repository-based evidence types often overlap with other evidence types.

Because of Unpaywall’s prioritisation of publisher-provided open access, caution is in order, when only the best_oa_location is used for categorisation: there are publisher-based evidence types that may not comply with how funders define open access journals, particularly with regard to license statements. On the other hand, a large number of articles seems to be identified as open access only through license information from Crossref without an associated free pdf having been found. However, we are unsure whether Unpaywall performs all open access identification procedures for every single record indexed in Crossref, which would allow for such a comparison.

Visualising the occurring intersections of multiple evidence types is difficult. Following (Alexander Lex and Gehlenborg 2014) and (A. Lex et al. 2014), we created an UpSet figure using the UpSetR package described in (Conway, Lex, and Gehlenborg 2017) in order to examine in more detail, which of the 411 occurring combinations of evidence types - including singletons - shows up most frequently and how large these groups are. In theory, for the 15 evidence types up to 32,768 combinations would be possible.

To start with, we interface BigQuery and retrieve how often the combinations of evidence types occur. We store the corresponding SQL queries (evidence_overlap_08_12.sql and evidence_overlap_13_18.sql) and share the resulting dataset as .csv files (results_evidence_overlap_08_12.csv and results_evidence_overlap_13_18.csv). Next, we transform the data to an upsetr-compatible expression format, resulting in a named character vector. Lastly, the upset() function generates the graph. To keep the resulting figure readable, we only display the 15 combinations of the 7 most frequent types with the highest number of articles each.

library(UpSetR)
# fetch data for upset graph
evidence_categories_upset_08_12_query <- readLines("database/evidence_overlap_08_12.sql") %>%
  paste(collapse = "")
evidence_categories_upset_13_18_query <- readLines("database/evidence_overlap_13_18.sql") %>%
  paste(collapse = "")
evidence_categories_upset_08_12 <- dbGetQuery(con, evidence_categories_upset_08_12_query)
# export to csv
write_csv(evidence_categories_upset_08_12, "data/results_evidence_overlap_08_12.csv")
evidence_categories_upset_13_18 <- dbGetQuery(con, evidence_categories_upset_13_18_query)
# export to csv
write_csv(evidence_categories_upset_13_18, "data/results_evidence_overlap_13_18.csv")
evidence_categories_upset_df <- bind_rows(evidence_categories_upset_08_12, evidence_categories_upset_13_18) %>%
  group_by(ev_cat) %>%
  summarise(n = sum(number_of_articles))
# list with counts
evidence_categories_upset_list <- as.list(evidence_categories_upset_df$n)
# categories as list names
names(evidence_categories_upset_list) <- evidence_categories_upset_df$ev_cat
# convert to vector
evidence_categories_upset_expr <- unlist(evidence_categories_upset_list)
upset(fromExpression(evidence_categories_upset_expr), nsets = 7, nintersects = 15, order.by = "freq", show.numbers = FALSE, set_size.angles = 25)

Most frequent combinations of evidence types. The barplot on the left displays the total number of articles per evidence type ("Set Size"). The central barplot shows the number of articles per overlap category ("Intersection Size"). Which evidence types contribute to each intersection is given by the black dots in the chart below.

Figure 7: Most frequent combinations of evidence types. The barplot on the left displays the total number of articles per evidence type (“Set Size”). The central barplot shows the number of articles per overlap category (“Intersection Size”). Which evidence types contribute to each intersection is given by the black dots in the chart below.

Discussion and Conclusion

In this blog post, we demonstrated how to analyse the Unpaywall data dump with Google BigQuery and R. Interfacing BigQuery with R has allowed us to integrate a high-performance and user-friendly database environment into our R data analytics workflow. Using this data management environment, we found 11,6 million journal articles published between 2008 and 2018 with open access full-texts, representing around one third of all articles indexed in Crossref for this period. Moreover, we found Unpaywall to be a suitable data source for open access analytics, because Unpaywall does not only tag if a publication is freely available, but also provides metadata describing how and where the open access full-text links were discovered.

Our Unpaywall data analysis revealed various open access location and evidence types, as well as large overlaps between them. Along with the likely influence of embargoed or delayed open access provision on some of these types, our analysis raises important questions about how to responsibly use Unpaywall data in bibliometric research and open access monitoring. Examining Unpaywall’s best open access location only, favours publisher-provided open access, which, in turn, means that open access provided by repositories would be underestimated. Likewise, large overlaps between evidence categories can be observed. To allow for careful consideration, bibliometric research and open access monitoring must therefore be clear about how open access indicators were derived from Unpaywall.

In future, we will use these insights from our data analysis to work on a matching procedure between Unpaywall and the Web of Science in-house database from the German Competence Center for Bibliometrics in our OAUNI project. In doing so, we want to represent Unpaywall’s open access evidence as comprehensive as possible to allow for a pluralist view on open access to journal articles from researchers affiliated with German Universities.

Acknowledgments

We are very grateful for helpful feedback on this post and ongoing discussions on the usage of Unpaywall data for bibliometric analyses with our project partners in Bielefeld, Niels Taubert and Elham Iravani, the related project OASE, in particular Nicholas Fraser and Philipp Mayr-Schlegel, Stephan Stahlschmidt and Aliakbar Akbaritabar from the DZHW, as well as Daniel Bangert and Birgit Schmidt from the SUB Göttingen.

We acknowledge financial support from the the Federal Ministry of Education and Research of Germany (BMBF) in the framework Quantitative research on the science sector (Project: “OAUNI Entwicklung und Einflussfaktoren des Open-Access-Publizierens an Universitäten in Deutschland,” Förderkennzeichen: 01PU17023A).

Björk, Bo-Christer, Mikael Laakso, Patrik Welling, and Patrik Paetau. 2013. “Anatomy of Green Open Access.” Journal of the Association for Information Science and Technology 65 (2): 237–50. https://doi.org/10.1002/asi.22963.

Conway, Jake R., Alexander Lex, and Nils Gehlenborg. 2017. “UpSetR: An r Package for the Visualization of Intersecting Sets and Their Properties.” Bioinformatics 33 (18): 2938–40. https://doi.org/10.1093/bioinformatics/btx364.

Laakso, Mikael, and Bo-Christer Björk. 2013. “Delayed Open Access: An Overlooked High-Impact Category of Openly Available Scientific Literature.” Journal of the American Society for Information Science and Technology 64 (7): 1323–29. https://doi.org/10.1002/asi.22856.

Lex, A., N. Gehlenborg, H. Strobelt, R. Vuillemot, and H. Pfister. 2014. “UpSet: Visualization of Intersecting Sets.” IEEE Transactions on Visualization and Computer Graphics 20 (12): 1983–92. https://doi.org/10.1109/TVCG.2014.2346248.

Lex, Alexander, and Nils Gehlenborg. 2014. “Points of View: Sets and Intersections.” Nature Methods 11 (July): 779. https://doi.org/10.1038/nmeth.3033.

Piwowar, Heather, Jason Priem, Vincent Larivière, Juan Pablo Alperin, Lisa Matthias, Bree Norlander, Ashley Farley, Jevin West, and Stefanie Haustein. 2018. “The State of OA: A Large-Scale Analysis of the Prevalence and Impact of Open Access Articles.” PeerJ 6: e4375. https://doi.org/10.7717/peerj.4375.

Suber, Peter. 2012. Open Access. MIT Press. https://mitpress.mit.edu/books/open-access.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc. https://r4ds.had.co.nz/.

Wilke, Claus O. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media, Inc. https://serialmentor.com/dataviz/.

Open Access Evidence in Unpaywall

Store and analyse large datasets with Google BigQuery

Unpaywall Overview

Open Access availability (`is_oa`)

Unpaywall Open Access Hosting Types (`host_type`)

Unpaywall Open Access Evidence Types (`evidence`)

Overlap of Open Access Provision and Evidence Types

Overlap between Host Types

Overlaps between Evidence types

Discussion and Conclusion

Acknowledgments

References

Corrections

Reuse

Citation

Open Access Evidence in Unpaywall

Store and analyse large datasets with Google BigQuery

Unpaywall Overview

Open Access availability (is_oa)

Unpaywall Open Access Hosting Types (host_type)

Unpaywall Open Access Evidence Types (evidence)

Overlap of Open Access Provision and Evidence Types

Overlap between Host Types

Overlaps between Evidence types

Discussion and Conclusion

Acknowledgments

References

Corrections

Reuse

Citation

Open Access availability (`is_oa`)

Unpaywall Open Access Hosting Types (`host_type`)

Unpaywall Open Access Evidence Types (`evidence`)