We investigated more than 31 million scholarly journal articles published between 2008 and 2018 that are indexed in Unpaywall, a widely used open access discovery tool. Using Google BigQuery and R, we determined over 11.6 million journal articles with open access full-text links in Unpaywall, corresponding to an open access share of 37 %. Our data analysis revealed various open access location and evidence types, as well as large overlaps between them, raising important questions about how to responsibly re-use Unpaywall data in bibliometric research and open access monitoring.
Unpaywall, developed and maintained by the team of Impactstory, finds open access copies of scholarly literature (Piwowar et al. 2018). Providing DOIs to Unpaywall’s REST API not only returns open access full-text links, but also helpful metadata about the open access status of publications indexed in Crossref, a DOI registration agency. While the API allows to retrieve a limited amount of records, Unpaywall also offers database snapshots for large-scale analysis, which more and more bibliometric databases and open access monitoring services utilise. However, documentation about how database and service providers work with these dumps are hard to find.
In this blog post, we describe how we loaded Unpaywall’s data dump into Google BigQuery, a cloud-based service that allows fast analysis of large datasets, and how we interfaced BigQuery for our analysis with R. We wanted to know the extent of open access status information in Unpaywall, particularly, how this information can be utilised for bibliometric research. In our case, we intend to match open access status information from Unpaywall with the Web of Science in-house database from the German Competence Center for Bibliometrics to determine factors influencing open access publication activities among German Universities as part of our BMBF-funded research project OAUNI.
The Unpaywall data dump from February 2019 comprises more than 100 million records amounting to a file size of more than 100 GB. Working with datasets of such a large size is generally non-trivial. We therefore chose Google’s BigQuery as a cloud-based solution. From our perspective BigQuery has several advantages. Firstly, it is a highly performant tool enabling us to query and manipulate large datasets very fast. We would not be able to achieve a similarly satisfying performance with a local database deployed on our standard laptops. Secondly, using this cloud-based service gives us the possibility to share access to our database with colleagues and collaborators. Finally, the already existing interfaces to BigQuery from R allow us to incorporate this environment into our familiar data analytics workflow.
As a preparatory step, we loaded the entire dataset into a local Mongo DB database, exported relevant fields and rows for the study of the open access status of scholarly output as compressed JSON Lines files, and uploaded them to Google Cloud Storage. To import these files into BigQuery, we had to specify a schema, which we share in the source code repository of this blog. We used the BigQuery user interface, where the files are automatically decompressed and the corresponding tables are created.
In R, we interface our Unpaywall dataset stored in Google BigQuery with the packages DBI and bigrquery.
Our BigQuery project has two tables, one containing all records between 2008 and 2012, and another for more recent works published since 2013. When connecting with tbl()
from dplyr, Google asks us to login via a web browser or to supply a private access token to interface our access-restricted database.
bigrquery allows querying BigQuery tables using SQL or dplyr functions. The latter is convenient for us, because we have just started to learn SQL, but feel more experienced in the tidyverse, a popular collection of R packages following the Wickham-Grolemund approach to practise data science (Wickham and Grolemund 2017). Here’s an example where we call BigQuery with dplyr, which is part of the tidyverse, to obtain the first ten records from 2018. We restrict our search to journal articles, the most common genre in Unpaywall.
#> # Source: lazy query [?? x 13]
#> # Database: BigQueryConnection
#> year genre updated published_date journal_is_in_d…
#> <int> <chr> <dttm> <date> <lgl>
#> 1 2018 journal-… 2018-06-20 21:37:24 2018-01-01 FALSE
#> 2 2018 journal-… 2019-01-26 20:22:21 2018-05-07 TRUE
#> 3 2018 journal-… 2018-06-19 15:09:52 2018-03-23 TRUE
#> 4 2018 journal-… 2019-01-15 22:31:55 2018-01-12 FALSE
#> 5 2018 journal-… 2018-06-21 16:16:44 2018-04-01 FALSE
#> 6 2018 journal-… 2018-06-18 19:20:28 2018-05-10 TRUE
#> 7 2018 journal-… 2018-06-18 20:12:06 2018-01-01 FALSE
#> 8 2018 journal-… 2019-01-15 13:40:41 2018-01-04 FALSE
#> 9 2018 journal-… 2019-02-11 09:39:25 2018-12-14 TRUE
#> 10 2018 journal-… 2018-12-09 03:14:10 2018-12-03 TRUE
#> # … with 8 more variables: journal_is_oa <lgl>, journal_issns <chr>,
#> # oa_locations <list>, doi <chr>, is_oa <lgl>, publisher <chr>,
#> # journal_name <chr>, data_standard <int>
Notice that our schema follows the Unpaywall data format. However, we excluded the large data field z-authors
. Moreover, we did not consider the fields title
, doi-url
, which is redundant to the doi
field, and best-oa-location
, which is derived from the Open Access location object.
In this blog post, we will examine the following open access indicators:
is_oa
: A logical value indicating whether an open access version of the article was found or not.
journal_is_in_doaj
: A logical value indicating whether an article was published in a journal registered in the Directory of Open Access Journals (DOAJ).
The column oa_locations
is a list-column that contains individual metadata about all open access full-text links found per article. By definition, open access provision is not limited to one route, but multiple copies of an article can be made freely available at the same time using various means (Suber 2012).
Here are the three data variables from the oa_locations
object that we will focus on:
is_best
: A logical value defined by Unpaywall’s algorithm that describes the most relevant open access location. The algorithm prioritises publisher-hosted content.
host_type
: Is the open access full-text provided by a publisher or a repository?
evidence
: How did Unpaywall find the open access full-text?
is_oa
)To start with, we retrieve the number and proportion of journal articles with open access full-text published between 2008 and 2018 using Unpaywall’s most basic open access indicator is_oa
, a logical value, which is TRUE
when at least one open access full-text was found. After matching and summarising the is_oa
observations by year with dplyr, collect()
from the dplyr database complement dbplyr loads the aggregated data from BigQuery into a local tibble. We use the lubridate package to transform the year variable to a date object.
library(lubridate)
oa_08_12 <- upw_08_12 %>%
# query and aggregate with dpylr
filter(genre == "journal-article") %>%
group_by(year, is_oa) %>%
summarise(n = n()) %>%
# load the data into a local tibble
collect()
oa_13_18 <- upw_13_19 %>%
# query and aggregate with dpylr
filter(genre == "journal-article", year < 2019) %>%
group_by(year, is_oa) %>%
summarise(n = n()) %>%
# load the data into a local tibble
collect()
my_df <- bind_rows(oa_08_12, oa_13_18) %>%
# calculate proportion per year
ungroup() %>%
mutate(year = lubridate::ymd(paste0(year, "-01-01"))) %>%
group_by(year, is_oa) %>%
summarise(n = sum(n)) %>%
mutate(prop = n / sum(n))
my_df
#> # A tibble: 22 × 4
#> # Groups: year [11]
#> year is_oa n prop
#> <date> <lgl> <int> <dbl>
#> 1 2008-01-01 FALSE 1454745 0.711
#> 2 2008-01-01 TRUE 590250 0.289
#> 3 2009-01-01 FALSE 1562765 0.701
#> 4 2009-01-01 TRUE 665951 0.299
#> 5 2010-01-01 FALSE 1724760 0.697
#> 6 2010-01-01 TRUE 749139 0.303
#> 7 2011-01-01 FALSE 1585092 0.654
#> 8 2011-01-01 TRUE 838014 0.346
#> 9 2012-01-01 FALSE 1636130 0.630
#> 10 2012-01-01 TRUE 962036 0.370
#> # … with 12 more rows
In total, 31,159,960 journal articles published between 2008 and 2018 were included in Unpaywall. For 11,633,886 articles, Unpaywall was able to link a DOI to at least one freely available full-text (37 %). This means that around every third scholarly journal article published since 2008 is currently openly available.
Next, let’s plot the prevalence of open access to journal articles over time using the data visualisation package ggplot2, which is also part of the tidyverse. To make our ggplot object interactive, we turn it into a plotly chart, a javascript library, using ggplotly()
. The tooltip presents the total number and percentage for each category and year. We use the package scales to format the y-axis.
library(scales)
plot_a <- my_df %>%
# prepare label that we want to present as tooltip
mutate(`Proportion in %` = round(prop * 100, 2)) %>%
ggplot(aes(year, n, label = `Proportion in %`)) +
geom_area(aes(fill = is_oa, group = is_oa), alpha = 0.8) +
labs(x = "Year published", y = "Journal Articles",
title = "Open Access to Journal Articles") +
scale_fill_manual("Is OA?",
values = c("#b3b3b3a0", "#56B4E9")) +
scale_x_date(date_labels = "%y") +
scale_y_continuous(labels = scales::number_format(big.mark = " ")) +
theme_minimal(base_family = "Roboto") +
theme(plot.margin = margin(30, 30, 30, 30)) +
theme(panel.grid.minor = element_blank()) +
theme(axis.ticks = element_blank()) +
theme(panel.grid.major.x = element_blank()) +
theme(panel.border = element_blank())
# turn ggplot object into interactive plotly chart
plotly::ggplotly(plot_a, tooltip = c("label", "y"))
While a general growth of journal articles and open access provision to them can be observed, there is a considerable decline in the number of journal articles published in 2018, presumably because of an indexing lag between Crossref and Unpaywall. The decline in open access full-text availability was even clearer, suggesting that some open access content is provided only after a certain period of time.
host_type
)Using Unpaywall’s open access location types allows for a more detailed analysis of open access provision. In the following, we explore the variable host_type
, showing whether Unpaywall found the open access full-text on a publisher’s website or in a repository. Furthermore, we specifically highlight articles from fully open access journals that are indexed in the Directory of Open Access Journals (DOAJ) as indicated by the journal_is_in_doaj
variable. As a start, we only examine the best open access location per DOI, is_best
. As said before, this variable is defined by Unpaywall’s algorithm that prioritises publisher-hosted content.
Instead of dplyr, we are now querying BigQuery with SQL. Before, we built and tested the SQL queries in the BigQuery user interface. The SQL code is stored in separate files (host_type_08_12.sql and host_type_13_18.sql), which we share in the source code repository of this blog.
After calling BigQuery using SQL with the DBI interface, we bind the two resulting data frames into one. Using case_when()
from dplyr, we create a host
column distinguishing between “DOAJ-listed Journal,” “Other Journals” and “Repositories only” open access provision.
host_type_08_12_query_df <- dbGetQuery(con, host_type_08_12_query)
host_type_13_18_query_df <- dbGetQuery(con, host_type_13_18_query)
host_type_df <-
bind_rows(host_type_08_12_query_df, host_type_13_18_query_df) %>%
mutate(
host = case_when(
journal_is_in_doaj == TRUE ~ "DOAJ-listed Journal",
host_type == "publisher" ~ "Other Journals",
host_type == "repository" ~ "Repositories only"
)
) %>%
mutate(year = lubridate::ymd(paste0(year, "-01-01")))
host_type_df
#> # A tibble: 34 × 5
#> year host_type journal_is_in_doaj number_of_articles host
#> <date> <chr> <lgl> <int> <chr>
#> 1 2011-01-01 publisher TRUE 170917 DOAJ-l…
#> 2 2011-01-01 repository FALSE 178323 Reposi…
#> 3 2012-01-01 repository FALSE 188662 Reposi…
#> 4 2012-01-01 publisher TRUE 219293 DOAJ-l…
#> 5 2012-01-01 publisher FALSE 554081 Other …
#> 6 2008-01-01 publisher FALSE 358552 Other …
#> 7 2008-01-01 publisher TRUE 81731 DOAJ-l…
#> 8 2008-01-01 repository FALSE 149967 Reposi…
#> 9 2010-01-01 publisher FALSE 436375 Other …
#> 10 2010-01-01 repository FALSE 173419 Reposi…
#> # … with 24 more rows
To explore our data, we follow Claus Wilke’s excellent book “Fundamentals of Data Visualization”(Wilke 2019) and visualise our proportions separately as parts of the total. Again, our final ggplot graphic is transformed to an interactive plotly chart.
# calculate all oa articles per year
all_articles <- host_type_df %>%
ungroup() %>%
group_by(year) %>%
summarise(number_of_articles = sum(number_of_articles))
plot_b <-
ggplot(host_type_df, aes(x = year, y = number_of_articles, text = paste0("Publication year: ", lubridate::year(year)))) +
geom_bar(
data = all_articles,
aes(fill = "All OA Articles"),
color = "transparent",
stat = "identity"
) +
geom_bar(aes(fill = "by Host"), color = "transparent", stat = "identity") +
facet_wrap( ~ host, nrow = 1) +
scale_fill_manual(values = c("#b3b3b3a0", "#56B4E9"), name = "") +
labs(x = "Year", y = "OA Articles (Total)", title = "Open Access to Journal Articles by Unpaywall host") +
theme(legend.position = "top",
legend.justification = "right") +
scale_x_date(date_labels = "%y") +
scale_y_continuous(labels = scales::number_format(big.mark = " ")) +
theme_minimal(base_family = "Roboto") +
theme(plot.margin = margin(30, 30, 30, 30)) +
theme(panel.grid.minor = element_blank()) +
theme(axis.ticks = element_blank()) +
theme(panel.grid.major.x = element_blank()) +
theme(panel.border = element_blank())
# turn ggplot object into interactive plotly chart
plotly::ggplotly(plot_b, tooltip = c("y", "text"))
The figure shows that most publisher-provided open access links were obtained from journals that were not indexed in the DOAJ: these are 6,531,822 articles, representing 56 % of all journal articles with openly available full-text identified by Unpaywall.
While the number of publications in DOAJ-indexed journals is rising constantly, open access provided by other journal types and repositories declined from 2017 to 2018. Indeed, there is a considerable amount of journals that delay open access provision (Laakso and Björk 2013). A prominent example is the journal Cell where all articles are made freely available after an embargo period of twelve months. Also, self-archiving in repositories is often subject to embargo periods imposed by publishers, or researchers upload their publications later (Björk et al. 2013). A more detailed analysis of delayed open access, however, is challenging using Unpaywall data only, because Unpaywall has not tracked so far the point of time when articles were made open access.
evidence
)The evidence
field of the oa_locations
object contains more detailed open access status information. We use again SQL queries stored in separate files (evidence_08_12.sql and evidence_13_18.sql) to be found in the source code repository, and create a data.frame with the relevant fields.
library(tidyverse)
# define queries
evidence_08_12_query <- readLines("database/evidence_08_12.sql") %>%
paste(collapse = "")
evidence_13_18_query <- readLines("database/evidence_13_18.sql") %>%
paste(collapse = "")
# fetch records and bind them to one data frame
evidence_08_12 <- dbGetQuery(con, evidence_08_12_query)
evidence_13_18 <- dbGetQuery(con, evidence_13_18_query)
evidence_df <- bind_rows(evidence_08_12, evidence_13_18) %>%
ungroup() %>%
mutate(year = lubridate::ymd(paste0(year, "-01-01")))
The evidence
field indicates how Unpaywall found the article at a specific location and identified it as open access, for example via PubMed Central or via license information from Crossref.
For each evidence type we want to see how many articles were identified as open access in this way. To this end, we created the following table that shows the total number of articles per evidence type as well as their proportion and cumulative proportions with respect to the total number of all articles.
# calculate numbers and proportion of articles per evidence type
evidence_df %>%
group_by(evidence) %>%
summarize(N_records = sum(number_of_articles)) %>%
arrange(desc(N_records)) %>%
mutate(
prop = N_records / sum(N_records) * 100,
cum_prop = cumsum(prop)
) -> articles_per_type_df
articles_per_type_df %>%
knitr::kable(
col.names = c(
"Evidence Types",
"Number of Articles",
"Proportion of all Articles in %",
"Cumulative Proportion in %"
),
big.mark = ",",
caption = "Number of articles per evidence type. Columns show the total number per evidence type, the proportion of articles of this type with respect to the number of all articles and the cumulative proportion of articles associated with any of the above evidence types."
)
Evidence Types | Number of Articles | Proportion of all Articles in % | Cumulative Proportion in % |
---|---|---|---|
open (via free pdf) | 4415372 | 21.84 | 22 |
open (via page says license) | 3404452 | 16.84 | 39 |
oa repository (via OAI-PMH doi match) | 3247290 | 16.06 | 55 |
oa journal (via doaj) | 2960898 | 14.65 | 69 |
oa repository (via OAI-PMH title and first author match) | 2704088 | 13.38 | 83 |
oa repository (via pmcid lookup) | 2103323 | 10.41 | 93 |
open (via crossref license) | 985336 | 4.87 | 98 |
open (via page says Open Access) | 126113 | 0.62 | 99 |
open (via crossref license, author manuscript) | 82079 | 0.41 | 99 |
oa journal (via publisher name) | 63234 | 0.31 | 99 |
oa repository (via OAI-PMH title and last author match) | 60535 | 0.30 | 100 |
oa repository (via OAI-PMH title match) | 57106 | 0.28 | 100 |
oa repository (via doi prefix) | 3583 | 0.02 | 100 |
oa journal (via issn in doaj) | 392 | 0.00 | 100 |
manual | 27 | 0.00 | 100 |
It can be seen that the long tail of the least frequent 8 categories at the bottom of the table with proportions smaller than 1 % of all articles only makes up 1.9 % of all articles in total, which is why we will aggregate these evidence types in the category Other
in the following.
# collate least frequent articles as 'Other'
articles_per_type_df %>%
mutate(evidence = as_factor(evidence)) %>%
mutate(evidence_grouped = fct_relevel(fct_other(evidence, keep = evidence[.$prop > 1]), "Other")) %>%
group_by(evidence_grouped) %>%
summarize(number_of_articles = sum(N_records)) %>%
mutate(
prop = number_of_articles / sum(number_of_articles) * 100,
cum_prop = cumsum(number_of_articles) / sum(number_of_articles) *
100
) -> articles_per_type_grouped_df
# group according to categorization with "Other"
evidence_grouped_df <- evidence_df %>%
mutate(evidence = as_factor(evidence)) %>%
mutate(
evidence_grouped = factor(
fct_other(evidence, keep = articles_per_type_grouped_df$evidence_grouped),
levels = articles_per_type_grouped_df$evidence_grouped
)
) %>%
mutate(evidence_grouped = fct_relevel(fct_rev(evidence_grouped), "Other")) %>%
group_by(evidence_grouped, is_best, year) %>%
summarize(number_of_articles = sum(number_of_articles))
So far, we only examined the best open access location per DOI, indicated by is_best
, a variable defined by Unpaywall algorithm that prioritises publisher-hosted content. However, evidence types in Unpaywall are not exclusive categories. On the contrary, many records are associated with several evidence types, because various ways to openly access full-texts have been found by Unpaywall. For this reason, the following figure distinguishes whether a given evidence type is classified as best open access location by Unpaywall or not. It is clearly visible that Unpaywall prioritises publisher hosted content (open
, oa_journal
) over repository depositions (oa_repository
), as they state on their website. However, the figure also shows that an existing free pdf version on the publisher’s website (likely even without licensing information) is prioritised over journal level classifications as for example being indexed in DOAJ.
evidence_grouped_df %>%
group_by(evidence_grouped, is_best) %>%
summarize(number_of_articles = sum(number_of_articles)) %>%
#create plot
ggplot(aes(x = evidence_grouped, y = number_of_articles, fill = is_best)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("#b3b3b3a0", "#56B4E9"), name = "Is best?") +
theme_minimal(base_family = "Roboto") +
theme(plot.margin = margin(30, 30, 30, 30)) +
theme(panel.grid.minor = element_blank()) +
theme(axis.ticks = element_blank()) +
theme(panel.grid.major.y = element_blank()) +
theme(panel.border = element_blank()) +
coord_flip() +
labs(y = "Number of Open Access Articles", x = "Evidence Type",
title = "Number of Open Access Articles per Unpaywall Evidence Type") -> plot_ev_types_is_best
#create interactive plot
plotly::ggplotly(plot_ev_types_is_best, tooltip = c("y"))
To investigate the development of the most prevalent evidence types over time, we use a faceted graph. We again observe declines in the proportion of repository-based evidences which are chosen as best location.
# use collated data frame
evidence_grouped_df %>%
ggplot(aes(year, number_of_articles, fill = is_best, text = paste0("Publication year: ", lubridate::year(year)))) +
geom_bar(stat = "identity") +
facet_wrap( ~ fct_rev(evidence_grouped), ncol = 2) +
scale_fill_manual(values = c("#b3b3b3a0", "#56B4E9"), name = "Is best?") +
theme_minimal(base_family = "Roboto") +
theme(panel.grid.minor = element_blank()) +
theme(axis.ticks = element_blank()) +
theme(panel.grid.major.x = element_blank()) +
theme(panel.border = element_blank()) +
scale_x_date(date_labels = "%y") +
labs(x = "Publication Year", y = "Number of Open Access Articles",
title = "Unpaywall Open Access Evidence Categories per Year") -> plot_ev_types_per_year
#create interactive plot
plotly::ggplotly(plot_ev_types_per_year, tooltip = c("y", "text")) -> plotlyfacetfig
# move y-axes label to the left to ensure readability
plotlyfacetfig[['x']][['layout']][['annotations']][[2]][['x']] <- -0.08
plotlyfacetfig
Many open access articles are accessible through a number of locations, including the publisher’s website and also one or more open access repositories. Unpaywall not only describes one, but all open access full-texts it discovers with useful metadata. In the following, we will analyse if and to which extent the various open access indicators intersect. We start with an analysis of the overlap between host types, followed by determining set intersections between Unpaywall’s evidence types.
To present articles that are both provided by publishers and repositories, we use the BigQuery SQL function STRING_AGG to create a new variable where we concatenate the different host_types per open access article (for more details, see the full SQL queries host_type_intersect_08_12.sql
and host_type_intersect_13_18.sql
.
Again, we call BigQuery and load the aggregated data into our local R session. This time, we do not want to present the total number of open access publications, but its relative share. We already obtained the total number of articles, which are stored in the my_df
data.frame.
host_type_08_12_intersect_df <-
dbGetQuery(con, host_type_intersect_08_12_query)
host_type_13_18_intersect_df <-
dbGetQuery(con, host_type_intersect_13_18_query)
host_type_intersect <-
bind_rows(host_type_08_12_intersect_df, host_type_13_18_intersect_df) %>%
mutate(year = lubridate::ymd(paste0(year, "-01-01"))) %>%
mutate(
host = case_when(
host_type_count == "publisher" ~ "Publisher only",
host_type_count == "publisher,repository" ~ "Publisher & Repository",
host_type_count == "repository" ~ "Repositories only"
)
) %>%
mutate(host = factor(
host,
levels = c("Publisher only", "Publisher & Repository", "Repositories only")
))
# obtain yearly publication volumes
host_type_intersect <- my_df %>%
group_by(year) %>%
summarise(all_articles = sum(n)) %>%
# join with host type figures
right_join(host_type_intersect, by = "year") %>%
# calculate proportion
mutate(prop = number_of_articles / all_articles)
host_type_intersect
#> # A tibble: 33 × 6
#> year all_articles host_type_count number_of_artic… host
#> <date> <int> <chr> <int> <fct>
#> 1 2008-01-01 2044995 publisher 312392 Publ…
#> 2 2008-01-01 2044995 repository 149967 Repo…
#> 3 2008-01-01 2044995 publisher,repository 127891 Publ…
#> 4 2009-01-01 2228716 publisher 347502 Publ…
#> 5 2009-01-01 2228716 repository 165372 Repo…
#> 6 2009-01-01 2228716 publisher,repository 153077 Publ…
#> 7 2010-01-01 2473899 publisher 396449 Publ…
#> 8 2010-01-01 2473899 publisher,repository 179271 Publ…
#> 9 2010-01-01 2473899 repository 173419 Repo…
#> 10 2011-01-01 2423106 publisher 444220 Publ…
#> # … with 23 more rows, and 1 more variable: prop <dbl>
Let’s visualise the host type distribution including the overlap between publisher and repository-provided open access:
# get overall oa share
host_type_all <- host_type_intersect %>%
group_by(year) %>%
summarise(prop = sum(prop))
# make a ggplot graphic
plot_host_intersect <-
ggplot(host_type_intersect, aes(x = year, y = prop, text = paste0("Publication year: ", year(year)))) +
geom_bar(
data = host_type_all,
aes(fill = "All OA Articles"),
color = "transparent",
stat = "identity"
) +
geom_bar(aes(fill = "by Host"), color = "transparent", stat = "identity") +
facet_wrap(~ host, nrow = 1) +
scale_fill_manual(values = c("#b3b3b3a0", "#56B4E9"), name = "") +
labs(x = "Year", y = "OA Share",
title = "Overlap between Open Access Host Types in Unpaywall") +
scale_x_date(date_labels = "%y") +
scale_y_continuous(labels = scales::percent_format(accuracy = 5L)) +
theme_minimal(base_family = "Roboto") +
theme(plot.margin = margin(30, 30, 30, 30)) +
theme(panel.grid.minor = element_blank()) +
theme(axis.ticks = element_blank()) +
theme(panel.grid.major.x = element_blank()) +
theme(panel.border = element_blank())
# turn ggplot object into interactive plotly chart
plotly::ggplotly(plot_host_intersect, tooltip = c("y", "text"))
The figure shows that Unpaywall found most open access full-text on publishers’ websites (82%). 56 % of all open access full-texts was not archived in a repository. The overlap of open access provided by both routes also deserves attention: a proportion of around 26 % of all open access articles was accessible from both publishers’ websites and repositories.
The categorisation of evidence types is not exclusive, either. Therefore, many records will be associated with more than one evidence type.
To better understand this overlap, we generate a new column specifying all found combinations of evidence types using concatenation with the SQL function STRING_AGG. The SQL queries we use (evidence_single_cat_08_12.sql and evidence_single_cat_13_18.sql) can be found in the source code repository.
library(tidyverse)
# define queries
evidence_single_cat_08_12_query <- readLines("database/evidence_single_cat_08_12.sql") %>%
paste(collapse = "")
evidence_single_cat_13_18_query <- readLines("database/evidence_single_cat_13_18.sql") %>%
paste(collapse = "")
# fetch records and bind them to one data frame
evidence_categories_08_12 <- dbGetQuery(con, evidence_single_cat_08_12_query)
evidence_categories_13_18 <- dbGetQuery(con, evidence_single_cat_13_18_query)
evidence_categories_df <- bind_rows(evidence_categories_08_12, evidence_categories_13_18) %>%
group_by(ev_cat) %>%
summarize(number_of_articles = sum(number_of_articles)) %>%
arrange(desc(number_of_articles))
evidence_categories_df
#> # A tibble: 411 × 2
#> ev_cat number_of_artic…
#> <chr> <int>
#> 1 open (via free pdf) 3290683
#> 2 oa repository (via OAI-PMH title and first author… 945958
#> 3 open (via page says license) 910028
#> 4 oa journal (via doaj)&open (via page says license) 687962
#> 5 open (via crossref license) 546154
#> 6 oa journal (via doaj) 515453
#> 7 oa journal (via doaj)&oa repository (via OAI-PMH … 513620
#> 8 oa repository (via OAI-PMH doi match)&oa reposito… 462086
#> 9 oa repository (via OAI-PMH doi match) 462071
#> 10 oa repository (via OAI-PMH doi match)&oa reposito… 260696
#> # … with 401 more rows
We first illustrate for each evidence type - collating again the least frequent types in the category Other
- the amount of articles which corresponds exclusively to this type and no others.
#determine number of articles corresponding only to one evidence type
evidence_single_cat_df <- evidence_df %>%
group_by(evidence) %>%
summarize(number_of_articles = sum(number_of_articles)) %>%
left_join(evidence_categories_df, by = c("evidence" = "ev_cat")) %>%
rename(number_of_articles = number_of_articles.x, number_of_single_cat = number_of_articles.y) %>%
mutate(number_of_articles = replace_na(number_of_articles, 0),
number_of_single_cat = replace_na(number_of_single_cat, 0))
#aggregate least frequent types as category `Other`
evidence_single_cat_grouped_df <- evidence_single_cat_df %>%
ungroup() %>%
mutate(evidence = as_factor(evidence)) %>%
mutate(
evidence_grouped = factor(
fct_other(evidence, keep = articles_per_type_grouped_df$evidence_grouped),
levels = articles_per_type_grouped_df$evidence_grouped
)
) %>%
mutate(evidence_grouped = fct_relevel(fct_rev(evidence_grouped), "Other")) %>%
group_by(evidence_grouped) %>%
summarize(
number_of_articles = sum(number_of_articles),
number_of_single_cat = sum(number_of_single_cat)
) %>%
#arrange data in order to enable stacked barplots
mutate(multiple = number_of_articles-number_of_single_cat, single = number_of_single_cat) %>%
select(evidence_grouped, single, multiple) %>%
gather(is_single, number_of_articles, -evidence_grouped) %>%
mutate(is_single = case_when(
is_single == "single" ~ TRUE,
is_single == "multiple" ~ FALSE
)) %>%
#rename column for correct display in plot
rename(proportion = number_of_articles)
#create aggregated proportions barplot
evidence_single_cat_grouped_df %>%
ggplot(aes(x = evidence_grouped, y = proportion, fill = is_single)) +
geom_bar(stat = "identity", position = "fill") +
scale_fill_manual(values = c("#b3b3b3a0", "#56B4E9"), name = "Is unique?") +
scale_y_continuous(labels = scales::percent_format()) +
theme_minimal(base_family = "Roboto") +
theme(plot.margin = margin(30, 30, 30, 30)) +
theme(panel.grid.minor = element_blank()) +
theme(axis.ticks = element_blank()) +
theme(panel.grid.major.y = element_blank()) +
theme(panel.border = element_blank()) +
coord_flip() +
theme(legend.position = "top",
legend.justification = "right") +
labs(y = "Proportion of Articles", x = "Evidence Type",
title = "Proportion of Articles per Evidence Type") -> plot_ev_types_is_single_prop
#create interactive plot
plotly::ggplotly(plot_ev_types_is_single_prop, tooltip = c("y"))
It is interesting to see, that the evidence type which appears most often as a unique form of open access provision is via an openly available pdf on the publisher’s website, meaning that no other evidence, like license information from Crossref, was found. This confirms that Unpaywall’s open access detection has benefited from scraping publishers’ websites.
Moreover, a critical amount of articles is found only through repository-based evidence sources and hence, is available only via the green route. Still, the figure shows a phenomenon that we observed also for the host_type
, namely that repository-based evidence types often overlap with other evidence types.
Because of Unpaywall’s prioritisation of publisher-provided open access, caution is in order, when only the best_oa_location
is used for categorisation: there are publisher-based evidence types that may not comply with how funders define open access journals, particularly with regard to license statements. On the other hand, a large number of articles seems to be identified as open access only through license information from Crossref without an associated free pdf having been found. However, we are unsure whether Unpaywall performs all open access identification procedures for every single record indexed in Crossref, which would allow for such a comparison.
Visualising the occurring intersections of multiple evidence types is difficult. Following (Alexander Lex and Gehlenborg 2014) and (A. Lex et al. 2014), we created an UpSet figure using the UpSetR package described in (Conway, Lex, and Gehlenborg 2017) in order to examine in more detail, which of the 411 occurring combinations of evidence types - including singletons - shows up most frequently and how large these groups are. In theory, for the 15 evidence types up to 32,768 combinations would be possible.
To start with, we interface BigQuery and retrieve how often the combinations of evidence types occur. We store the corresponding SQL queries (evidence_overlap_08_12.sql and evidence_overlap_13_18.sql) and share the resulting dataset as .csv files (results_evidence_overlap_08_12.csv and results_evidence_overlap_13_18.csv). Next, we transform the data to an upsetr-compatible expression format, resulting in a named character vector. Lastly, the upset()
function generates the graph. To keep the resulting figure readable, we only display the 15 combinations of the 7 most frequent types with the highest number of articles each.
library(UpSetR)
# fetch data for upset graph
evidence_categories_upset_08_12_query <- readLines("database/evidence_overlap_08_12.sql") %>%
paste(collapse = "")
evidence_categories_upset_13_18_query <- readLines("database/evidence_overlap_13_18.sql") %>%
paste(collapse = "")
evidence_categories_upset_08_12 <- dbGetQuery(con, evidence_categories_upset_08_12_query)
# export to csv
write_csv(evidence_categories_upset_08_12, "data/results_evidence_overlap_08_12.csv")
evidence_categories_upset_13_18 <- dbGetQuery(con, evidence_categories_upset_13_18_query)
# export to csv
write_csv(evidence_categories_upset_13_18, "data/results_evidence_overlap_13_18.csv")
evidence_categories_upset_df <- bind_rows(evidence_categories_upset_08_12, evidence_categories_upset_13_18) %>%
group_by(ev_cat) %>%
summarise(n = sum(number_of_articles))
# list with counts
evidence_categories_upset_list <- as.list(evidence_categories_upset_df$n)
# categories as list names
names(evidence_categories_upset_list) <- evidence_categories_upset_df$ev_cat
# convert to vector
evidence_categories_upset_expr <- unlist(evidence_categories_upset_list)
upset(fromExpression(evidence_categories_upset_expr), nsets = 7, nintersects = 15, order.by = "freq", show.numbers = FALSE, set_size.angles = 25)
In this blog post, we demonstrated how to analyse the Unpaywall data dump with Google BigQuery and R. Interfacing BigQuery with R has allowed us to integrate a high-performance and user-friendly database environment into our R data analytics workflow. Using this data management environment, we found 11,6 million journal articles published between 2008 and 2018 with open access full-texts, representing around one third of all articles indexed in Crossref for this period. Moreover, we found Unpaywall to be a suitable data source for open access analytics, because Unpaywall does not only tag if a publication is freely available, but also provides metadata describing how and where the open access full-text links were discovered.
Our Unpaywall data analysis revealed various open access location and evidence types, as well as large overlaps between them. Along with the likely influence of embargoed or delayed open access provision on some of these types, our analysis raises important questions about how to responsibly use Unpaywall data in bibliometric research and open access monitoring. Examining Unpaywall’s best open access location only, favours publisher-provided open access, which, in turn, means that open access provided by repositories would be underestimated. Likewise, large overlaps between evidence categories can be observed. To allow for careful consideration, bibliometric research and open access monitoring must therefore be clear about how open access indicators were derived from Unpaywall.
In future, we will use these insights from our data analysis to work on a matching procedure between Unpaywall and the Web of Science in-house database from the German Competence Center for Bibliometrics in our OAUNI project. In doing so, we want to represent Unpaywall’s open access evidence as comprehensive as possible to allow for a pluralist view on open access to journal articles from researchers affiliated with German Universities.
We are very grateful for helpful feedback on this post and ongoing discussions on the usage of Unpaywall data for bibliometric analyses with our project partners in Bielefeld, Niels Taubert and Elham Iravani, the related project OASE, in particular Nicholas Fraser and Philipp Mayr-Schlegel, Stephan Stahlschmidt and Aliakbar Akbaritabar from the DZHW, as well as Daniel Bangert and Birgit Schmidt from the SUB Göttingen.
We acknowledge financial support from the the Federal Ministry of Education and Research of Germany (BMBF) in the framework Quantitative research on the science sector (Project: “OAUNI Entwicklung und Einflussfaktoren des Open-Access-Publizierens an Universitäten in Deutschland,” Förderkennzeichen: 01PU17023A).
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Jahn & Hobert (2019, May 7). Scholarly Communication Analytics: Open Access Evidence in Unpaywall. Retrieved from https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_evidence/
BibTeX citation
@misc{jahn2019open, author = {Jahn, Najko and Hobert, Anne}, title = {Scholarly Communication Analytics: Open Access Evidence in Unpaywall}, url = {https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_evidence/}, year = {2019} }