The OpenAIRE Research Graph provides a wide range of metadata about grant-supported research publications. This blog post presents an experimental R package with helpers for splitting, de-compressing and parsing the underlying data dumps. I will demonstrate how to use them by examining the compliance of funded projects with the open access mandate in Horizon 2020.
OpenAIRE has collected and interlinked scholarly data from various openly available sources for over ten years. In December 2019, this open science network released the OpenAIRE Research Graph(Manghi, Atzori, et al. 2019), a big scholarly data dump that contains metadata about more than 100 million research publications and 8 million datasets, as well as the relationships between them. These metadata are furthermore connected to open access locations and disambiguated information about persons, organisations and funders.
Like most big scholarly data dumps, the OpenAIRE Research Graph offers many data analytics opportunities, but working with it is challenging. One reason is the size of the dump. Although the OpenAIRE Research Graph is already split into several files, most of these data files are too large to fit the memory of a moderately equipped laptop, when directly imported into computing environments like R. Another challenge is the format. The dump consists of compressed XML-files following the comprehensive OpenAIRE data model(Manghi, Bardi, et al. 2019), from which only certain elements may be needed for a specific data analysis.
In this blog post, I introduce the R package openairegraph
, an experimental effort, that helps to transform the large OpenAIRE Research Graph dumps into relevant small datasets for analysis. These tools aim at data analysts and researchers alike who wish to conduct their own analysis using the OpenAIRE Research Graph, but are wary of handling its large data dumps. Focusing on grant-supported research results from the European Commission’s Horizon 2020 framework programme (H2020), I present how to subset and analyse the graph using this openairegraph
. My analytical use case is to benchmark the open access activities of grant-supported projects affiliated with the University of Göttingen against the overall uptake across the H2020 funding activities.
openairegraph
about?So far, the R package openairegraph
, which is available on GitHub as a development version, has two sets of functions. The first set provides helpers to split a large OpenAIRE Research Graph data dump into separate, de-coded XML records that can be stored individually. The other set consists of parsers that convert data from these XML files to a table-like representation following the tidyverse philosophy, a popular approach and toolset for doing data analysis with R (Wickham et al. 2019). Splitting, de-coding and parsing are essential steps before analysing the OpenAIRE Research Graph.
openairegraph
can be installed from GitHub using the remotes
(Hester et al. 2019) package:
library(remotes)
::install_github("subugoe/openairegraph") remotes
Several dumps from the OpenAIRE Research Graph are available on Zenodo(Manghi, Atzori, et al. 2019). So far, I tested openairegraph
to work with the dump h2020_results.gz
, which comprises research outputs funded by the European Commission’s Horizon 2020 funding programme (H2020).
After downloading it, the file can be imported into R using the jsonlite package(Ooms 2014). The following example shows that each line contains a record identifier and the corresponding Base64-encoded XML file. Base64 is a standard that allows file compression in a text-based format.
library(jsonlite) # tools to work with json files
library(tidyverse) # tools from the tidyverse useful for data analysis
# download the file from Zenodo and store it locally
oaire <- jsonlite::stream_in(file("data/h2020_results.gz"), verbose = FALSE) %>%
tibble::as_tibble()
oaire
#> # A tibble: 92,218 × 2
#> `_id`$`$oid` body$`$binary` $`$type`
#> <chr> <chr> <chr>
#> 1 5dbc22f81e82127b58c41073 UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAA… 00
#> 2 5dbc22f9b531c546e838683d UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAA… 00
#> 3 5dbc22fa45e3122d97bdb313 UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00
#> 4 5dbc22fa45e3122d97bdb31e UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00
#> 5 5dbc22fa4e0c061a4d17b85d UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00
#> 6 5dbc22fb81f3c12c00238e25 UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00
#> 7 5dbc22fb895be12461552bf0 UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00
#> 8 5dbc22fbe56570673e1bf884 UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00
#> 9 5dbc22fc81f3c12bfe83e3b6 UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00
#> 10 5dbc22fcb531c546e838688a UEsDBBQACAgIAIZiYU8AAAAAAAAAAAAA… 00
#> # … with 92,208 more rows
The function openairegraph::oarg_decode()
splits and de-codes each record. Storing the records individually allows to process the files independent from each other, which is a common approach when working with big data.
library(openairegraph)
openairegraph::oarg_decode(oaire, records_path = "data/records/",
limit = 500, verbose = FALSE)
openairegraph::oarg_decode()
writes out each XML-formatted record as a zip file to a specified folder. Because the dumps are quite large, the function furthermore has a parameter that allows setting a limit, which is helpful for inspecting the output first. By default, a progress bar presents the current state of the process.
So far, there are four parsers available to consume the H2020 results set:
openairegraph::oarg_publications_md()
retrieves basic publication metadata complemented by author details and access statusopenairegraph::oarg_linked_projects()
parses grants linked to publicationsopenairegraph::oarg_linked_ftxt()
gives full-text links including access informationopenairegraph::oarg_linked_affiliations()
parses affiliation dataThese parsers can be used alone, or together like this:
First, I obtain the locations of the de-coded XML records.
openaire_records <- list.files("data/records", full.names = TRUE)
After that, I read each XML file using the xml2
(Wickham, Hester, and Ooms 2019) package, and apply three parsers: openairegraph::oarg_publications_md()
, openairegraph::oarg_linked_projects()
and openairegraph::oarg_linked_ftxt()
. I use the future
(Bengtsson 2020b) and future.apply
(Bengtsson 2020a) packages to enable reading and parsing these records simultaneously with multiple R sessions. Running code in parallel reduces the execution time.
library(xml2) # working with xml files
library(future) # parallel computing
library(future.apply) # functional programming with parallel computing
library(tictoc) # timing functions
openaire_records <- list.files("data/records", full.names = TRUE)
future::plan(multisession)
tic()
oaire_data <- future.apply::future_lapply(openaire_records, function(files) {
# load xml file
doc <- xml2::read_xml(files)
# parser
out <- oarg_publications_md(doc)
out$linked_projects <- list(oarg_linked_projects(doc))
out$linked_ftxt <- list(oarg_linked_ftxt(doc))
# use file path as id
out$id <- files
out
})
toc()
#> 37.156 sec elapsed
oaire_df <- dplyr::bind_rows(oaire_data)
A note on performance: Parsing the whole dump h2020_results
using these parsers took me around 2 hours on my MacBook Pro (Early 2015, 2,9 GHz Intel Core i5, 8GB RAM, 256 SSD). I therefore recommend to back up the resulting data, instead of un-packing the whole dump for each analysis. jsonlite::stream_out()
outputs the data frame to a text-based json-file, where list-columns are preserved per row.
jsonlite::stream_out(oaire_df, file("data/h2020_parsed_short.json"))
#>
Processed 500 rows...
Complete! Processed total of 500 rows.
Usually, it is not individual researchers who sign grant agreements with the European Commission (EC), but the institution they are affiliated with. Universities and other research institutions hosting EC-funded projects are therefore looking for ways to monitor the insitutions’s overall compliance with funder rules. In the case of the open access mandate in Horizon 2020 (H2020), librarians are often assigned this task. Moreover, quantitative science studies have started to investigate the efficacy of funders’ open-access mandates.(Larivière and Sugimoto 2018)
In this use case, I will illustrate how to make use of the OpenAIRE Research Graph, which links grants to publications and open access full-texts, to benchmark compliance with the open access mandate against other H2020 funding activities.
As a start, I load a dataset, which was compiled following the above-described methods using the whole h2020_results.gz
dump.
It contains 92,218 grant-supported research outputs. Here, I will focus on the prevalence of open access across H2020 projects using metadata about the open access status of a publication and related project information stored in the list-column linked_projects
.
pubs_projects <- oaire_df %>%
filter(type == "publication") %>%
select(id, type, best_access_right, linked_projects) %>%
# transform to a regular data frame with a row for each project
unnest(linked_projects)
The dataset contains 84,781 literature publications from 9,008 H2020 projects. What H2020 funding activity published most?
library(cowplot)
library(scales)
pubs_projects %>%
filter(funding_level_0 == "H2020") %>%
mutate(funding_scheme = fct_infreq(funding_level_1)) %>%
group_by(funding_scheme) %>%
summarise(n = n_distinct(id)) %>%
mutate(funding_fct = fct_other(funding_scheme, keep = levels(funding_scheme)[1:10])) %>%
mutate(highlight = ifelse(funding_scheme %in% c("ERC", "RIA"), "yes", "no")) %>%
ggplot(aes(reorder(funding_fct, n), n, fill = highlight)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(
values = c("#B0B0B0D0", "#56B4E9D0"),
name = NULL) +
scale_y_continuous(
labels = scales::number_format(big.mark = ","),
expand = expansion(mult = c(0, 0.05)),
breaks = scales::extended_breaks()(0:25000)
) +
labs(x = NULL, y = "Publications", caption = "Data: OpenAIRE Research Graph") +
theme_minimal_vgrid(font_family = "Roboto") +
theme(legend.position = "none")