Accessing and analysing the OpenAIRE Research Graph data dumps

The OpenAIRE Research Graph provides a wide range of metadata about grant-supported research publications. This blog post presents an experimental R package with helpers for splitting, de-compressing and parsing the underlying data dumps. I will demonstrate how to use them by examining the compliance of funded projects with the open access mandate in Horizon 2020.

Najko Jahn https://twitter.com/najkoja (State and University Library Göttingen)https://www.sub.uni-goettingen.de/
April 7, 2020

OpenAIRE has collected and interlinked scholarly data from various openly available sources for over ten years. In December 2019, this open science network released the OpenAIRE Research Graph(Manghi, Atzori, et al. 2019), a big scholarly data dump that contains metadata about more than 100 million research publications and 8 million datasets, as well as the relationships between them. These metadata are furthermore connected to open access locations and disambiguated information about persons, organisations and funders.

Like most big scholarly data dumps, the OpenAIRE Research Graph offers many data analytics opportunities, but working with it is challenging. One reason is the size of the dump. Although the OpenAIRE Research Graph is already split into several files, most of these data files are too large to fit the memory of a moderately equipped laptop, when directly imported into computing environments like R. Another challenge is the format. The dump consists of compressed XML-files following the comprehensive OpenAIRE data model(Manghi, Bardi, et al. 2019), from which only certain elements may be needed for a specific data analysis.

In this blog post, I introduce the R package openairegraph, an experimental effort, that helps to transform the large OpenAIRE Research Graph dumps into relevant small datasets for analysis. These tools aim at data analysts and researchers alike who wish to conduct their own analysis using the OpenAIRE Research Graph, but are wary of handling its large data dumps. Focusing on grant-supported research results from the European Commission’s Horizon 2020 framework programme (H2020), I present how to subset and analyse the graph using this openairegraph. My analytical use case is to benchmark the open access activities of grant-supported projects affiliated with the University of Göttingen against the overall uptake across the H2020 funding activities.

What is the R package openairegraph about?

So far, the R package openairegraph, which is available on GitHub as a development version, has two sets of functions. The first set provides helpers to split a large OpenAIRE Research Graph data dump into separate, de-coded XML records that can be stored individually. The other set consists of parsers that convert data from these XML files to a table-like representation following the tidyverse philosophy, a popular approach and toolset for doing data analysis with R (Wickham et al. 2019). Splitting, de-coding and parsing are essential steps before analysing the OpenAIRE Research Graph.

Installation

openairegraph can be installed from GitHub using the remotes(Hester et al. 2019) package:

library(remotes)
remotes::install_github("subugoe/openairegraph")

Loading a dump into R

Several dumps from the OpenAIRE Research Graph are available on Zenodo(Manghi, Atzori, et al. 2019). So far, I tested openairegraph to work with the dump h2020_results.gz, which comprises research outputs funded by the European Commission’s Horizon 2020 funding programme (H2020).

After downloading it, the file can be imported into R using the jsonlite package(Ooms 2014). The following example shows that each line contains a record identifier and the corresponding Base64-encoded XML file. Base64 is a standard that allows file compression in a text-based format.

library(jsonlite) # tools to work with json files
library(tidyverse) # tools from the tidyverse useful for data analysis
# download the file from Zenodo and store it locally
oaire <- jsonlite::stream_in(file("data/h2020_results.gz"), verbose = FALSE) %>%
  tibble::as_tibble()
oaire
#> # A tibble: 92,218 × 2
#>    `_id`$`$oid`             body$`$binary`                    $`$type`
#>    <chr>                    <chr>                             <chr>   
#>  1 5dbc22f81e82127b58c41073 UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAA… 00      
#>  2 5dbc22f9b531c546e838683d UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAA… 00      
#>  3 5dbc22fa45e3122d97bdb313 UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00      
#>  4 5dbc22fa45e3122d97bdb31e UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00      
#>  5 5dbc22fa4e0c061a4d17b85d UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00      
#>  6 5dbc22fb81f3c12c00238e25 UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00      
#>  7 5dbc22fb895be12461552bf0 UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00      
#>  8 5dbc22fbe56570673e1bf884 UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00      
#>  9 5dbc22fc81f3c12bfe83e3b6 UEsDBBQACAgIAIViYU8AAAAAAAAAAAAA… 00      
#> 10 5dbc22fcb531c546e838688a UEsDBBQACAgIAIZiYU8AAAAAAAAAAAAA… 00      
#> # … with 92,208 more rows

De-coding and storing OpenAIRE Research Graph records

The function openairegraph::oarg_decode() splits and de-codes each record. Storing the records individually allows to process the files independent from each other, which is a common approach when working with big data.

library(openairegraph)
openairegraph::oarg_decode(oaire, records_path = "data/records/", 
  limit = 500, verbose = FALSE)

openairegraph::oarg_decode() writes out each XML-formatted record as a zip file to a specified folder. Because the dumps are quite large, the function furthermore has a parameter that allows setting a limit, which is helpful for inspecting the output first. By default, a progress bar presents the current state of the process.

Parsing OpenAIRE Research Graph records

So far, there are four parsers available to consume the H2020 results set:

These parsers can be used alone, or together like this:

First, I obtain the locations of the de-coded XML records.

openaire_records <- list.files("data/records", full.names = TRUE)

After that, I read each XML file using the xml2(Wickham, Hester, and Ooms 2019) package, and apply three parsers: openairegraph::oarg_publications_md(), openairegraph::oarg_linked_projects() and openairegraph::oarg_linked_ftxt(). I use the future(Bengtsson 2020b) and future.apply(Bengtsson 2020a) packages to enable reading and parsing these records simultaneously with multiple R sessions. Running code in parallel reduces the execution time.

library(xml2) # working with xml files
library(future) # parallel computing
library(future.apply) # functional programming with parallel computing
library(tictoc) # timing functions

openaire_records <- list.files("data/records", full.names = TRUE)

future::plan(multisession)
tic()
oaire_data <- future.apply::future_lapply(openaire_records, function(files) {
  # load xml file
  doc <- xml2::read_xml(files)
  # parser
  out <- oarg_publications_md(doc)
  out$linked_projects <- list(oarg_linked_projects(doc))
  out$linked_ftxt <- list(oarg_linked_ftxt(doc))
  # use file path as id
  out$id <- files
  out
})
toc()
#> 37.156 sec elapsed
oaire_df <- dplyr::bind_rows(oaire_data)

A note on performance: Parsing the whole dump h2020_results using these parsers took me around 2 hours on my MacBook Pro (Early 2015, 2,9 GHz Intel Core i5, 8GB RAM, 256 SSD). I therefore recommend to back up the resulting data, instead of un-packing the whole dump for each analysis. jsonlite::stream_out() outputs the data frame to a text-based json-file, where list-columns are preserved per row.

jsonlite::stream_out(oaire_df, file("data/h2020_parsed_short.json"))
#> 
Processed 500 rows...
Complete! Processed total of 500 rows.

Use case: Monitoring the Open Access Compliance across H2020 grant-supported projects at the institutional level

Usually, it is not individual researchers who sign grant agreements with the European Commission (EC), but the institution they are affiliated with. Universities and other research institutions hosting EC-funded projects are therefore looking for ways to monitor the insitutions’s overall compliance with funder rules. In the case of the open access mandate in Horizon 2020 (H2020), librarians are often assigned this task. Moreover, quantitative science studies have started to investigate the efficacy of funders’ open-access mandates.(Larivière and Sugimoto 2018)

In this use case, I will illustrate how to make use of the OpenAIRE Research Graph, which links grants to publications and open access full-texts, to benchmark compliance with the open access mandate against other H2020 funding activities.

Overview

As a start, I load a dataset, which was compiled following the above-described methods using the whole h2020_results.gz dump.

oaire_df <-
  jsonlite::stream_in(file("data/h2020_parsed.json"), verbose = FALSE) %>%
  tibble::as_tibble()

It contains 92,218 grant-supported research outputs. Here, I will focus on the prevalence of open access across H2020 projects using metadata about the open access status of a publication and related project information stored in the list-column linked_projects.

pubs_projects <- oaire_df %>%
  filter(type == "publication") %>%
  select(id, type, best_access_right, linked_projects) %>%
  # transform to a regular data frame with a row for each project
  unnest(linked_projects) 

The dataset contains 84,781 literature publications from 9,008 H2020 projects. What H2020 funding activity published most?

library(cowplot)
library(scales)
pubs_projects %>%
  filter(funding_level_0 == "H2020") %>% 
  mutate(funding_scheme = fct_infreq(funding_level_1)) %>%
  group_by(funding_scheme) %>%
  summarise(n = n_distinct(id)) %>%
  mutate(funding_fct = fct_other(funding_scheme, keep = levels(funding_scheme)[1:10])) %>%
  mutate(highlight = ifelse(funding_scheme %in% c("ERC", "RIA"), "yes", "no")) %>%
  ggplot(aes(reorder(funding_fct, n), n, fill = highlight)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(
    values = c("#B0B0B0D0", "#56B4E9D0"),
    name = NULL) +
  scale_y_continuous(
    labels = scales::number_format(big.mark = ","),
    expand = expansion(mult = c(0, 0.05)),
    breaks =  scales::extended_breaks()(0:25000)
    ) +
  labs(x = NULL, y = "Publications", caption = "Data: OpenAIRE Research Graph") +
  theme_minimal_vgrid(font_family = "Roboto") +
  theme(legend.position = "none")
Publication Output of Horizon 2020 funding activities captured by the OpenAIRE Research Graph, released in December 2019.

Figure 1: Publication Output of Horizon 2020 funding activities captured by the OpenAIRE Research Graph, released in December 2019.

Figure 1 shows that most publications in the OpenAIRE Research Graph originate from the European Research Council (ERC), Research and Innovation Actions (RIA) and Marie Skłodowska-Curie Actions (MSCA). On average, 10 articles were published per project. However, the publication performance per H2020 funding activity varies considerably (SD = 33).

The European Commission mandates open access to publications. Let’s measure the compliance to this policy using the OpenAIRE Research Graph per project:

library(rmarkdown)
oa_monitor_ec <- pubs_projects %>%
  filter(funding_level_0 == "H2020") %>%
  mutate(funding_scheme = fct_infreq(funding_level_1)) %>%
  group_by(funding_scheme,
           project_code,
           project_acronym,
           best_access_right) %>%
  summarise(oa_n = n_distinct(id)) %>% # per pub
  mutate(oa_prop = oa_n / sum(oa_n)) %>%
  filter(best_access_right == "Open Access") %>%
  ungroup() %>%
  mutate(all_pub = as.integer(oa_n / oa_prop)) 
rmarkdown::paged_table(oa_monitor_ec)

In the following, this aggregated data, oa_monitor_ec, will provide the basis to explore variations among and within H2020 funding programmes.

oa_monitor_ec %>%
  # only projects with at least five publications
  mutate(funding_fct = fct_other(funding_scheme, keep = levels(funding_scheme)[1:10])) %>%
  filter(all_pub >= 5) %>%
  ggplot(aes(fct_rev(funding_fct), oa_prop)) +
  geom_boxplot() +
  geom_hline(aes(
    yintercept = mean(oa_prop),
    color = paste0("Mean=", as.character(round(
      mean(oa_prop) * 100, 0
    )), "%")
  ),
  linetype = "dashed",
  size = 1) +
  geom_hline(aes(
    yintercept = median(oa_prop),
    color = paste0("Median=", as.character(round(
      median(oa_prop) * 100, 0
    )), "%")
  ),
  linetype = "dashed",
  size = 1) +
  scale_color_manual("H2020 OA Compliance", values = c("orange", "darkred")) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 5L),
                     expand = expansion(mult = c(0, 0.05))) +
  labs(x = NULL,
       y = "Open Access Percentage",
       caption = "Data: OpenAIRE Research Graph") +
  theme_minimal_vgrid(font_family = "Roboto") +
  theme(legend.position = "top",
        legend.justification = "right")
Open Access Compliance Rates of Horizon 2020 projects relative to funding activities, visualised as box plot. Only projects with at least five publications are shown individually.

Figure 2: Open Access Compliance Rates of Horizon 2020 projects relative to funding activities, visualised as box plot. Only projects with at least five publications are shown individually.

About 77% of research publications under the H2020 open access mandate are openly available. Figure 2 highlights a generally high rate of compliance with the open access mandate, however, uptake levels vary the funding schemes. In particular, ERC grants and Marie Skłodowska-Curie activities show higher levels of compliance compared to the overall average.

Because of their large variations, I want to put the open access rates of H2020-funded projects in context when presenting the share for projects affiliated with the University of Göttingen. Again, the data analysis starts with loading the previously backed up file with decoded and parsed data, choosing project and access information from it.

oaire_df <- jsonlite::stream_in(file("data/h2020_parsed.json"), verbose = FALSE) %>%
  tibble::as_tibble()

pubs_projects <- oaire_df %>%
  select(id, type, best_access_right, linked_projects) %>%
  unnest(linked_projects) 
pubs_projects
#> # A tibble: 136,298 × 12
#>    id        type   best_access_right to    project_title      funder 
#>    <chr>     <chr>  <chr>             <chr> <chr>              <chr>  
#>  1 data-raw… publi… Open Access       proj… Planning and Eval… Europe…
#>  2 data-raw… publi… Open Access       proj… Cortical algorith… Europe…
#>  3 data-raw… publi… Open Access       proj… Human Brain Proje… Europe…
#>  4 data-raw… publi… Restricted        proj… Implementation of… Europe…
#>  5 data-raw… publi… Open Access       proj… The power of imag… Europe…
#>  6 data-raw… publi… Open Access       proj… A psychological a… Wellco…
#>  7 data-raw… publi… Open Access       proj… Effects of Nutrit… Europe…
#>  8 data-raw… publi… Open Access       proj… Aggression subtyp… Europe…
#>  9 data-raw… publi… Open Access       proj… Global trends in … Europe…
#> 10 data-raw… publi… Open Access       proj… Mapping gravitati… Europe…
#> # … with 136,288 more rows, and 6 more variables:
#> #   funding_level_0 <chr>, funding_level_1 <chr>, project_code <chr>,
#> #   project_acronym <chr>, contract_type <chr>, funding_level_2 <chr>

Next, I want to identify H2020 projects with participation from the university. There are at least two ways to obtain links between projects and organisations: One is the OpenAIRE Research Graph. It provides project details from 29 funders in a separate dump, project.gz. Another option is to relate our dataset to open data provided by CORDIS, the European Commission’s research information portal. For convenience, I am going to follow the second option.

# load local copy downloaded from the EC open data portal
cordis_org <-
  readr::read_delim(
    "data/cordis-h2020organizations.csv",
    delim = ";",
    locale = locale(decimal_mark = ",")
  ) %>%
  # data cleaning
  mutate_if(is.double, as.character) 

After loading the file, I am able to tag projects affiliated with the University of Göttingen.

ugoe_projects <- cordis_org %>%
  filter(shortName %in% c("UGOE", "UMG-GOE")) %>% 
  select(project_id = projectID, role, project_acronym = projectAcronym)

pubs_projects_ugoe <- pubs_projects %>%
  mutate(ugoe_project = funding_level_0 == "H2020" & project_code %in% ugoe_projects$project_id)

Let’s put it all together and benchmark the rates of compliance with the H2020 open access mandate using data from the OpenAIRE Research Graph. The package plotly(Sievert 2018) allows presenting the figure as an interactive chart.

# funding programmes with Uni Göttingen participation
ugoe_funding_programme <- pubs_projects_ugoe %>% 
  filter(ugoe_project == TRUE) %>%
  group_by(funding_level_1, project_code) %>% 
  # min 5 pubs
  summarise(n = n_distinct(id)) %>%
  filter(n >= 5) %>%
  distinct(funding_level_1, project_code)
goe_oa <- oa_monitor_ec %>%
  # min 5 pubs
  filter(all_pub >=5) %>%
  filter(funding_scheme %in% ugoe_funding_programme$funding_level_1) %>%
  mutate(ugoe = project_code %in% ugoe_funding_programme$project_code) %>%
  mutate(`H2020 project` = paste0(project_acronym, " | OA share: ", round(oa_prop * 100, 0), "%"))
# plot as interactive graph using plotly
library(plotly)
p <- ggplot(goe_oa, aes(funding_scheme, oa_prop)) +
  geom_boxplot() +
  geom_jitter(data = filter(goe_oa, ugoe == TRUE),
               aes(label = `H2020 project`),
             colour = "#AF42AE",
             alpha = 0.9,
             size = 3,
             width = 0.25) +
  geom_hline(aes(
    yintercept = mean(oa_prop),
    color = paste0("Mean=", as.character(round(
      mean(oa_prop) * 100, 0
    )), "%")
  ),
  linetype = "dashed",
  size = 1) +
  geom_hline(aes(
    yintercept = median(oa_prop),
    color = paste0("Median=", as.character(round(
      median(oa_prop) * 100, 0
    )), "%")
  ),
  linetype = "dashed",
  size = 1) +
  scale_color_manual(NULL, values = c("orange", "darkred")) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 5L)) +
  labs(x = NULL,
       y = "Open Access Percentage",
       caption = "Data: OpenAIRE Research Graph") +
  theme_minimal(base_family = "Roboto") +
  theme(legend.position = "top",
        legend.justification = "right")
plotly::ggplotly(p, tooltip = c("label"))

Figure 3: Open Access Compliance Rates of Horizon 2020 projects affiliated with the University of Göttingen (purple dots) relative to the overall performance of the funding activity, visualised as a box plot. Only projects with at least five publications were considered. Data: OpenAIRE Research Graph(Manghi, Atzori, et al. 2019)

Figure 3 shows that many H2020-projects with University of Göttingen participation have an uptake of open access to grant-supported publications that is above the average in the peer group. At the same time, some perform below expectation. Together, this provides a valuable insight into open access compliance at the university-level, especially for research support librarians who are in charge of helping grantees to make their work open access. They can, for instance, point grantees to OpenAIRE-compliant repositoires for self-archiving their works.

Discussion and conclusion

Using data from the OpenAIRE Research Graph dumps makes it possible to put the results of a specific data analysis into context. Open access compliance rates of H2020 projects vary. These variations should be considered when reporting compliance rates of specific projects under the same open access mandate.

Although the OpenAIRE Research Graph is a large collection of scholarly data, it is likely that it still does not provide the whole picture. OpenAIRE mainly collects data from open sources. It is still unknown how the OpenAIRE Research Graph compares to well-established toll-access bibliometrics data sources like the Web of Science in terms of coverage and data quality.

As a member of the OpenAIRE consortium, improving the re-use of the OpenAIRE Research Graph dumps has become a SUB Göttingen working priority. In the scholarly communication analysts team, we want to support this with a number of data analyses and outreach activities. In doing so, we will add more helper functions to the openairegraph R package. It targets data analysts and researchers who wish to conduct their own analysis using the OpenAIRE Research Graph, but are wary of handling its large data dumps.

If you like to contribute, head on over to the packages’ source code repository and get started!

Acknowledgments

This work was supported by the European Commission [OpenAIRE-Advance - OpenAIRE Advancing Open Scholarship (777541)].

Bengtsson, Henrik. 2020a. Future.apply: Apply Function to Elements in Parallel Using Futures. https://CRAN.R-project.org/package=future.apply.
———. 2020b. Future: Unified Parallel and Distributed Processing in r for Everyone. https://CRAN.R-project.org/package=future.
Hester, Jim, Gábor Csárdi, Hadley Wickham, Winston Chang, Martin Morgan, and Dan Tenenbaum. 2019. Remotes: R Package Installation from Remote Repositories, Including ’GitHub’. https://CRAN.R-project.org/package=remotes.
Larivière, Vincent, and Cassidy R. Sugimoto. 2018. “Do Authors Comply When Funders Enforce Open Access to Research?” Nature 562 (7728): 483–86. https://doi.org/10.1038/d41586-018-07101-w.
Manghi, Paolo, Claudio Atzori, Alessia Bardi, Jochen Schirrwagen, Harry Dimitropoulos, Sandro La Bruzzo, Ioannis Foufoulas, et al. 2019. “OpenAIRE Research Graph Dump.” Zenodo. https://doi.org/10.5281/zenodo.3516918.
Manghi, Paolo, Alessia Bardi, Claudio Atzori, Miriam Baglioni, Natalia Manola, Jochen Schirrwagen, and Pedro Principe. 2019. “The OpenAIRE Research Graph Data Model.” Zenodo. https://doi.org/10.5281/zenodo.2643199.
Ooms, Jeroen. 2014. “The Jsonlite Package: A Practical and Consistent Mapping Between JSON Data and r Objects.” arXiv:1403.2805 [Stat.CO]. https://arxiv.org/abs/1403.2805.
Sievert, Carson. 2018. Plotly for r. https://plotly-r.com.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Jim Hester, and Jeroen Ooms. 2019. Xml2: Parse XML. https://CRAN.R-project.org/package=xml2.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Jahn (2020, April 7). Scholarly Communication Analytics: Accessing and analysing the OpenAIRE Research Graph data dumps. Retrieved from https://subugoe.github.io/scholcomm_analytics/posts/oaire_graph_2020/

BibTeX citation

@misc{jahn2020accessing,
  author = {Jahn, Najko},
  title = {Scholarly Communication Analytics: Accessing and analysing the OpenAIRE Research Graph data dumps},
  url = {https://subugoe.github.io/scholcomm_analytics/posts/oaire_graph_2020/},
  year = {2020}
}