openairegraph.rmd
The workflow starts with loading a downloaded OpenAIRE Research Graph dump. After that, the package helps you to de-code and split into several locally stored files. A dedicated parser will obtain data from these files.
OpenAIRE Research Graph dumps are json-files that contain a record identifier and a Base64-encoded text string representing the metadata.
library(jsonlite) library(tibble) # sample file delivered with this package dump_file <- system.file( "extdata", "h2020_results_short.gz", package = "openairegraph" ) # a dump file is in json format loaded_dump <- jsonlite::stream_in(file(dump_file), verbose = FALSE) tibble::as_tibble(loaded_dump) #> # A tibble: 100 x 2 #> `_id`$`$oid` body$`$binary` $`$type` #> <chr> <chr> <chr> #> 1 5dbc22f81e82127b58… UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00 #> 2 5dbc22f9b531c546e8… UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAAAAAEAAAAYm9kee0… 00 #> 3 5dbc22fa45e3122d97… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00 #> 4 5dbc22fa45e3122d97… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00 #> 5 5dbc22fa4e0c061a4d… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00 #> 6 5dbc22fb81f3c12c00… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00 #> 7 5dbc22fb895be12461… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee2… 00 #> 8 5dbc22fbe56570673e… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00 #> 9 5dbc22fc81f3c12bfe… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00 #> 10 5dbc22fcb531c546e8… UEsDBBQACAgIAIZiYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00 #> # … with 90 more rows
openairegraph::oarg_decode()
decodes these strings and saves them locally. It writes out each XML-formatted record as a zip file to a specified folder.
library(openairegraph) # writes out each XML-formatted record as a zip file to a specified folder dir.create("data") oarg_decode(loaded_dump, limit = 10, records_path = "data/")
These files can be loaded using the xml2
package.
library(xml2) # sample file delivered with this package dump_eg <- system.file( "extdata", "multiple_projects.xml", package = "openairegraph" ) my_record <- xml2::read_xml(dump_eg) my_record #> {xml_document} #> <record> #> [1] <result xmlns:dri="http://www.driver-repository.eu/namespace/dri" xmlns:x ...
So far, there are four parsers available to consume the H2020 results set:
openairegraph::oarg_publications_md()
retrieves basic publication metadata complemented by author details and access statusopenairegraph::oarg_linked_projects()
parses grants linked to publicationsopenairegraph::oarg_linked_ftxt()
gives full-text links including access informationopenairegraph::oarg_linked_affiliations()
parses affiliation dataopenairegraph::oarg_publications_md(my_record) #> # A tibble: 1 x 12 #> type title journal publisher date_of_accepta… best_access_rig… #> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 publ… Iden… PLoS G… Public L… 2017-06-01 Open Access #> # … with 6 more variables: embargo_enddate <chr>, resource_type <chr>, #> # authors <list>, pids <list>, collected_from <list>, source_ids <list>
Author infos
openairegraph::oarg_publications_md(my_record)$authors #> [[1]] #> # A tibble: 64 x 5 #> author_full_name author_order author_name author_surname author_orcid #> <chr> <dbl> <chr> <chr> <chr> #> 1 Li, He 1 He Li <NA> #> 2 Reksten, Tove Ragna 2 Tove Ragna Reksten <NA> #> 3 Ice, John A. 3 John A. Ice <NA> #> 4 Kelly, Jennifer A. 4 Jennifer A. Kelly <NA> #> 5 Adrianto, Indra 5 Indra Adrianto <NA> #> 6 Rasmussen, Astrid 6 Astrid Rasmussen 0000-0001-7744-2… #> 7 Wang, Shaofeng 7 Shaofeng Wang <NA> #> 8 He, Bo 8 Bo He <NA> #> 9 Grundahl, Kiely M. 9 Kiely M. Grundahl <NA> #> 10 Glenn, Stuart B. 10 Stuart B. Glenn <NA> #> # … with 54 more rows
Linked persistent identifiers (PID) to a research publication
openairegraph::oarg_publications_md(my_record)$pids #> [[1]] #> # A tibble: 3 x 2 #> pid_type pid #> <chr> <chr> #> 1 pmc PMC5501660 #> 2 doi 10.1371/journal.pgen.1006820 #> 3 pmid 28640813
openairegraph::oarg_linked_projects(my_record) #> # A tibble: 22 x 9 #> to project_title funder funding_level_0 funding_level_1 funding_level_2 #> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 proj… Genetische u… Deuts… Klinische Fors… <NA> <NA> #> 2 proj… Modulation o… Natio… NATIONAL INSTI… <NA> <NA> #> 3 proj… Functional E… Natio… NATIONAL INSTI… <NA> <NA> #> 4 proj… HARMONIzatio… Europ… H2020 RIA <NA> #> 5 proj… Genomics Core Natio… NATIONAL INSTI… <NA> <NA> #> 6 proj… Susceptibili… Natio… NATIONAL INSTI… <NA> <NA> #> 7 proj… Molecular Me… Natio… NATIONAL INSTI… <NA> <NA> #> 8 proj… Genetic Link… Natio… NATIONAL INSTI… <NA> <NA> #> 9 proj… Science in a… Natio… NATIONAL INSTI… <NA> <NA> #> 10 proj… A Genetic Ri… Natio… NATIONAL INSTI… <NA> <NA> #> # … with 12 more rows, and 3 more variables: project_code <chr>, #> # project_acronym <chr>, contract_type <chr>
openairegraph::oarg_linked_ftxt(my_record) #> # A tibble: 4 x 5 #> access_right collected_from instance_type hosted_by web_urls #> <chr> <chr> <chr> <chr> <list> #> 1 Open Access PubMed Central Article Europe PubMed Cent… <chr [1… #> 2 Open Access Publikationer från Up… Article Publikationer från… <chr [1… #> 3 UNKNOWN Sygma Article Unknown Repository <chr [1… #> 4 Open Access DOAJ-Articles Article PLoS Genetics <chr [3…
openairegraph::oarg_linked_affiliations(my_record) #> # A tibble: 52 x 3 #> legal_name country trust_score #> <chr> <chr> <chr> #> 1 Massachusetts Eye and Ear Infirmary United Stat… 0.8996 #> 2 University of Santo Tomas Hospital Philippines 0.9 #> 3 University of Birmingham United King… 0.8244 #> 4 Cedars-Sinai Medical Center United Stat… 0.9 #> 5 Oklahoma Baptist University United Stat… 0.8998 #> 6 NATIONAL INSTITUTE OF HEALTH United Stat… 0.7938 #> 7 University of Minnesota - University of Minnesota M… United Stat… 0.8433 #> 8 Haukeland University Hospital Norway 0.9 #> 9 Newcastle University United King… 0.9 #> 10 Johns Hopkins University United Stat… 0.8998 #> # … with 42 more rows