The workflow starts with loading a downloaded OpenAIRE Research Graph dump. After that, the package helps you to de-code and split into several locally stored files. A dedicated parser will obtain data from these files.

De-code and split OpenAIRE Research Graph dumps

OpenAIRE Research Graph dumps are json-files that contain a record identifier and a Base64-encoded text string representing the metadata.

library(jsonlite)
library(tibble)
# sample file delivered with this package
dump_file <- system.file(
  "extdata", "h2020_results_short.gz",
  package = "openairegraph"
)
# a dump file is in json format
loaded_dump <- jsonlite::stream_in(file(dump_file), verbose = FALSE)
tibble::as_tibble(loaded_dump)
#> # A tibble: 100 x 2
#>    `_id`$`$oid`        body$`$binary`                                   $`$type`
#>    <chr>               <chr>                                            <chr>   
#>  1 5dbc22f81e82127b58… UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00      
#>  2 5dbc22f9b531c546e8… UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAAAAAEAAAAYm9kee0… 00      
#>  3 5dbc22fa45e3122d97… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00      
#>  4 5dbc22fa45e3122d97… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00      
#>  5 5dbc22fa4e0c061a4d… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00      
#>  6 5dbc22fb81f3c12c00… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00      
#>  7 5dbc22fb895be12461… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee2… 00      
#>  8 5dbc22fbe56570673e… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00      
#>  9 5dbc22fc81f3c12bfe… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00      
#> 10 5dbc22fcb531c546e8… UEsDBBQACAgIAIZiYU8AAAAAAAAAAAAAAAAEAAAAYm9kee1… 00      
#> # … with 90 more rows

openairegraph::oarg_decode() decodes these strings and saves them locally. It writes out each XML-formatted record as a zip file to a specified folder.

library(openairegraph)
# writes out each XML-formatted record as a zip file to a specified folder
dir.create("data")
oarg_decode(loaded_dump, limit = 10, records_path = "data/")

These files can be loaded using the xml2 package.

library(xml2)
# sample file delivered with this package
dump_eg <- system.file(
  "extdata", "multiple_projects.xml",
  package = "openairegraph"
)
my_record <- xml2::read_xml(dump_eg)
my_record
#> {xml_document}
#> <record>
#> [1] <result xmlns:dri="http://www.driver-repository.eu/namespace/dri" xmlns:x ...

XML-Parsers

So far, there are four parsers available to consume the H2020 results set:

Basic publication metadata

openairegraph::oarg_publications_md(my_record)
#> # A tibble: 1 x 12
#>   type  title journal publisher date_of_accepta… best_access_rig…
#>   <chr> <chr> <chr>   <chr>     <chr>            <chr>           
#> 1 publ… Iden… PLoS G… Public L… 2017-06-01       Open Access     
#> # … with 6 more variables: embargo_enddate <chr>, resource_type <chr>,
#> #   authors <list>, pids <list>, collected_from <list>, source_ids <list>

Author infos

openairegraph::oarg_publications_md(my_record)$authors
#> [[1]]
#> # A tibble: 64 x 5
#>    author_full_name    author_order author_name author_surname author_orcid     
#>    <chr>                      <dbl> <chr>       <chr>          <chr>            
#>  1 Li, He                         1 He          Li             <NA>             
#>  2 Reksten, Tove Ragna            2 Tove Ragna  Reksten        <NA>             
#>  3 Ice, John A.                   3 John A.     Ice            <NA>             
#>  4 Kelly, Jennifer A.             4 Jennifer A. Kelly          <NA>             
#>  5 Adrianto, Indra                5 Indra       Adrianto       <NA>             
#>  6 Rasmussen, Astrid              6 Astrid      Rasmussen      0000-0001-7744-2…
#>  7 Wang, Shaofeng                 7 Shaofeng    Wang           <NA>             
#>  8 He, Bo                         8 Bo          He             <NA>             
#>  9 Grundahl, Kiely M.             9 Kiely M.    Grundahl       <NA>             
#> 10 Glenn, Stuart B.              10 Stuart B.   Glenn          <NA>             
#> # … with 54 more rows

Linked persistent identifiers (PID) to a research publication

openairegraph::oarg_publications_md(my_record)$pids
#> [[1]]
#> # A tibble: 3 x 2
#>   pid_type pid                         
#>   <chr>    <chr>                       
#> 1 pmc      PMC5501660                  
#> 2 doi      10.1371/journal.pgen.1006820
#> 3 pmid     28640813

Linked projects

openairegraph::oarg_linked_projects(my_record)
#> # A tibble: 22 x 9
#>    to    project_title funder funding_level_0 funding_level_1 funding_level_2
#>    <chr> <chr>         <chr>  <chr>           <chr>           <chr>          
#>  1 proj… Genetische u… Deuts… Klinische Fors… <NA>            <NA>           
#>  2 proj… Modulation o… Natio… NATIONAL INSTI… <NA>            <NA>           
#>  3 proj… Functional E… Natio… NATIONAL INSTI… <NA>            <NA>           
#>  4 proj… HARMONIzatio… Europ… H2020           RIA             <NA>           
#>  5 proj… Genomics Core Natio… NATIONAL INSTI… <NA>            <NA>           
#>  6 proj… Susceptibili… Natio… NATIONAL INSTI… <NA>            <NA>           
#>  7 proj… Molecular Me… Natio… NATIONAL INSTI… <NA>            <NA>           
#>  8 proj… Genetic Link… Natio… NATIONAL INSTI… <NA>            <NA>           
#>  9 proj… Science in a… Natio… NATIONAL INSTI… <NA>            <NA>           
#> 10 proj… A Genetic Ri… Natio… NATIONAL INSTI… <NA>            <NA>           
#> # … with 12 more rows, and 3 more variables: project_code <chr>,
#> #   project_acronym <chr>, contract_type <chr>

Linked Full-Texts

openairegraph::oarg_linked_ftxt(my_record)
#> # A tibble: 4 x 5
#>   access_right collected_from         instance_type hosted_by           web_urls
#>   <chr>        <chr>                  <chr>         <chr>               <list>  
#> 1 Open Access  PubMed Central         Article       Europe PubMed Cent… <chr [1…
#> 2 Open Access  Publikationer från Up… Article       Publikationer från… <chr [1…
#> 3 UNKNOWN      Sygma                  Article       Unknown Repository  <chr [1…
#> 4 Open Access  DOAJ-Articles          Article       PLoS Genetics       <chr [3…

Affiliation data

openairegraph::oarg_linked_affiliations(my_record)
#> # A tibble: 52 x 3
#>    legal_name                                           country      trust_score
#>    <chr>                                                <chr>        <chr>      
#>  1 Massachusetts Eye and Ear Infirmary                  United Stat… 0.8996     
#>  2 University of Santo Tomas Hospital                   Philippines  0.9        
#>  3 University of Birmingham                             United King… 0.8244     
#>  4 Cedars-Sinai Medical Center                          United Stat… 0.9        
#>  5 Oklahoma Baptist University                          United Stat… 0.8998     
#>  6 NATIONAL INSTITUTE OF HEALTH                         United Stat… 0.7938     
#>  7 University of Minnesota - University of Minnesota M… United Stat… 0.8433     
#>  8 Haukeland University Hospital                        Norway       0.9        
#>  9 Newcastle University                                 United King… 0.9        
#> 10 Johns Hopkins University                             United Stat… 0.8998     
#> # … with 42 more rows