Interfacing the PID Graph with R

The PID Graph from DataCite interlinks persistent identifiers (PID) in research. In this blog post, I will present how to interface this graph using the DataCite GraphQL API with R. To illustrate it, I will visualise the research information network of a person.

Najko Jahn https://twitter.com/najkoja (State and University Library Göttingen)https://www.sub.uni-goettingen.de/
2019-10-24

In 1965, Derek J. de Solla Price proposed to study the relationships between research articles using bibliographic references(Solla Price 1965). Ever since, scholars and librarians have been working on interrelating research activities and making such links discoverable.

In this context, the FREYA project, funded by the European Commission, connects and interlinks persistent identifier (PID) schemes. FREYA focuses, among others, on PIDs for persons (ORCID), organisations (ROR), publications, research data, and software (DOI). The project has created a PID Graph, which connects various resources using persistent identifiers. A GraphQL interface allows accessing these data.

Upon invitation of Martin Fenner, Technical Director of DataCite and FREYA team member, I attended the half-day workshop Project FREYA: connecting knowledge in the European Open Science Cloud, co-located at 14th Plenary Meeting of the Research Data Alliance in Helsinki. Using data analytics as an outreach strategy, Martin prepared a large collection of Juypter notebooks showcasing how the PID Graph can be interfaced using R and Python(Fenner 2019c). During the workshop, we presented two interactive notebooks deployed on mybinder.org, and invited workshop participants to re-run them in the web browser. The first notebook(Fenner 2019b) presents the overall indexing coverage of the PID Graph, while the second notebook demonstrated how to obtain data about a personal researcher network(Fenner 2019a).

In this blog post, I want to expand on what I have learned during the FREYA workshop. Although most participants were able to run the interactive Jupyter notebooks, some articulated problems along the data transformation path. In the following, I will therefore present an complementary approach of how to transform and visualise data from the PID graph with R by using tools from the popular tidyverse package collection.

Accessing the PID Graph using GraphQL

A first version of the PID graph is accessible via the DataCite GraphQL API. GraphQL is a query language designed to request multiple connections across resources at once. As an example, a query for accessing publications, research data and software by a particular researcher using the DataCite GraphQL API looks like this:


graphql_query <- '{
  person(id: "https://orcid.org/0000-0003-1444-9135") {
    id
    type
    name
    publications(first: 50) {
      totalCount
      nodes {
        id
        type
        relatedIdentifiers {
          relatedIdentifier
        }
      }
    }
    datasets(first: 50) {
      totalCount
      nodes {
        id
        type
        relatedIdentifiers {
          relatedIdentifier
        }
      }
    }
    softwareSourceCodes(first: 50) {
      totalCount
      nodes {
        id
        type
        relatedIdentifiers {
          relatedIdentifier
        }
      }
    }
  }
}'

Here, I query for publications, research data and software code authored by Scott Chamberlain, who is represented by his ORCID. I also retrieve relations between his research activities that are represented in the relatedIdentifier node. The query is stored in the R object graphql_query that will be used to interface the DataCite GraphQL API in the following.

To make GraphQL requests with R, Scott developed the R package ghql, which is maintained by rOpenSci. The package is not on CRAN, but can be installed from GitHub.


# Not on CRAN.
# Install from GitHub remotes::install_github("ropensci/ghql")
library(ghql)

To initialize the client session, call


cli <- GraphqlClient$new(
  url = "https://api.datacite.org/graphql"
)
qry <- Query$new()

Next, I can send the query stored in graphql_query to the API.


qry$query("getdata", graphql_query)

The data is represented in json. To parse the API response, I use the jsonlite package.


library(jsonlite)
my_data <- jsonlite::fromJSON(cli$exec(qry$queries$getdata))

Data Transformation

The data is represented as a nested list, which can be transformed to a data.frame using tidyverse tools tidyr and dplyr. Here, I want to obtain all DOIs representing scholarly articles, datasets and software including the relationships between them. Unlike the DOIs for research outputs, related identifiers of type DOI lack the DOI prefix. For consistency with the overall dataset, the prefix will be added.


library(dplyr)
library(tidyr)
my_df <- bind_rows(
  # publications
  my_data$data$person$publications$nodes,
  # dataset
  my_data$data$person$datasets$nodes,
  # software
  my_data$data$person$softwareSourceCodes$nodes
) %>%
  # get related identifiers
  unnest(cols = c(relatedIdentifiers), keep_empty = TRUE) %>%
  # unfortunately, related identifiers of type DOI lack DOI prefix
  mutate(to = ifelse(
    grepl("^10.", relatedIdentifier),
    paste0("https://doi.org/", relatedIdentifier),
    relatedIdentifier)
  )
head(my_df)

# A tibble: 6 x 4
  id                  type      relatedIdentifier    to               
  <chr>               <chr>     <chr>                <chr>            
1 https://doi.org/10… Scholarl… 10.6084/m9.figshare… https://doi.org/…
2 https://doi.org/10… Scholarl… <NA>                 <NA>             
3 https://doi.org/10… Scholarl… <NA>                 <NA>             
4 https://doi.org/10… Scholarl… 10.6084/m9.figshare… https://doi.org/…
5 https://doi.org/10… Scholarl… <NA>                 <NA>             
6 https://doi.org/10… Scholarl… <NA>                 <NA>             

A network consists of nodes (vertices) and links (edges). Nodes represent an output, while links describes relationships between them (relatedIdentifier).

Let’s create node data.frame


my_nodes <- my_df %>%
  select(name = id, type) %>%
  distinct() %>%
  # person
  add_row(name = my_data$data$person$id, type = "Person")
head(my_nodes)

# A tibble: 6 x 2
  name                                         type            
  <chr>                                        <chr>           
1 https://doi.org/10.6084/m9.figshare.97215.v1 ScholarlyArticle
2 https://doi.org/10.6084/m9.figshare.94090    ScholarlyArticle
3 https://doi.org/10.6084/m9.figshare.94296    ScholarlyArticle
4 https://doi.org/10.6084/m9.figshare.94295.v2 ScholarlyArticle
5 https://doi.org/10.6084/m9.figshare.94089    ScholarlyArticle
6 https://doi.org/10.6084/m9.figshare.94217    ScholarlyArticle

and a data.frame with the relationships between these nodes, i.e. edges:


my_edges_pub <- my_df %>%
  select(source = id, target = to) %>%
  # we only observe links between them
  filter(target %in% my_nodes$name)
#' lets ad relationsships between person and outputs
my_edges <-
  tibble(source = my_data$data$person$id, target = my_nodes$name) %>%
  # no self loop
  filter(target != my_data$data$person$id) %>%
  bind_rows(my_edges_pub)
head(my_edges)

# A tibble: 6 x 2
  source                          target                              
  <chr>                           <chr>                               
1 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
2 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
3 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
4 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
5 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
6 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…

Network visualisation

For the graph visualisation, I use the popular network analysis package igraph. First, the node and edge data are transformed to an igraph object. I also want to remove potential loops (“self-links”).


library(igraph)
g <-
  graph_from_data_frame(d = my_edges,
                        vertices = my_nodes,
                        directed = FALSE)
#' remove potential loops
g <- igraph::simplify(g)

Next, some visualisation parameter are defined including node colours and labels. Here, node colours represent the person and the three different publication types.


#' define node colours
my_palette <-
  c("#6da7de", "#9e0059", "#dee000", "#d82222")
my_color <- my_palette[as.numeric(as.factor(V(g)$type))]
#' don't display label
V(g)$label = NA

Finally, let’s visualise Scott’s publication network according to DataCite metadata.


plot(simplify(g), vertex.color = my_color, 
     vertex.frame.color = my_color,
     arrow.mode = 0)
legend(
  "bottomleft",
  legend = levels(as.factor(V(g)$type)),
  col = my_palette,
  bty = "n",
  pch = 20 ,
  pt.cex = 2.5,
  cex = 1,
  horiz = FALSE,
  inset = c(0.1,-0.1)
)