Scholarly Communication Analytics: Interfacing the PID Graph with R

Najko Jahn

doi:21.11101/0000-0007-DFAD-C

In 1965, Derek J. de Solla Price proposed to study the relationships between research articles using bibliographic references (Solla Price 1965). Ever since, scholars and librarians have been working on interrelating research activities and making such links discoverable.

In this context, the FREYA project, funded by the European Commission, connects and interlinks persistent identifier (PID) schemes. FREYA focuses, among others, on PIDs for persons (ORCID), organisations (ROR), publications, research data, and software (DOI). The project has created a PID Graph, which connects various resources using persistent identifiers. A GraphQL interface allows accessing these data.

Upon invitation of Martin Fenner, Technical Director of DataCite and FREYA team member, I attended the half-day workshop Project FREYA: connecting knowledge in the European Open Science Cloud, co-located at 14th Plenary Meeting of the Research Data Alliance in Helsinki. Using data analytics as an outreach strategy, Martin prepared a large collection of Juypter notebooks showcasing how the PID Graph can be interfaced using R and Python (Fenner 2019c). During the workshop, we presented two interactive notebooks deployed on mybinder.org, and invited workshop participants to re-run them in the web browser. The first notebook (Fenner 2019b) presents the overall indexing coverage of the PID Graph, while the second notebook demonstrated how to obtain data about a personal researcher network (Fenner 2019a).

In this blog post, I want to expand on what I have learned during the FREYA workshop. Although most participants were able to run the interactive Jupyter notebooks, some articulated problems along the data transformation path. In the following, I will therefore present an complementary approach of how to transform and visualise data from the PID graph with R by using tools from the popular tidyverse package collection.

Accessing the PID Graph using GraphQL

A first version of the PID graph is accessible via the DataCite GraphQL API. GraphQL is a query language designed to request multiple connections across resources at once. As an example, a query for accessing publications, research data and software by a particular researcher using the DataCite GraphQL API looks like this:

graphql_query <- '{
  person(id: "https://orcid.org/0000-0003-1444-9135") {
    id
    type
    name
    publications(first: 50) {
      totalCount
      nodes {
        id
        type
        relatedIdentifiers {
          relatedIdentifier
        }
      }
    }
    datasets(first: 50) {
      totalCount
      nodes {
        id
        type
        relatedIdentifiers {
          relatedIdentifier
        }
      }
    }
    softwareSourceCodes(first: 50) {
      totalCount
      nodes {
        id
        type
        relatedIdentifiers {
          relatedIdentifier
        }
      }
    }
  }
}'

Here, I query for publications, research data and software code authored by Scott Chamberlain, who is represented by his ORCID. I also retrieve relations between his research activities that are represented in the relatedIdentifier node. The query is stored in the R object graphql_query that will be used to interface the DataCite GraphQL API in the following.

To make GraphQL requests with R, Scott developed the R package ghql, which is maintained by rOpenSci. The package is not on CRAN, but can be installed from GitHub.

# Not on CRAN.
# Install from GitHub remotes::install_github("ropensci/ghql")
library(ghql)

To initialize the client session, call

cli <- GraphqlClient$new(
  url = "https://api.datacite.org/graphql"
)
qry <- Query$new()

Next, I can send the query stored in graphql_query to the API.

qry$query("getdata", graphql_query)

The data is represented in json. To parse the API response, I use the jsonlite package.

library(jsonlite)
my_data <- jsonlite::fromJSON(cli$exec(qry$queries$getdata))

Data Transformation

The data is represented as a nested list, which can be transformed to a data.frame using tidyverse tools tidyr and dplyr. Here, I want to obtain all DOIs representing scholarly articles, datasets and software including the relationships between them. Unlike the DOIs for research outputs, related identifiers of type DOI lack the DOI prefix. For consistency with the overall dataset, the prefix will be added.

library(dplyr)
library(tidyr)
my_df <- bind_rows(
  # publications
  my_data$data$person$publications$nodes,
  # dataset
  my_data$data$person$datasets$nodes,
  # software
  my_data$data$person$softwareSourceCodes$nodes
) %>%
  # get related identifiers
  unnest(cols = c(relatedIdentifiers), keep_empty = TRUE) %>%
  # unfortunately, related identifiers of type DOI lack DOI prefix
  mutate(to = ifelse(
    grepl("^10.", relatedIdentifier),
    paste0("https://doi.org/", relatedIdentifier),
    relatedIdentifier)
  )
head(my_df)

# A tibble: 6 x 4
  id                  type      relatedIdentifier    to               
  <chr>               <chr>     <chr>                <chr>            
1 https://doi.org/10… Scholarl… <NA>                 <NA>             
2 https://doi.org/10… Scholarl… 10.6084/m9.figshare… https://doi.org/…
3 https://doi.org/10… Scholarl… <NA>                 <NA>             
4 https://doi.org/10… Scholarl… <NA>                 <NA>             
5 https://doi.org/10… Scholarl… 10.6084/m9.figshare… https://doi.org/…
6 https://doi.org/10… Scholarl… 10.6084/m9.figshare… https://doi.org/…

A network consists of nodes (vertices) and links (edges). Nodes represent an output, while links describes relationships between them (relatedIdentifier).

Let’s create node data.frame

my_nodes <- my_df %>%
  select(name = id, type) %>%
  distinct() %>%
  # person
  add_row(name = my_data$data$person$id, type = "Person")
head(my_nodes)

# A tibble: 6 x 2
  name                                         type            
  <chr>                                        <chr>           
1 https://doi.org/10.6084/m9.figshare.97222    ScholarlyArticle
2 https://doi.org/10.6084/m9.figshare.94217.v2 ScholarlyArticle
3 https://doi.org/10.6084/m9.figshare.94296    ScholarlyArticle
4 https://doi.org/10.6084/m9.figshare.94090    ScholarlyArticle
5 https://doi.org/10.6084/m9.figshare.94295.v2 ScholarlyArticle
6 https://doi.org/10.6084/m9.figshare.97215.v1 ScholarlyArticle

and a data.frame with the relationships between these nodes, i.e. edges:

my_edges_pub <- my_df %>%
  select(source = id, target = to) %>%
  # we only observe links between them
  filter(target %in% my_nodes$name)
#' lets ad relationsships between person and outputs
my_edges <-
  tibble(source = my_data$data$person$id, target = my_nodes$name) %>%
  # no self loop
  filter(target != my_data$data$person$id) %>%
  bind_rows(my_edges_pub)
head(my_edges)

# A tibble: 6 x 2
  source                          target                              
  <chr>                           <chr>                               
1 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
2 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
3 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
4 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
5 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
6 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…

Network visualisation

For the graph visualisation, I use the popular network analysis package igraph. First, the node and edge data are transformed to an igraph object. I also want to remove potential loops (“self-links”).

library(igraph)
g <-
  graph_from_data_frame(d = my_edges,
                        vertices = my_nodes,
                        directed = FALSE)
#' remove potential loops
g <- igraph::simplify(g)

Next, some visualisation parameter are defined including node colours and labels. Here, node colours represent the person and the three different publication types.

#' define node colours
my_palette <-
  c("#6da7de", "#9e0059", "#dee000", "#d82222")
my_color <- my_palette[as.numeric(as.factor(V(g)$type))]
#' don't display label
V(g)$label = NA

Finally, let’s visualise Scott’s publication network according to DataCite metadata.

plot(simplify(g), vertex.color = my_color, 
     vertex.frame.color = my_color,
     arrow.mode = 0)
legend(
  "bottomleft",
  legend = levels(as.factor(V(g)$type)),
  col = my_palette,
  bty = "n",
  pch = 20 ,
  pt.cex = 2.5,
  cex = 1,
  horiz = FALSE,
  inset = c(0.1,-0.1)
)

Discussion and Outlook

Using data analytics is a great outreach activity to promote the PID Graph. During the workshop, participants were able to run the interactive notebooks with analytical code. This enabled a hands-on experience about how to interface the graph with GraphQL. It also led to great discussions about the PID Graph’s indexing coverage and potential use-cases. In particular, participants raised the issue of yet-incomplete PID metadata coverage. In our example, for instance, we likely miss a considerable amount of Scott’s software projects linked with a DOI, because the underlying metadata records lack his ORCID.

Besides the fruitful discussion about PID coverage in the metadata, I had the feeling that many participants struggled with following the steps for data transformation. Therefore, I decided to dry out the code from the initial notebook using the packages tidyr and dplyr from the tidyverse. I hope that such an approach will make the examples clearer.

In the future, the FREYA team will continuously extend the indexing coverage of the PID Graph in collaboration with related research graph activities from OpenAIRE (Manghi et al. 2019), and the Wikibase community. On 25 October, there will be a joint meeting of large data providers for Open Science Graphs at the RDA 14th Plenary. Together with the Software Sustainability Institute, FREYA will hold a day-long hackathon on 4 December at the British Library so as to further improve data analytics using the PID graph.

Acknowledgments

I would like to thank Martin Fenner, Kristian Garza, Slava Tykhonov, and Maaike de Jong for having me at the workshop, and their valuable help with the analysis and the use of the PID Graph with Jupyter Notebooks.

Fenner, Martin. 2019a. “FREYA PID Graph for a Specific Researcher.” DataCite. https://doi.org/10.14454/628M-3882.

———. 2019b. “FREYA PID Graph Key Performance Indicators (KPIs).” DataCite. https://doi.org/10.14454/3BPW-W381.

———. 2019c. “Using Jupyter Notebooks with GraphQL and the PID Graph.” https://doi.org/10.5438/HWAW-XE52.

Manghi, Paolo, Alessia Bardi, Claudio Atzori, Miriam Baglioni, Natalia Manola, Jochen Schirrwagen, and Pedro Principe. 2019. “The OpenAIRE Research Graph Data Model.” https://doi.org/10.5281/ZENODO.2643199.

Solla Price, D. J. de. 1965. “Networks of Scientific Papers.” Science 149 (3683): 510–15. https://doi.org/10.1126/science.149.3683.510.

Interfacing the PID Graph with R

Accessing the PID Graph using GraphQL

Data Transformation

Network visualisation

Discussion and Outlook

Acknowledgments

References

Corrections

Reuse

Citation