The PID Graph from DataCite interlinks persistent identifiers (PID) in research. In this blog post, I will present how to interface this graph using the DataCite GraphQL API with R. To illustrate it, I will visualise the research information network of a person.
In 1965, Derek J. de Solla Price proposed to study the relationships between research articles using bibliographic references (Solla Price 1965). Ever since, scholars and librarians have been working on interrelating research activities and making such links discoverable.
In this context, the FREYA project, funded by the European Commission, connects and interlinks persistent identifier (PID) schemes. FREYA focuses, among others, on PIDs for persons (ORCID), organisations (ROR), publications, research data, and software (DOI). The project has created a PID Graph, which connects various resources using persistent identifiers. A GraphQL interface allows accessing these data.
Upon invitation of Martin Fenner, Technical Director of DataCite and FREYA team member, I attended the half-day workshop Project FREYA: connecting knowledge in the European Open Science Cloud, co-located at 14th Plenary Meeting of the Research Data Alliance in Helsinki. Using data analytics as an outreach strategy, Martin prepared a large collection of Juypter notebooks showcasing how the PID Graph can be interfaced using R and Python (Fenner 2019c). During the workshop, we presented two interactive notebooks deployed on mybinder.org, and invited workshop participants to re-run them in the web browser. The first notebook (Fenner 2019b) presents the overall indexing coverage of the PID Graph, while the second notebook demonstrated how to obtain data about a personal researcher network (Fenner 2019a).
In this blog post, I want to expand on what I have learned during the FREYA workshop. Although most participants were able to run the interactive Jupyter notebooks, some articulated problems along the data transformation path. In the following, I will therefore present an complementary approach of how to transform and visualise data from the PID graph with R by using tools from the popular tidyverse package collection.
A first version of the PID graph is accessible via the DataCite GraphQL API. GraphQL is a query language designed to request multiple connections across resources at once. As an example, a query for accessing publications, research data and software by a particular researcher using the DataCite GraphQL API looks like this:
<- '{
graphql_query person(id: "https://orcid.org/0000-0003-1444-9135") {
id
type
name
publications(first: 50) {
totalCount
nodes {
id
type
relatedIdentifiers {
relatedIdentifier
}
}
}
datasets(first: 50) {
totalCount
nodes {
id
type
relatedIdentifiers {
relatedIdentifier
}
}
}
softwareSourceCodes(first: 50) {
totalCount
nodes {
id
type
relatedIdentifiers {
relatedIdentifier
}
}
}
}
}'
Here, I query for publications, research data and software code authored by Scott Chamberlain, who is represented by his ORCID. I also retrieve relations between his research activities that are represented in the relatedIdentifier
node. The query is stored in the R object graphql_query
that will be used to interface the DataCite GraphQL API in the following.
To make GraphQL requests with R, Scott developed the R package ghql
, which is maintained by rOpenSci. The package is not on CRAN, but can be installed from GitHub.
# Not on CRAN.
# Install from GitHub remotes::install_github("ropensci/ghql")
library(ghql)
To initialize the client session, call
<- GraphqlClient$new(
cli url = "https://api.datacite.org/graphql"
)<- Query$new() qry
Next, I can send the query stored in graphql_query
to the API.
$query("getdata", graphql_query) qry
The data is represented in json. To parse the API response, I use the jsonlite package.
library(jsonlite)
<- jsonlite::fromJSON(cli$exec(qry$queries$getdata)) my_data
The data is represented as a nested list, which can be transformed to a data.frame
using tidyverse tools tidyr
and dplyr
. Here, I want to obtain all DOIs representing scholarly articles, datasets and software including the relationships between them. Unlike the DOIs for research outputs, related identifiers of type DOI lack the DOI prefix. For consistency with the overall dataset, the prefix will be added.
library(dplyr)
library(tidyr)
<- bind_rows(
my_df # publications
$data$person$publications$nodes,
my_data# dataset
$data$person$datasets$nodes,
my_data# software
$data$person$softwareSourceCodes$nodes
my_data%>%
) # get related identifiers
unnest(cols = c(relatedIdentifiers), keep_empty = TRUE) %>%
# unfortunately, related identifiers of type DOI lack DOI prefix
mutate(to = ifelse(
grepl("^10.", relatedIdentifier),
paste0("https://doi.org/", relatedIdentifier),
relatedIdentifier)
)head(my_df)
# A tibble: 6 x 4
id type relatedIdentifier to
<chr> <chr> <chr> <chr>
1 https://doi.org/10… Scholarl… <NA> <NA>
2 https://doi.org/10… Scholarl… 10.6084/m9.figshare… https://doi.org/…
3 https://doi.org/10… Scholarl… <NA> <NA>
4 https://doi.org/10… Scholarl… <NA> <NA>
5 https://doi.org/10… Scholarl… 10.6084/m9.figshare… https://doi.org/…
6 https://doi.org/10… Scholarl… 10.6084/m9.figshare… https://doi.org/…
A network consists of nodes (vertices) and links (edges). Nodes represent an output, while links describes relationships between them (relatedIdentifier
).
Let’s create node data.frame
<- my_df %>%
my_nodes select(name = id, type) %>%
distinct() %>%
# person
add_row(name = my_data$data$person$id, type = "Person")
head(my_nodes)
# A tibble: 6 x 2
name type
<chr> <chr>
1 https://doi.org/10.6084/m9.figshare.97222 ScholarlyArticle
2 https://doi.org/10.6084/m9.figshare.94217.v2 ScholarlyArticle
3 https://doi.org/10.6084/m9.figshare.94296 ScholarlyArticle
4 https://doi.org/10.6084/m9.figshare.94090 ScholarlyArticle
5 https://doi.org/10.6084/m9.figshare.94295.v2 ScholarlyArticle
6 https://doi.org/10.6084/m9.figshare.97215.v1 ScholarlyArticle
and a data.frame
with the relationships between these nodes, i.e. edges:
<- my_df %>%
my_edges_pub select(source = id, target = to) %>%
# we only observe links between them
filter(target %in% my_nodes$name)
#' lets ad relationsships between person and outputs
<-
my_edges tibble(source = my_data$data$person$id, target = my_nodes$name) %>%
# no self loop
filter(target != my_data$data$person$id) %>%
bind_rows(my_edges_pub)
head(my_edges)
# A tibble: 6 x 2
source target
<chr> <chr>
1 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
2 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
3 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
4 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
5 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
6 https://orcid.org/0000-0003-14… https://doi.org/10.6084/m9.figshare…
For the graph visualisation, I use the popular network analysis package igraph. First, the node and edge data are transformed to an igraph object. I also want to remove potential loops (“self-links”).
library(igraph)
<-
g graph_from_data_frame(d = my_edges,
vertices = my_nodes,
directed = FALSE)
#' remove potential loops
<- igraph::simplify(g) g
Next, some visualisation parameter are defined including node colours and labels. Here, node colours represent the person and the three different publication types.
#' define node colours
<-
my_palette c("#6da7de", "#9e0059", "#dee000", "#d82222")
<- my_palette[as.numeric(as.factor(V(g)$type))]
my_color #' don't display label
V(g)$label = NA
Finally, let’s visualise Scott’s publication network according to DataCite metadata.
plot(simplify(g), vertex.color = my_color,
vertex.frame.color = my_color,
arrow.mode = 0)
legend(
"bottomleft",
legend = levels(as.factor(V(g)$type)),
col = my_palette,
bty = "n",
pch = 20 ,
pt.cex = 2.5,
cex = 1,
horiz = FALSE,
inset = c(0.1,-0.1)
)
Using data analytics is a great outreach activity to promote the PID Graph. During the workshop, participants were able to run the interactive notebooks with analytical code. This enabled a hands-on experience about how to interface the graph with GraphQL. It also led to great discussions about the PID Graph’s indexing coverage and potential use-cases. In particular, participants raised the issue of yet-incomplete PID metadata coverage. In our example, for instance, we likely miss a considerable amount of Scott’s software projects linked with a DOI, because the underlying metadata records lack his ORCID.
Besides the fruitful discussion about PID coverage in the metadata, I had the feeling that many participants struggled with following the steps for data transformation. Therefore, I decided to dry out the code from the initial notebook using the packages tidyr
and dplyr
from the tidyverse. I hope that such an approach will make the examples clearer.
In the future, the FREYA team will continuously extend the indexing coverage of the PID Graph in collaboration with related research graph activities from OpenAIRE (Manghi et al. 2019), and the Wikibase community. On 25 October, there will be a joint meeting of large data providers for Open Science Graphs at the RDA 14th Plenary. Together with the Software Sustainability Institute, FREYA will hold a day-long hackathon on 4 December at the British Library so as to further improve data analytics using the PID graph.
I would like to thank Martin Fenner, Kristian Garza, Slava Tykhonov, and Maaike de Jong for having me at the workshop, and their valuable help with the analysis and the use of the PID Graph with Jupyter Notebooks.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Jahn (2019, Oct. 24). Scholarly Communication Analytics: Interfacing the PID Graph with R. Retrieved from https://subugoe.github.io/scholcomm_analytics/posts/datacite_graph/
BibTeX citation
@misc{jahn2019interfacing, author = {Jahn, Najko}, title = {Scholarly Communication Analytics: Interfacing the PID Graph with R}, url = {https://subugoe.github.io/scholcomm_analytics/posts/datacite_graph/}, year = {2019} }