We investigated OpenAlex and found over four million records with incompatible metadata about open access works. To illustrate this issue, we applied Unpaywall’s methodology to OpenAlex data. The comparative analysis revealed a shift, with over one million journal articles published in 2023 that were previously labelled as “closed” in OpenAlex, being reclassified as “gold”, “hybrid”, “green”, or “bronze”.
Over the last few months, we have switched our data source for open access analytics from Unpaywall to OpenAlex. Both open scholarly data services are developed by OurResearch and have a similar metadata format for describing open access full-texts. However, OpenAlex provides monthly data dumps, which we find particularly helpful as the release of free snapshot versions from Unpaywall appear to have been discontinued since March 2022.
While transitioning from Unpaywall to OpenAlex, we noticed more than four million OpenAlex records with contradictory open access metadata. This blog post aims to explore this issue. To better understand this, we reimplemented Unpaywall’s open access classification using OpenAlex data, and compared our relabelled open access status information against OpenAlex’ existing data.
OpenAlex provides various methods for identifying open access literature. Within the work object, the open_access
and best_oa_location
elements, among others, contain information about the open access status at the article level. The sources object, on the other hand, gives information about the open access model of a journal.
The issue we have identified is that filtering for open access works with is_oa:true
returns more than four million records with the oa_status
marked as closed
, a discrepancy that is inconsistent with OpenAlex’s own documentation.
Accordingly, OpenAlex follows Unpaywall’s methodology, tagging openly available works (is_oa
) and qualifying their open access status (oa_status
) using the following labels:
In case no open access full-text could be found, the open access status is marked as “closed”.
To better understand this issue, we analysed the most recent OpenAlex snapshot from October 2023. After importing the data into our BigQuery data warehouse, we created a subset focusing on journal articles published since 2013, excluding retractions and non-scholarly content published in journals.
CREATE OR REPLACE TABLE
-collaborative.resources.oalex_cr_journal_articles_13_23 AS (
subugoeSELECT
doi,
publication_year,
open_access,
best_oa_location,AS journal_is_oa,
sources.is_oa AS journal_is_in_doaj,
sources.is_in_doaj AS publisher_name
sources.host_organization_name FROM
-collaborative.openalex.works`
`subugoeLEFT JOIN
-collaborative.openalex.sources` AS sources
`subugoeON
source.id = sources.id
primary_location.WHERE
= "journal-article"
type_crossref AND is_paratext = FALSE
AND is_retracted = FALSE
AND publication_year BETWEEN 2013
AND 2023 )
We then analysed the open access prevalence over the years, aggregating the record counts across both is_oa
and oa_status
.
SELECT
COUNT(DISTINCT doi) AS articles,
publication_year,
open_access.is_oa,
open_access.oa_statusFROM
-collaborative.resources.oalex_cr_journal_articles_13_23`
`subugoeGROUP BY
open_access.is_oa,
open_access.oa_status,
publication_yearORDER BY
DESC publication_year
The resulting figure shows the distribution of open access evidence in OpenAlex over the years. All possible open access status values, as known from Unpaywall, were also represented in OpenAlex. The figure also presents the number of records with (blue bar chart stacks) or without (grey bar chart stacks) open access full-text according to the information provided by is_oa
. Notably, the bulk of contradictory open access information could be found in records representing journal articles published in 2023, with 1,197,013 articles tagged as open access, but assigned the open access status “closed”.
To address this inconsistency, we reimplemented Unpaywall’s open access classification methodology. The SQL code snippet shows how we approached reclassification.
CREATE OR REPLACE TABLE
-collaborative.resources.oalex_reclassify_oa` AS (
`subugoeSELECT
DISTINCT doi,
publication_year,
open_access.is_oa,
open_access.oa_status,CASE
WHEN best_oa_location IS NULL THEN "closed"
WHEN best_oa_location.source.type = "repository" THEN "green"
WHEN (journal_is_in_doaj = TRUE OR journal_is_oa = TRUE) THEN "gold"
WHEN (journal_is_in_doaj = FALSE
AND journal_is_oa = FALSE )
AND best_oa_location.license IS NOT NULL THEN "hybrid"
WHEN (journal_is_in_doaj = FALSE AND journal_is_oa = FALSE ) AND best_oa_location.license IS NULL THEN "bronze"
ELSE
NULL
END
AS oa_new
FROM
-collaborative.resources.oalex_cr_journal_articles_13_23` ) `subugoe
Because of the inconsistent use is_oa
compared to the open access status labels, we used the best_oa_location
element instead to determine the availability of at least one open access full-text. If this metadata element was absent, we categorised the work as “closed”. For open access works not exclusively provided by a repository (“green”), we used open access journal information from the source object to distinguish between “gold”, “hybrid”, and “bronze”.
After reclassification, we calculated the updated open access statistics.
SELECT
COUNT(DISTINCT doi) AS n,
oa_status,
oa_new,
publication_yearFROM
-collaborative.resources.oalex_reclassify_oa`
`subugoeGROUP BY
oa_status,
oa_new, publication_year
The following figure compares OpenAlex open access classification (black bars) with our approach (pink bars). Notably, the reclassification resulted in many journal articles published in 2023 that were previously tagged as “closed” having one of the open access values “gold”, “hybrid”, “green”, or “bronze”.
Overall, we reclassified a total of 4,087,711 records representing journal articles published since 2013, with 1,257,175 of them being published in 2023. The following figure demonstrates changes in open access status after our reclassification for 2023. The “gold” category gained 607,896 additional records in 2023, “hybrid” gained 340,351, “green” gained 96,156, and “bronze” gained 211,065. The figure also highlights that we not only relabelled records that previously belonged to the “closed” category but that there were also changes between other categories.
Analysing and reclassifying open access data in OpenAlex revealed inconsistencies in the actual implementation. The is_oa
filter, which indicates the availability of open access full texts, did not always match the open access status information.
In response, we share this detailed problem description to contribute to the ongoing improvement of OpenAlex, a scholarly data source that we enjoy working with on a daily basis. As a practical suggestion in the meantime, we recommend not relying solely on the open access information provided. Instead, we suggest reclassifying open access status information based on OpenAlex’ comprehensive metadata about open access full-text availability, for example by reusing the code snippets provided within this blog post.
This work is funded by the Bundesministerium für Bildung und Forschung (BMBF) projects KBMINE (16WIK2101F) and KBOPENBIB (16WIK2301E). We acknowledge the support of the German Competence Center for Bibliometrics.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Jahn, et al. (2023, Nov. 7). Scholarly Communication Analytics: Analysing and reclassifying open access information in OpenAlex. Retrieved from https://subugoe.github.io/scholcomm_analytics/posts/oalex_oa_status/
BibTeX citation
@misc{jahn2023analysing, author = {Jahn, Najko and Haupka, Nick and Hobert, Anne}, title = {Scholarly Communication Analytics: Analysing and reclassifying open access information in OpenAlex}, url = {https://subugoe.github.io/scholcomm_analytics/posts/oalex_oa_status/}, year = {2023} }