Analysing and reclassifying open access information in OpenAlex

We investigated OpenAlex and found over four million records with incompatible metadata about open access works. To illustrate this issue, we applied Unpaywall’s methodology to OpenAlex data. The comparative analysis revealed a shift, with over one million journal articles published in 2023 that were previously labelled as “closed” in OpenAlex, being reclassified as “gold”, “hybrid”, “green”, or “bronze”.

Najko Jahn (State and University Library Göttingen)https://www.sub.uni-goettingen.de/ , Nick Haupka (State and University Library Göttingen)https://www.sub.uni-goettingen.de/ , Anne Hobert (State and University Library Göttingen)https://www.sub.uni-goettingen.de/
2023-11-07

Over the last few months, we have switched our data source for open access analytics from Unpaywall to OpenAlex. Both open scholarly data services are developed by OurResearch and have a similar metadata format for describing open access full-texts. However, OpenAlex provides monthly data dumps, which we find particularly helpful as the release of free snapshot versions from Unpaywall appear to have been discontinued since March 2022.

While transitioning from Unpaywall to OpenAlex, we noticed more than four million OpenAlex records with contradictory open access metadata. This blog post aims to explore this issue. To better understand this, we reimplemented Unpaywall’s open access classification using OpenAlex data, and compared our relabelled open access status information against OpenAlex’ existing data.

What is the issue?

OpenAlex provides various methods for identifying open access literature. Within the work object, the open_access and best_oa_location elements, among others, contain information about the open access status at the article level. The sources object, on the other hand, gives information about the open access model of a journal.

The issue we have identified is that filtering for open access works with is_oa:true returns more than four million records with the oa_status marked as closed, a discrepancy that is inconsistent with OpenAlex’s own documentation. Accordingly, OpenAlex follows Unpaywall’s methodology, tagging openly available works (is_oa) and qualifying their open access status (oa_status) using the following labels:

In case no open access full-text could be found, the open access status is marked as “closed”.

Understanding the issue

To better understand this issue, we analysed the most recent OpenAlex snapshot from October 2023. After importing the data into our BigQuery data warehouse, we created a subset focusing on journal articles published since 2013, excluding retractions and non-scholarly content published in journals.

CREATE OR REPLACE TABLE
  subugoe-collaborative.resources.oalex_cr_journal_articles_13_23 AS (
  SELECT
  doi,
  publication_year,
  open_access,
  best_oa_location,
  sources.is_oa AS journal_is_oa,
  sources.is_in_doaj AS journal_is_in_doaj,
  sources.host_organization_name AS publisher_name
FROM
  `subugoe-collaborative.openalex.works`
LEFT JOIN
  `subugoe-collaborative.openalex.sources` AS sources
ON
  primary_location.source.id = sources.id
WHERE
  type_crossref = "journal-article"
  AND is_paratext = FALSE
  AND is_retracted = FALSE
  AND publication_year BETWEEN 2013
  AND 2023 )

We then analysed the open access prevalence over the years, aggregating the record counts across both is_oa and oa_status.

SELECT
  COUNT(DISTINCT doi) AS articles,
  publication_year,
  open_access.is_oa,
  open_access.oa_status
FROM
  `subugoe-collaborative.resources.oalex_cr_journal_articles_13_23`
GROUP BY
  open_access.is_oa,
  open_access.oa_status,
  publication_year
ORDER BY
  publication_year DESC

The resulting figure shows the distribution of open access evidence in OpenAlex over the years. All possible open access status values, as known from Unpaywall, were also represented in OpenAlex. The figure also presents the number of records with (blue bar chart stacks) or without (grey bar chart stacks) open access full-text according to the information provided by is_oa. Notably, the bulk of contradictory open access information could be found in records representing journal articles published in 2023, with 1,197,013 articles tagged as open access, but assigned the open access status “closed”.

Reclassification and analysis of changes

To address this inconsistency, we reimplemented Unpaywall’s open access classification methodology. The SQL code snippet shows how we approached reclassification.

CREATE OR REPLACE TABLE
  `subugoe-collaborative.resources.oalex_reclassify_oa` AS (
  SELECT
    DISTINCT doi,
    publication_year,
    open_access.is_oa,
    open_access.oa_status,
    CASE
      WHEN best_oa_location IS NULL THEN "closed"
      WHEN best_oa_location.source.type = "repository" THEN "green"
      WHEN (journal_is_in_doaj = TRUE OR journal_is_oa = TRUE) THEN "gold"
      WHEN (journal_is_in_doaj = FALSE
      AND journal_is_oa = FALSE )
    AND best_oa_location.license IS NOT NULL THEN "hybrid"
      WHEN (journal_is_in_doaj = FALSE AND journal_is_oa = FALSE ) AND best_oa_location.license IS NULL THEN "bronze"
    ELSE
    NULL
  END
    AS oa_new
  FROM
    `subugoe-collaborative.resources.oalex_cr_journal_articles_13_23` )

Because of the inconsistent use is_oa compared to the open access status labels, we used the best_oa_location element instead to determine the availability of at least one open access full-text. If this metadata element was absent, we categorised the work as “closed”. For open access works not exclusively provided by a repository (“green”), we used open access journal information from the source object to distinguish between “gold”, “hybrid”, and “bronze”.

After reclassification, we calculated the updated open access statistics.

SELECT
  COUNT(DISTINCT doi) AS n,
  oa_status,
  oa_new,
  publication_year
FROM
  `subugoe-collaborative.resources.oalex_reclassify_oa`
GROUP BY
  oa_status,
  oa_new,
  publication_year

The following figure compares OpenAlex open access classification (black bars) with our approach (pink bars). Notably, the reclassification resulted in many journal articles published in 2023 that were previously tagged as “closed” having one of the open access values “gold”, “hybrid”, “green”, or “bronze”.

Overall, we reclassified a total of 4,087,711 records representing journal articles published since 2013, with 1,257,175 of them being published in 2023. The following figure demonstrates changes in open access status after our reclassification for 2023. The “gold” category gained 607,896 additional records in 2023, “hybrid” gained 340,351, “green” gained 96,156, and “bronze” gained 211,065. The figure also highlights that we not only relabelled records that previously belonged to the “closed” category but that there were also changes between other categories.