Recent Changes in Document type classification in OpenAlex compared to Web of Science and Scopus

In June 2024, we published a preprint on the classification of document types in OpenAlex and compared it with the scholarly databases Web of Science, Scopus, PubMed and Semantic Scholar. In this follow-up study, we want to investigate further developments in OpenAlex and compare the results with the proprietary databases Scopus and Web of Science.

Nick Haupka (State and University Library Göttingen)https://www.sub.uni-goettingen.de/ , Sophia Dörner (State and University Library Göttingen)https://www.sub.uni-goettingen.de/ , Najko Jahn (State and University Library Göttingen)https://www.sub.uni-goettingen.de/
2024-09-04

In June 2024, we submitted an analysis of publication and document types in OpenAlex in comparison with the proprietary databases Web of Science and Scopus and the open data sources Semantic Scholar and PubMed (Haupka et al. 2024). We found substantial differences between these databases: While Web of Science and Scopus provided a comprehensive set of document types to describe works published in journals, OpenAlex supported only a comparably limited number of types. Notably, OpenAlex lacked a distinction between research articles and reviews, which can be crucial when calculating citation indicators. In line with related studies (Alperin et al. 2024), we also observed discrepancies in the number of publications when restricting to certain document types.

Meanwhile, in late May and late July 2024, OpenAlex introduced extended approaches to obtain publication and document types. Among the four new categories were preprints and reviews. Using PubMed, OpenAlex identified approximately 4 million journal articles as editorials, erratum, letters, preprints, reviews, or retractions.

Of course, we wanted to know how these improvements affect our findings. We therefore re-applied our approach to the recent changes. Using works published in journals between 2012 and 2022, we demonstrate that OpenAlex’s recent changes provide a more nuanced set of document types to refine scholarly works. However, the comparison with Web of Science and Scopus reveals that there remain considerable differences.

Data and Methods

Following our preprint, we performed a pairwise comparison of journal publications indexed in OpenAlex with the Web of Science and Scopus published 2012 to 2022. To investigate changes made in OpenAlex, we furthermore compared data from the OpenAlex July 2024 and August 2023 snapshots. Scopus and Web of Science data were retrieved from the German Competence Network of Bibliometrics, using the April 2024 snapshots. Web of Science data retrieval comprised the Core Collection. We matched items between the databases by DOI after normalisation to lowercase. Overall, the intersection of OpenAlex and Scopus covered 24,704,172 and the intersection of OpenAlex and Web of Science covered 21,775,771 records.

Then, we categorised works based on their document type information into two categories: research discourse and editorial discourse. The research discourse category now also includes publications of type “preprint”, which was added to OpenAlex in May 2024. The mapping tables used for reclassifying the document types can be found in the appendix of Haupka et al. (2024).

Findings

Figure 1 illustrates OpenAlex document type changes in comparison with Scopus. Before the introduction of the more nuanced set of document types, OpenAlex tagged 24,559,634 items (99.42%) as articles, which reduced to 22,132,347 (89.59%). Scopus tagged 20,777,473 items (84.11%) as article. OpenAlex assigned the type review to 1,511,172 items (6.12%), whereas Scopus to 1,776,555 items (7.19%).

Figure 1: Comparison of OpenAlex and Scopus for publication years 2012-2022

Figure 2 illustrates the same for the comparison of OpenAlex with Web of Science. Here, OpenAlex tagged 21,673,833 items (99.53%) as articles before the introduction of the more nuanced set of document types and 19,500,710 (89.55%) after. In Web of Science 17,266,997 items (79.29%) were tagged as articles. The document type review is assigned to 1,362,290 items (6.26%) by OpenAlex, whereas Web of Science tagged 1,242,472 items (5.71%) as such.

Figure 2: Comparison of OpenAlex and Web of Science for publication years 2012-2022

Overall, Figures 1 and 2 demonstrate that even after the introduction of a more nuanced set of document types, OpenAlex still tags a higher proportion of items as articles than the commercial data sources. The difference between the proportions of items tagged as articles is, however, slightly more pronounced in the comparison of OpenAlex with Web of Science. Scopus tags a higher proportion of items as reviews and both Scopus and Web of Science still tag more items as editorial content than OpenAlex. In sum, 340,998 (Scopus) and 656,366 (Web of Science) items are tagged as editorial/editorial material or letters in Scopus and Web of Science, respectively, while tagged as articles in OpenAlex.

When grouping the document types into the two categories research discourse and editorial discourse, we found that even after the introduction of a more nuanced set of document types in OpenAlex, the proportion of items labelled as editorial discourse is still about 3% lower compared to Scopus and Web of Science, as shown in the tables below.

Discussion and Outlook

Our updated analysis demonstrated a noticable improvement of the classification of document types in OpenAlex when comparing it to Scopus and Web of Science. Compared to data from 2023, the discrepancy in the classification of items has decreased slightly. This indicates a convergence of the classification system in OpenAlex towards those from proprietary databases, with an enhanced coverage of reviews and editorial materials. In addition, the rule-based string matching for recognising paratexts introduced and revised by OpenAlex resulted in more texts being categorised as editorial material than before. However, the results also show that the curation of document types has not yet been finalised.

Conclusively, we would like to point out that there is no correct classification system per se. Rather different classification systems applied by the database operators can bring advantages and disadvantages. In Semantic Scholar and PubMed, for example, publications are labelled as clinical studies and case reports, which in Scopus, Web of Science and OpenAlex are predominantly assigned to the document type article. A differentiation of these publications has the potential to increase the quality of bibliometric surveys in the analysed databases. Also, the results from this analysis are only partially comparable with the results from our preprint, as in the preprint we worked with a more restrictive set that included publications from Semantic Scholar and PubMed.

Funding

This work is funded by the Bundesministerium für Bildung und Forschung (BMBF) project KBOPENBIB (16WIK2301E). We acknowledge the support of the German Competence Center for Bibliometrics.

Alperin, Juan Pablo, Jason Portenoy, Kyle Demes, Vincent Larivière, and Stefanie Haustein. 2024. “An Analysis of the Suitability of OpenAlex for Bibliometric Analyses.” arXiv. https://doi.org/10.48550/arXiv.2404.17663.
Haupka, Nick, Jack H. Culbert, Alexander Schniedermann, Najko Jahn, and Philipp Mayr. 2024. “Analysis of the Publication and Document Types in OpenAlex, Web of Science, Scopus, PubMed and Semantic Scholar.” https://arxiv.org/abs/2406.15154.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Haupka, et al. (2024, Sept. 4). Scholarly Communication Analytics: Recent Changes in Document type classification in OpenAlex compared to Web of Science and Scopus. Retrieved from https://subugoe.github.io/scholcomm_analytics/posts/openalex_document_types/

BibTeX citation

@misc{haupka2024recent,
  author = {Haupka, Nick and Dörner, Sophia and Jahn, Najko},
  title = {Scholarly Communication Analytics: Recent Changes in Document type classification in OpenAlex compared to Web of Science and Scopus},
  url = {https://subugoe.github.io/scholcomm_analytics/posts/openalex_document_types/},
  year = {2024}
}