Is Semantic Scholar suitable for enriching references in OpenAlex?

Author
Affiliation

Nick Haupka

Published

October 27, 2025

Doi
Abstract
This blog post explores the question whether metadata from Semantic Scholar can be used to enrich references in OpenAlex. To do this, I will compare the reference numbers in Semantic Scholar with those in OpenAlex. Analysing over 37 million journal articles between 2015 and 2023, I found strong evidence of potential benefits from integrating reference data from Semantic Scholar into OpenAlex.

Introduction

References are an essential part of bibliometric research. For example, references are required to calculate bibliometric indicators such as the journal impact factor and the h-index. University rankings also rely on a proper recording of references, as the number of citations a institution receives is usually used as a criterion. The coverage of references within a bibliometric database therefore plays a significant role.

As Culbert et al. (2025) has demonstrated, OpenAlex has comparable source reference numbers to Scopus and Web of Science. However, that does not mean that the coverage of references is perfect. In this blog post, I will use the service Semantic Scholar to identify further references for journal articles that can be used to enhance references in OpenAlex.

Semantic Scholar, developed by the Allen Institute for Artificial Intelligence, is a promising data source for enriching bibliometric data, as it makes a large portion of its metadata available under a free license (ODC-BY). However, the quality of metadata has been little studied to date compared to data sources like Crossref and OpenAlex.

The following research question shall be answered in this blog post.

  • RQ: How does the number of (source) references in OpenAlex and Semantic Scholar differ?

Data and Method

The following analysis is based on data from OpenAlex (Snapshot 08/2025) and Semantic Scholar (Snapshot 02/2025). Data was accessed via the data warehouse of the SUB Göttingen which is powered by Google BigQuery.

The present study focuses on articles and reviews in journals. For convenience, journal items were only filtered by the document and publication type from OpenAlex. In addition, the publication year from OpenAlex was used. Journal items were matched via the Digital Object Identifier (DOI). The analysis covers journal items from 2015 to 2023. To obtain the number of references from the respective data sources, the precomputed field was used (for example in Semantic Scholar: ‘referencecount’). However, sometimes the precalculated number may differ from the actual number of references contained in a database. For example, a journal article has a reference count of 18, but all references are missing in the database because the publisher decided to omit this information. In this case, I have to calculate the so-called source reference count.

The source reference count in Semantic Scholar was retrieved by counting items in the field ‘references’ that refer to other items within Semantic Scholar. Unlike the reference count, the actual references from Semantic Scholar are not contained in the data warehouse of the SUB Göttingen, but in an external dataset (available through the Semantic Scholar Datasets API). Because the Semantic Scholar reference dataset is quite large (about 1 TB), I made use of the High Performance Cluster of the GWDG Göttingen to calculate the number of source references. For this purpose, I wrote a Python script that performs this task. The aggregated data was then loaded into BigQuery. Retrieving the source reference count in OpenAlex is far simpler, as OpenAlex only considers the source reference count (field: ‘referenced_works_count’).

Results

Figure 1a displays a comparison of reference counts in Semantic Scholar and OpenAlex for the publication years 2015 to 2023, based on a shared corpus of approximately 37,5 million journal articles. Reference counts precalculated by Semantic Scholar are shown with a dashed line. It can be seen that the precalculated reference count in Semantic Scholar exceeds the number counted in OpenAlex for the publication years 2015 to 2019, but drops slightly afterwards. The number of source references in Semantic Scholar is lower across all publication years than in OpenAlex. In total, 982,585,826 references were counted in OpenAlex and 994,263,247 references were precalculated in Semantic Scholar. The number of source references in Semantic Scholar totals 896,752,043. Note that, although OpenAlex appears to have a greater number of source references than Semantic Scholar, I found about 9 million journal articles with more source references in Semantic Scholar than in OpenAlex (for the chosen time period).

In contrast to Figure 1a, Figure 1b shows a comparison of average reference counts in Semantic Scholar and OpenAlex. Again, reference counts precalculated by Semantic Scholar are shown with a dashed line. While the figure shows the same trends as in Figure 1a, a slight decline in the average reference figures can be observed from 2021 onwards. On average, OpenAlex counts 25.52 references per journal item, while Semantic Scholar counts 25.83 references and 23.29 source references per article.

Figure 1: Comparison of references in OpenAlex and Semantic Scholar based on a shared corpus.

In order to examine the differences between Semantic Scholar and OpenAlex in more detail, I will assign the precalculated reference and source reference count to publishers (see Figure 3a and 3b).

Figure 2: Comparison of references per publisher in Semantic Scholar and OpenAlex.

I also compiled a list of the most common publishers for which OpenAlex or Semantic Scholar has a larger source reference count (see Table 1 and 2). Here, it can be seen that OpenAlex deposits more source references for the publishers Elsevier (about 12,600,000 references), Springer (about 916,000 references) and Wiley (about 746,000 references) than Semantic Scholar. On the contrary, Semantic Scholar has more source references for the publishers Frontiers Media (about 800,000 references), SAGE (about 512,000 references) and Taylor & Francis (about 269,000 references).

Table 1: The 10 most frequently appearing publishers for which OpenAlex has a higher source reference count than Semantic Scholar

oal_source_ref_sum s2_source_ref_sum publisher_id publisher_name diff
221,098,550 208,460,916 https://openalex.org/P4310320990 Elsevier BV 12,637,634
58,524,445 46,853,039 None None 11,671,406
91,912,174 82,755,155 https://openalex.org/P4310319900 Springer Science+Business Media 9,157,019
75,121,716 67,664,721 https://openalex.org/P4310320595 Wiley 7,456,995
25,816,934 21,628,791 https://openalex.org/P4310320006 American Chemical Society 4,188,143
18,203,600 14,255,451 https://openalex.org/P4310320556 Royal Society of Chemistry 3,948,149
19,562,074 15,847,098 https://openalex.org/P4310311648 Oxford University Press 3,714,976
9,996,522 6,651,643 https://openalex.org/P4310320261 American Physical Society 3,344,879
17,313,110 15,205,127 https://openalex.org/P4310320256 BioMed Central 2,107,983
15,143,213 13,396,255 https://openalex.org/P4310319965 Springer Nature 1,746,958

Table 2: The 10 most frequently appearing publishers for which Semantic Scholar has a higher source reference count than OpenAlex (surplus represented by a minus)

oal_source_ref_sum s2_source_ref_sum publisher_id publisher_name diff
27,245,366 28,043,664 https://openalex.org/P4310320527 Frontiers Media -798,298
6,050,726 6,629,422 https://openalex.org/P4310319811 Emerald Publishing Limited -578,696
19,780,360 20,292,611 https://openalex.org/P4310320017 SAGE Publishing -512,251
707,064 1,069,281 https://openalex.org/P4310320466 OMICS Publishing Group -362,217
38,428,778 38,697,631 https://openalex.org/P4310320547 Taylor & Francis -268,853
349,881 612,388 https://openalex.org/P4310320855 Sciencedomain International -262,507
6,447,765 6,696,983 https://openalex.org/P4310319869 Hindawi Publishing Corporation -249,218
1,507,041 1,719,281 https://openalex.org/P4310319798 Association for Computing Machinery -212,240
1,493,723 1,655,240 https://openalex.org/P4310317820 Karger Publishers -161,517
2,069,755 2,221,403 https://openalex.org/P4310320000 Thieme Medical Publishers (Germany) -151,648

Finally, I will compare the reference and source reference count in Semantic Scholar and OpenAlex by using topics from OpenAlex (see Table 3 and Table 4). I choose the primary topic domain for this purpose (which comprises the domains Health Sciences, Social Sciences, Life Sciences and Physical Sciences). For health sciences, Semantic Scholar counts on average 22,4 source references per article, while OpenAlex counts 23,9 source references. This differs slightly from the precomputed references count, where Semantic Scholar exceeds the reference count of OpenAlex by 24,1 to 23,9 references. The same applies to the social sciences, where the number of references precalculated by Semantic Scholar is greater than in the case of OpenAlex. However, when looking at the average number of source references, OpenAlex surpasses Semantic Scholar (16,0 to 14,5 source references on average).

Table 3: References contained in OpenAlex and precalculated by Semantic Scholar per OpenAlex domain.

oal_ref_sum oal_ref_avg s2_ref_sum s2_ref_avg primary_topic
191,680,092 36.625827 190,953,855 36.487059 Life Sciences
235,953,838 23.902660 237,839,461 24.093678 Health Sciences
407,198,668 31.643654 400,246,309 31.103382 Physical Sciences
145,744,398 16.035306 163,572,762 17.996845 Social Sciences
2,008,830 1.398958 1,650,860 1.149666 None

Table 4: Source references contained in OpenAlex and Semantic Scholar per OpenAlex domain.

oal_source_ref_sum oal_source_ref_avg s2_source_ref_sum s2_source_ref_avg primary_topic
191,680,092 36.625827 177,680,421 33.950799 Life Sciences
235,953,838 23.902660 221,164,605 22.404477 Health Sciences
407,198,668 31.643654 364,822,452 28.350572 Physical Sciences
145,744,398 16.035306 132,044,159 14.527958 Social Sciences
2,008,830 1.398958 1,040,406 0.724543 None

Conclusion

This analysis demonstrates interesting differences between the reference counts in OpenAlex and Semantic Scholar. In particular, discrepancies were found between publishers, with Semantic Scholar counting more references than OpenAlex for publishers like Taylor & Francis and BMJ. The proportion of source references is higher in OpenAlex than in Semantic Scholar, although about 9 million journal articles were identified for which the source reference count was greater than in OpenAlex. This finding is quite relevant for bibliometric research, as it may enable more accurate citation analysis in OpenAlex, which can be achieved through the inclusion of open metadata.

During my analysis, I also discovered some inconsistencies between the Semantic Scholar website and its API. For example, references for some journal articles are missing in the API but are shown on the website. In other cases, references are missing in the API with the note: “Notice: The following paper fields have been elided by the publisher: {‘references’}”. However, this observation was not analysed further.

Data and Code Availability

The source code for the underlying analysis can be found on GitHub: https://github.com/naustica/oal_s2_ref/.

Data is available in the Data Warehouse of the SUB Göttingen: https://subugoe.github.io/scholcomm_analytics/data.html.

References

Culbert, Jack H., Anne Hobert, Najko Jahn, Nick Haupka, Marion Schmidt, Paul Donner, and Philipp Mayr. 2025. “Reference Coverage Analysis of OpenAlex Compared to Web of Science and Scopus.” Scientometrics 130 (4): 2475–92. https://doi.org/10.1007/s11192-025-05293-3.

Citation

BibTeX citation:
@article{haupka2025,
  author = {Haupka, Nick},
  title = {Is {Semantic} {Scholar} Suitable for Enriching References in
    {OpenAlex?}},
  journal = {Scholarly Communication Analytics},
  date = {2025-10-27},
  url = {https://subugoe.github.io/scholcomm_analytics/posts/s2_reference_analysis/s2_reference_analysis.html},
  doi = {10.59350/8t2s7-vtw86},
  langid = {en},
  abstract = {This blog post explores the question whether metadata from
    Semantic Scholar can be used to enrich references in OpenAlex. To do
    this, I will compare the reference numbers in Semantic Scholar with
    those in OpenAlex. Analysing over 37 million journal articles
    between 2015 and 2023, I found strong evidence of potential benefits
    from integrating reference data from Semantic Scholar into
    OpenAlex.}
}
For attribution, please cite this work as:
Haupka, Nick. 2025. “Is Semantic Scholar Suitable for Enriching References in OpenAlex?” Scholarly Communication Analytics, October. https://doi.org/10.59350/8t2s7-vtw86.