Open Scholarly Data @ SUB Göttingen - Overview

We use Google Big Query to work with large open scholarly data. Our main data sources are Unpaywall, Crossref and OpenAlex.

An overview of our data warehouse including procedures to load the data into BigQuery can be found below.

Anyone can view and query our publicly available Open Scholarly Data warehouse on BigQuery with a Google Cloud Computing account. Note that Google will charge you for the number of bytes processed by each query (currently $ 6.25 per 1 TB).

Status Crossref

Current Snapshot (cr_instant)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2025/05 all.json.tar.gz cr_instant.snapshot schema_crossref.json Repo 11.06.2025 All 170.078.997

Historical Snapshots (cr_history)

Info: Only includes publications with type ‘journal-article’

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2018/04 all.json.tar.gz cr_history.cr_apr18 schema_crossref.json Repo 20.02.2022 2013-2018 16.766.035
2019/04 all.json.tar.gz cr_history.cr_apr19 schema_crossref.json Repo 29.10.2021 2013-2019 20.715.644
2020/04 all.json.tar.gz cr_history.cr_apr20 schema_crossref.json Repo 29.10.2021 2013-2020 25.334.525
2021/04 all.json.tar.gz cr_history.cr_apr21 schema_crossref.json Repo 29.10.2021 2013-2021 30.579.119
2022/04 all.json.tar.gz cr_history.cr_apr22 schema_crossref.json Repo 14.05.2022 2013-2022 35.939.195
2023/04 all.json.tar.gz cr_history.cr_apr23 schema_crossref.json Repo 07.05.2023 2013-2023 41.767.461
2024/04 all.json.tar.gz cr_history.cr_apr24 schema_crossref.json Repo 07.05.2024 2013-2024 47.709.184

Status Unpaywall

Current Snapshot (upw_instant)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2024/11 unpaywall_snapshot_2024-11-27T031702.jsonl.gz upw_instant.snapshot bq_schema_nov24.json Repo 23.06.2025 2008-2025 94.924.816

Historical Snapshots (upw_history)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2018/03 unpaywall_snapshot_2018-03-29T113154.jsonl.gz upw_history.upw_Mar18_08_20 bq_schema_mar18.json Repo 29.10.2021 2008-2018 36.557.043
2019/02 unpaywall_snapshot_2019-02-21T031509.jsonl.gz upw_history.upw_Feb19_08_19 bq_schema_feb19.json Repo 10.11.2021 2008-2019 42.143.979
2020/02 unpaywall_snapshot_2020-02-25T115244.jsonl.gz upw_history.upw_Feb20_08_20 bq_schema_feb20.json Repo 30.10.2021 2008-2020 49.717.710
2021/02 unpaywall_snapshot_2021-02-18T160139.jsonl.gz upw_history.upw_Feb21_08_21 bq_schema_feb21.json Repo 29.10.2021 2008-2021 58.437.927
2022/03 unpaywall_snapshot_2022-03-09T083001.jsonl.gz upw_history.upw_Mar22_08_22 bq_schema_mar22.json Repo 14.03.2022 2008-2022 67.424.819

Status Semantic Scholar

Snapshot Directory Table Schema Procedure Last Changed Coverage Number of rows
2025-02-25 papers/ semantic_scholar.papers s2_papers_schema.json Repo 04.03.2025 All 224.566.486
2025-02-25 venues/ semantic_scholar.venues s2_venues_schema.json Repo 04.03.2025 All 194.578
2025-02-25 abstracts/ semantic_scholar.abstracts s2_abstracts_schema.json Repo 04.03.2025 All 108.246.108

Status Openalex

Snapshot Directory Table Schema Procedure Last Changed Coverage Number of rows
2025-05-29 authors/ openalex.authors schema_openalex_author.json Repo 15.06.2025 All 103.480.180
2025-05-30 funders/ openalex.funders schema_openalex_funders.json Repo 15.06.2025 All 32.437
2025-05-30 institutions/ openalex.institutions schema_openalex_institutions.json Repo 15.06.2025 All 114.883
2025-05-30 publishers/ openalex.publishers schema_openalex_publishers.json Repo 15.06.2025 All 10.741
2025-05-30 sources/ openalex.sources schema_openalex_sources.json Repo 15.06.2025 All 260.798
2025-05-26 topics/ openalex.topics schema_openalex_topics.json Repo 15.06.2025 All 4.516
2025-05-29 works/ openalex.works schema_openalex_work.json Repo 19.06.2025 All 267.516.817

Status OPENBIB

Snapshot Table Schema Procedure Last Changed Coverage Number of rows
2025-05-01 openbib.publishers schema_openbib_publishers.json Repo 11.04.2025 2014-2024 373
2025-05-01 openbib.publishers_relation schema_openbib_publishers_relation.json Repo 11.04.2025 2014-2024 212
2025-05-01 openbib.funding_information schema_openbib_funding_information.json Repo 14.04.2025 2020-2024 9.255
2025-05-01 openbib.document_types schema_openbib_document_types.json Repo 28.03.2025 2014-2024 56.063.628
2025-05-01 openbib.kb_a_addr_inst schema_openbib_kb_a_addr_inst.json Repo 14.04.2025 All 9.903.725
2025-05-01 openbib.kb_s_addr_inst schema_openbib_kb_s_addr_inst.json Repo 14.04.2025 All 9.900.278
2025-05-01 openbib.kb_inst schema_openbib_kb_inst.json Repo 14.04.2025 All 2.759
2025-05-01 openbib.kb_inst_trans schema_openbib_kb_inst_trans.json Repo 28.03.2025 All 91
2025-05-01 openbib.kb_sectors schema_openbib_kb_sectors.json Repo 28.03.2025 All 22
2025-05-01 openbib.jct_articles schema_openbib_jct_articles.json Repo 14.04.2025 2018-2025 1.996.190
2025-05-01 openbib.jct_esac schema_openbib_jct_esac.json Repo 14.04.2025 2018-2025 1.285
2025-05-01 openbib.jct_institutions schema_openbib_jct_institutions.json Repo 14.04.2025 2018-2025 28.007
2025-05-01 openbib.jct_journals schema_openbib_jct_journals.json Repo 14.04.2025 2018-2025 491.218

Status OpenAlex Document Type classification by SUB Göttingen

Info: Only includes publications with type ‘article’ or ‘review’ and primary source type ‘journal’

Snapshot Directory Table Schema Procedure Last Changed Coverage Number of rows
2025-05-29 works/ resources.document_classification_may25 schema_document_types.json Repo 15.06.2025 2014-2025 60.762.537

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".