Open Scholarly Data @ SUB Göttingen - Overview

We use Google Big Query to work with large open scholarly data. Our main data sources are Unpaywall, Crossref and OpenAlex.

An overview of our data warehouse including procedures to load the data into BigQuery can be found below.

Anyone can view and query our publicly available Open Scholarly Data warehouse on BigQuery with a Google Cloud Computing account. Note that Google will charge you for the number of bytes processed by each query (currently $ 6.25 per 1 TB).

Status Crossref

Current Snapshot (cr_instant)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2025/01 all.json.tar.gz cr_instant.snapshot schema_crossref.json Repo 11.02.2025 2013-2025 52.717.946

Historical Snapshots (cr_history)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2018/04 all.json.tar.gz cr_history.cr_apr18 schema_crossref.json Repo 20.02.2022 2013-2018 16.766.035
2019/04 all.json.tar.gz cr_history.cr_apr19 schema_crossref.json Repo 29.10.2021 2013-2019 20.715.644
2020/04 all.json.tar.gz cr_history.cr_apr20 schema_crossref.json Repo 29.10.2021 2013-2020 25.334.525
2021/04 all.json.tar.gz cr_history.cr_apr21 schema_crossref.json Repo 29.10.2021 2013-2021 30.579.119
2022/04 all.json.tar.gz cr_history.cr_apr22 schema_crossref.json Repo 14.05.2022 2013-2022 35.939.195
2023/04 all.json.tar.gz cr_history.cr_apr23 schema_crossref.json Repo 07.05.2023 2013-2023 41.767.461
2024/04 all.json.tar.gz cr_history.cr_apr24 schema_crossref.json Repo 07.05.2024 2013-2024 47.709.184

Status Unpaywall

Current Snapshot (upw_instant)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2022/03 unpaywall_snapshot_2022-03-09T083001.jsonl.gz upw_instant.snapshot bq_schema_mar22.json Repo 14.03.2022 2008-2022 67.424.819

Historical Snapshots (upw_history)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2018/03 unpaywall_snapshot_2018-03-29T113154.jsonl.gz upw_history.upw_Mar18_08_20 bq_schema_mar18.json Repo 29.10.2021 2008-2018 36.557.043
2019/02 unpaywall_snapshot_2019-02-21T031509.jsonl.gz upw_history.upw_Feb19_08_19 bq_schema_feb19.json Repo 10.11.2021 2008-2019 42.143.979
2020/02 unpaywall_snapshot_2020-02-25T115244.jsonl.gz upw_history.upw_Feb20_08_20 bq_schema_feb20.json Repo 30.10.2021 2008-2020 49.717.710
2021/02 unpaywall_snapshot_2021-02-18T160139.jsonl.gz upw_history.upw_Feb21_08_21 bq_schema_feb21.json Repo 29.10.2021 2008-2021 58.437.927
2022/03 unpaywall_snapshot_2022-03-09T083001.jsonl.gz upw_history.upw_Mar22_08_22 bq_schema_mar22.json Repo 14.03.2022 2008-2022 67.424.819

Status Semantic Scholar

Snapshot Directory Table Schema Procedure Last Changed Coverage Number of rows
2024-05-28 papers/ semantic_scholar.papers / Repo 10.06.2024 All 218.668.220
2024-05-28 venues/ semantic_scholar.venues / Repo 10.06.2024 All 194.578

Status Openalex

Snapshot Directory Table Schema Procedure Last Changed Coverage Number of rows
2024-12-31 authors/ openalex.authors schema_openalex_author.json Repo 07.01.2025 All 101.693.809
2025-01-01 funders/ openalex.funders schema_openalex_funders.json Repo 07.01.2025 All 32.437
2025-01-01 institutions/ openalex.institutions schema_openalex_institutions.json Repo 07.01.2025 All 110.553
2025-01-01 publishers/ openalex.publishers schema_openalex_publishers.json Repo 07.01.2025 All 10.741
2025-01-01 sources/ openalex.sources schema_openalex_sources.json Repo 07.01.2025 All 260.811
2024-12-30 topics/ openalex.topics schema_openalex_topics.json Repo 07.01.2025 All 4.516
2024-12-31 works/ schema_openalex_work.json Repo 07.01.2025 All 262.630.159

Status OpenAlex Document Type classification by SUB Göttingen

Snapshot Directory Table Schema Procedure Last Changed Coverage Number of rows
2024-12-31 works/ resources.classification_article_reviews_december24 schema_document_types.json Repo 10.01.2025 2014-2024 58.240.262


If you see mistakes or want to suggest changes, please create an issue on the source repository.


Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".