Open Scholarly Data @ SUB Göttingen - Overview

We use Google Big Query to work with large open scholarly data. Our main data sources are Unpaywall, Crossref and OpenAlex.

An overview of our data warehouse including procedures to load the data into BigQuery can be found below.

Anyone can view and query our publicly available Open Scholarly Data warehouse on BigQuery with a Google Cloud Computing account. Note that Google will charge you for the number of bytes processed by each query (currently $ 6.25 per 1 TB).

Status Crossref

Current Snapshot (cr_instant)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2024/10 all.json.tar.gz cr_instant.snapshot schema_crossref.json Repo 08.11.2024 2013-2024 50.954.931

Historical Snapshots (cr_history)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2018/04 all.json.tar.gz cr_history.cr_apr18 schema_crossref.json Repo 20.02.2022 2013-2018 16.766.035
2019/04 all.json.tar.gz cr_history.cr_apr19 schema_crossref.json Repo 29.10.2021 2013-2019 20.715.644
2020/04 all.json.tar.gz cr_history.cr_apr20 schema_crossref.json Repo 29.10.2021 2013-2020 25.334.525
2021/04 all.json.tar.gz cr_history.cr_apr21 schema_crossref.json Repo 29.10.2021 2013-2021 30.579.119
2022/04 all.json.tar.gz cr_history.cr_apr22 schema_crossref.json Repo 14.05.2022 2013-2022 35.939.195
2023/04 all.json.tar.gz cr_history.cr_apr23 schema_crossref.json Repo 07.05.2023 2013-2023 41.767.461
2024/04 all.json.tar.gz cr_history.cr_apr24 schema_crossref.json Repo 07.05.2024 2013-2024 47.709.184

Status Unpaywall

Current Snapshot (upw_instant)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2022/03 unpaywall_snapshot_2022-03-09T083001.jsonl.gz upw_instant.snapshot bq_schema_mar22.json Repo 14.03.2022 2008-2022 67.424.819

Historical Snapshots (upw_history)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2018/03 unpaywall_snapshot_2018-03-29T113154.jsonl.gz upw_history.upw_Mar18_08_20 bq_schema_mar18.json Repo 29.10.2021 2008-2018 36.557.043
2019/02 unpaywall_snapshot_2019-02-21T031509.jsonl.gz upw_history.upw_Feb19_08_19 bq_schema_feb19.json Repo 10.11.2021 2008-2019 42.143.979
2020/02 unpaywall_snapshot_2020-02-25T115244.jsonl.gz upw_history.upw_Feb20_08_20 bq_schema_feb20.json Repo 30.10.2021 2008-2020 49.717.710
2021/02 unpaywall_snapshot_2021-02-18T160139.jsonl.gz upw_history.upw_Feb21_08_21 bq_schema_feb21.json Repo 29.10.2021 2008-2021 58.437.927
2022/03 unpaywall_snapshot_2022-03-09T083001.jsonl.gz upw_history.upw_Mar22_08_22 bq_schema_mar22.json Repo 14.03.2022 2008-2022 67.424.819

Status Semantic Scholar

Snapshot Directory Table Schema Procedure Last Changed Coverage Number of rows
2024-05-28 papers/ semantic_scholar.papers / Repo 10.06.2024 All 218.668.220
2024-05-28 venues/ semantic_scholar.venues / Repo 10.06.2024 All 194.578

Status Openalex

Snapshot Directory Table Schema Procedure Last Changed Coverage Number of rows
2024-10-30 authors/ openalex.authors schema_openalex_author.json Repo 06.11.2024 All 101.053.289
2024-10-31 funders/ openalex.funders schema_openalex_funders.json Repo 06.11.2024 All 32.437
2024-10-31 institutions/ openalex.institutions schema_openalex_institutions.json Repo 06.11.2024 All 109.815
2024-10-31 publishers/ openalex.publishers schema_openalex_publishers.json Repo 06.11.2024 All 10.250
2024-10-31 sources/ openalex.sources schema_openalex_sources.json Repo 06.11.2024 All 254.533
2024-10-28 topics/ openalex.topics schema_openalex_topics.json Repo 06.11.2024 All 4.516
2024-10-30 works/ openalex.works schema_openalex_work.json Repo 06.11.2024 All 260.574.437

Status OpenAlex Document Type classification by SUB Göttingen

Snapshot Directory Table Schema Procedure Last Changed Coverage Number of rows
2024-10-30 works/ resources.classification_article_reviews_october_2024 schema_document_types.json Repo 06.11.2024 All 151.719.141

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".