Open Scholarly Data @ SUB Göttingen - Overview

We use Google Big Query to work with large open scholarly data. Our main data sources are Unpaywall, Crossref and OpenAlex.

An overview of our data warehouse including procedures to load the data into BigQuery can be found below.

Status Crossref

Current Snapshot (cr_instant)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2024/03 all.json.tar.gz cr_instant.snapshot schema_crossref.json Repo 09.04.2024 2013-2024 47.185.018

Historical Snapshots (cr_history)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2018/04 all.json.tar.gz cr_history.cr_apr18 schema_crossref.json Repo 20.02.2022 2013-2018 16.766.035
2019/04 all.json.tar.gz cr_history.cr_apr19 schema_crossref.json Repo 29.10.2021 2013-2019 20.715.644
2020/04 all.json.tar.gz cr_history.cr_apr20 schema_crossref.json Repo 29.10.2021 2013-2020 25.334.525
2021/04 all.json.tar.gz cr_history.cr_apr21 schema_crossref.json Repo 29.10.2021 2013-2021 30.579.119
2022/04 all.json.tar.gz cr_history.cr_apr22 schema_crossref.json Repo 14.05.2022 2013-2022 35.939.195
2023/04 all.json.tar.gz cr_history.cr_apr23 schema_crossref.json Repo 07.05.2023 2013-2023 41.767.461

Status Unpaywall

Current Snapshot (upw_instant)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2022/03 unpaywall_snapshot_2022-03-09T083001.jsonl.gz upw_instant.snapshot bq_schema_mar22.json Repo 14.03.2022 2008-2022 67.424.819

Historical Snapshots (upw_history)

Snapshot File Table Schema Procedure Last Changed Coverage Number of rows
2018/03 unpaywall_snapshot_2018-03-29T113154.jsonl.gz upw_history.upw_Mar18_08_20 bq_schema_mar18.json Repo 29.10.2021 2008-2018 36.557.043
2019/02 unpaywall_snapshot_2019-02-21T031509.jsonl.gz upw_history.upw_Feb19_08_19 bq_schema_feb19.json Repo 10.11.2021 2008-2019 42.143.979
2020/02 unpaywall_snapshot_2020-02-25T115244.jsonl.gz upw_history.upw_Feb20_08_20 bq_schema_feb20.json Repo 30.10.2021 2008-2020 49.717.710
2021/02 unpaywall_snapshot_2021-02-18T160139.jsonl.gz upw_history.upw_Feb21_08_21 bq_schema_feb21.json Repo 29.10.2021 2008-2021 58.437.927
2022/03 unpaywall_snapshot_2022-03-09T083001.jsonl.gz upw_history.upw_Mar22_08_22 bq_schema_mar22.json Repo 14.03.2022 2008-2022 67.424.819

Status Openalex

Snapshot Directory Table Schema Procedure Last Changed Coverage Number of rows
2024-03-27 authors/ openalex.authors schema_openalex_author.json Repo 08.04.2024 All 90.556.187
2024-03-28 funders/ openalex.funders schema_openalex_funders.json Repo 08.04.2024 All 32.437
2024-03-28 institutions/ openalex.institutions schema_openalex_institutions.json Repo 08.04.2024 All 107.716
2024-03-28 publishers/ openalex.publishers schema_openalex_publishers.json Repo 08.04.2024 All 10.249
2024-03-28 sources/ openalex.sources schema_openalex_sources.json Repo 08.04.2024 All 252.375
2024-03-25 topics/ openalex.topics schema_openalex_topics.json Repo 08.04.2024 All 4.516
2024-03-27 works/ openalex.works schema_openalex_work.json Repo 08.04.2024 All 250.129.733

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".