We use Google Big Query to work with large open scholarly data. Our main data sources are Unpaywall, Crossref and OpenAlex.
An overview of our data warehouse including procedures to load the data into BigQuery can be found below.
Status Crossref
Current Snapshot (cr_instant)
2024/03 |
all.json.tar.gz |
cr_instant.snapshot |
schema_crossref.json |
Repo |
09.04.2024 |
2013-2024 |
47.185.018 |
Historical Snapshots (cr_history)
2018/04 |
all.json.tar.gz |
cr_history.cr_apr18 |
schema_crossref.json |
Repo |
20.02.2022 |
2013-2018 |
16.766.035 |
2019/04 |
all.json.tar.gz |
cr_history.cr_apr19 |
schema_crossref.json |
Repo |
29.10.2021 |
2013-2019 |
20.715.644 |
2020/04 |
all.json.tar.gz |
cr_history.cr_apr20 |
schema_crossref.json |
Repo |
29.10.2021 |
2013-2020 |
25.334.525 |
2021/04 |
all.json.tar.gz |
cr_history.cr_apr21 |
schema_crossref.json |
Repo |
29.10.2021 |
2013-2021 |
30.579.119 |
2022/04 |
all.json.tar.gz |
cr_history.cr_apr22 |
schema_crossref.json |
Repo |
14.05.2022 |
2013-2022 |
35.939.195 |
2023/04 |
all.json.tar.gz |
cr_history.cr_apr23 |
schema_crossref.json |
Repo |
07.05.2023 |
2013-2023 |
41.767.461 |
Status Unpaywall
Current Snapshot (upw_instant)
2022/03 |
unpaywall_snapshot_2022-03-09T083001.jsonl.gz |
upw_instant.snapshot |
bq_schema_mar22.json |
Repo |
14.03.2022 |
2008-2022 |
67.424.819 |
Historical Snapshots (upw_history)
2018/03 |
unpaywall_snapshot_2018-03-29T113154.jsonl.gz |
upw_history.upw_Mar18_08_20 |
bq_schema_mar18.json |
Repo |
29.10.2021 |
2008-2018 |
36.557.043 |
2019/02 |
unpaywall_snapshot_2019-02-21T031509.jsonl.gz |
upw_history.upw_Feb19_08_19 |
bq_schema_feb19.json |
Repo |
10.11.2021 |
2008-2019 |
42.143.979 |
2020/02 |
unpaywall_snapshot_2020-02-25T115244.jsonl.gz |
upw_history.upw_Feb20_08_20 |
bq_schema_feb20.json |
Repo |
30.10.2021 |
2008-2020 |
49.717.710 |
2021/02 |
unpaywall_snapshot_2021-02-18T160139.jsonl.gz |
upw_history.upw_Feb21_08_21 |
bq_schema_feb21.json |
Repo |
29.10.2021 |
2008-2021 |
58.437.927 |
2022/03 |
unpaywall_snapshot_2022-03-09T083001.jsonl.gz |
upw_history.upw_Mar22_08_22 |
bq_schema_mar22.json |
Repo |
14.03.2022 |
2008-2022 |
67.424.819 |
Status Openalex
2024-03-27 |
authors/ |
openalex.authors |
schema_openalex_author.json |
Repo |
08.04.2024 |
All |
90.556.187 |
2024-03-28 |
funders/ |
openalex.funders |
schema_openalex_funders.json |
Repo |
08.04.2024 |
All |
32.437 |
2024-03-28 |
institutions/ |
openalex.institutions |
schema_openalex_institutions.json |
Repo |
08.04.2024 |
All |
107.716 |
2024-03-28 |
publishers/ |
openalex.publishers |
schema_openalex_publishers.json |
Repo |
08.04.2024 |
All |
10.249 |
2024-03-28 |
sources/ |
openalex.sources |
schema_openalex_sources.json |
Repo |
08.04.2024 |
All |
252.375 |
2024-03-25 |
topics/ |
openalex.topics |
schema_openalex_topics.json |
Repo |
08.04.2024 |
All |
4.516 |
2024-03-27 |
works/ |
openalex.works |
schema_openalex_work.json |
Repo |
08.04.2024 |
All |
250.129.733 |
Corrections
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Reuse
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".