We use Google Big Query to work with large open scholarly data. Our main data sources are Unpaywall, Crossref and OpenAlex.
An overview of our data warehouse including procedures to load the data into BigQuery can be found below.
Anyone can view and query our publicly available Open Scholarly Data warehouse on BigQuery with a Google Cloud Computing account. Note that Google will charge you for the number of bytes processed by each query (currently $ 6.25 per 1 TB).
Snapshot | File | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
---|---|---|---|---|---|---|---|
2024/10 | all.json.tar.gz | cr_instant.snapshot | schema_crossref.json | Repo | 08.11.2024 | 2013-2024 | 50.954.931 |
Snapshot | File | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
---|---|---|---|---|---|---|---|
2018/04 | all.json.tar.gz | cr_history.cr_apr18 | schema_crossref.json | Repo | 20.02.2022 | 2013-2018 | 16.766.035 |
2019/04 | all.json.tar.gz | cr_history.cr_apr19 | schema_crossref.json | Repo | 29.10.2021 | 2013-2019 | 20.715.644 |
2020/04 | all.json.tar.gz | cr_history.cr_apr20 | schema_crossref.json | Repo | 29.10.2021 | 2013-2020 | 25.334.525 |
2021/04 | all.json.tar.gz | cr_history.cr_apr21 | schema_crossref.json | Repo | 29.10.2021 | 2013-2021 | 30.579.119 |
2022/04 | all.json.tar.gz | cr_history.cr_apr22 | schema_crossref.json | Repo | 14.05.2022 | 2013-2022 | 35.939.195 |
2023/04 | all.json.tar.gz | cr_history.cr_apr23 | schema_crossref.json | Repo | 07.05.2023 | 2013-2023 | 41.767.461 |
2024/04 | all.json.tar.gz | cr_history.cr_apr24 | schema_crossref.json | Repo | 07.05.2024 | 2013-2024 | 47.709.184 |
Snapshot | File | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
---|---|---|---|---|---|---|---|
2022/03 | unpaywall_snapshot_2022-03-09T083001.jsonl.gz | upw_instant.snapshot | bq_schema_mar22.json | Repo | 14.03.2022 | 2008-2022 | 67.424.819 |
Snapshot | File | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
---|---|---|---|---|---|---|---|
2018/03 | unpaywall_snapshot_2018-03-29T113154.jsonl.gz | upw_history.upw_Mar18_08_20 | bq_schema_mar18.json | Repo | 29.10.2021 | 2008-2018 | 36.557.043 |
2019/02 | unpaywall_snapshot_2019-02-21T031509.jsonl.gz | upw_history.upw_Feb19_08_19 | bq_schema_feb19.json | Repo | 10.11.2021 | 2008-2019 | 42.143.979 |
2020/02 | unpaywall_snapshot_2020-02-25T115244.jsonl.gz | upw_history.upw_Feb20_08_20 | bq_schema_feb20.json | Repo | 30.10.2021 | 2008-2020 | 49.717.710 |
2021/02 | unpaywall_snapshot_2021-02-18T160139.jsonl.gz | upw_history.upw_Feb21_08_21 | bq_schema_feb21.json | Repo | 29.10.2021 | 2008-2021 | 58.437.927 |
2022/03 | unpaywall_snapshot_2022-03-09T083001.jsonl.gz | upw_history.upw_Mar22_08_22 | bq_schema_mar22.json | Repo | 14.03.2022 | 2008-2022 | 67.424.819 |
Snapshot | Directory | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
---|---|---|---|---|---|---|---|
2024-05-28 | papers/ | semantic_scholar.papers | / | Repo | 10.06.2024 | All | 218.668.220 |
2024-05-28 | venues/ | semantic_scholar.venues | / | Repo | 10.06.2024 | All | 194.578 |
Snapshot | Directory | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
---|---|---|---|---|---|---|---|
2024-10-30 | authors/ | openalex.authors | schema_openalex_author.json | Repo | 06.11.2024 | All | 101.053.289 |
2024-10-31 | funders/ | openalex.funders | schema_openalex_funders.json | Repo | 06.11.2024 | All | 32.437 |
2024-10-31 | institutions/ | openalex.institutions | schema_openalex_institutions.json | Repo | 06.11.2024 | All | 109.815 |
2024-10-31 | publishers/ | openalex.publishers | schema_openalex_publishers.json | Repo | 06.11.2024 | All | 10.250 |
2024-10-31 | sources/ | openalex.sources | schema_openalex_sources.json | Repo | 06.11.2024 | All | 254.533 |
2024-10-28 | topics/ | openalex.topics | schema_openalex_topics.json | Repo | 06.11.2024 | All | 4.516 |
2024-10-30 | works/ | openalex.works | schema_openalex_work.json | Repo | 06.11.2024 | All | 260.574.437 |
Snapshot | Directory | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
---|---|---|---|---|---|---|---|
2024-10-30 | works/ | resources.classification_article_reviews_october_2024 | schema_document_types.json | Repo | 06.11.2024 | All | 151.719.141 |
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".