wikipedia::
The Leine (German: [ˈlaɪnə]; Old Saxon Lagina) is a river in Thuringia and Lower Saxony, Germany.
WAG runs several big data pipelines used in various data products.
These pipelines, though largely not themselves run in R, are here organised into an R package.
Design
WAG is a relatively small team of data analysts, serving academic and librarian stakeholders with various data products.
The data engineering of our pipelines has to correspond to these constraints:
- Our most important, and scarce resource is developer time.
- Our most important, and hard target is reproducibility.
Priorities
From this follows:
- Cheap compute is good, but convenience is better. Our workloads are comparatively minor, labor costs are a much bigger driver.
- Special-purpose tools are good, but standardising on fewer tools is better. Given our small, and sometimes churning team, we can only support very few tools.
- Working prototypes are good, but reproducibility is better. For our academic (as well as librarian) stakeholders, reproduciblity trumps all else.
- Interactive, one-off results are good, but automation, testing and documentation are better. Given churn (and vacation, context-switching, etc.), we must avoid low bus factors. Data pipelines, especially, must be designed to be run and be maintainable without the original developer.
ELT
Our data pipelines follow an extract-load-transform paradigm. They are centered a “data river” (or data like) hosted on the Google Cloud Platform (GCP).
- Data river
- Data is extracted from sources in its rawest form into to GCP Cloud Storage (for long-term versioned coldline storage).
- Data is then loaded into GCP BigQuery. If the source data is schemaless or noncompliant, it is loaded without schema, with entire unparsed entries as cells.
- Data warehouse
- Data is then transformed into a canonical form in GCP BigQuery with a well-defined schema.
- Data mart
- Data is then further transformed according to shared needs of WAGs data products on GCP BigQuery.