Mining and analysing invoice data from Elsevier relative to hybrid open access

Publishers rarely make publication fee spending for hybrid journals transparent. Elsevier is a remarkable exception, as the publisher provides open and machine-readable data relative to its central invoicing with funding bodies and fee waivers at the article level. This blogpost illustrates how to mine Elsevier full-texts for these data with the data science tool R and presents new insights by analysing the resulting dataset: of 70,657 articles published open access in 1,753 hybrid journals from 2015 to date, around one third of the publication fees were paid through central agreements. Nevertheless, the majority of funding sources for hybrid open access remains unclear.

Najko Jahn https://twitter.com/najkoja (State and University Library Göttingen)https://www.sub.uni-goettingen.de/
Nov 25, 2019

Introduction and background

In September 2018, cOAltion S, a group of international research funders, announced its widely discussed Plan S. According to its principles, research funding organisations aligned in cOAlition S will cover open access publication fees, also known as article-processing charges (APC), but they expressed the intent to suspend financial support of such fees associated with open access publishing in hybrid journals. An exception are cases within the controlled setting of transformative agreements. These are institutional or consortial agreements that repurpose subscription expenditures for open access publishing in order to drive the transition of subscription-based journal publishing to fully open access; research performing organisations initiated transformative agreements in recent years as a strategy to rein in uncontrolled and unmonitored spending on publication fees in hybrid journals and to accelerate the open access transition.

With the aforementioned financial restrictions in place from 2021, cOAltionS also intends to monitor compliance with the Plan S principles. To date, however, the monitoring of spending for open access publishing in hybrid journals has been limited, due to a lack of data around these financial transactions. Although surveys(Solomon and Björk 2011; Dallmeier-Tiessen et al. 2011) suggest that already many authors do not pay publication fees themselves, keeping track of these funding streams is challenging, because publishers rarely share invoice data (Björk 2017). But also not all funders and research organisations track or make the payment information available to fill this gap, despite examples like the British Charity Open Access Fund or the Open APC Initiative (Jahn and Tullney 2016).

At the SUB Göttingen, we will address this lack of transparency in a new project funded by the Deutsche Forschungsgemeinschaft (DFG) in the context of its programme “Open Access Transition Agreements”(Holzer 2017). Building on our pilot project, the interactive Shiny app Hybrid OA Journal Monitor, this project will investigate the data needs of German library consortia and how they can be addressed through metadata requirements in transformative agreements. Case-studies and data products will monitor levels of compliance with policy recommendations. Here, invoice data will be essential to make the various funding streams of open access publishing in hybrid journals visible.

Against this background, this blogpost presents a dataset comprising publicly available invoice data relative to open access articles in hybrid journals published by Elsevier, a major publisher of scholarly journals. This dataset brings together metadata from Crossref and information retrieved from open access full-texts. The methods used to obtain the data address challenges to discover open access articles in hybrid journals(Laakso and Björk 2016) including related funding and affiliation information using open data and tools. I will argue that Elsevier’s approach of sharing invoice recipients serves as an example of good business practise for other publishers offering hybrid open access options and central open access agreements. It is, thus, relevant for standardisation efforts like the “ESAC Workflow Recommendations for Transformative Agreements” (Geschuhn and Stone 2017).

To demonstrate the potential of publisher-provided data to enable monitoring Plan S compliance, transformative agreements and the transition of subscription journals to open access, the dataset will be used to analyse the number and the proportion of open access articles in Elsevier hybrid journals. Drawing on Elsevier’s funding information, I will also investigate whether Elsevier sent invoices to authors or to funders and research organisations that, presumably, have either a central payment agreement or a transformative agreement with Elsevier, or whether the fees were waived. Moreover, text-mined author email domains will provide a rough approximation of the affiliation of the first corresponding author, an important data point for delineating open access funding; it is now standard practise for the first, or submitting corresponding author, or her institution, to take on responsibility for payment of the relative open access publishing fees (Geschuhn and Stone 2017). Finally, the publisher-provided invoice data will be compared with crowd-sourced spending data from the Open APC Initiative.

To allow for a data-driven discussion about Elsevier’s approach and its potential for monitoring Plan S compliance and transformative agreements, I made the resulting dataset openly available on GitHub along with the source code used to obtain the data.

Methods

As a start, I used the Elsevier publication fee price list, an openly available pdf document, to determine current hybrid open access journals in Elsevier’s journal portfolio. The rOpenSci tabulizer package (Leeper 2018) allowed me to extract data about these journals from this file.

Then, I interfaced the Crossref REST API with the R package rcrossref (Chamberlain et al. 2019). The first API call retrieved facet field counts for license URLs and the yearly article volumes for the period 2015-19 for every journal. After matching Creative Commons license URLs indicating open access articles, a second API call retrieved article-level metadata per journal. Next, I used the metadata field delay-in-days to exclude open access articles published after an embargo period (“delayed open access”). Because a few records had different date formats, which were used for the delay calculation by Crossref, I allowed for a lag of 31 days.

Elsevier participates in the Crossref Text and Data Mining Services (Crossref-TDM) and provides access to full-texts as html and xml documents. Surprisingly, the xml representation not only contains the full-text, but also comprises embedded metadata including information about open access sponsorship in the <core> node:

<openaccess>1</openaccess>
<openaccessArticle>true</openaccessArticle>
<openaccessType>Full</openaccessType>
<openArchiveArticle>false</openArchiveArticle>
<openaccessSponsorName>
  BMBF - German Federal Ministry of Education and Research
</openaccessSponsorName>
<openaccessSponsorType>FundingBody</openaccessSponsorType>
<openaccessUserLicense>
  http://creativecommons.org/licenses/by/4.0/
</openaccessUserLicense>

Snapshot of open access metadata in Elsevier XML full-texts. https://api.elsevier.com/content/article/PII:S0169409X18301479?httpAccept=text/xml

After downloading the Elsevier full-texts with the crminer package(Chamberlain 2018), I extracted the above-highlighted open access information from the xml documents.

Moreover, I parsed the first occurrence of an author email, assuming that email domains roughly indicate the affiliation of the relevant corresponding author at the time of publication. The package urltools (Keyes et al. 2019) made it possible to extract email domains and to split them into meaningful parts.

Finally, to measure the overlap between crowd-sourced and publisher-provided invoice data, I downloaded spending data from the Open APC Initiative (Aasheim et al. 2019). To my knowledge, the Open APC Initiative maintains the largest evidence-base for institutional spending on open access publication fees.

Throughout the data analysis, I used tools from the Tidyverse (Wickham et al. 2019). Data were gathered on 15 November 2019. To make this project more reproducible, I shared it as a research compendium using the holepunch package (Ram 2019). A research compendium contains data, code, and text associated with it (Marwick, Boettiger, and Mullen 2018). The research compendium belonging this blog post is accessible here: https://github.com/subugoe/elsevier_hybrid_volume

Dataset characteristics

In the following data analysis, I will be using two files that I compiled. The first file, journal_facets.json, contains the publication volume per Elsevier journal offering hybrid open access. It furthermore summarises the various license URLs found through Crossref per Elsevier journal.

The second file, elsevier_hybrid_oa_df.csv, comprises article-level data. Each row holds information for a single hybrid open access article published in a hybrid journal, and the columns represent:

Variable Description
doi DOI
license Open Content License
issued Earliest publication date
issued_year Earliest publication year
issn ISSN, a journal identifier
journal_title The title of the journal
journal_volume Yearly publication volume
tdm_link Link to the XML full-text
oa_sponsor_type Invoice recipient type
oa_sponsor_name Institution that directly received an invoice
oa_archive Was open access provided through Elsevier’s open archive programme, in which articles are made openly available after an embargo?
host Email host, e.g. med.cornell.edu
tld Top-level domain, e.g. edu
suffix Extracted suffix from domain name as defined by the public suffix list, e.g. ac.uk
domain Email domain, e.g. cornell.edu
subdomain Email subdomain, e.g. med

It should be noted, however, that Elsevier did not provide an official documentation of its open access and invoice data at the time of writing of this blogpost.

Results

In total, 1,753 out of 1,990 hybrid journals published at least one open access article from 2015 to date, corresponding to about 88% of journal titles in Elsevier’s hybrid journal portfolio. In these journals, 70,657 articles were published open access. The total share of hybrid open access in the publication volume of Elsevier journals was 2.4%.

What is the uptake of open access in Elsevier hybrid journals?

The open access share varied across Elsevier hybrid journals. Figure 1, which replicates a boxplot aesthetics from The Economist magazine using the ggeconodist package (Rudis 2019), shows a slow, but steady hybrid open access uptake. The median open access proportion was around 3% in the first eleven months in 2019.

Open access uptake in Elsevier journals per year in percent, visualised as diminutive distribution chart. Since 2015, most hybrid journals have had a slow uptake rate of open access articles. In general, open access via the hybrid open access publishing model played a marginal role in the context of Elsevier's total publication volume. Data Sources: Crossref, Elsevier B.V.

Figure 1: Open access uptake in Elsevier journals per year in percent, visualised as diminutive distribution chart. Since 2015, most hybrid journals have had a slow uptake rate of open access articles. In general, open access via the hybrid open access publishing model played a marginal role in the context of Elsevier’s total publication volume. Data Sources: Crossref, Elsevier B.V.

How many payments for open access articles in hybrid journals were facilitated by central invoicing?

In most cases, Elsevier sent invoices for hybrid open access publication fees to individual authors (59%). For around 33% of articles, the publisher directly billed funders and research organisations. Elsevier granted publication fee waivers to 6.2% of open access articles in hybrid journals.

Figure 2 shows the annual development per invocing type. Inspired by Claus O. Wilke’s “Fundamentals of Data Visualisation” (Wilke 2019), each type is visualised separately as parts of the total. The figure reveals a general growth of open access articles in hybrid journals. It illustrates that this development was mainly driven by billing individual authors, while central invoicing stagnated. Also, the amount of fee-waived articles remained more or less constant from 2015 to date.

Development of fee-based open access publishing in Elsevier hybrid journals by invoicing type. Colored bars represent the invoice recipient, or whether the fee was waived. Grey bars show the total number of hybrid open access articles published in Elsevier journals from 2015 to date. Data Sources: Crossref, Elsevier B.V.

Figure 2: Development of fee-based open access publishing in Elsevier hybrid journals by invoicing type. Colored bars represent the invoice recipient, or whether the fee was waived. Grey bars show the total number of hybrid open access articles published in Elsevier journals from 2015 to date. Data Sources: Crossref, Elsevier B.V.

The following interactive visualisation (Figure 3), created with the echarts4r package(Coene 2019), lets you browse the invoicing data. I recommend using a recent Chrome browser ot interact with the visualisation.

Figure 3: Breakdown of Elsevier hybrid open access journal articles by invoice recipient. Each rectangle represents an invoicing type and can be broken down by recipient. Data Source: Elsevier B.V.

Clicking on “Agreement” shows the funders or research organisations that paid for open access publication fees as part of a central or transformative agreement. In total, Elsevier disclosed 74 different institutions that received an invoice for open access publication. Not surprisingly, mostly British and Dutch funders or consortia paid for hybrid open access in Elsevier hybrid journals. The German Federal Ministry of Education and Research (BMBF) is, however, also represented despite the current boycott from most universities and research organisations in Germany (Else 2018). In fact, the BMBF is not part of the Alliance of Science Organisations in Germany, whose members want to negotiate a transformative agreement with Elsevier (Mittermaier 2017). Since 2018, the BMBF has financially supported 181 open access articles that appeared in 129 Elsevier hybrid journals according to data from the publisher.

Who published hybrid open access in Elsevier journals?

In addition to funding information, email domains were parsed from Elsevier full-texts. These domains roughly indicate the affiliation of the first corresponding author, a data point used to delineate open access funding (Geschuhn and Stone 2017).

Email domain analysis of first corresponding authors publishing open access in Elsevier hybrid journals. Around every fourth open access article in an Elsevier hybrid journal from 2015 to date had a corresponding author affiliated with an UK-based academic institution. Data Source: Elsevier B.V.

Figure 4: Email domain analysis of first corresponding authors publishing open access in Elsevier hybrid journals. Around every fourth open access article in an Elsevier hybrid journal from 2015 to date had a corresponding author affiliated with an UK-based academic institution. Data Source: Elsevier B.V.

Figure 4 presents a breakdown by email domain suffix. In total, 67,900 email addresses were retrieved and parsed from Elsevier full-texts, corresponding to a share of 96%. Most corresponding author emails originate from academic institutions in the UK (“.ac.uk”), reflecting the country’s leading role in supporting hybrid open access (Pinfield, Salter, and Bath 2015). They are followed by domains from commercial organisations (“.com”), and US-American institutions of higher education (“.edu”). The figure illustrates that European institutions from Germany (“.de”), the Netherlands (“.nl”), and Sweden (“.se”) were also well represented. In total, 330 domain suffixes were retrieved.

In the following figure, a hierarchical, interactive treemap visualises the distribution of the email domains (see Figure 5). It appears that this distribution roughly represents the overall national research landscapes measured by publication output. However, the dominance of domains from commercial organisations, mostly email providers like “gmail.com” or the Chinese “163.com” and “126.com”, highlights the limitations of this approach to infer eligible funding institutions with author email addresses.

Figure 5: Email domain analysis of first corresponding authors publishing open access in Elsevier hybrid journals. Each top-level domain can be subdivided further into domain names representing academic institutions or companies. Data Source: Elsevier B.V.

How does Elsevier invoice data compare to spending information from the Open APC Initiative?

Finally, I was interested in the overlap between publisher-provided invoice data from Elsevier and institutional spending data from the Open APC Initiative. In total, the Open APC Initiative tracked 8,213 out of 70,657 published open access articles in hybrid journals, corresponding to a share of 12%. Institutional expenditures for these articles amounted to 24,008,889 € according to Open APC data. However, the Open APC Initiative listed 683 additional open access articles. One likely explanation is that the Crossref metadata representing these articles did not meet my criteria; another explanation could be that they appeared in journals that recently transitioned from hybrid to fully open access (e.g. the journal “NeuroImage”). At the journal level, the overlap was 58%.

Figure 6 presents the annual development of spending disclosure relative to open access articles in Elsevier hybrid journals as reported in the Open APC Initiative grouped by invoicing type. The Open APC Initiative mostly tracked articles covered under central invoicing agreements. The figure also suggests that invoices billed to authors were covered by institutions participating in Open APC. Generally, the results confirm a delay between invoicing and reporting to the Open APC Initiative (Jahn and Tullney 2016). Surprisingly, Open APC listed institutional payments for 13 articles, for which Elsevier reported that the relative fee was waived.

Development of fee-based open access publishing in Elsevier hybrid journals by invoicing type and disclosure of institutional payment by the Open APC Initiative. Grey bars show the total number of hybrid open access articles published by invoicing type from 2015 to date. Colored bars represent the number of articles that are also tracked in Open APC. Data Sources: Crossref, Elsevier B.V., Open APC Initiative.

Figure 6: Development of fee-based open access publishing in Elsevier hybrid journals by invoicing type and disclosure of institutional payment by the Open APC Initiative. Grey bars show the total number of hybrid open access articles published by invoicing type from 2015 to date. Colored bars represent the number of articles that are also tracked in Open APC. Data Sources: Crossref, Elsevier B.V., Open APC Initiative.

Figure 7 presents the gap between publisher-provided invoice data and Open APC for the ten greatest contributing funding bodies. It highlights that British funders had the largest overlap rates, which reflects Open APC efforts to re-use openly available spending data from these institutions (Pieper and Broschinski 2018). On the other hand, Open APC did not track Dutch (“VSNU”), U.S. (“Melinda & Bill Gates Foundation”) or European funding activities (“European Research Council”) for hybrid open access publication fees.

Proportion of fee-based open access articles in Elsevier hybrid journals disclosed by the Open APC Initiative. Blue areas represent an overlap in spending data availability, grey areas reflect centrally paid articles, which were not present in the Open APC data. Data Source: Crossref, Elsevier B.V., Open APC Initiative.

Figure 7: Proportion of fee-based open access articles in Elsevier hybrid journals disclosed by the Open APC Initiative. Blue areas represent an overlap in spending data availability, grey areas reflect centrally paid articles, which were not present in the Open APC data. Data Source: Crossref, Elsevier B.V., Open APC Initiative.

Discussion and conclusion

In this blog post, I have illustrated how it is possible to obtain invoice data from Elsevier, which is embedded in full-texts. This data can be used to determine whether Elsevier sent invoices to authors, to funders or research organisations that have a central payment agreement or a transformative agreement with Elsevier, or whether the fee was waived. Providing such machine-readable data, makes funding streams for hybrid open access more transparent.

At the same time, the data analysis highlights various critical aspects related to open access publishing in hybrid journals. Despite increased funding activities, only a small proportion of journal articles were made openly available under this model. Furthermore, Elsevier sent the majority of invoices directly to the authors. This practise not only imposes administrative burdens and costs to all parties involved, but also conceals funding sources for publication fees. Existing spending data from funders and research organisations can only partly overcome this gap. Moreover, publishers offer different kinds of funding opportunities for hybrid open access at the same time, including central invoicing. However, it is likely that not all agreements with central invoicing as they currently stand meet the Plan S requirements for transformative agreements.

Implementation of Plan S is underway to change current practises of funding open access publication in hybrid journals. Because Elsevier’s current transparency related to their invoicing is a remarkable exception, workflow guidelines for transformative agreements should consider taking the publisher’s example of sharing invoice data as a recommended good business practise for publishers. Although future work needs to tackle the remaining questions about the data quality and coverage, publisher-provided invoice data make publishers more accountable and extends the evidence base relative to hybrid open access. As a result, the data analysis presented here provides a basis to improve the monitoring of funding streams in the context of transformative agreements.

Acknowledgments

This work was supported by the Deutsche Forschungsgemeinschaft, project “Hybrid OA Dashboards: Mehrwertorientierte Analytics-Anwendungen zur Förderung der Kostentransparenz bei Transformationsverträgen”, project id 416115939.

Aasheim, Jens Harald, Benjamin Ahlborn, Chelsea Ambler, Magdalena Andrae, Jochen Apel, Hans-Georg Becker, Roland Bertelmann, et al. 2019. The Open APC Initiative. Bielefeld University Library. https://github.com/OpenAPC/openapc-de.
Björk, Bo-Christer. 2017. “Growth of Hybrid Open Access, 2009-2016.” PeerJ 5: e3878. https://doi.org/10.7717/peerj.3878.
Chamberlain, Scott. 2018. Crminer: Fetch ’Scholary’ Full Text from ’Crossref’. https://CRAN.R-project.org/package=crminer.
Chamberlain, Scott, Hao Zhu, Najko Jahn, Carl Boettiger, and Karthik Ram. 2019. Rcrossref: Client for Various ’CrossRef’ ’APIs’. https://CRAN.R-project.org/package=rcrossref.
Coene, John. 2019. Echarts4r: Create Interactive Graphs with ’Echarts JavaScript’ Version 4. http://echarts4r.john-coene.com/.
Dallmeier-Tiessen, Suenje, Robert Darby, Bettina Goerner, Jenni Hyppoelae, Peter Igo-Kemenes, Deborah Kahn, Simon C. Lambert, et al. 2011. “Highlights from the SOAP Project Survey. What Scientists Think about Open Access Publishing.” http://arxiv.org/abs/1101.5260.
Else, Holly. 2018. “Dutch Publishing Giant Cuts Off Researchers in Germany and Sweden.” Nature 559 (7715): 454–55. https://doi.org/10.1038/d41586-018-05754-1.
Geschuhn, Kai, and Graham Stone. 2017. “It’s the Workflows, Stupid! What Is Required to Make ‘Offsetting’ Work for the Open Access Transition.” Insights: The UKSG Journal 30 (3): 103–14. https://doi.org/10.1629/uksg.391.
Holzer, Angela. 2017. “Wozu Open-Access-Transformationsverträge?” O-Bib. Das Offene Bibliotheksjournal 4: 87–95. https://doi.org/10.5282/o-bib/2017H2S87-95.
Jahn, Najko, and Marco Tullney. 2016. “A Study of Institutional Spending on Open Access Publication Fees in Germany.” PeerJ 4 (August): e2323. https://doi.org/10.7717/peerj.2323.
Keyes, Os, Jay Jacobs, Drew Schmidt, Mark Greenaway, Bob Rudis, Alex Pinto, Maryam Khezrzadeh, et al. 2019. Urltools: Vectorised Tools for URL Handling and Parsing. https://CRAN.R-project.org/package=urltools.
Laakso, Mikael, and Bo-Christer Björk. 2016. “Hybrid Open Access—a Longitudinal Study.” Journal of Informetrics 10 (4): 919–32. https://doi.org/10.1016/j.joi.2016.08.002.
Leeper, Thomas J. 2018. Tabulizer: Bindings for Tabula PDF Table Extractor Library. https://cran.r-project.org/package=tabulizer.
Marwick, Ben, Carl Boettiger, and Lincoln Mullen. 2018. “Packaging Data Analytical Work Reproducibly Using r (and Friends).” PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.3192v2.
Mittermaier, Bernhard. 2017. “From the DEAL Engine Room — an Interview with Bernhard Mittermaier.” LIBREAS.Library Ideas. https://libreas.eu/ausgabe32/mittermaier_en/.
Pieper, Dirk, and Christoph Broschinski. 2018. “OpenAPC: A Contribution to a Transparent and Reproducible Monitoring of Fee-Based Open Access Publishing Across Institutions and Nations.” Insights: The UKSG Journal 31. https://doi.org/10.1629/uksg.439.
Pinfield, Stephen, Jennifer Salter, and Peter A. Bath. 2015. “The "Total Cost of Publication" in a Hybrid Open-Access Environment: Institutional Approaches to Funding Journal Article-Processing Charges in Combination with Subscriptions.” Journal of the Association for Information Science and Technology 67 (7): 1751–66. https://doi.org/10.1002/asi.23446.
Ram, Karthik. 2019. Holepunch: Configure Your r Project for ’Binderhub’. https://github.com/karthik/holepunch.
Rudis, Bob. 2019. Ggeconodist: Create Diminutive Distribution Charts. https://github.com/hrbrmstr/ggeconodist.
Solomon, David J., and Bo-Christer Björk. 2011. “Publication Fees in Open Access Publishing: Sources of Funding and Factors Influencing Choice of Journal.” Journal of the Association for Information Science and Technology 63 (1): 98–107. https://doi.org/10.1002/asi.21660.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wilke, Claus O. 2019. Fundamentals of Data Visualization. O’Reilly. https://serialmentor.com/dataviz/.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Jahn (2019, Nov. 25). Scholarly Communication Analytics: Mining and analysing invoice data from Elsevier relative to hybrid open access. Retrieved from https://subugoe.github.io/scholcomm_analytics/posts/elsevier_invoice/

BibTeX citation

@misc{jahn2019mining,
  author = {Jahn, Najko},
  title = {Scholarly Communication Analytics: Mining and analysing invoice data from Elsevier relative to hybrid open access},
  url = {https://subugoe.github.io/scholcomm_analytics/posts/elsevier_invoice/},
  year = {2019}
}