Publishers rarely make publication fee spending for hybrid journals transparent. Elsevier is a remarkable exception, as the publisher provides open and machine-readable data relative to its central invoicing with funding bodies and fee waivers at the article level. This blogpost illustrates how to mine Elsevier full-texts for these data with the data science tool R and presents new insights by analysing the resulting dataset: of 70,657 articles published open access in 1,753 hybrid journals from 2015 to date, around one third of the publication fees were paid through central agreements. Nevertheless, the majority of funding sources for hybrid open access remains unclear.
In September 2018, cOAltion S, a group of international research funders, announced its widely discussed Plan S. According to its principles, research funding organisations aligned in cOAlition S will cover open access publication fees, also known as article-processing charges (APC), but they expressed the intent to suspend financial support of such fees associated with open access publishing in hybrid journals. An exception are cases within the controlled setting of transformative agreements. These are institutional or consortial agreements that repurpose subscription expenditures for open access publishing in order to drive the transition of subscription-based journal publishing to fully open access; research performing organisations initiated transformative agreements in recent years as a strategy to rein in uncontrolled and unmonitored spending on publication fees in hybrid journals and to accelerate the open access transition.
With the aforementioned financial restrictions in place from 2021, cOAltionS also intends to monitor compliance with the Plan S principles. To date, however, the monitoring of spending for open access publishing in hybrid journals has been limited, due to a lack of data around these financial transactions. Although surveys(Solomon and Björk 2011; Dallmeier-Tiessen et al. 2011) suggest that already many authors do not pay publication fees themselves, keeping track of these funding streams is challenging, because publishers rarely share invoice data (Björk 2017). But also not all funders and research organisations track or make the payment information available to fill this gap, despite examples like the British Charity Open Access Fund or the Open APC Initiative (Jahn and Tullney 2016).
At the SUB Göttingen, we will address this lack of transparency in a new project funded by the Deutsche Forschungsgemeinschaft (DFG) in the context of its programme “Open Access Transition Agreements”(Holzer 2017). Building on our pilot project, the interactive Shiny app Hybrid OA Journal Monitor, this project will investigate the data needs of German library consortia and how they can be addressed through metadata requirements in transformative agreements. Case-studies and data products will monitor levels of compliance with policy recommendations. Here, invoice data will be essential to make the various funding streams of open access publishing in hybrid journals visible.
Against this background, this blogpost presents a dataset comprising publicly available invoice data relative to open access articles in hybrid journals published by Elsevier, a major publisher of scholarly journals. This dataset brings together metadata from Crossref and information retrieved from open access full-texts. The methods used to obtain the data address challenges to discover open access articles in hybrid journals(Laakso and Björk 2016) including related funding and affiliation information using open data and tools. I will argue that Elsevier’s approach of sharing invoice recipients serves as an example of good business practise for other publishers offering hybrid open access options and central open access agreements. It is, thus, relevant for standardisation efforts like the “ESAC Workflow Recommendations for Transformative Agreements” (Geschuhn and Stone 2017).
To demonstrate the potential of publisher-provided data to enable monitoring Plan S compliance, transformative agreements and the transition of subscription journals to open access, the dataset will be used to analyse the number and the proportion of open access articles in Elsevier hybrid journals. Drawing on Elsevier’s funding information, I will also investigate whether Elsevier sent invoices to authors or to funders and research organisations that, presumably, have either a central payment agreement or a transformative agreement with Elsevier, or whether the fees were waived. Moreover, text-mined author email domains will provide a rough approximation of the affiliation of the first corresponding author, an important data point for delineating open access funding; it is now standard practise for the first, or submitting corresponding author, or her institution, to take on responsibility for payment of the relative open access publishing fees (Geschuhn and Stone 2017). Finally, the publisher-provided invoice data will be compared with crowd-sourced spending data from the Open APC Initiative.
To allow for a data-driven discussion about Elsevier’s approach and its potential for monitoring Plan S compliance and transformative agreements, I made the resulting dataset openly available on GitHub along with the source code used to obtain the data.
As a start, I used the Elsevier publication fee price list, an openly available pdf document, to determine current hybrid open access journals in Elsevier’s journal portfolio. The rOpenSci tabulizer package (Leeper 2018) allowed me to extract data about these journals from this file.
Then, I interfaced the Crossref REST API with the R package rcrossref (Chamberlain et al. 2019). The first API
call retrieved facet field counts for license URLs and the yearly
article volumes for the period 2015-19 for every journal. After matching
Creative Commons license URLs indicating open access articles, a second
API call retrieved article-level metadata per journal. Next, I used the
metadata field delay-in-days
to exclude open access
articles published after an embargo period (“delayed open access”).
Because a few records had different date formats, which were used for
the delay calculation by Crossref, I allowed for a lag of 31 days.
Elsevier participates in the Crossref
Text and Data Mining Services (Crossref-TDM) and provides access to
full-texts as html
and xml
documents.
Surprisingly, the xml
representation not only contains the
full-text, but also comprises embedded metadata including information
about open access sponsorship in the <core>
node:
openaccess>1</openaccess>
<openaccessArticle>true</openaccessArticle>
<openaccessType>Full</openaccessType>
<openArchiveArticle>false</openArchiveArticle>
<openaccessSponsorName>
<
BMBF - German Federal Ministry of Education and ResearchopenaccessSponsorName>
</openaccessSponsorType>FundingBody</openaccessSponsorType>
<openaccessUserLicense>
<
http://creativecommons.org/licenses/by/4.0/openaccessUserLicense> </
Snapshot of open access metadata in Elsevier XML full-texts. https://api.elsevier.com/content/article/PII:S0169409X18301479?httpAccept=text/xml
After downloading the Elsevier full-texts with the crminer package(Chamberlain 2018), I extracted the
above-highlighted open access information from the xml
documents.
Moreover, I parsed the first occurrence of an author email, assuming that email domains roughly indicate the affiliation of the relevant corresponding author at the time of publication. The package urltools (Keyes et al. 2019) made it possible to extract email domains and to split them into meaningful parts.
Finally, to measure the overlap between crowd-sourced and publisher-provided invoice data, I downloaded spending data from the Open APC Initiative (Aasheim et al. 2019). To my knowledge, the Open APC Initiative maintains the largest evidence-base for institutional spending on open access publication fees.
Throughout the data analysis, I used tools from the Tidyverse (Wickham et al. 2019). Data were gathered on 15 November 2019. To make this project more reproducible, I shared it as a research compendium using the holepunch package (Ram 2019). A research compendium contains data, code, and text associated with it (Marwick, Boettiger, and Mullen 2018). The research compendium belonging this blog post is accessible here: https://github.com/subugoe/elsevier_hybrid_volume
In the following data analysis, I will be using two files that I
compiled. The first file, journal_facets.json
,
contains the publication volume per Elsevier journal offering hybrid
open access. It furthermore summarises the various license URLs found
through Crossref per Elsevier journal.
The second file, elsevier_hybrid_oa_df.csv
,
comprises article-level data. Each row holds information for a single
hybrid open access article published in a hybrid journal, and the
columns represent:
Variable | Description |
---|---|
doi |
DOI |
license |
Open Content License |
issued |
Earliest publication date |
issued_year |
Earliest publication year |
issn |
ISSN, a journal identifier |
journal_title |
The title of the journal |
journal_volume |
Yearly publication volume |
tdm_link |
Link to the XML full-text |
oa_sponsor_type |
Invoice recipient type |
oa_sponsor_name |
Institution that directly received an invoice |
oa_archive |
Was open access provided through Elsevier’s open archive programme, in which articles are made openly available after an embargo? |
host |
Email host,
e.g. med.cornell.edu |
tld |
Top-level domain,
e.g. edu |
suffix |
Extracted suffix from domain name as
defined by the public suffix list, e.g. ac.uk |
domain |
Email domain,
e.g. cornell.edu |
subdomain |
Email subdomain,
e.g. med |
It should be noted, however, that Elsevier did not provide an official documentation of its open access and invoice data at the time of writing of this blogpost.
In total, 1,753 out of 1,990 hybrid journals published at least one open access article from 2015 to date, corresponding to about 88% of journal titles in Elsevier’s hybrid journal portfolio. In these journals, 70,657 articles were published open access. The total share of hybrid open access in the publication volume of Elsevier journals was 2.4%.
The open access share varied across Elsevier hybrid journals. Figure 1, which replicates a boxplot aesthetics from The Economist magazine using the ggeconodist package (Rudis 2019), shows a slow, but steady hybrid open access uptake. The median open access proportion was around 3% in the first eleven months in 2019.
In most cases, Elsevier sent invoices for hybrid open access publication fees to individual authors (59%). For around 33% of articles, the publisher directly billed funders and research organisations. Elsevier granted publication fee waivers to 6.2% of open access articles in hybrid journals.
Figure 2 shows the annual development per invocing type. Inspired by Claus O. Wilke’s “Fundamentals of Data Visualisation” (Wilke 2019), each type is visualised separately as parts of the total. The figure reveals a general growth of open access articles in hybrid journals. It illustrates that this development was mainly driven by billing individual authors, while central invoicing stagnated. Also, the amount of fee-waived articles remained more or less constant from 2015 to date.
The following interactive visualisation (Figure 3), created with the echarts4r package(Coene 2019), lets you browse the invoicing data. I recommend using a recent Chrome browser ot interact with the visualisation.
Clicking on “Agreement” shows the funders or research organisations that paid for open access publication fees as part of a central or transformative agreement. In total, Elsevier disclosed 74 different institutions that received an invoice for open access publication. Not surprisingly, mostly British and Dutch funders or consortia paid for hybrid open access in Elsevier hybrid journals. The German Federal Ministry of Education and Research (BMBF) is, however, also represented despite the current boycott from most universities and research organisations in Germany (Else 2018). In fact, the BMBF is not part of the Alliance of Science Organisations in Germany, whose members want to negotiate a transformative agreement with Elsevier (Mittermaier 2017). Since 2018, the BMBF has financially supported 181 open access articles that appeared in 129 Elsevier hybrid journals according to data from the publisher.
In addition to funding information, email domains were parsed from Elsevier full-texts. These domains roughly indicate the affiliation of the first corresponding author, a data point used to delineate open access funding (Geschuhn and Stone 2017).
Figure 4 presents a breakdown by email domain suffix. In total, 67,900 email addresses were retrieved and parsed from Elsevier full-texts, corresponding to a share of 96%. Most corresponding author emails originate from academic institutions in the UK (“.ac.uk”), reflecting the country’s leading role in supporting hybrid open access (Pinfield, Salter, and Bath 2015). They are followed by domains from commercial organisations (“.com”), and US-American institutions of higher education (“.edu”). The figure illustrates that European institutions from Germany (“.de”), the Netherlands (“.nl”), and Sweden (“.se”) were also well represented. In total, 330 domain suffixes were retrieved.
In the following figure, a hierarchical, interactive treemap visualises the distribution of the email domains (see Figure 5). It appears that this distribution roughly represents the overall national research landscapes measured by publication output. However, the dominance of domains from commercial organisations, mostly email providers like “gmail.com” or the Chinese “163.com” and “126.com”, highlights the limitations of this approach to infer eligible funding institutions with author email addresses.
Finally, I was interested in the overlap between publisher-provided invoice data from Elsevier and institutional spending data from the Open APC Initiative. In total, the Open APC Initiative tracked 8,213 out of 70,657 published open access articles in hybrid journals, corresponding to a share of 12%. Institutional expenditures for these articles amounted to 24,008,889 € according to Open APC data. However, the Open APC Initiative listed 683 additional open access articles. One likely explanation is that the Crossref metadata representing these articles did not meet my criteria; another explanation could be that they appeared in journals that recently transitioned from hybrid to fully open access (e.g. the journal “NeuroImage”). At the journal level, the overlap was 58%.
Figure 6 presents the annual development of spending disclosure relative to open access articles in Elsevier hybrid journals as reported in the Open APC Initiative grouped by invoicing type. The Open APC Initiative mostly tracked articles covered under central invoicing agreements. The figure also suggests that invoices billed to authors were covered by institutions participating in Open APC. Generally, the results confirm a delay between invoicing and reporting to the Open APC Initiative (Jahn and Tullney 2016). Surprisingly, Open APC listed institutional payments for 13 articles, for which Elsevier reported that the relative fee was waived.
Figure 7 presents the gap between publisher-provided invoice data and Open APC for the ten greatest contributing funding bodies. It highlights that British funders had the largest overlap rates, which reflects Open APC efforts to re-use openly available spending data from these institutions (Pieper and Broschinski 2018). On the other hand, Open APC did not track Dutch (“VSNU”), U.S. (“Melinda & Bill Gates Foundation”) or European funding activities (“European Research Council”) for hybrid open access publication fees.
In this blog post, I have illustrated how it is possible to obtain invoice data from Elsevier, which is embedded in full-texts. This data can be used to determine whether Elsevier sent invoices to authors, to funders or research organisations that have a central payment agreement or a transformative agreement with Elsevier, or whether the fee was waived. Providing such machine-readable data, makes funding streams for hybrid open access more transparent.
At the same time, the data analysis highlights various critical aspects related to open access publishing in hybrid journals. Despite increased funding activities, only a small proportion of journal articles were made openly available under this model. Furthermore, Elsevier sent the majority of invoices directly to the authors. This practise not only imposes administrative burdens and costs to all parties involved, but also conceals funding sources for publication fees. Existing spending data from funders and research organisations can only partly overcome this gap. Moreover, publishers offer different kinds of funding opportunities for hybrid open access at the same time, including central invoicing. However, it is likely that not all agreements with central invoicing as they currently stand meet the Plan S requirements for transformative agreements.
Implementation of Plan S is underway to change current practises of funding open access publication in hybrid journals. Because Elsevier’s current transparency related to their invoicing is a remarkable exception, workflow guidelines for transformative agreements should consider taking the publisher’s example of sharing invoice data as a recommended good business practise for publishers. Although future work needs to tackle the remaining questions about the data quality and coverage, publisher-provided invoice data make publishers more accountable and extends the evidence base relative to hybrid open access. As a result, the data analysis presented here provides a basis to improve the monitoring of funding streams in the context of transformative agreements.
This work was supported by the Deutsche Forschungsgemeinschaft, project “Hybrid OA Dashboards: Mehrwertorientierte Analytics-Anwendungen zur Förderung der Kostentransparenz bei Transformationsverträgen”, project id 416115939.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Jahn (2019, Nov. 25). Scholarly Communication Analytics: Mining and analysing invoice data from Elsevier relative to hybrid open access. Retrieved from https://subugoe.github.io/scholcomm_analytics/posts/elsevier_invoice/
BibTeX citation
@misc{jahn2019mining, author = {Jahn, Najko}, title = {Scholarly Communication Analytics: Mining and analysing invoice data from Elsevier relative to hybrid open access}, url = {https://subugoe.github.io/scholcomm_analytics/posts/elsevier_invoice/}, year = {2019} }