Re: Comparing the Wellcome OA Policy and the RCUK (draft) Policy

From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
Date: Thu, 19 May 2005 23:18:53 +0100

On Thu, 19 May 2005, Robert Terry (Wellcome Trust) wrote:

> [http://www.wellcome.ac.uk/doc_WTX025191.html ]
> [Here is] why funding institutions believe that central repositories are
> the best scientific solution for access to the research papers and the
> data they fund.

The issue is not whether central repositories are the *best* way to access
papers. The issue is whether in order to get the Wellcome-funded papers
into a central archive like PMC or UKPMC it was necessary or desirable
to require that they be self-archived directly in PMC/UKPMC rather than
requiring that they be self-archived in the researcher's own institutional
repository (IR), from which PMC could then *harvest* the data.

Institutional self-archiving propagates across all disciplines and
institutions; central self-archiving does not. There can be many central
archives, providing access to the same harvested content, or mixes and
matches thereof. The institutions are the research providers. Their own
IRs are the natural primary locus for their own research output.

> The Trust and the Research Councils have institutions of their own
> (i.e. not part of a university) and so PMC will be their institutional
> repository and it is important to remember that the Trust operates
> globally supporting 4000 researchers in more that 40 countries -
> we need a repository that meets all our needs today and PMC offers that.

No disagreement whatsoever about that! It is good that the Trust creates
a central archive for its funded research, both as (1) a central means
of access and as (2) a back-up direct locus for self-archiving by those
researchers whose institutions do not yet have their own IRs, or whose
institutions are part of the Trust itself.

[But, please, let's leave the Research Councils -- I mean RCUK --
out of this, as they have not, fortunately, opted for the central
self-archiving route! If RCUK decide to create a RCUK central archive,
for purposes (1) and (2), that's just fine, for they will be able to
harvest their contents from the UK IRs that will be the primary locus
for institutional self-archiving, thanks to the (hoped for!) optimal
RCUK policy (in contrast to the needlessly non-optimal Wellcome policy).]

> The reason the Wellcome Trust, the Medical Research Council, the
> Biotechnology and Biological Research Council, the British Heart
> Foundation and the Arthritis Research Campaign with support from JISC
> are interested in exploring the possibility of establishing a UK portal
> for PubMed Central is that we want a long-term digital archive (i.e. not
> Word or PDF files but XML files) that will integrate the research
> literature with the data.

That is splendid, highly desirable, and highly commendable! But why on
earth did you pursue it at the cost of requiring central self-archiving,
when you would have had *precisely* the same benefits by requiring
institutional self-archiving -- with central harvesting, and the central
archive also serving as a back-up for direct self-archiving by those
whose institutions did not yet have an IR? The outcome and benefits
(for the Wellcome Trust et al., and for users) would have been exactly
identical, but the benefits for the growth of OA as a whole would have been
incomparably greater.

> We fund research from a scientific
> perspective, not its geographical location, and we want to ensure that
> when the literature is searched the search engine can go deeper than the
> metadata and provide links between, for example, genome sequence,
> chemical compounds or MRI scan images embedded in an article and
> databases such as PubChem and Genbank. It will move between the
> databases and PMC and visa versa - a Japanese or French team working on
> a gene but not publishing in English will be able to discover other
> research groups working on the same sequence. Teams working on drug
> compounds but investigating different uses will be able to discover who
> else is working on that compound either by searching the literature or
> the database.

These are all wonderful, desirable features, but they have *nothing* to do with
the point at issue!

> PMC already offers this functionality and that's vital to enhance the
> potential that the Internet offers. The life sciences have already
> moved beyond the need to read a word document on a local website.
> Institutional repositories may never offer the same degree of
> functionality until every single institution uses the same ingestion and
> storage system - OAI only links the metadata to files that might be in
> Word or PDF which may be unreadable in the years to come.

Fine. So harvest and enhance their metadata and full-texts. The point
is that if the Wellcome self-archiving requirement had thereby put its
weight and momentum behind institutional self-archiving, its influence
would have propagated far beyond the subset of the biomedical literature
that Wellcome funds. And without losing a single one of the desirable
features you describe.

> PMC will be the institutional repository for the National Institutes of
> Health, they are already talking to organisations in Japan and France
> and hopefully UK PMC will bring in the major funders of the life
> sciences in the UK - the database will be a truly interoperable global
> resource.

Robert: We are repeating ourselves, and you keep talking about apples
(PMC research) whereas I keep talking about fruit (*all research*), and
how the Wellcome self-archiving requirement could have helped increase
the quantity of Open Access fruit without sacrificing or compromising
or complicating a single one of the benefits you keep listing.

> This doesn't preclude the universities using PMC to populate their own
> repositories

You have said this to me before, and you have heard my reply: The
university archives are almost empty (15%). To fill them, we need a
self-archiving *requirement*. Wellcome can only require that its own
fundees self-archive, but if it requires them to self-archive in their
own institutional IRs, the self-archiving practice propagates. If instead
they are required to self-archive specifically in PMC -- adding that
universities are "welcome" to harvest it back if they wish -- it turns
everything upside-down! Universities, the actual research-providers,
who are hardly self-archiving now and need every bit of incentive to do so,
are instead invited to become self-harvesters of their own output, from PMC!

Well as far as OA is concerned, there is no gain in universities
harvesting back their own OA if it is already OA. What's already OA is
not the problem. What is *not* yet OA is the problem. *That* can't be
harvested back from PMC! And that is not increased one whit by the Wellcome
mandate if it is (needlessly) PMC-centric!

Moreover, it is not only illogical and impractical, but contrary to the
spirit and functional power of the OAI-PMH -- the Open Archives Initiative
Protocol for Metadata *Harvesting* (the very glue that is holding all of
OA together) http://www.openarchives.org/ -- for the *data-providers*
(the research institutions) to harvest back their *own* data from
the central *service-providers* (who have enhanced and repackaged
the data in various ways)! That has the whole OAI harvesting function
backwards. Institutions are the data-providers and central archives
like PMC (and why just one? why not many, with different enhancements,
functionalities, services?) are meant to be the service-providers,
for whom the OAI-PMH was created in 1999 specifically so the OAI data
would be interoperable, so it can be harvested and enhanced by central
OAI service-providers like PMC.

> but it has to be realised that science operates at the
> subject level and different disciplines move at different speeds and
> have different requirements.

Correct. So Wellcome would be doing its bit to accelerate OA provision
in the biomedical disciplines that it funds by requiring self-archiving
(as it does) and enhancing the data and metadata centrally in PMC
(as it does). But why on earth lose -- for no reason whatsoever -- the
potential to extend the scope and reach of its admirable self-archiving
requirement far *beyond* the biomedical disciplines it funds: by requiring
institutional self-archiving (except where an immediate back-up archive
is needed) and simply harvesting its fruits *with no loss whatsoever in
any of the desirable functionality you describe*?

> What is the most commonly held up example
> of an OA archive? arXive a subject based repository mirrored around the
> globe driven by the needs of the researchers not by their employing
> institution.

The Physics Arxiv became the (central) locus of a *practice* that
physicists were already engaging 15 years ago, in on-paper days, which is
to share preprints of their work centrally, prior to publication. This
central preprint-sharing practice evolved, naturally, with the advent
of the Web, into self-archiving in a central archive. At exactly
the same time, however, another discipline, computer science, was
also self-archiving and sharing its papers, but not centrally! They
were self-archiving distrubutedly, institutionally, and it was all soon
being *harvested* centrally, by citeseer http://citeseer.ist.psu.edu/
(harvested the hard way, before OAI-interoperability had made harvesting
far easier and more powerful and efficient). And citeseer has mare than twice as
many papers as Arxiv. And Arxiv in turn has far fewer physics
papers than the total number that are actually being self-archived: Physnet
http://de.physnet.net/PhysNet/physdoc.html is a central archive and data
enhancer, harvestiing physics papers from Arxiv *and* from thousands of
distributed institutional servers (most not even yet OAI-compliant yet),
for a total of over 2,000,000 OA papers, not just the 320,000 in Arxiv.

    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/4536.html

Moreover, Citeseer too, and Physnet (and OAIster
http://oaister.umdl.umich.edu/o/oaister/) and many other central
harvesters (some subject-based, some multidisciplinary) are likewise
accessible around the world, and mirrored and cached as needed. Those
properties have nothing whatsoever to do with whether the primary
data-providers self-archive locally or centrally.

> For the life sciences the need is for integration between
> the literature and the data - as a strategic decision a global PMC
> offers the best long term solution: high quality ingestion and checking
> of papers, additional functionality by integrating literature with data,
> a more refined search facility tailored to the needs of life science
> researchers and a global data set.

I hope it is apparent by now how we keep talking at cross-purposes. You
keep speaking about virtues that I do not for a moment deny, but you
don't address my core point, which is that these self-same virtues can
be had, at no further cost, with no loss (and untold further benefits),
by "ingesting" (harvesting) the papers from IRs. Yet the virtues keep
being cited as a justification for making self-archiving mandates central
instead of institutional. (And the institutions are being "invited" to
ingest back their own output, if they like -- when the whole point has
been about a needlessly lost opportunity to make the Wellcome mandate
maximise OA by maximising the self-archiving of research output well
beyond the scope of the specific research Wellcome happens to fund.)

> This all creates the environment to make it happen so what about the
> encouragement? Well, from 1 October 2005 all new Trust grants will have
> to deposit their papers in PMC (we are currently working with NCBI to
> adapt the NIH ingestion system to allow this to happen) and from 1
> October 2006 we will extend this condition to all our existing grant
> holders. We are allowing a maximum delay of 6 months i.e. many
> journals, and growing, appear in PMC immediately after publication this
> is the maximum delay - an immediate release is still the preferred
> option.

One tiny parametric change in the Wellcome Policy could have (and still
could!) make all the difference in the world: Make it a grant-fulfillment
condition of Wellcome funding that the fundees must deposit their papers
in their own IRs immediately upon acceptance for publication (leaving
it up to the fundees whether they set access as "Open Access " or
"Institution-Internal Access") and then let Wellcome harvest ("ingest")
it 6 months later. (For fundees who have no IR yet, let them deposit it
in UKPMC immediately upon acceptance for publication, and Wellcome can
decide how soon it wants to set access as OA.)

That covers all bases -- and kills two birds (making Wellcome research
OA in PMC and helping to get all other research OA in the researchers'
own IRs) with one and the same stone. (The same suggestion was made to
NIH, and ignored.)

    "A Simple Way to Optimize the NIH Public Access Policy"
    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/4091.html

> This allows for a period of adjustment in a rapidly changing market
> something that the learned societies, an important component of the life
> science community, appear to require. These are pragmatic, realistic
> and necessary steps none of which limit the development of institutional
> repositories.

With 92% of journals having already given their green light to immediate
institutional self-archiving, the main thing that is missing is an immediate
institutional self-archiving mandate.

    http://romeo.eprints.org/stats.php

81% of authors (as we know from two international, interdisciplinary JISC
studies) would comply willingly, 14% more would comply reluctantly. Only
5% of authors say they would not comply if their employers and/or their funders
required self-archiving.

    http://www.ecs.soton.ac.uk/~harnad/Temp/alma-amst.pdf

For the 8% of journals that are not yet green, the solution is to deposit
their metadata and full texts immediately upon acceptance into the
author's IR, and leave it up to the author whether to make the full-text
immediately OA, or to set it Institution-Internal Access for the time
being, and meanwhile email the eprint to all would-be users who email
for the eprint on the basis of the metadata (author, journal, date,
title, etc.), which are of course OA in any case.

    http://www.ecs.soton.ac.uk/~harnad/Temp/berlin3-harnad.ppt

*That*'s what the margin for adjustment would be if Wellcome had mandated
institutional self-archiving. Let's hope you will reconsider it. And let's
hope (even more hopefully) that RCUK will mandate the right thing from
the outset.

All of these points have already made clearly and forcefully in the
JISC reports that addressed these very questions. One hopes that, having
commissioned these studies, JISC (and RCUK) will heed their outcome:

    Swan, Alma and Needham, Paul and Probets, Steve and Muir, Adrienne and
    O'Brien, Ann and Oppenheim, Charles and Hardy, Rachel and Rowland,
    Fytton (2005) Delivery, Management and Access Model for E-prints
    and Open Access Journals within Further and Higher Education. JISC
    Report. http://cogprints.org/4122/

    Swan, Alma and Needham, Paul and Probets, Steve and Muir, Adrienne
    and Oppenheim, Charles and O'Brien, Ann and Hardy, Rachel and
    Rowland, Fytton and Brown, Sheridan (2005) Developing a model for
    e-prints and open access journal content in UK further and higher
    education. Learned Publishing. http://cogprints.org/4120/

Stevan Harnad
Received on Thu May 19 2005 - 23:18:53 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:47:53 GMT