Re: Central versus institutional self-archiving

From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
Date: Sun, 8 Aug 2004 14:32:48 +0100

On Sun, 8 Aug 2004, Richard Durbin wrote:

> Stevan Harnad wrote:
>
>sh> I think the House Appropriations Committee was less wise in going on
>sh> to specify *where* grantees should self-archive their articles to make
>sh> them OA (in PubMed). Surely it is enough to mandate that they should
>sh> be made OA! For reasons discussed in an early posting in the American
>sh> Scientist Open Access Forum (reproduced below), it no longer makes any
>sh> difference where an article is self-archived, as long as the Archive is
>sh> OAI-compliant. In this regard, the recommendations of the UK Parliamentary
>sh> Science and Technology Committee
>sh> http://www.publications.parliament.uk/pa/cm200304/cmselect/cmsctech/399/39903.htm
>sh> which were released within a fortnight of the US recommendations were
>sh> wiser (though otherwise very similar). The UK did not stipulate that funded
>sh> research must be self-archived in a central OA Archive, only that it
>sh> must be self-archived, hence OA. (In fact, they expressed a preference for
>sh> Institutional Self-Archiving.)
>
> I disagree entirely with this. I believe that central open-access
> archiving is far superior to distributed open access archiving.

It is not clear whether Richard Durbin is here disagreeing with me (when
I suggest that the UK committee has made the wiser decision to prefer
institutional self-archiving over central self-archiving) or disagreeing
with the UK Committee's recommendation itself. I expect he means both.

But before replying to Richard, I wish to point out that there are two
underlying issues here.

    (1) OA THROUGH (1a) INSTITUTIONAL OA SELF-ARCHIVING OR THROUGH (1b)
    CENTRAL OA SELF-ARCHIVING: The first is a strategic disagreement about
    whether it is (i) central, discipline-based self-archiving or (ii)
    distributed, institution-based self-archiving -- i.e., which shade of
    "Green" -- that will more quickly and surely provide us with 100%
    OA (Open Access to all 2.5 million annual articles in all 24,000
    peer-reviewed journals).

The answer is: Both are needed, helpful and welcome, but there are
strong reasons why we should put most of our efforts into institutional
self-archiving.

The main reason is that institutional self-archiving has far greater
and faster potential to grow and generalize across disciplines and
institutions to generate 100% OA:

Central discipline-based archives are nonexistent for most disciplines,
complicated to create and maintain, and have big overheads and little
clout with authors.

Institutional archives are easy to create and maintain. Institutions
are many, house all disciplines, and have the clout to implement a
self-archiving mandate with their employees..

The OAI interoperability/harvesting protocol has made the two kinds
of archives completely equivalent functionally. Search engine features
are trivial and all of them can be easily implemented with either form
of self-archiving.

The real problem is providing content (OA), not providing search
functionality. It is institutions, not disciplines, that provide
the content, hence they are the ones in the position to implement
the mandate to provide OA, cross all their disciplines.

    (2) OA THROUGH (2a) OA SELF-ARCHIVING ("Green") OR THROUGH (2b) OA
    JOURNAL-PUBLISHING ("Gold"): The second underlying disagreement is
    about whether the target is indeed 100% OA (to which self-archiving
    -- of either shade of Green -- will get us) or the target is in
    fact not 100% OA itself at all, but specifically OA *publishing*
    (Gold), in which case self-archiving (of either shade of Green)
    could at best only be a way-station to the real goal, or might at
    worst even be a competitor or obstacle to that goal.

The answer is: Both are needed, helpful and welcome, but there are strong
reasons why we should put most of our efforts into Green rather than Gold.

The main reason is that self-archiving has far greater potential to grow
and generalize across disciplines and institutions to generate 100% OA.

There is already at least three times more OA from OA self-archiving
(Green) (15%) than from OA journal-publishing (Gold) (5%).

It is far surer and easier to self-archive existing articles published in
existing non-OA journals (95%, of which 84% of those surveyed are already
Green) than to create new OA journals or try to persuade existing non-OA
journals to become OA journals (5%).

What is urgently needed by research and researchers now is 100% OA.

The reader should bear these two underlying themes -- (1) and (2) --
in mind in weighing the following comments from Richard Durbin:

> I know the OAI protocol allows search of distributed archives, but (a)
> its coverage is currently very poor, with no indication to me of how it
> will increase

Of course OAI coverage is very poor: That is because far too few
researchers are as yet providing OA to their articles! Remedying that
is what the two self-archiving mandates in question (US and UK) are about!

OA is currently being provided (3/4 of it via OA self-archiving,
1/4 via OA publishing) for somewhere between 10-20% of the 2.5 million
articles published annually in the world's 24,000 peer-reviewed journals
(although Swan & Brown's 2004 survey reported that 39% of authors have
self-archived at least one of their articles).

Swan & Brown's survey also pointed out the remedy for this, in
a finding I have by now quoted many many times! They asked their
sample of authors:

    "how they would feel if their employer or funding body required
    them to deposit copies of their published articles in... [OA
    archives]. The vast majority... said they would do so willingly..."

So this is precisely the missing "indication of how it will increase"
that Richard says does not exist!

Moreover, both the self-archiving recommendations under discussion here
-- the UK's institution-based one and the US's PubMed-based one -- are
recommendations to mandate precisely the remedy that authors themselves
have already indicated will induce them to self-archive willingly!

And, please note: so far this is all neutral, as between the two forms
of self-archiving, institutional and central.

    Swan, A. & Brown, S.N. (2004) JISC/OSI Journal Authors Survey
    Report. http://www.jisc.ac.uk/uploaded_documents/JISCOAreport1.pdf
    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/3628.html

    Swan, A. & Brown, S.N. (2004) Authors and open access
    publishing. Learned Publishing 2004:17(3) 219-224.
    http://www.ingentaselect.com/rpsv/cw/alpsp/09531513/v17n3/s7/

Richard Durbin then continues:

> (b) all current tools that have been proposed to me are hopeless in
> performance (quality and time) compared to Pubmed searching.

Pubmed is a wonderful resource:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi

So is Pubmed Central (PMC)
http://www.pubmedcentral.nih.gov/

They are not the same.

    PubMed indexes over 14 million biomedical articles
    from 18,687 journals (back to the 1950's).
    http://www.ncbi.nlm.nih.gov/entrez/citmatch_help.html#JournalLists

To search in Pubmed is to search most of the biomedical literature,
for most of it is in there. Unfortunately, however, it is mostly only
in there in the form of metadata! The full-texts (for 4388 of these
journals) are accessible only via toll-access through the publisher.
http://www.ncbi.nlm.nih.gov/entrez/journals/loftext_noprov.html

Only a tiny minority of those full-texts are OA. PMC, in contrast, has
only OA journal full-texts, but there are only 158 such journals (i.e, 3%
of Pubmed's 4388 full-text journals):

    PMC indexes 12,798 biomedical articles
    from 158 journals (59 of them BioMed Central [BMC] journals)
    http://www.pubmedcentral.nih.gov/front-page/fp.fcgi

So if Richard is comparing distributed OAI search
(presumably using http://oaister.umdl.umich.edu/ )
with Pubmed search, he is comparing apples with oranges!

If Richard is comparing OAIster search with PMC search, then he
is comparing

     search across one impoverished pan-disciplinary corpus (OAIster,
     with its 3,420,891 records from 327 institutions [*including PMC*!],
     containing at most 10-20% of the content of the current annual 2.5
     million articles published in the world's 24,000 journals)

     with

     search in one of OAIster's own impoverished subsets (containing 3%
     of the content of the biomedical subset of Pubmed's 4388 full-text
     journals: the 158 PMC OA journals).

>From this comparison Richard concludes that Pubmed (or PMC) search is
better (which of course it is!)! But Pubmed and PMC are not only better
because of their better search features (which can all, of course, be
fully duplicated by OAIster and by any other OAI search engine, whenever
we wish to implement them!): They are also better because PubMed (not PMC)
covers most of the biomedical literature: Except that unfortunately 97%
of that is not OA!

So, so far Richard has given no rationale at all for preferring central
self-archiving in general (or self-archiving in PMC in particular).
He has merely pointed out the obvious fact that Pubmed is a bigger and
better search engine, for a far bigger corpus -- which is mostly not OA!

Search engines and search engine features are *trivial*. Any and every
one of them can be implemented by any engine that so chooses. What
OAIster is missing most right now is not enhanced search-features but
increased content! And there is nothing OAIster itself can do about that:
It depends on authors, self-archiving. That is what the recommended US
and UK mandates are meant to induce authors to do (in their own interests,
and the interests of research).

What is (or should be) at issue here is the relative advantages of
mandating central vs. distributed self-archiving -- in terms of their
likelihood and speed of generating 100% OA. So far, nothing Richard has
written has any bearing whatsoever on that fundamental question. Currently
impoverished search engine features are utterly irrelevant; currently
impoverished content, on the other hand, is absolutely fundamental,
and the first, second, third, and nth priority!

(If I were a biomedical researcher, by the way, I would search PMC first
today, and then supplement the search with OAIster, because whereas PMC
is the better search engine, OAIster has the wider coverage, including
all of BMC *plus* whatever is self-archived elsewhere!)

> The only useful articles I have found in repeated OAI searches in
> broad areas of molecular biology, bioinformatics and genomics have
> been in PMC (because they are Gold or 6-month Gold), and OAI searches
> have given them back poorly, encrusted with junk.

By all means search PMC first! You'll still only get 3% of the
literature that way, but then you can supplement it with OAIster
and get that 3% ("encrusted") plus perhaps somewhat more too...

> Search is what matters. We learnt this lesson early with genomic data.
> The value of openly available sequence data is in having it powerfully
> searchable, and that happened when it was deposited centrally.

The value in having openly available data is having it openly available.
The search power can be provided easily whenever it is needed; it is
getting the open availability that is the hard part.

No point starting from an inapplicable premise: For the genome, we
had the open data (suitably tagged, presumably!), and needed only the
search power. For the 2.5 million annual articles, we only have 10-20%
of the OA, and what we need is more OA, not more search-power!

Nor does the eventual search power depend *one bit* on housing all the
OA content in one place! Here is where Richard profoundly underestimates
the potential power of OAI interoperability, focussing instead only on
its current implementation -- when the real problem today is the missing
80-90% of the OA content, not the missing but easily provided
functionality!

The reason distributed institution-based self-archiving is more likely
to generate that 80-90% remaining content have been discussed many times
in this Forum:

    "Central vs. Distributed Archives"
    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/0293.html

    "Central versus institutional self-archiving"
    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/3206.html

Here again are the three main reasons why preferential institutional
self-archiving is the optimal strategy. For the details see:

    "Re: Mandating OA around the corner?"
     http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/3873.html

    (i) Given the existence since 1999 of the OAI archive-interoperability
    standard we are now in the age of distributed digital archives,
    all made interoperable by compliance with the OAI protocol
    http://www.openarchives.org/

This means it no longer makes any difference *where* a paper is
archived, as long as the archive is OAI-compliant. Harvesters
and search engines can treat distributed content exactly as they
do centralized content.

    (ii) The reason institutional OA self-archiving should be made the
    preferred option for the NIH self-archiving mandate is that that is
    the way to generalize the effects of NIH policy on OA in general,
    across disciplines and institutions.

Institutions share the access/impact benefits of OA with their own
authors, and they also share the costs of lost impact. If
research-funders mandate self-archiving, that guarantees OA for their
funded research, but what about the rest? If it all goes into a
dedicated central archive it is much less likely to propagate the OA
effect to other disciplines at the same institution than if it goes
into each institution's own OA archives.

http://archives.eprints.org/eprints.php?action=analysis

    (iii) Another reason central self-archiving should not be specifically
    stipulated or preferred is that a number of the 84% of journals
    that have given their Green Light to self-archiving have specified
    that this must be institutional self-archiving, not 3rd-part central
    archiving. http://romeo.eprints.org/publishers.html

This 3rd reason is minor (because the distinctions are silly
and groundless and can probably be ignored), but needlessly and
counterproductively mandating that the self-archiving must be central
would also needlessly raise this minor obstacle.

> Second, I keep hearing that Gold is 5% and Green 84%. But well under 5%
> of the articles in the 84% Green articles are actually made open access,
> at least articles of interest to me in fields of interest to me. So
> currently, Gold central archiving is more, not less successful than Green
> OAI distributed archiving in terms of article coverage, and it looks more
> promising to me.

This is simply incorrect, factually. About 5% of OA content is Gold,
and about 3 times as much (c. 15%) is Green. (I had said that there
is about 10%-20% OA content in all That +/- 50% uncertainty applies to
both of these figures: Gold: 2.5%-5% and Green: 7.5%-15%. But
the ratio is at least 3/1 Green/Gold.

On the other hand, it is true that 15% actual Green falls far short of
84% OA (let alone 100% OA). In other words, authors are not yet taking
up their own publishers' Green Light to self-archive in anywhere near
sufficient numbers -- and that is another reason why the US and UK
mandate recommendations are so timely, and so needed.

But Richard's quote above already begins to show the conflation between
the two distinct questions I mentioned at the outset:

   (1) OA THROUGH (1a) INSTITUTIONAL OA SELF-ARCHIVING OR THROUGH
                  (1b) CENTRAL OA SELF-ARCHIVING:

   (2) OA THROUGH (2a) OA SELF-ARCHIVING ("Green") OR THROUGH
                  (2b) OA JOURNAL-PUBLISHING ("Gold")

It may well be that Richard (in his genomic research, where the data
are already openly accessible) has found more useful articles among
the 158 Gold OA journals in PMC than among any articles from the 4230
non-OA journals that may also have been self-archived individually by
their authors in other archives. But the real question Richard should
be asking himself is what percentage of *all* the articles that he needs
are currently OA!

We know 3% of the biomedical literature is published in OA journals and
archived in PMC already. What we are talking about is how to raise that 3%
to 100%. Let us suppose we agree that a worldwide self-archiving mandate
along the lines proposed by the UK and US Committees is the way to get
that remaining 97% self-archived. What reason has Richard given for
mandating that they be self-archived in PMC? He likes the search engine
and PMC has most of the existing biomedical full-text journal content (3%!).

But what about the fact that a great deal of journal content is not biomedical?
And that a great deal of biomedical journal content is not funded by NIH?
And that a great deal of the biomedical journal content funded by NIH is
not published in the 158 biomedical OA Journals?

The self-archiving mandate is not a mandate to publish in OA Journals!
It is a mandate to make your article OA by self-archiving it *regardless
of which journal it is published in*! In fact, it is a mandate that is
aimed specifically at the content of the *non-OA* journals (all 95% of
them). The OA journal articles are already OA!

So Richard thinks the fact that the PMC Archive, because it contains the
158 OA journals, and because it has good search features, is a reason
why all the non-OA journal articles should be self-archived in PMC too!

It would be helpful if Richard explained why this is a reason, especially
in view of the diametrically opposing three reasons (i)-(iii) given
above -- together with the fact that the question (above) of where to
self-archive (1: institutional or central) is not at all the same question
as the question of where to publish (2: Green or Gold). "Central" does
not equal "Gold"! Far from it! In fact, the question is whether it makes
sense to mandate that individual articles from the 84% Green journals
must be self-archived centrally rather than institutionally.

> Few of the 84% currently support central archiving (most
> restrict to author's "own" web site, which is reasonably interpreted as
> institutional). However, if NIH mandate conversion to central Green I am
> told that the top journals that currently either are only local Green
> or no-SA at all will have no problem converting to central Green, and
> the others will follow.

This observation is circular: Of course if NIH mandates central
self-archiving specifically, this will exert pressure on the 84%
Green journals to give their Green Light to central self-archiving
too.

But those journals *already* give their Green Light to institutional
self-archiving now. And since not a single valid reason has been given
for preferring central to institutional self-archiving (while at least
3 have been given for preferring institutional self-archiving, which
already has the Green Light of 84% of the journals surveyed), to mandate
central self-archiving specifically is simply to put gratuitous obstacle
in the path of OA (for no reason).

In contrast, if NIH mandates institutional self-archiving as the preferred
option (as well as offering PMC self-archiving as the back-up, in case
the author has no suitable institutional archive yet) then the effect of
the mandate will not only be to ensure that all current NIH-funded
content published in the 84% Green journals becomes OA, but it exerts
pressure on the remaining 16% of Gray publishers to go Green, or risk
losing their NIH authors to the 84% Green journals.

And because the mandate is institutional rather than central and
specific to PMC, it has a good chance of propagating to non-NIH-funded
and non-biomedical research at the same institutions too, as the UK
recommendation already does.

Because the institutional archives are all OAI-compliant, there is no
problem with PMC harvesting their metadata either. That could even be
part of the NIH stipulation:

    "Full-text must go in OAI-compliant OA Archive, preferably
    author-institutional. Metadata must also be deposited in PMC."

(Probably the easiest way is to set the metadata up to be harvested by
PMC from the OAI Archives automatically.)

Then all the extra power of PMC search would immediately accrue to all
those distributed OA/OAI articles.

> The biological community is well on the way towards central archiving.

It is 3% along the way! The rest of the world is 10%-20% along the way.
What is the point?

> Please Stevan and other idealists on this group, stop acting to derail
> this. Central self-archiving is what has succeeded for physics, not
> distributed self-archiving.

Far from acting to derail this (and far from being idealists), the
advocates of distributed institutional self-archiving are trying to
*accelerate* self-archiving, and hence 100% OA, and we have reasons
and evidence -- reasons and evidence that you have to take into account
and counter with counter-reasons and counter-evidence, if you wish to
recommend otherwise. So far you have only said that PMC search is better
and you have found more useful OA articles in genomics in PMC than
elsewhere!

Yes, self-archiving in physics began centrally (in 1991). At about
the same time, it was also beginning distributedly, for example, in
computer science (and many other disciplines). The Physics Arxiv
http://www.arxiv.org/
has since reached about a quarter of a million centralized papers,
and Citeseer has harvested at least a half-million distributed
papers. http://citeseer.ist.psu.edu/cs

But the rate of growth of the Physics Arxiv has remained steadily linear
since 1991:
http://arxiv.org/show_monthly_submissions
http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0043.gif

and that's just too slow! In contrast, distributed institutional
self-archiving seems to be growing faster (now curvilinear upward,
but after a much later and slower start).
http://archives.eprints.org/eprints.php?action=analysis
http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0043.gif
(The figures are a bit of a cheat, because they include all archives,
but that is the nature of distributed vs central archiving!)

What is needed to speed all of that up -- so we get from 10%-20% OA to
100% OA before we are all brain-dead! -- is that authors' institutions
and research-funders mandate that they self-archive their journal
articles. We already know from Swan & Brown's (2004) survey that the
majority of authors will willingly comply with such a mandate.

> Please don't use biology as a guinea-pig for a technology that as far
> as I can see has not yet proved itself for any discipline.

Mandating self-archiving is not about a *technology*: it is about a
policy, and a practise!

The problem is not with the search engines or with the locus of the
impoverished content we have so far: The problem is that the content of
all loci is impoverished. What is needed is more content, fast.

This is not a technological problem. It merely requires implementing
a mandatory self-archiving policy. And the most powerful and general
such policy -- the one that will apply to and propagate across all the
disciplines most quickly and surely -- is institutional self-archiving
(with central self-archiving always available as a back-up, and central
harvesting always made possible thanks to OAI-compliance).

> I agree that Green is somewhat better than nothing but central Green is
> much much better than distributed Green - Gold is a simple way to get
> central Green.

The task of sorting out the underlying logic and causality here is
rather daunting! "Green" means self-archiving one's own journal articles to
make them OA (regardless of whether the journal in which they are
published is Green or Gold).

We are talking about mandating self-archiving, hence mandating Green;
not about mandating Gold: i.e., mandating self-archiving of whatever
journal articles one publishes, not mandating that they must be published
in a Gold journal.

And we are disagreeing over whether it would be better to mandate
self-archiving (Green) in Distributed Institution-Based OA Archives or
Central Discipline-Based OA Archives (like PMC in biomedical disciplines).

Mandating Green (self-archiving) looks as if it will generate OA.
(It generates OA, not Gold, i.e., not OA journal publishing.) The
question is whether to prefer distributed or central self-archiving.
This is not Green vs. Gold, it is distributed vs. central!

Nor is "Gold a simple way to get central Green" -- quite the
opposite! Gold is already OA! It is not for the sake of the 5% of articles
that are already being published in OA journals that self-archiving
needed to be mandated: Those articles are already OA!

Self-archiving needed to be mandated to provide OA for the *rest*: the
95% of articles that are *not* published in the existing 5% of journals
that are gold. Trying to squeeze the remaining 95% of articles
published into 5% of the journals they are currently published in is a
non-starter. Creating 4230 new gold biomedical journals or converting
the existing 4230 to gold is that other long and uncertain road to OA,
the golden road, which is specifically *not* the road to which mandated
self-archiving pertains!

(If anything, the green road *might* ultimately prove to be the road
to gold too. But long before that it will have been the road that led
us to 100% OA. That green road should not be obstructed by trying to
direct all traffic centrally: As with the Information Highway itself
[the Internet], distributed content-provision is the fastest, surest,
and most natural to the medium. OAI and the medium itself will ensure
that the functionality accorded to the distributed content will be at
least as good as anything that can be accorded to centralized content.)

I have given the reasons institutional self-archiving is a better bet.
It is time for Richard to explain why not. Otherwise we are just bypassing
one another.

> So I appeal to everyone on this list to support central open access
> archiving for biology as the house recommends, and encourage the UK to
> go the same way with national or international archives, rather than
> promote a more distributed solution.

And I appeal to everyone (on the basis of the reasons and evidence
adduced) to support mandated self-archiving, as both the US and UK
have recommended, but to encourage the US to go the same way as the UK,
preferring distributed institutional self-archiving, for the reasons
given.

Stevan Harnad

UNIVERSITIES: If you have adopted or plan to adopt an institutional
policy of providing Open Access to your own research article output,
please describe your policy at:
        http://www.eprints.org/signup/sign.php

UNIFIED DUAL OPEN-ACCESS-PROVISION POLICY:
    BOAI-2 ("gold"): Publish your article in a suitable open-access
            journal whenever one exists.
            http://www.earlham.edu/~peters/fos/boaifaq.htm#journals
    BOAI-1 ("green"): Otherwise, publish your article in a suitable
            toll-access journal and also self-archive it.
            http://www.eprints.org/self-faq/
    http://www.soros.org/openaccess/read.shtml

AMERICAN SCIENTIST OPEN ACCESS FORUM:
A complete Hypermail archive of the ongoing discussion of providing
open access to the peer-reviewed research literature online (1998-2004)
is available at:
    http://www.cogsci.soton.ac.uk/~harnad/Hypermail/Amsci/index.html
        To join the Forum:
http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html
        Post discussion to:
    american-scientist-open-access-forum_at_amsci.org
Received on Sun Aug 08 2004 - 14:32:48 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:47:33 GMT