Re: Central versus institutional self-archiving from Joseph Halpern on 2004-11-03 (American-Scientist-Open-Access-Forum)

From: Joseph Halpern <halpern_at_cs.cornell.edu>
Date: Wed, 3 Nov 2004 09:16:01 -0500 (EST)

Just a very brief response to Stevan's note:

- Stevan says:

> Fewer keystrokes, more self-archiving. Accepted. But now can we talk
> about the vast, sluggish majority that does *no* self-archiving at all?
> That's why the self-archiving mandate is needed.

  For what it's worth, in CS, my anecdotal impression is that almost
  all papers that I want to get are freely available on the web
  (typically in citeseer or on author's home pages or both; occasionally
  on CoRR; the CS part of the arXiv; hardly ever on departmental
  archives; never, as far as I can recall, on university archives). So
  it seems that, at least in CS, the vast, sluggish majority are
  self-archiving somehow. This is not to say that it's not worth
  encouraging similar behavior in other fields though!

- Stevan says:

> (8) What is certain is that if OAI-compliant self-archiving is to be
> mandated, it is institutions that are in the natural position to implement
> the mandate and monitor compliance (probably at the departmental level),
> for it is institutions (and not disciplines) that share with their own
> researchers the benefits of maximising research impact, and the costs of
> losing research impact.

  I agree that if there is going to be mandate then it will have to come
  from either the universities or from funding agencies. My guess is
  that it will be even more effective coming from funding agencies, but
  that is not an argument against having universities mandate it as
  well. (I have actually been trying to convince the NSF to impose
  just such a mandate -- unsuccesfully so far.) If there is to be a
  mandate at all, my distinct preference would be that it be to archive
  on *some* OAI-compliant server, and not necessarily to archive on the
  university server.

- Stevan says:

> (4) Logically and practically, if there existed a central, OAI-compliant
> archive for each discipline (and some central entity to foot the costs
> and maintain the entire disciplinary archive in each case, as the
> Physics ArXiv does today), then it would make absolutely no difference
> whether authors self-archived in their disciplinary OAI archive or their
> institutional archive.

  I disagree with this, at least the way things stand currently. In the
  case of many subfields in physcis, the real "publication" of a paper
  (in the sense of "making public") happens when it is posted on the
  arXiv. Posting a paper on an institutional archive has a very
  different effect (in terms of the paper being noticed) than posting in
  on the arXiv. Maybe at some point it won't make a difference (when
  all archives are linked into a centralized virtual archive), but now
  it does.

- Stevan says:

> (2) What functionality does Joe think an individual OAI archive can
> provide for users (I am not speaking about depositing authors) that
> an OAI harvester and service provider could not provide, and better?

  I'm perhaps not imaginative enough to come up with lots of examples,
  but the type of thing that I had in mind was that an art history
  archive might provide particularly good ways of relating reproductions
  that would be important for art historians. Similarly, a
  genomics/computational biology archive might include gene sequencing
  data and ways of accessing it. Clearly, both examples involve going
  beyond just a repository of papers, but an archive of papers in a field
  might well evolve in the direction of providing more than just a
  collection of papers.

- Stevan, responding to Tom Wilson, says:

> > Perhaps, also, the various disciplinary archives may vary in what
> > they accept

> What they accept? It is journals that accept, and the target of OA is
> the postprints accepted by the journals. The preprints are another matter
> and not central to OA.

  For me as a CS researcher, I'm often interested in the preprints that
  haven't yet been accepted by journals. (The situation is more complicated in CS
  because conference papers are often never published in journals.) And
  the arXiv definitely does have policies on what is acceptable, and it
  varies by discipline. For example, the policy on CoRR is to accept
  any paper with CS content, even if it's blatantly incorrect. The
  physics arXiv tries to be (a teeny bit) more selective.

-- Joe

>From harnad_at_ecs.soton.ac.uk Mon Nov 1 22:20:26 2004
X-UIDL: P[Y!!3TA!!~KX!!8<~!!
Date: Tue, 2 Nov 2004 03:20:23 +0000 (GMT)
From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
To: BOAI Forum <boai-forum_at_ecs.soton.ac.uk>
cc: Joseph Halpern <halpern_at_cs.cornell.edu>
Subject: Re: What's happening in open archives?
MIME-Version: 1.0
X-MailScanner-Information: Please contact helpdesk_at_ecs.soton.ac.uk for more information
X-ECS-MailScanner: Found to be clean
X-MailScanner-From: harnad_at_ecs.soton.ac.uk

On Sun, 31 Oct 2004, Prof. Tom Wilson wrote:

> Quoting Joseph Halpern <halpern_at_cs.cornell.edu>:
>
>jh> My guess is that CS researchers will typically not put their
>jh> papers on university servers unless required to do so, simply because of
>jh> laziness.

It is true of just about *all* researchers that they will typically
not put their papers on any server unless they are required to do so
(laziness). If the problem of achieving 100% OA were merely the problem of
getting those who already self-archive in some way or other (i.e., those
who are not lazy) to do it in some other way (be it central disciplinary
server, institutional server, departmental server, or home page) then
we would not need a self-archiving mandate at all, and we be almost there!

It is important to keep this reality in mind in what follows, otherwise
all we are doing is meditating on our favorite way to self-archive,
rather than solving the problem of getting the non-self-archivers
to self-archive, so we can reach 100% OA.

I would suggest setting aside for the moment those who already
self-archive, and how they do it, and focussing on those who do not
(the lazy ones).

>jh> There's less overhead in putting a paper on your home page
>jh> than there is in putting it on a university server and authors know that,
>jh> once it's on citeseer, their paper is easily accessible (and, I would
>jh> guess, more likely to be seen than on a university server).

(1) The number of keystrokes it takes to self-archive a paper on one's
home-page may be a few (not many) fewer, but that is not the point: The
real problem (and the relevant laziness) is that of those who are *not*
doing those keystrokes *at all*, not that of those who are doing too few!

(2) Since the advent of the OAI protocol (1999), OAIster and citebase,
there is no difference whatsoever in either ease of accessibility or
likelihood of being seen, between a paper in an OAI archive (whether
institutional, disciplinary, or departmental) and a paper harvested by
citeseer. If anything, the advantage is the other way (because citeseer
is not OAI-compliant).

(3) Let us not mix up (i) the fact that citeseer is a (harvested) central
disciplinary archive that happens to be quite *populated* with (ii)
other facts about citeseer (such as that it is central, disciplinary,
or in the CS field).

(4) The salient feature of citeseer is that it is *harvested*. If
citeseer trawled for self-archived full-texts in physics or biology --
or even (surprisingly!) social science -- instead of computer science,
it would be populated too. (Possibly not as populated as in computer
science, but one can't be sure of that either.) Our webwide trawls for
OA full-texts using ISI-based citations in Biology and Social Science
are currently generating a hit rate of 10-15%.

    http://opcit.eprints.org/oacitation-biblio.html
    http://citebase.eprints.org/isi_study/
    http://www.crsc.uqam.ca/lab/chawki/ch.htm

(5) Hence what is really being compared here is not institutional
versus disciplinary archives, but harvested versus non-harvested
(full-text) archives.

(6) So let us not compare apples and oranges: The right comparison is
whether the probability (and rate) of reaching 100% OA is higher (a) if
authors do fewer keystrokes and we instead design more full-text
trawlers and harvesters like citeseer, or (b) if authors do a few more
keystrokes (to make their full-texts OAI-compliant) and then OAIster
(etc.) can just harvest their metadata, as they were designed to do.

(7) And this is entirely independent of whether self-archiving needs to
be mandated in order to ensure that we reach 100% OA soon enough.

(8) What is certain is that if OAI-compliant self-archiving is to be
mandated, it is institutions that are in the natural position to implement
the mandate and monitor compliance (probably at the departmental level),
for it is institutions (and not disciplines) that share with their own
researchers the benefits of maximising research impact, and the costs of
losing research impact.

Tom Wilson replies (to Joe Halpern)

> ...perhaps loyalty
> to a discipline is stronger than loyalty to an institution, which can
> vary over an academic career. And your comment, "unless required to do
> so" chimes in with my earlier point about academic authors needing
> some motivation to submit to institutional archives.

I'm afraid that several factors are again being mixed up here:

(1) "Loyalty to a discipline" is an abstraction, and an irrelevant one,
here: Disciplines do not count an author's publications, weigh their impact,
and employ and fund him accordingly. His institution does (and to a
certain extent his research funders do too). If an author elects
to self-archive so as to maximize his research's visibility, access,
usage and impact, this is primarily for the sake of his research itself,
and his own career, for which all the carrots and sticks are in the hands
of his institution (and funder), not his discipline. (So much for
self-archiving out of loyalty to one's discipline!)

(2) The author motivation in question is not specific to self-archiving
in his institutional archive: It is the motivation to self-archive at
all. (That motivation is so as to maximize his research's visibility,
access, usage and impact.)

(3) The author's institution's motivation likewise needs to be
taken into account; and that too is to maximize the visibility, access,
usage and impact of its own employees' research output. For authors and their
institutions, as noted, share in the benefits of enhanced research impact,
as well as in the costs of lost research impact. Hence it is authors'
institutions that wield the carrot/stick for maximizing impact through
self-archiving, and have both the means and the interest to monitor
compliance (both for themselves, and for their employees' research
funders, who likewise have a stake in the impact of the research they
fund, and often subsidise their fundees' institutions with substantial
overheads). Again, "discipline loyalty" has nothing to do with any of
this.

(4) Logically and practically, if there existed a central, OAI-compliant
archive for each discipline (and some central entity to foot the costs
and maintain the entire disciplinary archive in each case, as the
Physics ArXiv does today), then it would make absolutely no difference
whether authors self-archived in their disciplinary OAI archive or their
institutional archive. But this is not the case today: There are few
disciplinary archives, and the burden and complexity of creating and
maintaining them are substantially greater than offloading and
distributing the load on each individual university (and its departments)
for its own research output alone -- at far lower cost, per university,
along with a far more natural means of monitoring compliance.

(5) But because of the "laziness" problem noted earlier (let us henceforth
call it, more charitably, "sluggishness"), merely creating archives,
be they central or institutional, is not enough: Self-archiving needs to
be mandated, and compliance needs to be monitored and rewarded,
just as publishing itself needed to be mandated, and compliance needed
to be monitored and rewarded. And in both cases, the natural candidate
for mandating and monitoring the requisite practice (for the author's
own good!) is the author's institution (backed up by his research funder).
"Loyalty to a discipline" has absolutely nothing to do with it.

Two footnotes:

(6) Authors changing institutions is trivial in an OAI-compliant world. It
is simple for the author's new institution to automatically harvest the
metadata as well as the full-texts from the author's old OAI-compliant
institutional archive. (Wanting to remove one's work from the old
institution is absurd as wanting to remove it from the shelves of one's
old library -- or any library!)

(7) An article's or author's "discipline" -- in a digital, distributed,
OAI-compliant world -- is not a *place* but a metadata tag (or rather
several of them, as few disciplines are hermetic and autonomous).

> But is there really 'less overhead' in putting something on your own
> home page? If I (and I am talking personally, rather than generally)
> put something on my own home page it involves a degree of labour in
> converting to html - if I simply send, say, a Word document to the
> organizer of the institutional archive, all that is needed is an
> e-mail attachment. But perhaps the perception that it IS more trouble
> is part of the problem??

This is just the extra-keystroke saga again: It takes a few keystrokes
to convert, a few to email, not many more to self-archive.

I wonder how many people who have expressed strong opinions about what is
and is not feasible/optimal have ever actually gone through the motions
of self-archiving one of their papers in an OAI-compliant archive? and
even those few keystrokes are an over-estimate, as all subsequent
papers can "clone" most of the repeated metadata and enter only what is
new. And the proof that the problem is not the *number* of keystrokes
but the sluggishness, simpliciter, is that even in institutions such as
St. Andrews, which have established a proxy self-archiving service that
will do all the keystrokes *for* the author

    "Let us Archive it for you!"
    http://eprints.st-andrews.ac.uk/proxy_archive.html

the cupboards are nearly bare, waiting for a self-archiving mandate
(just as there would be few publications at all, if not for the
"publish or perish" mandate). As the Swan & Brown (2004) survey
reports: the majority of authors state that they will willingly
self-archive if it is mandated (but not otherwise).

    http://www.ingentaselect.com/rpsv/cw/alpsp/09531513/v17n3/s7/

(I invite those authors who would like to actually see what they are
talking about when they say self-archiving calls for too many keystrokes
to self-archive one paper in http://demoprints.eprints.org/ and sample
self-archiving for themselves.)

>jh> My own strong preference is for discipline-based archives, rather than for
>jh> intsitutional archives. The arXiv is extreemely successful because, for
>jh> large areas of physcis, that's *the* place to have your paper appear if
>jh> you want people to be aware of it.

Yes, but how long are we willing to keep waiting for other discipline-based
archives to be created and filled? The Physics ArXiv has been around
for nearly 15 years now, and no other has sprouted since. The only
two other comparable-sized disciplinary collections are *harvested* ones:
Citeseer (Computer Science) and RepEc (Economics). Harvesting, as noted
earlier, is for the already-converted (who have self archived in any
which way already). The problem, however, is the sluggish, who are
still the vast majority -- in *every* discipline. They're the ones for
whom the self-archiving mandate is needed. (Even Physics, at Arxiv's
present linear growth rate, unchanged since 1991, will not be 100%
OA for at least another 10 years).

  http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving_files/Slide0043.gif

But employers cannot mandate the creation of central disciplinary
archives: They can only mandate the creation and filling of their own
archives, for each of their own disciplines (departments). Research
funders *can* create their own archives (not exactly disciplinary, but
dedicated to the research they fund), and NIH is on the verge of doing
just that; but even there, the mandate is far more likely to propagate
to non-NIH-funded research and to other disciplines if it is
implemented institutionally:

    "A Simple Way to Optimize the NIH Public Access Policy"
    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/4091.html

(Besides: The likely causality is that Arxiv is today preferred by its
users because of the many papers that are in it -- not that the many
papers are in it because it was preferred by users: ArXiv was just a
natural online adaptation of a "self-archiving" practice among certain
physicists that preceded the Internet. If OAI had come before ArXiv,
the same physicist practice would naturally have been implemented
institutionally. It was pre-OAI functionality that dictated central
archiving back in 1991.

>jh> It is also easier to tailor a
>jh> discipline-based archive to the needs of a discipline; I can well
>jh> imagine that different features might be appropriate for an archive of
>jh> computer science papers than for an archive of papers in art history.
>jh> Of course, if all archives eventually hook together, this point may become
>jh> moot, but I think it holds for now.

(1) Of course all OAI-compliant archives -- including ArXiv -- will
eventually hook together. (They already do, via OAIster, but since ArXiv
is the only one of them with critical mass, there's no advantage for
ArXiv-users motivating them to use it through OAIster, nor any advantage
in upgrading OAIster's functionality to match and exceed ArXiv's --
though that can and will easily be done when there are more archives
with critical mass in OAIster.)

    http://oaister.umdl.umich.edu/o/oaister/

(2) What functionality does Joe think an individual OAI archive can
provide for users (I am not speaking about depositing authors) that
an OAI harvester and service provider could not provide, and better?

Tom Wilson:

> Again, Thomas Krichel noted that the needs of disciplines would probably vary

How, specifically, (1) for the user, and (2) for the author?

> and I think that differences may also relate to such factors as the delay
> between submission and publication in a field,

How? The self-archiving of the pre-refereeing preprint is (and must remain)
optional for the author. The self-archiving of the peer-reviewed final
draft ("postprint") can and should be done the moment it is accepted
for publication. What are the discipline differences, and the factors,
and the role of the length of the delay between preprint and postprint?

> the significance of primacy of discovery in the discipline,
> and national cultures.

He who wants to ensure primacy can (optionally) self-archive the
preprint (or date-stamp it is some other way). So what is the point
here? Remember that the target of OA is the postprint, the target
literature being the peer-reviewed journal literature.

> Perhaps, also, the various disciplinary archives may vary in what
> they accept

What they accept? It is journals that accept, and the target of OA is
the postprints accepted by the journals. The preprints are another matter
and not central to OA.

> and if an archive has a policy of accepting anything it may play
> against that archive being accepted as a reliable source.

The mark of reliability is the journal that accepted the postprint.
Unrefereed preprints are tagged as such. Caveat emptor. Nor is this new.
Scholars and scientists have always been able to distinguish between
refereed publications and unrefereed drafts. Archives are merely
access-providers, not certifiers of reliability: Journals are the
certifiers of reliability.

And before the predictable next question is asked: "What certifies that
a postprint has indeed been accepted?" -- please read the self-archiving
FAQs on "Certification" and "Authentication":

    http://www.eprints.org/self-faq/#5.Certification
    http://www.eprints.org/self-faq/#2.Authentication

> There is obviously a researchable topic here and perhaps the more
> that is known about the appeal to authors in different disciplines
> and different countries of different modes of 'open access'
> availability, the easier it will be to devise policies that stand
> some chance of working in different contexts.

Open Access is already 10 years overdue. I suppose we could now turn
to researching disciplinary and national differences, but it seems
to me we'd be better off just going ahead and mandating self-archiving,
at last. It is unlikely that the outcome of the research would be
the only result that could counter-indicate doing this for any discipline,
namely, that researchers in that discipline would *not* benefit from
maximising the visibility, access, usage and impact of their research
output. For that is all that OA does, and is intended to do.

>jh> Full disclosure: I run the computer science part of the arXiv. Despite
>jh> the predominance of citeseer in the computer science community, I
>jh> believe that the CS arXiv plays an important role both because of issues of
>jh> copyright (when they post their papers on the arXiv, authors
>jh> explicitly give the arXiv permission to post papers) and because of
>jh> stability (since the Cornell library has assumed responsibility for
>jh> the arXiv, there's some assurance it will be around for the long term).
>
> Permanence is clearly important but, as the Deputy Director of a university
> computer services said to me when we were discussing archiving and I loosely
> used the words 'in perpetuity', "Nothing is for ever!" :-)

What needs to be forever is the journal's published version of the
article (the one subscribed to by libraries). The immediate purpose of
OA is immediate *access-provision* to all those would-be users whose
institutions cannot afford access to the journal's published version,
in order to maximize its visibility, access, usage and impact.
These author-provided self-archived supplementary versions (although
they can and will be preserved) do not have the primary preservation
burden, and it is a great mistake to delay access and impact on the
assumption that they do

> Joseph Halpern wrote:
>
>jh> My own sense is that the more the better. What I've told the librarians
>jh> at Cornell is that I'd ultimately prefer one submission process that
>jh> would simultaneously put my papers on all relevant archives. But if I
>jh> were to choose just one archive, I'd choose a discipline-based archive.

This is all splendid, but preaching to the converted: The problem is the
vast, sluggish majority who do no keystrokes, and self-archive nowhere,
not those who don't do enough keystrokes, and don't self-archive in
enough places!

>tw> But is there really 'less overhead' in putting something on your own
>tw> home page?
>
>jh> Of course, this depends on the university. But typically, with
>jh> university archives, there are forms to fill out (title, author,
>jh> abstract, etc.). None of that is required on one's homepage (although
>jh> some of us do put that information there).

This is just haggling again over the number of keystrokes, when the
problem, again, is the vast, sluggish majority who still do no
self-archiving keystrokes at all.

>tw> But perhaps the perception that it IS more trouble is part of the problem??
>
>jh> It may well be in some cases. And certainly there's not much overhead
>jh> involved in any case. But we're all busy people. There's no question
>jh> that part of the success of citeseer is due to the fact that so little
>jh> overhead is involved.

Fewer keystrokes, more self-archiving. Accepted. But now can we talk
about the vast, sluggish majority that does *no* self-archiving at all?
That's why the self-archiving mandate is needed.

>jh> We're hoping to see journals archived on the arXiv. For journals, some
>jh> assurance of permanence is critical.

The only journals you'll see archived in ArXiv are OA journals (by
definition, unless you only mean preservation archiving for back issues).
OA journals are not the problem. Non-OA journals are, and they require
author self-archiving for the current impact of current articles. That
said, hosting OA journal-archiving too is a fine service -- if journals
can be persuaded to want it!

Stevan Harnad
Received on Wed Nov 03 2004 - 14:16:01 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:47:39 GMT