How Green Open Access Supports Text- and Data-Mining
Stevan Harnad
Version with hyperlinked references:
http://openaccess.eprints.org/index.php?/archives/310-guid.html
SUMMARY: Data-mining robots like SciBorg can harvest Green OA
full-texts, self-archived in their authors' Institutional Repositories
(IRs) and "repurpose" them for better functionality. A Green publisher
has endorsed the author's posting of his Green OA postprint in his
own IR, free for all. The postprint is the author's own refereed,
revised final draft. The author can certainly revise that draft
further, making additional corrections, updates and enhancements,
including marking it up in XML and adding comments. Those corrections
need not be done by the author's own hands: They could be done by a
graduate student, a collaborator, a secretary, or a hired hand. The
author could also have SciBorg "repurpose" his postprint -- under
one trivial condition, easily fulfilled, which is that the locus
of the enhanced postprint, the URL from which users must download
it, remains the author's own IR, not a 3rd-party website. It would
be highly inimical to the progress of Green OA mandates to insist
instead that the Green publisher's endorsement to self-archive the
postprint in the author's IR is "not enough" -- that the author must
also successfully negotiate with the publisher the retention of the
right to assign to 3rd-party harvesters like SciBorg the right to
publish a "derivative work" derived from the author's postprint.
Peter Murray-Rust, in "Why Green Open Access does not support text-
and data-mining", wrote:
http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=702
> PM-R: "the first thing to do is to gather a corpus of documents... any
> other scientist should be able to have access to it. It therefore has
> to be freely distributable"
Agreed. So far this is just bog-standard OA. If the original
documents are self-archived as Green OA postprints in their
authors' Institutional Repositories (IRs), your SciBorg robot
http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html can harvest them
and data-mine them, and make the results freely accessible (but linking
back to the postprint in the author's IR whenever the full-text needs
to be downloaded).
> PM-R: "[At SciBorg] we are interested in machines understanding
> science"
Fine. Let your SciBorg machines harvest the Green OA full-texts and
"repurpose" them as they see fit.
> PM-R: "almost all articles are copyrighted and non-distributable.
> Publisher Copyright is a major barrier? you can't just go out and
> compile a wordlist or whatever as you may infringe copyright or
> invisible publisher contracts (we found that out the hard way)"
You can't do that if you are harvesting the publisher's proprietary
text, but you can certainly do that if you are harvesting the author's
Green OA postprints.
> PM-R: "PDFs are so awful? we have to repurpose them by converting to
> HTML, XML and so on"
Fine.
> PM-R: "Now the corpus is annotated. Expert humans go through line
> by line... It is this annotated corpus which is of most use to the
> scientific community"
Fine.
> PM-R: "So suppose I find 50 articles in 50 different repositories, all
> of which claim to be Green Open Access. I now download them, aggregate
> them and [SciBorg] repurpose[s] them. What is the likelihood that some
> publisher will complain? I would guess very high"
Complain about what, and to whom? A Green publisher has endorsed the
author's posting of his Green OA postprint in his IR, free for all. The
postprint is the author's own refereed, revised final draft. Now follow
me: Having endorsed the posting of that draft, does anyone imagine that
the publisher would have any grounds for objection if the author revised
it further, making additional corrections and enhancements? Of course
not. It's exactly the same thing: the author's Green OA postprint.
So what if the author decides to mark it up as XML and add comments? Any
grounds for objections? Again, no. Corrections, updates and enhancements
of the author's postprint are in complete conformity with posting his
postprint.
Suppose the author did not do those corrections with his own hands, but
had a graduate student, a secretary, or a hired hand do them for him,
and then posted the corrected postprint? Still perfectly fine.
Now suppose the author had your SciBorg "repurpose" his postprint: Any
difference? None -- except a trivial condition, easily filled, which is
that the locus of the enhanced postprint, the URL from which users can
download it, should again be the author's IR, not a 3rd-party website
(that the publisher could then legitimately regard as a rival publisher
-- especially if they were selling access to the "repurposed" text).
So the solution is quite obvious and quite trivial: It is fine for the
SciBorg harvester to be the locus of the data-mining and enhancement of
each Green OA postprint. It can also be the means by which users search
and navigate the corpus. But SciBorg must not be the locus from which
the user accesses the full-text: The "repurposed" full-text must be
parked in the author's own IR, and retrieved from there whenever a user
wants to read and download it, rather than just to search and surf the
entire corpus via SciBorg.
Not only does this all sound silly: it really is silly. In the online
age, it makes no functional difference at all where a document is
actually physically located, especially if the document is OA!. But we
are still at the interface between the paper age and the OA era. So we
have to be prepared to go through a few silly rituals, to forestall any
needless fits of apoplexy, which always mean delay (for OA).
So the ritual is this: It would be highly inimical to the progress of
Green OA mandates to insist that the publisher's endorsement to
self-archive the postprint in the author's IR is not enough -- that the
author must also successfully negotiate with the publisher the retention
of the right to assign to 3rd-party harvesters like SciBorg the right to
publish a "derivative work" derived from the author's postprint. That
would definitely be the tail wagging the dog, insofar as OA is
concerned, and it would put authors and off providing Green OA (and
hence their institutions from mandating it) for a long time to come.
Instead, when SciBorg harvests a document from a Green OA IR, SciBorg
must make an arrangement with the author that the resultant "repurposed"
draft will be deposited by the author in the author's IR as an update of
the postprint. Then when a user of SciBorg wishes to retrieve the
"repurposed" draft, the downloading site must always be the author's IR,
not a draft hosted by and retrieved directly from SciBorg.
This ritual is ridiculous, and of course it is functionally unnecessary,
but it is pseudo-juridically necessary, during this imbecilic
interregnum, to keep all parties (publishers, lawyers, IP specialists,
institutions, authors) calm and happy -- or at least mutely resigned --
about the transition to the optimal and inevitable that is currently
taking place. Once it's over, and we have 100% Green OA, all this
papyrophrenic nonsense can be dropped.
Please, Peter, be prepared to adapt SciBorg to the exigencies of this
all-important (and all too slow-footed) transitional phase, rather than
trying to adapt the status quo to SciBorg, at the cost of still more
delays to OA.
> PM-R: "Only a rights statement actually on each document would allow
> us to create a corpus for NLP without fear of being asked to take it down"
No. Green OA authors with standard copyright agreements are not in a
position to license republication rights to SciBorg or any other 3rd
party. Let us be happy that they have provided Green OA at all, and let
SciBorg be the one to adapt to it for now, rather than vice versa.
Brody, T., Carr, L., Gingras, Y., Hajjem, C., Harnad, S. and
Swan, A. (2007) Incentivizing the Open Access Research Web:
Publication-Archiving, Data-Archiving and Scientometrics. CTWatch
Quarterly 3(3).
http://eprints.ecs.soton.ac.uk/14418/
Stevan Harnad
Received on Wed Oct 17 2007 - 05:14:32 BST