On the Deep Disanalogy
Between Text and Software and
Between Text and Data
Insofar as Free/Open Access is Concerned
Stevan Harnad
It would be a *great* conceptual and strategic mistake for the movement
dedicated to open access to peer-reviewed research (BOAI)
http://www.soros.org/openaccess/ to conflate its sense of "free"
vs. open" with the sense of "free vs. open" as it is used in the
free/open-source software movements. The two senses are not at all the
same, and importing the software-movements' distinction just adds to
the still widespread confusion and misunderstanding that there is in
the research community about toll-free access.
I will try to state it in the simplest and most direct terms possible:
Software is code that you use to *do* things. It may not be enough to
let you use the code for free to do things, because one of the things you
may want to do is to modify the code so it will do *other* things. Hence
you may need not only free use of the code, but the code itself has to
be open, so you can see and modify it.
There is simply *no counterpart* to this in peer-reviewed research
article use. None. Researchers, in using one another's articles, are
using and re-using the *content* (what the articles are reporting), and
not the *code* (i.e., the actually words in the text). Yes, they read the
text. Yes (within limits) they may quote it. Yes, it is helpful to be able
to navigate the code by character-string and boolean searching. But what
researchers are fundamentally *not* doing in writing their own articles
(which build on the articles they have read) is anything faintly analogous
to modifying the code for the original article!
I hope that that is now transparent, having been pointed out and written
in longhand like this. So if it is obvious that what researchers do with
the articles they read is not to modify the text in order to generate a
new text, as programmers may modify a program to generate a new program,
then where on earth did this open/free source/access conflation come from?
And there is a second conflation inherent in it, namely, a conflation
between research publishing (i.e., peer-reviewed journal articles) and
public data-archiving (scientific and scholarly databases consisting of
the raw and processed data on which the research reports are based).
Digital data archiving (e.g., the various genome databases, astrophysical
databases, etc.) is relatively new, and it is a powerful *supplement*
to peer-reviewed article publishing. In general, the data are not *in*
the published article, they are *associated with* it. In paper days, there
was not the page-quota or the money to publish all the data. And even
in digital days, there is no standardized practice yet of making the raw
data as public as the research findings themselves; but there is definite
movement in that direction, because of its obvious power and utility.
The point, however, is this: As of today, articles and data are not
the same thing. The 2,000,000 new articles appearing every year in the
planet's 20,000 peer-reviewed journals (the full-text literature that
-- as we cannot keep reminding ourselves often enough, apparently --
the open/free access movement is dedicated to freeing from access-tolls)
consists of articles only, *not* the research data on which the articles
are based.
Hence, today, the access problem concerns toll-access to the full-texts
of 2,000,000 articles published yearly, not access to the data on which
they are based (most of which are not yet archived online, let alone
published; and, when they *are* archived online, they are often already
publicly accessible toll-free!).
No doubt research practices will evolve toward making all data
accessible to would-be users, along with the articles reporting the
research findings. This is quite natural, and in line with researchers'
desire to maximize the use and hence the impact of their research. What
may happen is that journals will eventually include some or all the
underlying data as part of the peer-reviewed publication itself (there
may even be "peer-reviewed data"), but in an online digital supplement
only, rather than in the paper edition.
(What is *dead-certain*, though, is that, as this happens, authors will
not be idiotic enough to sign over copyright for their research data to
their publishers, the same way they have been signing over copyright
for the texts of their research reports! So let's not even waste time on
that implausible hypothetical contingency. The research community may be
slow off the mark in reaching for the free-access that is already within
its grasp in the online era, but they have not altogether taken leave
of their senses!)
But that bridge (digital data supplements), if it ever comes, can be
crossed if/when we get to it. Right now, when we are talking about
the peer-reviewed literature to which we are trying to free access we
are talking about *articles* and not about *data*. Hence, exactly as
in the conflation of text with software in the invalid and misleading
open/free source analogy, the conflation of open/free full-text access to
the refereed literature with hypothetical questions about data-access
and data re-use and re-analysis capability is likewise invalid and
misleading. Article-access and data-access are different, and it is only
the first that is at issue today.
Open/free access -- (in this flurry of definitional fussiness and fancy
one no longer knows which word to use!) -- to the refereed research
literature is already vastly overdue, even though it has been 100%
within our practical reach for several years now.
http://cogprints.soton.ac.uk/documents/disk0/00/00/16/85/index.html
Research usage and impact and productivity are still being needlessly
lost daily, in untold quantities, because of access-denial by
toll-barriers. Why on earth do we keep wasting our time, energy
and attention on minor diversions and irrelevancies, while keeping
the solution to the real, pressing problem on hold, as we ponder the
ramifications of incoherent analogies with software and with
data-archiving, when there is a real job to be done: freeing (sic)
full-text access to the planet's yearly 2,000,000 peer-reviewed research
articles, now!
http://www.nature.com/nature/debates/e-access/Articles/harnad.html
I will now quote/comment this latest variant of that Protean microbe
that keeps on infecting us with Zeno's Paralysis in our progress along
the road to the optimal and inevitable. In the past, the source of this
persistent virus and its ever-mutating variants had been the adversaries
of free access (some toll-access publishers), as well as its over-timorous
potential beneficiaries (researchers, librarians, administrators).
http://www.ecs.soton.ac.uk/~harnad/Tp/resolution.htm#8
But now the paralysis-inducing bug is also originating from the ranks of
free-access activists, who risk balkanizing the free-access movement by
driving an idealogical wedge between "free" and "open," despite the fact
that nothing substantive is to be gained, and only more time to be lost
thereby. I will pass to quote/comment mode to illustrate this:
On Thu, 14 Aug 2003, Matthew Cockerill wrote:
> The open source software community [uses] the shorthand 'free, as in beer'
The open/free distinction in software is based on the modifiability of the
code. This is irrelevant to refereed-article full-text. (And the beer
analogy was silly and uninformative in both cases! Lots of laughs, but
little light cast.)
> Sure, if you are given some limited access to something and that access is
> 'free, as in beer', that can be very useful.
> In the world of software, say, that would apply to Windows Media Player,
> which you can download for free from the Microsoft website (even though the
> software itself is highly proprietary, and Microsoft would not take kindly
> to you reverse-engineering it or distributing a modified version).
This is all irrelevant to article-access, except that toll-access
publishers can, like every other product- or service-provider, use partial
or temporary give-aways as a marketing "hook." Temporary access is not free
access (or rather it is free access only while it is free). And partial
access is free only for whatever it is access to, not for what it is
not access to. (We're all "non-smokers" while we are asleep...)
But none of this provides any basis at all for the analogy with
proprietary code, as in software, nor with any need for code
modifiability, whatsoever.
> But free/open source software is more than 'free as in beer', it is 'free as
> in speech', and this offers hugely significant extra freedoms (which is why
> open source software has had such a revolutionary effect on the software
> industry).
This free-beer/free-speech analogy was already dubious in the software case
(not all programmers wish to give away their code [the freedom to produce
non-give-away products/services is a freedom too!], either for use or
for modification, or both; and my speech, whether spoken or written,
is spoken/written for you to hear/read, not for you to alter or to claim
to have been your own words, whether in unaltered or altered form; and
we are free to say or write what we like, as long as it is indeed our
own words and ideas [some of this enforceable by law, most of it only
enforceable by social convention -- these days with some help from
technology], etc., etc.).
But never mind. We will not try to repair another domain's incoherent
analogy here; but, please, let us not import it where it just sows still
more confusion in an already confused terrain: Refereed-research-article
authors (unlike the authors of most other forms of "written speech")
are not interested in earning access-royalties from the sale or use of
their words. They just want their words *used,* as much as possible. (That's
"research impact.") But to use their words is not to modify their *form*
(the code) and then re-issue them, perhaps as the modifier's own. To use
their words is to use their *content*, by incorporating that content
into the user's own content, in his *own* words, with proper source
attribution, so as to produce another text, another "written speech."
It would be nice if all programmers were willing and motivated to make
all their code free, not just for use, but for modification too. It would
also be nice if the writers of all words were willing and motivated to
make their words free, not just for use, but for modification too. But
alas humans and their egos and their selfish genes are monadic, not
distributed and diffuse, and their motivation is usually local, and quid
pro quo. So there will always be programmers who program only if it pays
by the unit-sold, and they may want the credit as well as the first-dibs
at modification and development. Nolo contendere there.
But the same is true of writers. Some will always want to be paid for
access to their words by the unit-sold, and virtually all will want to
keep their own words as their own alone.
http://cogprints.ecs.soton.ac.uk/archive/00001700/index.html
Refereed-article writers, however, don't want to be paid for access to
their words, any which way, because access-tolls reduce the usage of
their ideas and findings, and usage is what they really want to maximize
(because that research impact is what brings them their rewards, both
financial and scholarly/scientific). Because the words are in natural
language, there is no question of researchers concealing their code
(if they choose to publish at all). But what they want you freely using
is its *content* (with proper attribution). There is no question of your
modifying its form. As software does not have this form/content duality,
the analogy simply does not apply; it is incoherent.
> The Free Software Foundation defines these freedoms as:
> * The freedom to run the program, for any purpose (freedom 0).
Inapplicable to text: "Running the program" is accessing the text.
> * The freedom to study how the program works, and adapt it to your needs
> (freedom 1). Access to the source code is a precondition for this.
Irrelevant to text. You may study and use the *content* of my (giveway,
refereed-article) text (with attribution) in any way you like, and you
may quote it (with attribution). That's all. And there all analogy
between text and software ends.
There are also many new software-based uses (indexing, search,
navigation, digitometric analyses) that one can make of online text,
which refereed-article authors also welcome, but the big hurdle is free
full-text access, and not these perks, which will come with the territory.
But no reprocessing of *my* text code in order to turn it into *your*
text code (other than via its content, as processed by your brain)!
(And remember that data, and data-processing, are not part of
refereed-article text.)
> * The freedom to redistribute copies so you can help your neighbor
> (freedom 2).
Moot for text, when all you need redistribute is the URL of its toll-free
full-text online.
> * The freedom to improve the program, and release your improvements to the
> public, so that the whole community benefits (freedom 3). Access to the
> source code is a precondition for this.
> (see http://www.gnu.org/philosophy/free-sw.html )
Irrelevant to refereed-article text. You may only improve on the content,
in text of your own, with proper attribution. (And again, data re-analysis
is an orthogonal matter.) Only *I* can improve on my own text.
> This philosophy fits exceptionally well with the needs of the scientific
> community to share and build on each others research, which is why very many
> academic software development projects are developed using an open source
> model.
Scientific *software*. But we were talking about scientific-article
*text*, and this was supposed to be an analogy! There is no counterpart
to collective software development at the article-code level. It is only
*content* that the scientific community develops collectively -- and even
that, only while faithfully tracking sources through citation (and
quotation, where verbatim text is used).
Nor did the collective, cumulative use of scientific content require any
cues from the software community! Open-source *content* has been the
rule with scholarship for centuries: That's why scholars *publish*. The
new question is only about online-access to their content (via their
text). Please let's not forget or obscure that fundamental new question
in this welter of free-associative digital analogies of doubtful relevance
and coherence.
> BioMed Central's policy of Open Access is based on giving the scientific
> community a similarly broad freedom to make use of the research articles
> that we publish.
The scientific community already has the freedom to make use of
published articles. What it lacks is toll-free access to their texts!
> This includes giving access to the structured form of the articles,
We're back to XML mark-up again: a perk, a welcome perk, but we first,
and far more urgently, need the basics, namely, toll-free access to the
full-text. Please let us focus on that, rather than getting side-tracked
onto perks, especially those that make it seem as if free access were
somehow not enough, somehow not "truly open." We do not have free access
today. We don't need advice on the shortcomings of free access; we need
help in getting free access, as soon as possible.
> and giving the right to redistribute and create derivative works
> from the articles.
I've already replied to this in an earlier posting: When the full-text
is online and toll-free, the only relevant mode of "redistribution" is
to distribute the URL. Ditto for "derivative works." Quotes, as always,
require attribution. And text without attribution may be neither "re-used"
nor modified. So what is really the point here?
> This isn't just a philosophical issue - it has practical implications:
>
> e.g. in the August 14 issue of Nature (Vol 424 p727), Donat Agosti, from the
> American Museum of Natural History, New York, laments the fact that the
> www.antbase.org database of ant taxonomy is missing much critical
> information because a large fraction of all descriptions of new ant species
> are covered by publisher copyright.
I couldn't follow this. If the database is toll-free, the database is
toll-free. If making the database useful requires toll-free access to
the full-text of refereed-articles, then the full-text of
refereed-articles needs to be made accessible toll-free! We knew that
already! What is the point of all these further free-associations and
free-floating analogies? We are running in circles instead of breaking
out of the circle.
> In a true Open Access environment, not only could Antbase link to the
> articles on the publishers web site, but it could also make use the images
> and the text within those published descriptions to compile a universal and
> authoritative catalog of Ant taxonomy.
Translation: We need free access not only to the database, but to the
full-text. This can be clearly seen without conflating the two. (Please
jettison this "true open access" locution, or save it for when we at
last have universal false-but-toll-free full-text access, and we have
nothing more urgent left to do than to optimize it further. My guess
is that the rest will already have come with the territory of its own
accord. But please, let's go for the territory, before the "truth"
[see Keats quote at the end of this posting]).
> Finally, to respond to Sally's point questioning the benefits of
> deposition in a standard repository:
I re-read Sally Morris's point, and I now see that (in agreeing on #5)
I misconstrued it as as addressing only the trivial differences between
the types of "databases" -- "archives," "repositories": how we unfailingly
prefer to fuss with and multiply terminological trivia instead of
staying focussed on matter of substance! -- in which a full-text might
be deposited (e.g., Eprints vs Dspace, or central vs. institutional). I
now realize that Sally was refereeing there to BioMedCentral's (BMC's)
[requirement? recommendation?] that BMC authors self-archive their BMC
full-texts in an open-access database such as PubMed Central. Hence what
my reply to Sally should have been was this:
>sm> 5) Whether the item and/or its metadata are deposited in certain
>sm> types of databases (this last seems to me supremely irrelevant)
I agree it's irrelevant, if by "certain
type" you mean, say, Eprints vs. Dspace.
http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2670.html
But it's certainly not irrelevant whether the item (full-text)
is deposited in *some* type of database *at all*, for if it
is not deposited in a free-access database of *some* type,
it is not free access!
Whether that database type is institutional and distributed,
disciplinary and central, or the toll-free access database of an
open-access or a toll-access publisher is an implementational
and strategic matter. And whether or not that database is
OAI-compliant is a matter of functionality and efficiency
(interoperable OAI-compliant databases greatly preferred!).
> Although theoretically it might not matter where something is available, or
> in what format, it should be clear that in practical terms these are
> absolutely vital issues.
Absolutely vital *relative to what*? In practical terms, we do not
have free full-text online access to most of the refereed literature
(2,000,000 annual articles, in 20,000 refereed journals) today. What
is absolutely vital is getting that free access, now, and putting an
end at last to the needless daily impact-loss that continues until that
happens. Whether that free access is via this type of archive or that,
and has or lacks these perks or those, is certainly not the absolutely
vital issue today. On the contrary, foregrounding such minor details
when we still lack the basics, and thereby raising the goal post for
what we should all be aiming for, slows and diverts rather than speeds
progress.
Free access, now! Never mind the rest until we have those long-overdue
basics in hand, at last!
> So for example, theoretically, every DNA sequencing
> lab could put up its own web page and make available the sequences they
> themselves have obtained, using their own choice of format. The scientific
> community would thereby have free access to all those DNA sequences.
Correct. And this has absolutely *nothing* to do with the free-access
movement, which is about toll-free access to the 2M articles in the 20K
toll-access journals, not about data-archiving, which is a parallel but
independent development that proceeds apace, and does not need
free-access's (or publishers') permission! (Data-archiving, on the
other hand, might help accelerate article-archiving!)
http://www.ecs.soton.ac.uk/~harnad/Temp/data-archiving.htm
> But in
> fact, the deposition of all DNA sequences in a standard format with Genbank
> has a truly enormous benefit in practical terms, and has served as a crucial
> foundation for the development of tools to mine the genome. PubMed Central's
> role as a repository for biomedical research articles is very much
> analogous to Genbank's role as a repository for DNA sequence data.
An archive is an archive. There is an analogy (as well as a
complementarity) between data-archives and article-archives, but the
big difference is that both data archiving and data-archives are (1)
new, and (2) do not have a prior tradition and current status quo of
being non-free, whereas articles are (1) old, and (2) do have a prior
tradition and current status quo of being non-free. Publishers' relatively
new toll-based online article-archives are also non-free. So the relevant
point about article archiving is that article-archives should be free.
"that is all ye know on earth, and all ye need to know"
Stevan Harnad
NOTE: A complete archive of the ongoing discussion of providing open
access to the peer-reviewed research literature online is available at
the American Scientist September Forum (98 & 99 & 00 & 01 & 02 & 03):
http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html
or
http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/index.html
Discussion can be posted to: american-scientist-open-access-forum_at_amsci.org
Received on Sat Aug 16 2003 - 23:52:27 BST