Re: Nature launches web debate "Future e-access to the primary literature"

From: Declan Butler <dbutler_at_cybercable.fr>
Date: Wed, 5 Sep 2001 22:27:34 +0100

      06 September 2001
      Nature 413, 1 - 3 (2001) © Macmillan Publishers Ltd.

The future of the electronic scientific literature

The Internet's transformation of scientific communication has only begun,
but already much of its promise is within reach. The vision below may change
in its detail, but experimentation and lack of dogmatism are undoubtedly the
way forward.


"The Internet is easier to invent than to predict" is a maxim that time has
proven to be a truism. Much the same might be said of scientific publishing
on the Internet, the history of which is littered with failed predictions.
Technological advance itself will, of course, bring dramatic changes — and
it is a safe bet that bright software minds will punctually overturn any
vision. But it is becoming clear that developing common standards will be
critical in determining both the speed and extent of progress towards a
scientific web.

'Standards' for managing electronic content are hardly a riveting topic for
researchers. But they are key to a host of issues that affect scientists,
such as searching, data mining, functionality and the creation of stable,
long-term archives of research results. Moreover, just as the Internet and
web owe their success to agreed network protocols on which others were able
to build, common standards in science will provide a foundation for a
diversity of publishing models and experiments and be a better alternative
to 'one-size-fits-all' solutions.

This explains why the Open Archives Initiative (OAI), one of many
alternatives now being offered to scientists to disseminate their work, has
now broadened its focus from e-prints to promoting common web standards for
digital content.

The reason is that some of the most promising emerging technologies will
only realize their full promise if they are adopted in a consensual fashion
by entire communities. At the level of the online scientific 'paper', one
major change, for example, is a shift in format to make papers more
computer-readable. Searches will become much more powerful; tables and
figures will cease to be flat, lifeless objects, and instead will be able to
be queried and manipulated by users, using suites of online visualization
and data-analysis tools.

This is being made possible by Extensible Mark-up Language (XML), which
allows a document to be tagged with machine-readable 'metadata', in effect
converting it into a sort of mini-database. Most web pages today are coded
in HTML. But this contains information only about a page's appearance.
Whereas HTML specifies title and author information, for example as simple
headings, such as:
<H1> The future of the electronic scientific literature </H1>
<H3>by John Smith</H3>
XML specifies these in a way that computers can understand:
<articletitle> The future of the electronic scientific literature
</articletitle> <author><firstname>John</firstname> <lastname>
Smith</lastname>.

The possibilities for tagging are endless. But a major need now is for
stakeholders to agree on common metadata standards for the basic structure
of scientific papers. This would allow more specific queries to be made
across large swathes of the literature. Indeed, what is above all hampering
the usefulness of today's online journals, e-print archives and scientific
digital libraries is the lack of means to federate these resources through
unified interfaces.

The OAI has agreed metadata standards to facilitate improved searching
across participating archives, which can therefore be queried by users as if
they were one seamless site. The OAI is attractive compared with centralized
archives in that it allows any group to create an archive while, by agreeing
common standards, they become part of a greater whole. The idea is catching
on: it is supported by the Digital Library Federation (DLF), a consortium of
US libraries and agencies, including the Online Computer Library Center.
CrossRef, a collaboration of 78 learned society and commercial publishers,
in which Nature's publishers are taking a leading role, is also actively
developing common metadata standards that would allow better cross-searching
of the 3 million articles they hold.

Minimal options
As metadata are expensive to create — it is estimated that tagging papers
with even minimal metadata can add as much as 40% to costs — OAI is
developing its core metadata as a lowest common denominator to avoid putting
an excessive burden on those who wish to take part. But even these skimpy
metadata already allow one to improve retrieval. This strategy is sensible
as it acknowledges the fact that the value and nature of scientific
information are heterogeneous.

Minimal metadata will suffice for much of the literature. But there will
increasingly be sophisticated and novel forms of publications built around
highly organized communities working off large, shared data sets. These hubs
will stand out by their large investment in rich metadata and sophisticated
databases. The future electronic landscape should see such high added-value
hubs evolving as overlays to vast but largely automated literature archives
and databases.

In such an early stage of development, it is essential to avoid dogmatic
solutions. Not all papers will warrant the costs of marking up with
metadata, nor will much of the grey literature, such as conference
proceedings or the large internal documentation of government agencies. Many
high-cost, low-circulation print journals could be replaced by digital
libraries. Overheads would be kept low, and the economics argues that the
cheapest means of handling the bulk of the literature may be automated
digital libraries. Tags automatically generated from machine analysis of the
text, for example, might minimize the quantity of manual metadata needed.

Or take ResearchIndex, software produced by the computer company NEC, which
builds digital libraries with little human intervention. It gathers
scientific papers from around the web and, using simple rules based on
document formatting, can extract the title, abstract, author and references.
It interprets the latter, and can conduct automatic citation analyses for
all the papers indexed. Such digital libraries will also provide new tools,
for example to generate new metrics based on user behaviour, which will
complement and even surpass citation rankings and impact factors.

At the other end of the spectrum, specialized communities organized around
shared data sets will produce highly sophisticated electronic
'publications', making it much more arduous for authors to submit
information because of the amount and detail they will be required to enter
in machine-readable form. Take the Alliance for Cellular Signaling (AfCS), a
10-year, multimillion-dollar, multidisciplinary project run by a consortium
of 20 US institutions. It is taking a systems view of proteins involved in
signalling, and integrating large amounts of data into models that will
piece together how cellular signalling functions as a whole in the cell.
Here, authors would be required to input information, for example, on the
protocols, tissues, cell types, specific concentration factors used and the
experimental outcomes. Inputs would be chosen from menus of strictly defined
terms and ranges, corresponding to predefined knowledge representations and
vocabularies for cell signalling.

The idea is that, rather than simply producing their own data, communities
instead create a vast, shared pool of well-structured information, and
benefit by being able to make much more powerful queries, simulations and
data mining. A series of 'molecule pages' would also pull together virtually
all published data and literature about individual molecules in relation to
signalling.

Indeed, the high-throughput nature of much of modern research means that,
increasingly, important results can be fully expressed only in electronic
rather than print format. Systems biology in particular is driving research
that seeks to describe the function of whole pathways and networks of genes
and proteins, and to cover scales ranging from atoms and molecules to
organisms. Increasingly, the literature and biological databases will
converge to create new forms of publications. Other disciplines stand to
benefit, too.

Helping machines make sense of science on the web
Many communities, including the AfCS, are building ontologies to underpin
such schemes. Ontologies mean different things to different people, but they
are in effect representations that attempt to hard-code human knowledge
about a topic and the intrinsic relationships in ways that computers can
use. The microarray community has been very active in this area. The
Microarray Gene Expression Database group has coordinated global standards;
as a result, users will be able to query vast shared data sets to find all
experiments that use a specified type of biological material, test the
effects of a specified treatment or measure the expression of a specified
gene, and much more.

One major problem is that genes and proteins often have different names in
different organisms, and these often say little about what they do. To get
round this problem, the Gene Ontology (GO) Consortium is creating tree-like
ontologies of the 'molecular function', 'biological process' and 'cellular
component' of gene products. All genes involved in 'DNA repair', for
example, would be mapped to the corresponding GO term, irrespective of their
name or source organism. A microarray gene-expression analysis that
previously yielded only names of expressed genes would in addition carry
mapped GO terms that might reveal, say, that half the genes are involved in
'protein folding'. GO terms can also help to federate disparate databases.

Ontologies can also be used to tag literature automatically, and will be
particularly useful for grey literature and archival material for which
manual tagging was not justified. Papers tagged automatically with concepts
can be matched, grouped into topic maps and mined. By breaking down
terminological barriers between disciplines, this should also enhance
interdisciplinary understanding and even serendipity. Nature is actively
investigating such possibilities.

The GO ontologies are still very incomplete, however, and the internal
relationships need to be enriched. Moreover, caution is required against
prematurely pigeon-holing gene functions, given the uncertainty of most
annotations. Ontologies are also the focus of intensive research in
computing science, and biology is not yet up to speed on this. Efforts such
as GO and the Bio-Ontologies Consortium deserve support. Indeed, given the
shortcomings of existing ontologies and controlled vocabularies, there may
be a case for creating a more organized international effort to ensure
economy of effort, interoperability and sharing of expertise.

The advent of structured papers that are increasingly held in literature
databases blurs further the distinction between the scientific paper and
entries in biological databases. Already, entries in the biological
databases are often hyperlinked to relevant articles in the literature and
vice versa, and CrossRef is developing standards for such linking. As text
becomes more structured, it will be possible to increase the sophistication
of both linking, data manipulation and retrieval.

Biological databases and journals have evolved relatively independently of
one another. Database annotations lack the prestige of published papers;
indeed, their value is largely ignored by citation metrics, and their upkeep
is often regarded as a thankless task. Database curation has consequently
lacked the quality control typical of good journals. The convergence between
databases and the literature means that database annotators and curators
will increasingly perform the functions of journal editors and reviewers,
while publishers will develop sophisticated database platforms and tools.

New ways in
Database- and metadata-driven systems will drive interfaces to publications
from simple keyword search models to ones that reflect the structure of
biological information. Visualization tools of chromosomal location,
biochemical pathways and structural interactions may become the obvious
portals to the wider literature, given that there are far fewer protein
structures or gene sequences than there are articles about them. As Mark
Gerstein, a bioinformaticist at Yale University, points out: "One might 'fly
through' a large three-dimensional molecular structure, such as the
ribosome, where various surface patches would be linked to publications
describing associated chemical binding studies."

Future electronic literature will therefore be much more heterogeneous than
the current journal system, and dogmatic solutions should therefore be
resisted. It is significant and sensible that both CrossRef and OAI have
made key strategic choices favouring openness and adaptability. They seek to
federate distributed actors rather than to create centralized structures.
They also make their work independent of the type of content, which makes it
flexible enough to incorporate and link seamlessly not just papers but news,
books and other media.

Crucially, both OAI and CrossRef have also decided to build systems
independent of the economic mechanisms surrounding that content. Many
publishers, in particular some learned societies, may be willing to make
their content free, perhaps after a certain delay. Others are exploring
business models where authors or sponsors pay, which would allow free access
to articles on publication. The open technological frameworks also mean that
particular communities, such as scientists with specific metadata needs for
their discipline, are free to build in more complex data structures; the
higher overheads incurred may require charging for added-value services.

Neutrality
The OAI and CrossRef strategies therefore differ fundamentally from more
centralized systems proposed by PubMed Central (PMC), operated by the US
National Library of Medicine, and E-Biosci, being developed by the European
Molecular Biology Organization.

But PMC and E-Biosci highlight the urgent need to index the full text of
papers and their metadata and not just abstracts, as is the practice of
PubMed and other aggregators. Services that require publishers to deposit
full text only for indexing and improving search are useful.

Unfortunately, PMC, unlike E-Biosci, confounds this primarily technological
issue with an economic one, by requiring that all text be made available
free after, at most, one year. It is regrettable that PMC has not in the
first instance sought full-text indexing itself as a goal, as this in itself
would be an immediate boon to researchers. It would also probably have been
more successful in attracting publishers.

The reality is that all of those involved in scientific publishing are in a
period of intense experimentation, the outcome of which is difficult to
predict. Getting there will require novel forms of collaboration between
publishers, databases, digital libraries and other stakeholders. It would be
unwise to put all of one's eggs in the basket of any one economic or
technological 'solution'. Diversity is the best bet.

This Opinion article has been inspired by many of the contributions to
Nature's web forum on "Future e-access to the primary literature". The
current table of contents of the forum can be found at the following
address: http://www.nature.com/nature/debates/e-access/

----------------------------------------------------------------------------
Nature © Macmillan Publishers Ltd 2001 Registered No. 785998 England.
Received on Wed Jan 03 2001 - 19:17:43 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:46:14 GMT