Boolean full-text search, institutional mandates and musical chairs

From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
Date: Sat, 4 Mar 2006 14:14:28 +0000 (GMT)

---------- Forwarded message ----------
Date: Fri, 3 Mar 2006 23:37:58 +0000 (GMT)
From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
To: Belinda Weaver <b.weaver_at_library.uq.edu.au>
Subject: Boolean full-text search, institutional mandates and musical chairs

Dear Belinda,

Thank you very much for your detailed and thoughtful reply.
Here are some comments:

On Fri, 3 Mar 2006, Belinda Weaver wrote:

> Dear Professor Harnad
> Someone passed on to me the email below where you appear critical of the
> University of Queensland's Fez development. I think many of your remarks are
> based on misunderstandings of what we are doing now and what the point of
> our project is, so we thought we would try to set these out for you so you
> can see for yourself that your aims and ours are not at all at odds with
> each other, but are probably going in the same direction.
>
>> SH: Subjects? Subject search?
>> What utter flotsam! A 1980's library-aided Dialog search hang-over. (In
>> the online/OA millennium, it is Google-style full-text Boolean search, not
>> subject search any more. Completely obsolete; alive now only in
>> library cataloguers' imaginations. (Also, no one ever made subject
>> catalogue cards for journal articles!)
>
> I would take issue with this statement - a very large number of searches on
> our ePrintsUQ repository are by (controlled) subject terms. The thesaurus we
> use, the Australian Standard Research Classification, is well known to
> Australian academics as it is used for reporting publications annually to
> DEST, when applying for ARC grants and so on. Therefore when looking for
> materials, many people follow that approach of going directly to the ASRC
> code that interests them. That way they get publications that are
> specifically and substantially about, say, signal processing rather than
> items that might simply mention signal processing in passing.

That is undoubtedly useful for Australian researchers familiar with ASRC
and wishing to search Australian research therewith. But surely
Australia's goal in making its research output Open Access is not only
or mainly to make it usable by and for Australian researchers! Surely
the main target usership for Australian research output -- like the main
target usership for the rest of the world's research output -- is the rest
of the world. And the rest of the world will neither wish to search all
and only Australian research output (on, say "signal processing"), nor
will they be familiar with or desirous of using ASRC. They will be
searching all research, worldwide, and via OAI harvesters, not an
Australia specific collection.

Moreover, I am certain that when they do search the world OA literature,
the classification, if any, will come from automatic AI-based
classification derived from the full-texts, citations, affiliations and
keywords, not from a pre-tagged classification like ASRC. Nor would I
want to discourage authors (who are already too sluggish about
self-archiving their articles, fearful of too much time, keystrokes and
complications) with having to subject-classify their work. I am certain
the inverted full-text search, augmented by AI-based classification and
the other tags, on the *harvested* collection, will easily beat any
laborious and by now very dated self-classification at source (by either
author or institutional librarians). I think time and effort spent on
that will alas be entirely wasted in the OA/OAI world.

> Keywords go in and out of fashion.
> Repositories that rely on keyword search alone will
> quickly fill up with 'dark matter' - where no search on current keywords
> actually retrieves certain records that nevertheless may be relevant.

Agreed. So keywords, as noted, are mere supplements to the real data,
which is inverted full-text, plus any AI processing and classification
and analysis, citation, co-citation, latent semantic indexing, text
similarity metrics, etc.

> Controlled subject search is actually far more efficient in finding items in
> a database than full text searching by keyword.

But keywords are not the alternative: Boolean search on inverted
full-text is, augmented by all the further techniques I mentioned, none
of them based on hand pre-classification at source. (Or are you thinking
of merely a bibliographic database, without the full-texts? But that is
not at all what OA is about!)

> Keyword searches return many
> results, many of borderline relevance or no relevance at all. Subject
> searches result in fewer, more targeted results.

I am talking of neither keyword nor subject search but boolean search on
inverted full-text, augmented by AI-classification, citation links,
co-citation, similarity indices, etc.

> Material that is not
> subject-classified may end up being virtually lost. An example - the book
> 'What colour is your parachute' is about career planning. It is classified
> in our library catalogue with terms such as career planning, jobs, and so
> on. Without that classification, who would find that title as the relevant
> keywords a searcher would use do not appear anywhere in the title.

We are not talking here about books and book titles but about the
full texts of journal articles, searchable by boolean full-text
search. It is, as I stated before, a profound mis-analogy to apply
book-title classification techniques of yore to today's online full-text
journal-article texts (which are, we must ever remind ourselves, OA's
primary target content).

> Journal articles in the arts are full of titles that bear little
> relevance to their contents. Subject classification helps locate those.

But far, far, far more helpful is boolean search on the inverted
full-texts of those articles. *Titles alone* may bear little relevance
to the contents, but surely the full-text contents themselves do!

> In any case, we offer both - controlled subject searching, keyword
> searching, so users can choose. Why not offer multiple pathways to data?
> There seems little harm in that.

Because the subject classification is unnecessary, time-consuming,
and redundant, and wastes either the author's time or the
documentalist's, delaying and deterring self-archiving.

Because most searching is not even done on the Australian corpus, but on
a harvested worldwide corpus, of which the Australian corpus in merely a
small subset.

Because full-text search easily beats subject and keyword search every
time.

Because AI-based subject-classification tags, derived automatically
from the full-texts and citations, rather than time-wasting hand-tagging
by the depositor, and based on the world OA corpus, not just the OZ
subset, provides all the "multiple pathways" one might need or wish.

Because there is harm in anything that risks needlessly deterring
self-archiving at a historic time when it is still hovering at 15%.

> And while card catalogues for journal
> articles might not have existed, almost all journal databases do use
> controlled subject headings to facilitate searching, and this is the method
> I always use in journal databases - keyword search first to identify
> possibly useful records. When one record seems particularly good, I extract
> the subject terms allocated, and re run the search on that term, limiting to
> that field. I always find records that keyword search did not retrieve and
> the records I do get are much more substantially on topic than the keyword
> alone results.

Belinda, how many of your journal databases offer boolean search
of the *full-texts* of the articles (rather than just the titles and
abstracts)? That is what we are talking about in OA IRs, and that is
the benchmark to beat.

>> Australia-only search?
>
> The ARROW project is addressing this and that is not the objective of
> ePrintsUQ, eScholarshipUQ for APSR, UQ eSpace or any other current UQ
> project. The document that Leslie refers to was for an earlier project
> proposal that has considerably changed since its first suggestion. The aim
> of the UQ portion of the APSR test-bed project is to capture the entire
> research output of this institution - not just the publications, but the
> supporting data (datasets, digitised scans and x-rays, images, audio and
> video files and so on) so that other researchers can not only read the
> research output as published in journal articles and conference papers, but
> also see the datasets from which the conclusions drawn in publications were
> taken. This is an attempt to mirror the eScience agenda in the UK where
> researchers have taken data collected by researchers and re-used it in ways
> the original researchers might not have imagined. We see this very much as a
> contribution to the research community internationally.

It is admirable and desirable to capture all research output, indeed all
digital output. But I am only talking about OA's target output, which
consists of the full texts of all journal articles published by
Australian authors. What's missing and urgently needed now is the
remaining 85% of those -- not their accompanying accoutrements -- though
those are of course extremely welcome too! The focus, target, and
overwhelming priority, is those article full-texts.

Australian research journal article content -- OA's target -- is an
extremely specific and important one, for Australian research impact and
progress. It is not just an arbitrary subset of Australian digital
output, and should not be treated as such. Australia, like the rest of
the world, is only spontaneously self-archiving about 15% of its annual
research article full-texts. This needs to be augmented to 100%, and
that 100% needs to be integrated into the world research article
full-text output in order to have its full impact and usage.

Subject hand-classification is a waste of time and a deterrent to
self-archiving. And the collection should not be thought of as an
Australia-only collection (except for Australia-only book-keeping and
performance evaluation purposes). It should be thought of and provided
as Australia's component of the worldwide harvest, for global searching
along with the rest of the harvest.

The primary objective is searching those research articles, not searching
all manner of digital content. And the point is that when researchers
worldwide are searching journal articles they have no need or interest
to focus on an Australia-only corpus.

> It is a pity that you have drawn conclusions from a project that was
> stillborn. It seems that the inevitable confusion between Fez/Fedora as a
> repository platform, UQ eSpace as a portal for exposing the content and the
> notion of the eScholarship project (albeit an earlier version) may obscure
> the key fact that the choice of which underlying platform is used is almost
> completely irrelevant. Thank you for alerting us to the fact that the older
> documentation on that site is not accurate and doesn't describe what the
> project is actually about. We will certainly be removing it and replacing
> with a document which is up to date.

I'm not sure I was able to follow all that. I agree that the repository
software is irrelevant -- as long as it facilitates rather than retards
the primary goal, which is the self-archiving of the full-texts of 100%
of Australian research article output -- first, for worldwide research
accessibility, visibility, usage and impact and, second, for internal
auditing, record-keeping, and performance review.

> Regarding the Fez/Fedora platform
> We built this to be an open source, open access repository system that could
> do the job of holding and making accessible our ePrint-type materials, as
> well as hold other collections as mentioned above - datasets, etc. Our UQ
> eSpace repository is not yet launched nor fully populated, therefore
> comparisons with our ePrintsUQ repository are unhelpful. We hope to migrate
> all our holdings - ePrintsUQ, our e-theses and our datasets to UQ eSpace
> when it is fully operational and ready for launch. It is not yet an open
> system, and should not be judged as such. We also intend to use UQ eSpace to
> facilitate the electronic delivery of research for the Australian RQF. It is
> working extremely well in that role though again this work would not be
> visible to outsiders.

Again, I could not follow all of that, but I certainly wish the project
well. What I have understood is that the Eprint collection, in one
existing archive (what percentage of UQ current research article output
is being self-archived in it?) is being migrated to another archive (not
altogether clear why migrating, rather than augmenting it to 100%) and
that that new archive will have more kinds of contents (fine, but why
another archive, why migration? archives are OAI compliant, they can be
harvested and integrated, they need not be migrated; what they need is
more of their respective contents). And that the new archive will feed
RQF (but why not feed RQF from the existing archive, augmented to 100?).
Plus all the needless subject tagging...

It sounds like what was needed was a UQ mandate to fill the original UQ
archive to 100% with UQ article output, an RQF feed from that archive,
perhaps OAI harvesting and integration with other UQ OAI archives, with
other contents (or just adding the other content to the original UQ archive)
and no subject classification...

> We also hope to be able to use it for a range of
> purposes in the coming years, not least deposit of all DEST-reportable
> publications or links to them elsewhere.

It is not at all clear why this needed migration and a new platform,
rather than 100% filling of the original UQ archive with all DEST content,
plus a DEST feed.

> The Fez/UQ eSpace initiative will
> deliver much more open access research from this institution than
> previously, and we see this as a good thing, as I am sure you do.

I do, but what I don't quite see is why/how a *platform change* increases
article content. It seems to me that it is increasing article content
it increases article content...

> The job
> that ePrintsUQ has done has been great in terms of making research available
> and increasing its visibility. Fez will only take that further, and our
> migration from the eprints.org platform to Fez is not an attempt to get away
> from exposing our research output but a way of exposing even further
> research via an interface that more clearly meets our institutional
> reporting, storage, preservation metadata, security, archiving and other
> needs.

This too, I could not quite follow, in the OAI age. OAI picks
up everything that is OAI and can be harvested and integrated with
everything else that is OAI. More article content could have filled the
original article UQ archive; other kinds of content of other feeds could
have been added to it; or it could have been co-harvested and integrated
with other content in other UQ archives. (An archive, unlike a physical
library, is a virtual, distributed thing, in the OAI age, not a fixed
locus.) And the various RQF and DEST feeds are simple add-ons, being
done with such archives in the UK and elsewhere, as needed, without the
need for migration.

My commentary, by the way, was not motivated by the needless migration
from Eprints to Fez, but by any needless modification and action other
than the action that is really needed most, namely, increasing
article content from the spontaneous worldwide baseline rate of 15%
to 100%. Migration and subject classification are simply dissipating
efforts in the wrong place, while doing nothing for OA growth and
momentum. Playing musical chairs instead of playing more music!

> The capacity and extra functionality that Fez has delivered has gone a long
> way towards institutionalising mandatory reporting of research here at UQ
> which is a very desirable aim which we have pursued unsuccessfully up till
> now. The ease of input and flexibility that Fez has demonstrated as an RQF
> tool has made it an attractive option for other research reporting jobs here
> at UQ, which can only increase the body of research available from UQ openly
> to the world.

I am for anything that successfully generates a UQ self-archiving
mandate!

So if it is truly easier for authors to self-archive in Fez, and this has
induced mandatory self-archiving at UQ, that's splendid!

But my guess is that self-archiving in Fez is in reality not one bit
easier than self-archiving in Eprints, and that mandating self-archiving
with Fez is not one bit harder than mandating self-archiving with
Eprints. It was always the keystrokes that were missing, and needed
mandating. The RQF reporting could of course be done either way too. And
the subject classification (likewise feasible for both) is equally
unnecessary and a deterrent in both. (So it still sounds like musical
chairs to me, but if the outcome is indeed mandated 100% music, I have
no complaints: Is it?)

Best wishes,

Stevan

> Belinda Weaver,
> Coordinator, ePrintsUQ,
> The University of Queensland Library,
> The University of Queensland,
> Brisbane Australia 4072.
> T: +617 3365 8281
> F: +617 3365 7930
> E: b.weaver_at_library.uq.edu.au
> W: http://eprint.uq.edu.au/
>
> From: Stevan Harnad [mailto:harnad_at_ecs.soton.ac.uk]
> Sent: Wednesday, 1 March 2006 10:42 PM
>
> Dear All,
>
> http://espace.library.uq.edu.au/
> UQ's Fez sounds very much like a lot of stufFez and nonsense (with a strong
> a-priori library-speculative feel!)
>
> Subjects? Subject search?
>
> What utter flotsam! A 1980's library-aided Dialog search hang-over. (In the
> online/OA millennium, it is google-style full-text boolean search, not
> subject search any more. Completely obsolete; alive now only in library
> cataloguers' imaginations. (Also, no one ever made subject catalogue cards
> for journal articles!)
>
> Australia-only search?
>
> Balderdash -- as foolish as the idea of Soton-only, UT-only, or UQ-only
> search. Search by external users, that is (i.e., the main clientele for
> research output, including Australian research output). The lines that are
> being foolishly crisscrossed here external global search vs. internal local
> record-keeping (whether institution internal, or
> national-assessment-internal [RQF]) and external use. An OZ-specific
> "collection" for the former (external-world global usage/search) is
> irrelevant and a misconception and the latter (local internal/national
> record-keeping and assessment) can and should be done at the metadata and
> harvesting level, not by imagining that one big OZ super-archive is needed.
>
> I think this is very timely, as it illustrates exactly the kind of absurd
> thinking that the library community (possibly hand in hand with the central
> assessor community, blinkered as they too are) is bringing down on all of
> Oz. It's all a-priorist (i.e., from an armchair, no reality-testing, no
> insight,
> no critical reflection, mostly with heads screwed on in 180-degree
> retro-position, looking squarely at -- and emulating -- the obsolete past).
>
> Richard is right on point. Hurrah for "piecemeal" solutions -- as long as
> the piece is the OA research article output of a single institution or
> department and the archive is OAI-compliant. If you go to the "eprints
> community" in UQ's Fez, you find the tiny bit that *all* of it should have
> been about, and focussed on as a priority. The pertinent point of
> comparison,
> UQ espace eprints community (the eprints subset of UQ's Fez archive, about
> 350 papers)
> http://espace.library.uq.edu.au/list.php?community_pid=UQ:1
> versus
> eprintsUQ (UQ's eprint archive, about 3000 papers)
> http://eprint.uq.edu.au/
>
> EprintsUQ had been a good start, and needed only a UQ mandate to make it
> work; instead it is to be abandoned (as was University of California's
> Eprints archive way back in 2001) and replaced by the Fez omnibus, heading
> off in all directions.
>
> What *should* be heading off in a different direction, is the research
> community, if it had any sense: It should be filling its own dedicated OAR,
> dedicated to OA articles. But of course the research community (as Richard
> also points out!) does not have any sense either -- otherwise they wouldn't
> need a mandate to get them to stop sitting on their idle fingers and do the
> 40 minutes per year keystroking it would take to put an end to all this
> absurdity once at for all.
>
> It is ironic that catchy quips propagate much more readily than sense: I had
> originally floated the term "espace" ironically, for "empty space" when I
> had warned way back in (Feb 2003) that we should not be fussing over which
> of the (then only two) archive softwares to use, but about how to fill them
> (and fill them with their target OA contents)!
>
> EPrints, DSpace or ESpace? (Feb 2003)
> http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2671.html
>
> Here we are in Mar 2006, still fussing over which software to use (now with
> an overflow of perhaps a dozen to choose from) and our archives still
> unfilled (filling with other than OA content).
>
> http://archives.eprints.org/?action=search&query=espace&submit=Search
> Grrrr...
>
> S
>
> On 1-Mar-06, at 2:54 AM, Leslie Carr wrote:
>
> In response to Arthur's distribution of the UQ/RQF comments I have dug
> around the UQ documentation somewhat and found the UQ Fez project proposal.
>
> http://espace.library.uq.edu.au/documentation/templates/proposal.html and
> http://www.library.uq.edu.au/escholarship/destproposal.pdf
>
> The former sets the scene: we have lots of repositories with lots of
> functionality but specific 'nuances' and we want a UberRepository. The
> second document is particularly interesting as they provide more detail. In
> particular the following tell-tale collections/searching misinformation:
>
> "Piecemeal solutions to building greater accessibility to Australian digital
> research
> have encompassed the creation of institutional ePrint repositories in some
> Australian universities, the Australian Digital Theses Project, and the
> setting up of
> ePress initiatives in some Australian universities to create new
> opportunities for
> scholarly communication. While these are all worthy projects, their
> fragmented
> nature means that researchers must approach each individually and master
> different layouts, search facilities and subject schema to discover their
> contents.
> The US Mellon-funded OAIster project
> (http://oaister.umdl.umich.edu/o/oaister/)
> does allow cross-searching of many of these initiatives, but searches cannot
> be
> searched by subject, which is a limitation that the proposed eScholarship_at_UQ
> initiative plans to resolve."
>
> Also, this objective to establish a national portal to all Australian
> research. Were you aware of this, Arthur?
>
> "The eScholarship_at_UQ project will aim to develop an integrated entry point
> to
> the vast body of under-reported research, initially within a single
> institution and
> then nationally. The gateway will harvest metadata from existing open access
> repositories, such as those already established for ePrints, and will also
> establish
> mechanisms to identify, capture, organise, manage - and in some cases,
> digitise
> - other forms of research, such as exists in individual academic Web pages,
> on
> departmental or School Web servers and in existing research datasets in both
> the
> sciences and humanities, among others. The project's goals will be to
> encourage
> better reporting of academic research outputs, to centralise access to
> information
> about research, to improve its visibility and usability, and to add value to
> existing
> information by standardising subject classifications across material
> harvested
> from a range of different sources. This would involve mapping existing
> institutional subject classification schemes to the Australian Standard
> Research
> Classification. It provides a comprehensive, robust thesaurus specifically
> designed for Australian research output. Where metadata harvested by
> eScholarship_at_UQ lacks the appropriate thesaurus descriptors, automatic and
> semi-automatic subject mapping techniques will be applied during the
> harvesting
> and upload process."
>
> To be honest, i don't think that they have asked for nearly enough resources
> to accomplish this, but I am surprised that it is politically acceptable!
> --
> Les
>
>
Received on Mon Mar 06 2006 - 04:06:08 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:48:13 GMT