Re: Interoperability - subject classification/terminology from David Goodman on 2003-03-11 (American-Scientist-Open-Access-Forum)

From: David Goodman <dgoodman_at_PRINCETON.EDU>
Date: Tue, 11 Mar 2003 13:06:44 -0500

Stevan, I agree with your conclusion-- it would be both confusing and
wasteful to develop local elaborate classification schemes.

As for the more specific difficulties of access you simply do not appear
to
realize how difficult much of this can be in practice, especially to
beginners.
My journal examples were meant to indicate the difficulty of classifying
even roughly papers in those disciplines, not just journals.
Resolving journal names with a fuzzy match will work -- 95% of the
time.
A good library intends to locate and retrieve not 95%, but 100%,
and can usually accomplish 98 or 99 % for journal articles.
(Considerably less for informally published items, depending on area.
Google and other citation-indexed based systems are very good, but not
that good.

Actually, there is a partial solution to locating known items
that we both know about:
to eliminate primary reliance on journal names,
indexes, and so forth and go with links and DOIs. The real difficulty
comes
when you want to identify items that you haven't already known about.

There is a certain tendency that we all have to underestimate other
peoples' problems.

Stevan Harnad wrote:
>
> On Mon, 10 Mar 2003, David Goodman wrote:
>
> > The reason I suggested classification is that various people in the
> > subjects covered have told me that they use this archive by checking
> > everything in their subject classification each day, and that the current
> > rather straight-forward classification suits them fine.
>
> I assume that "this archive" refers to the Physics ArXiv, which is a
> global, discipline-based archive. Some users monitor some topics daily
> or weekly, and there are ways to accommodate their needs that include a
> subject taxonomy. (Whether that taxonomy, and the classification of the
> the papers within it, is best done, in our online digital era, by human
> classifiers and/or authors, rather than by a text-processing algorithm,
> is another question.)
>
> I was not referring, however, to global, discipline-based archives,
> but to local, institutional archives. For local search and use they
> certainly don't need a global taxonomy; and as bits of a harvested
> distributed worldwide "virtual archive" they are surely better sorted
> and navigated globally by cross-archive search tools than by local
> classification schemes.
>
> > People work in various ways, especially for current awareness. One of the
> > many virtues of systems such as this is that they can be designed to be
> > adaptable to individuals.
>
> The current-awareness alerting system (likewise probably better if based
> on text-processing algorithms rather than human classifiers and/or
> authors) is not the same issue as the question of whether or not there
> is any need to develop a classification system for local institutional
> refereed-research output archives. (The Eprints software, for example,
> has an alerting capability but no elaborate classification system.)
>
> > I did not mention Boolean full-text searching, only because I assumed it.
> > Stevan, would anyone design such a system without it--still, now?
>
> Not only is the boolean capability there with all inverted digital
> full-text, but (I'm betting) it can beat any human classification scheme
> (with the help of the right text-processing algorithms).
>
> > And I remain much less sanguine than you about the ability to accommodate
> > all the fields of science -- let alone all academic knowledge -- in a
> > single relatively simple system.
>
> In one (local, institutional) archiving system or in one classification
> system? I am sanguine about the first (though not necessarily all squashed
> into a single university archive: many interoperable departmental ones
> will probably work better) whereas I consider the second unnecessary and
> a waste of time (beyond a very rudimentary, first-cut classification)
> scheme: Computational algorithms on the full-text should do the rest. Not
> human classifiers (including the authors). Remember: we are talking about
> journal articles, not books or other works. Who ever searched the journal
> literature on the basis of a fixed human classification of it? (And if
> they did, how much mileage did they really get out of that taxonomy,
> compared to computational sorting based on full-text analysis?)
>
> > Anyone who has ever worked in a library can tell you about the
> > unreliability of a rough arrangement by discipline and journal name.
>
> Unreliability for what? Ambulatory, analog search? Of course. But we
> are talking about digital data and digital search. Who searches the
> journal system by taxonomy rather than, say, boolean word-search?
>
> > What subject is Phys Rev B (Condensed Matter)? or J Chem Phys? or Brain
> > Research?
>
> Who cares?
>
> If I am looking for stuff on neuropeptides, my boolean search will
> retrieve any papers from the latter two journals regardless, as long as
> they contain the indicators my algorithm specifies.
>
> > And if you always remember journal names correctly, I congratulate you but
> > wish you weren't unique. All your plans--as is inevitable--are shaped by
> > your own preferences. So would mine be, but at least I realize
> > it--sometimes.
>
> No need to remember journal names correctly (fuzzy matches can be
> fine-tuned -- see http://http://paracite.eprints.org/) and (in my
> optinion) no longer any need for any prefabricated a-priori human
> taxonomies (in searching the refereed research journal literature) --
> though a-posteriori algorithmic ones can be generated on the fly.
>
> Stevan Harnad

--
Dr. David Goodman
Princeton University Library
and
Palmer School of Library and Information Science, Long Island University
e-mail: dgoodman_at_princeton.edu

Received on Tue Mar 11 2003 - 18:06:44 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:46:54 GMT