Re: Use of Navigational Tools in a Repository from Leslie Carr on 2006-03-10 (American-Scientist-Open-Access-Forum)

From: Leslie Carr <lac_at_ecs.soton.ac.uk>
Date: Fri, 10 Mar 2006 00:57:32 +0000

On 9 Mar 2006, at 16:19, Stevan Harnad wrote:

> How well do search engines index the OA repositories?
> Frank McCown, Xiaoming Liu, Michael L. Nelson, Mohammad
> Zubair (2006) Search Engine Coverage of the OAI-PMH
> Corpus, IEEE Internet Computing, March/April 2006.
> http://library.lanl.gov/cgi-bin/getfile?LA-UR-05-9158.pdf
>
> Of this OAI-PMH corpus, Yahoo indexed 65%, followed by Google (44%)
> and MSN (7%). Twenty-one percent of the resources were not indexed
> by any of the three search engines.

I have just examined 6 weeks of historic web logs from the
eprints.ecs.soton.ac.uk repository (midnight on 30/1/05 to 0600 on
16/3/05), when the repository had 9372 eprints.

In those six weeks there were 595279 downloads from the Google bot
alone of which 393132 were of eprint "abstract" pages or eprint
document files.

In that time, the Google bot crawled 9353 of the repository's 9372
eprints (99.8% coverage). On average it crawled each eprint 42 times
- equivalent to once a day every day for the whole six weeks
observation.

Why is this observation apparently so different from that reported by
by McCown et al? Firstly the figures quoted from the papers are
averages. The paper gives a table of 10 representative repositories
whose Google percentage varies between 100% and 1.3%. Secondly, we
are measuring different things - McCown et al tested a statistical
sample of the search engine's index by query, whereas I have examined
the actions of the search engine's crawler and ASSUMED that a
document that is crawled must be indexed.

However, these results highlight the apparent difference in behaviour
of various repositories. And a difference that causes most of the
content of some repositories to be mainly overlooked by a search
engine is a crucial difference for Open Access.

What causes this difference? Is it an intrinsic feature of the
repository software, or a side effect of the organisation of the
repository and the interlinking of the pages that it exposes. Is it
all down to the sets of navigational pages that are provided
internally? (Does the subject classification pull its weight in this
aspect of a repository?) Or is the difference rather in the context
that the repository is situated? Will a repository that is well-
linked into its community (with high pagerank scores) have different
behaviour from an isolated repository?

---
Les

Received on Fri Mar 10 2006 - 15:47:34 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:48:14 GMT