The fundamental importance of capturing cited-reference metadata in Institutional Repository deposits
On 22-Jan-09, at 5:18 AM, Francis Jayakanth wrote on the eprints-tech
list:
Till recently, we used to include references for all the
uploads that are
happening into our repository. While copying and pasting
metadata content
from the PDFs, we don't directly paste the copied content
onto the
submission screen. Instead, we first copy the content
onto an editor like
notepad or wordpad and then copy the content from an
editor on to the
submission screen. This is specially true for the
references.
Our experience has been that when the references are
copied and pasted on to an editor like notepad or wordpad
from the PDF file, invariably
non-ascii characters found in almost every reference.
Correcting the
non-ascii characters takes considerable amount of time.
Also, as to be
expected, the references from difference publishers are
in different
styles, which may not make reference linking straight
forward. Both these
factors forced us take a decision to do away with
uploading of references,
henceforth. I'll appreciate if you could share your
experiences on the
said matter.
The items in an article's reference list are among the most important
of metadata, second only to the equivalent information about the
article itself. Indeed they are the canonical metadata: authors,
year, title, journal. If each Institutional Repository (IR) has those
canonical metadata for every one of its deposited articles as well as
for every article cited by every one of its deposited articles, that
creates the glue for distributed reference interlinking and metric
analysis of the entire distributed OA corpus webwide, as well as a
means of triangulating institutional affiliations and even name
disambiguation.
Yes, there are some technical problems to be solved in order to
capture all references, such as they are, filtering out noise, but
those technical problems are well worth solving (and sharing the
solution) for the great benefits they will bestow.
The same is true for handling the numerous (but finite) variant
formats that references may take: Yes, there are many, including
different permutations in the order of the key components,
abbreviations, incomplete components etc., but those too are finite,
can be solved once and for all to a very good approximation, and the
solution can be shared and pooled across the distributed IRs and
their softwares. And again, it is eminently worthwhile to make the
relatively small effort to do this, because the dividends are so
vast.
I hope the IR community in general -- and the EPrint community in
particular -- will make the relatively small, distributed,
collaborative effort it takes to ensure that this all-important OA
glue unites all the IRs in one of their most fundamental functions.
(Roman Chyla has since replied to eprints-tech with one potential
solution: "The technical solution has been there for quite some time,
look at citeseer where all the references are extracted automatically
(the code of the citeseer, the old version, was available upon
request - I dont know if that is the case now, but it was in the
past). That would be the right way to go, imo. I think to remember
one citeseer-based library for economics existed, so not only the
computer-science texts with predictable reference styles are possible
to process. With humanities it is yet another story.")
Stevan Harnad
Received on Fri Jan 23 2009 - 00:28:17 GMT
This archive was generated by hypermail 2.3.0
: Fri Dec 10 2010 - 19:49:39 GMT