Predicting later citation counts from very early data

From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
Date: Mon, 21 Apr 2008 07:47:13 -0400

On 20-Apr-08, at 9:47 AM, Peter Suber wrote:

      Hi Stevan:  Yesterday I tried to send the message below
      to the OACI list.  But I got an error message suggesting
      that the list has been discontinued.  

      Instead of predicting citations from early downloads, as
      you've done, this team predicts citations from properties
      of the article.

      Prediction of citation counts for clinical articles at
      two years using data available within three weeks of
      publication: retrospective cohort study, BMJ, February
      21, 2008. http://dx.doi.org/10.1136/bmj.39482.526713.BE

      Conclusion:  Citation counts can be reliably predicted at
      two years using data within three weeks of publication.


Hi Peter,

I am forwarding your post instead to the Sigmetrics
list: SIGMETRICS_at_LISTSERV.UTK.EDU

This interesting article finds that there are a number of metrics
immediately upon publication that predict citations two years later
(using multiple regression analysis).

      1274 articles from 105 journals published from January to
      June 2005, randomly divided into a 60:40 split to provide
      derivation and validation datasets. 20 article and
      journal features, including ratings of clinical relevance
      and newsworthiness, routinely collected by the McMaster
      online rating of evidence system, compared with citation
      counts at two years. The derivation analysis showed that
      the regression equation accounted for 60% of the
      variation (R2=0.60, 95% confidence interval 0.538 to
      0.629). This model applied to the validation dataset gave
      a similar prediction (R2=0.56, 0.476 to 0.596, shrinkage
      0.04; shrinkage measures how well the derived equation
      matches data from the validation dataset). Cited articles
      in the top half and top third were predicted with 83% and
      61% sensitivity and 72% and 82% specificity. Higher
      citations were predicted by indexing in numerous
      databases; number of authors; abstraction in synoptic
      journals; clinical relevance scores; number of cited
      references; and original, multicentred, and therapy
      articles from journals with a greater proportion of
      articles abstracted. Conclusion:  Citation counts can be
      reliably predicted at two years using data within three
      weeks of publication.


This finding reinforces the importance of taking into account as many
predictor metrics as possible, though a number of the metrics do seem
specific to clinical medical articles. The (apparently already known)
high correlation with physician ratings for clinical relevance is a
variable specific to this field. (The metrics used are listed at the
end of this message.)

We might perhaps make a distinction between static and dynamic
metrics. This study was based largely on static metrics, in that they
are fixed as of the day of publication. Dynamic metrics like early
downloads (which have also been found to predict later citations)
were not included (the Perneger study was cited but the Brody et al
study was not), nor were early citation growth metics (also
predictive of later citations).

      Perneger TV. Relation between online "hit counts" and
      subsequent citations: prospective study of research
      papers in the BMJ. BMJ
      2004;329:546-7. doi:10.1136/bmj.329.7465.546


      Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web
      Usage Statistics as Predictors of Later Citation Impact.
      Journal of the American Association for Information
      Science and Technology (JASIST) 57(8) pp. 1060-1072.
      http://eprints.ecs.soton.ac.uk/10713/


Journal impact factor was not included either, because it was not
available for a large number journals in the sample.
To my mind, the article reinforces the importance of validating all
these metrics, not just against one another, but against peer
evaluations, in all fields, as in the RAE 2008 database:

      Harnad, S. (2007) Open Access Scientometrics and the UK
      Research Assessment Exercise. In Proceedings of 11th
      Annual Meeting of the International Society for
      Scientometrics and Informetrics 11(1), pp. 27-33, Madrid,
      Spain. Torres-Salinas, D. and Moed, H. F., Eds.
      http://eprints.ecs.soton.ac.uk/13804/


Stevan Harnad

----------------------------------------------------------------------------
------------------------------
      Predictor variables Hypothesised influences: 

      Article specific from external sources: 

      No of authors More authors
       Residence of first author in North America North America
       No of pages Longer article
       No of references in bibliography More references
       No of participants More participants
       Structured abstract  Structured abstracts
       Length of abstract Longer
       Multicentre studies If multicentred
       Original article rather than systematic review If
      systematic review
       Dealing with therapy If therapy

      Article specific from internal sources: 

       No of disciplines chosen relevant to article (breadth of
      interest)  More disciplines
       Average relevance scores over all raters Higher scores
       Average newsworthiness scores over all raters  Higher
      scores
       Average time taken by raters to rate article  More time
       Whether article was selected for abstraction in 1 of 3
      synoptic journals  If yes
       No of views per email alert sent  More views per alert

      Journal specific using internal data: 

       Proportion of articles that passed criteria (2005)
      Higher proportion
       Proportion abstracted by 3 synoptic journals Higher
      proportion

      Journal specific using external data: 

       No of databases that index journal More databases
Received on Mon Apr 21 2008 - 12:52:03 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:49:17 GMT