Sunday, November 13, 2011

Google Scholar (still) sucks

(This is a follow-up to my previous post on the topic.)

I was encouraged by the appearance of two R-based Scholar-scrapers, within a week of each other. One, by Kay Cichini, converts the page URLs into text mode and scrapes from there (There's a slightly hacked version by Tony Breyal on github. The other, by Tony Breyal (github version here), uses XPath.

I started poking around with these functions -- they each do some things I like and have some limitations.
  • Cichini's version:
    • is based on plain old text-scraping, which is easy for me to understand.
    • has a nice loop for fetching multiple pages of results automatically.
    • has (to me) a silly output format -- her code automatically generates a word cloud, and can dump a csv file to disk if requested. It would be very easy and make more sense to break this up into separate functions: a scraper which returned a data frame and a wordcloud creator which accepted a data frame as input ...



  • Breyal's version:
    • is based on XPath, which seems more magical to me but is probably more robust in the long run
    • extracts numbers of citations
Neither of them does what I really want, which is to extract the full bibliographic information. However, when I looked more closely at what GS actually gives you, I got frustrated again. The full title is available, but the bibliographic information is only available in a severely truncated form; the author list and publication (source) title are both truncated if they are too long (!!: e.g. check out this search) Since the "save to [reference manager]" links are available on the page (e.g. this link to BibTeX information: see these instructions on setting a fake cookie), one could in principle go and visit them all, but ... this is where we run into trouble. Google Scholar's robots.txt file contains the line Disallow: /scholar, which according to the definition of the robot-exclusion protocol technically means that we're not allowed to use a script to visit links starting with http://scholar.google.ca/scholar.bib... as in the example above. Google Scholar does block IP addresses that do too many rapid queries (this is mentioned on the GS help page, and on the aforementioned Python scraper page). It would be easy to circumvent this by pausing appropriately between retrievals, but I'm not comfortable with writing general-purpose code to do that. So: Google Scholar offers a reduced amount of information on the page they return, and prohibits us from spidering the page to retrieve the full bibliographic information. Argh. As a side effect of this, I did take a quick look for existing bibliographic information-handling packages in R (with sos::findFn("bibliograph*")) and found:
  • CITAN: a Scopus-centric package that uses a SQLite backend and does heavy-duty bibliometric analysis (h-indices, etc.)
  • RISmed is Pubmed-centric and defines a Reference class (seems sensible but geared pretty narrowly towards article-type references). It imports RIS format (a common tagged format used by ISI and others)
  • ris: a similar (?) package without the PubMed interface
  • bibtex: parses BibTeX files
  • RMendeley from the ROpenSci project
So: there's a little more infrastructure out there, but nothing (it seems) that will do what I want without breaking or bending rules.
  • ISI is big and evil and explicitly disallows scripted access.
  • PubMed doesn't cover ecology as well as I'd like.
  • I might be able to use Scopus but would prefer something Open (this is precisely why GS's cripplage annoys me so much).
  • Mendeley is nice, and perhaps has most of what I really want, but ideally I would prefer something with systematic coverage [my understanding is that the Mendeley databases would have everything that everyone has bothered to include in their personal databases ...]
  • I wonder if JSTOR would like to play ... ?
If anyone's feeling really bored, here are the features I'd like:
  • scrape or otherwise save information to a variety of useful fields (author, date, date, source title, title, keywords, abstract?
  • save/identify various types (e.g. article/book chapter etc.)
  • allow dump to CSV file
  • citation information would be cool -- e.g. to generate co-citation graphs --but might get big

I wonder if it's worth complaining to Google?