Tuesday, February 15, 2011

Does Google Scholar suck or am I just bad at it?

 I would love to use Google Scholar [GS] more effectively, both for cool bibliometric stuff (which Thomson/Reuters/ISI/Web of Knowledge/Web of Science doesn't want me to do) and for routine searches (for which I find the WoS interface much easier to use). Thomson/Reuters seems like a big nasty information-monopolizing company (their lawsuit against Zotero for allowing the conversion of Endnote to Zotero styles is a case in point). It's not clear that Google isn't such, but for now they seem like a better choice on philosophical grounds ... at least they claim to be trying Not to be Evil ...

(I know this is essentially whining, and that the bottom line is that I am entitled to a full refund of the money I paid to use GS, but it is frustrating when a tool falls short of its potential ...)

Scraping/automated uses

I can think of lots of cool little projects that I could do if I had access to an easy source of bibliographic/bibliometric information in a convenient form. For example:
  • tracking the growth of citations to the R project, subdivided by discipline (and perhaps national affiliation of authors?)
  • looking at changes over time in use of key terms in ecology ("mutualism", "symbiosis", "facilitation", etc., subdivided by ecosystem (marine/terrestrial/freshwater: it might be a little tricky to scrape this information ...); subdividing by national affiliation might be interesting here too, as well as (of course) by journal
  • getting summary information on publications by the members of an academic department (over time, as a function of academic age, by impact factor ...)
  • redoing some of the examples in Bolker et 2008 Trends in Ecology and Evolution (sorry, paywalled link) about use of different GLMM methods in ecology
  • analyzing citation counts/h-index etc. for an individual researcher (similar to ISI WoS's citation analysis)
Unfortunately, there is no really easy way (that I know of) to scrape this information from Google. There is much moaning on the Web about Google's failure to provide an API for web search, which leaves people writing custom scrapers in Python, Perl, etc. (none so far specifically in R; I guess if I really wanted to do this I could write an interface to one of these scrapers). I vaguely recall that some of the moaners speculate that Google is not doing this as a condition of its agreement with publishers (i.e., not making the information too open?)

I know that GS restricts scraping by limiting searches to (something like) 100 at a time; this can of course be overcome by searching in chunks (which could be done automatically). There are no terms of use stated explicitly for GS, that I can find ...
Interactive use

Even more useful (and less of an opportunity for distracting side projects) would be a better/more useful interface. (Some of these capabilities may well be available already, but I haven't figured out how to use them.  I would be very happy to have pointers in the right direction.) Typical things that I would like to be able to do in GS that I can't/don't know how to do:
  • sort by date! (There is a restrict by date option, but not a sorting option)
  • sort by other interesting fields: number of citations, journal, etc.)
  • compact/"spreadsheet" view (title,authors,type,year,source ...) -- including export to spreadsheet or export to Zotero via little checkboxes in the margin ...
  • restrict to peer-reviewed sources (and/or type (book chapter, conference proceedings, ...) and/or sort by type)
  • easy "cited by" search, like WoS's "cited reference search"
  • something akin to WoS's "author finder" (search by author name but restrict by affiliation and/or subject area: I know that pieces of this are available but not nearly as easily)
  • I'd like to be able to get straight to an "Advanced search" window that looks like this, with the date restriction field easily accessible
  • some way to restrict by journal that is smart about synonyms
  • restrict by source (open access, e.g. JSTOR/PubMed; journals available from my library ...)
Presumably a lot of this interface could be piggybacked on what GS already offers, by someone who was skilled in Perl/PHP etc.
 

      4 comments:

      1. I found a link that might be useful for this. You have to scroll past some stuff about taking advice to get to the real content, but it's there:
        http://scienceblogs.com/gregladen/2011/02/how_to_mine_data_from_the_inte.php#more

        ReplyDelete
      2. Have a look at rOpenSci, especially the Rmendeley package. http://ropensci.org/project-overview/

        ReplyDelete
      3. New links:

        https://bitbucket.org/fccoelho/scholarscrap/changeset/b5020c74d233#chg-scholar/scholar/spiders/scholar_spyder.py

        asociologist.com/2012/01/02/google-scholar-scraper/

        ReplyDelete
      4. and http://stackoverflow.com/questions/7523961/google-scholar-with-matlab/

        ReplyDelete