(I know this is essentially whining, and that the bottom line is that I am entitled to a full refund of the money I paid to use GS, but it is frustrating when a tool falls short of its potential ...)
Scraping/automated uses
I can think of lots of cool little projects that I could do if I had access to an easy source of bibliographic/bibliometric information in a convenient form. For example:
- tracking the growth of citations to the R project, subdivided by discipline (and perhaps national affiliation of authors?)
- looking at changes over time in use of key terms in ecology ("mutualism", "symbiosis", "facilitation", etc., subdivided by ecosystem (marine/terrestrial/freshwater: it might be a little tricky to scrape this information ...); subdividing by national affiliation might be interesting here too, as well as (of course) by journal
- getting summary information on publications by the members of an academic department (over time, as a function of academic age, by impact factor ...)
- redoing some of the examples in Bolker et 2008 Trends in Ecology and Evolution (sorry, paywalled link) about use of different GLMM methods in ecology
- analyzing citation counts/h-index etc. for an individual researcher (similar to ISI WoS's citation analysis)
I know that GS restricts scraping by limiting searches to (something like) 100 at a time; this can of course be overcome by searching in chunks (which could be done automatically). There are no terms of use stated explicitly for GS, that I can find ...
- http://code.google.com/p/google-scholar-perl/ (a GS scraper module for perl)
- http://code.activestate.com/recipes/523047-search-google-scholar/ (Python recipes/interfaces)
- http://www.academicproductivity.com/2009/google-scholar-api/ (general complaints, with a link to a PHP (?) scraper)
- http://codeandculture.wordpress.com/2010/05/13/cited-reference-search-time-series/ (Perl/curl/bash script)
Even more useful (and less of an opportunity for distracting side projects) would be a better/more useful interface. (Some of these capabilities may well be available already, but I haven't figured out how to use them. I would be very happy to have pointers in the right direction.) Typical things that I would like to be able to do in GS that I can't/don't know how to do:
- sort by date! (There is a restrict by date option, but not a sorting option)
- sort by other interesting fields: number of citations, journal, etc.)
- compact/"spreadsheet" view (title,authors,type,year,source ...) -- including export to spreadsheet or export to Zotero via little checkboxes in the margin ...
- restrict to peer-reviewed sources (and/or type (book chapter, conference proceedings, ...) and/or sort by type)
- easy "cited by" search, like WoS's "cited reference search"
- something akin to WoS's "author finder" (search by author name but restrict by affiliation and/or subject area: I know that pieces of this are available but not nearly as easily)
- I'd like to be able to get straight to an "Advanced search" window that looks like this, with the date restriction field easily accessible
- some way to restrict by journal that is smart about synonyms
- restrict by source (open access, e.g. JSTOR/PubMed; journals available from my library ...)
I found a link that might be useful for this. You have to scroll past some stuff about taking advice to get to the real content, but it's there:
ReplyDeletehttp://scienceblogs.com/gregladen/2011/02/how_to_mine_data_from_the_inte.php#more
Have a look at rOpenSci, especially the Rmendeley package. http://ropensci.org/project-overview/
ReplyDeleteNew links:
ReplyDeletehttps://bitbucket.org/fccoelho/scholarscrap/changeset/b5020c74d233#chg-scholar/scholar/spiders/scholar_spyder.py
asociologist.com/2012/01/02/google-scholar-scraper/
and http://stackoverflow.com/questions/7523961/google-scholar-with-matlab/
ReplyDelete