Sunday, November 13, 2011

Google Scholar (still) sucks

(This is a follow-up to my previous post on the topic.)

I was encouraged by the appearance of two R-based Scholar-scrapers, within a week of each other. One, by Kay Cichini, converts the page URLs into text mode and scrapes from there (There's a slightly hacked version by Tony Breyal on github. The other, by Tony Breyal (github version here), uses XPath.

I started poking around with these functions -- they each do some things I like and have some limitations.
  • Cichini's version:
    • is based on plain old text-scraping, which is easy for me to understand.
    • has a nice loop for fetching multiple pages of results automatically.
    • has (to me) a silly output format -- her code automatically generates a word cloud, and can dump a csv file to disk if requested. It would be very easy and make more sense to break this up into separate functions: a scraper which returned a data frame and a wordcloud creator which accepted a data frame as input ...



  • Breyal's version:
    • is based on XPath, which seems more magical to me but is probably more robust in the long run
    • extracts numbers of citations
Neither of them does what I really want, which is to extract the full bibliographic information. However, when I looked more closely at what GS actually gives you, I got frustrated again. The full title is available, but the bibliographic information is only available in a severely truncated form; the author list and publication (source) title are both truncated if they are too long (!!: e.g. check out this search) Since the "save to [reference manager]" links are available on the page (e.g. this link to BibTeX information: see these instructions on setting a fake cookie), one could in principle go and visit them all, but ... this is where we run into trouble. Google Scholar's robots.txt file contains the line Disallow: /scholar, which according to the definition of the robot-exclusion protocol technically means that we're not allowed to use a script to visit links starting with http://scholar.google.ca/scholar.bib... as in the example above. Google Scholar does block IP addresses that do too many rapid queries (this is mentioned on the GS help page, and on the aforementioned Python scraper page). It would be easy to circumvent this by pausing appropriately between retrievals, but I'm not comfortable with writing general-purpose code to do that. So: Google Scholar offers a reduced amount of information on the page they return, and prohibits us from spidering the page to retrieve the full bibliographic information. Argh. As a side effect of this, I did take a quick look for existing bibliographic information-handling packages in R (with sos::findFn("bibliograph*")) and found:
  • CITAN: a Scopus-centric package that uses a SQLite backend and does heavy-duty bibliometric analysis (h-indices, etc.)
  • RISmed is Pubmed-centric and defines a Reference class (seems sensible but geared pretty narrowly towards article-type references). It imports RIS format (a common tagged format used by ISI and others)
  • ris: a similar (?) package without the PubMed interface
  • bibtex: parses BibTeX files
  • RMendeley from the ROpenSci project
So: there's a little more infrastructure out there, but nothing (it seems) that will do what I want without breaking or bending rules.
  • ISI is big and evil and explicitly disallows scripted access.
  • PubMed doesn't cover ecology as well as I'd like.
  • I might be able to use Scopus but would prefer something Open (this is precisely why GS's cripplage annoys me so much).
  • Mendeley is nice, and perhaps has most of what I really want, but ideally I would prefer something with systematic coverage [my understanding is that the Mendeley databases would have everything that everyone has bothered to include in their personal databases ...]
  • I wonder if JSTOR would like to play ... ?
If anyone's feeling really bored, here are the features I'd like:
  • scrape or otherwise save information to a variety of useful fields (author, date, date, source title, title, keywords, abstract?
  • save/identify various types (e.g. article/book chapter etc.)
  • allow dump to CSV file
  • citation information would be cool -- e.g. to generate co-citation graphs --but might get big

I wonder if it's worth complaining to Google?

Monday, May 2, 2011

Unicode symbols in R

A friend asked me this morning if there was a way to plot a symbol in R (as a plotting character) representing a half-filled circle. I didn't know, but I figured this out (perhaps it's demonstrated elsewhere -- the ability to use Unicode symbols was added in 2008 or so -- but I didn't stumble across it). First, looking at this list of Unicode shapes indicated that I wanted Unicode symbol 25D1. Then looking at ?points indicated that I could use a negative value (in this case -0x25D1L allows me to enter the value as hexadecimal: the L denotes a (long) integer). So

plot(1,1,pch=-0x25D1L)
plot(1,1,pch=-as.hexmode("25D1"))
plot(1,1,pch=-0x25D1L)

all work equivalently.

TestUnicode <- function(start="25a0", end="25ff", ...)
  {
    nstart <- as.hexmode(start)
    nend <- as.hexmode(end)
    r <- nstart:nend
    s <- ceiling(sqrt(length(r)))
    par(pty="s")
    plot(c(-1,(s)), c(-1,(s)), type="n", xlab="", ylab="",
         xaxs="i", yaxs="i")
    grid(s+1, s+1, lty=1)
    for(i in seq(r)) {
      try(points(i%%s, i%/%s, pch=-1*r[i],...))
    }
  }

TestUnicode()
TestUnicode(9500,9900)  ## some cool spooky stuff in here!
One thing to keep in mind is that you should test whatever symbols you decide to use carefully with whatever graphics path/display/printing solution you plan to use, as all platforms may not render all Unicode symbols properly. With a little more work I could change TestUnicode() to do proper indexing so that it would be easier to figure out which symbol was which. Watch for my next paper, in which I will use Unicode symbols 9748/x2614 ('UMBRELLA WITH RAIN DROPS'), 9749/x2615 ('HOT BEVERAGE'), 9763/x2623 ('BIOHAZARD SIGN'), and 9764/x2624 ('CADUCEUS') to represent my data ... Related links:

PS This worked fine on my primary 'machine' (Ubuntu 10.04 under VMWare on MacOS X.6), but under MacOS X.6 most of the symbols were not resolved. The friend for whom I worked this out has also stated that it didn't work under his (unstated) Linux distribution ... feel free to post in comments below if this works on your particular machine/OS combination. There is a remote possibility that this could be done with Hershey fonts as well (see this page on the R wiki for further attempts at symbol plotting), but I don't know how thorough the correspondence is between the Hershey fonts and the Unicode symbol set ...

PPS I asked about this on StackOverflow and got a useful answer from Gavin Simpson, referencing some notes by Paul Murrell: use cairo_pdf. This should work on any Linux installation with the Pango libraries, I think. In principle it could work on MacOS (and/or Windows?) with Pango installed as well, but I haven't tried ...

Friday, April 8, 2011

BUGS and related

(People who aren't interested in automated Markov chain Monte Carlo sampling, or have no idea what it is, should probably stop reading now.)

Got an e-mail from a student trying to get the WinBUGS examples in Chapter 7 of the book to work, on MacOS under WINE.  Here's what I told her:

My advice, unless you really need WinBUGS, would be to switch to JAGS: there are some cases where WinBUGS is faster, and a few things that WinBUGS does that JAGS doesn't, but on the whole JAGS is very good. In fact,
I have mostly been using JAGS rather than WinBUGS these days, because I'm now mostly working in Linux (Ubuntu) under VMWare on a Mac -- I couldn't face installing WINE under VMWare (and hence dealing with two levels of virtualization ...).


Despite my generally good feelings about the idea of BUGS (a useful general-purpose language for expressing multilevel models, with several independently implemented backends for automatically generating computational samplers from specified models: I say this in my ESA forum paper [paywalled link; contact me if you want it and can't get it], I find the BUGS/R ecosystem a bit of a mess.

  • WinBUGS is old and very stable, and has a few nice extensions like GeoBUGS (don't know if these exist in OpenBUGS yet ... ?),  but (1) is free ("gratis") but not open source ("libre"), (2) only runs under WINE on non-Windows platform
  • OpenBUGS is newer, shinier, and more open, but ... (1) doesn't run on the Mac (at least, the home page says "for Windows and Linux personel [sic] computers"); (2) it's not clear (to me at least) how to run it from within R on Linux (again from the home page, "At present the BRugs R functions do not work under Linux".  (After some more poking around, I see that there is some recently developed stuff that looks like it will make BRugs and OpenBUGS work on Linux as well as Windows (but not Mac??) -- but this is still sort of in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying 'Beware of the Leopard' (this will supposedly, eventually, make it to CRAN -- Emmanuel Charpentier posted to the R list a while ago mentioning this stuff ...)
  • ...)
  • the R interfaces are somewhat hard to navigate, too. coda is old and stable and fairly straightforward, but figuring out the various versions of plots and fitted objects that come out of R2WinBUGS, R2jags, Rjags (the latter two are different!), and arm, is a bit tricky.
Someone (other than me!) ought to spend some time documenting this and cleaning it up ...

Tuesday, February 15, 2011

Does Google Scholar suck or am I just bad at it?

 I would love to use Google Scholar [GS] more effectively, both for cool bibliometric stuff (which Thomson/Reuters/ISI/Web of Knowledge/Web of Science doesn't want me to do) and for routine searches (for which I find the WoS interface much easier to use). Thomson/Reuters seems like a big nasty information-monopolizing company (their lawsuit against Zotero for allowing the conversion of Endnote to Zotero styles is a case in point). It's not clear that Google isn't such, but for now they seem like a better choice on philosophical grounds ... at least they claim to be trying Not to be Evil ...

(I know this is essentially whining, and that the bottom line is that I am entitled to a full refund of the money I paid to use GS, but it is frustrating when a tool falls short of its potential ...)

Scraping/automated uses

I can think of lots of cool little projects that I could do if I had access to an easy source of bibliographic/bibliometric information in a convenient form. For example:
  • tracking the growth of citations to the R project, subdivided by discipline (and perhaps national affiliation of authors?)
  • looking at changes over time in use of key terms in ecology ("mutualism", "symbiosis", "facilitation", etc., subdivided by ecosystem (marine/terrestrial/freshwater: it might be a little tricky to scrape this information ...); subdividing by national affiliation might be interesting here too, as well as (of course) by journal
  • getting summary information on publications by the members of an academic department (over time, as a function of academic age, by impact factor ...)
  • redoing some of the examples in Bolker et 2008 Trends in Ecology and Evolution (sorry, paywalled link) about use of different GLMM methods in ecology
  • analyzing citation counts/h-index etc. for an individual researcher (similar to ISI WoS's citation analysis)
Unfortunately, there is no really easy way (that I know of) to scrape this information from Google. There is much moaning on the Web about Google's failure to provide an API for web search, which leaves people writing custom scrapers in Python, Perl, etc. (none so far specifically in R; I guess if I really wanted to do this I could write an interface to one of these scrapers). I vaguely recall that some of the moaners speculate that Google is not doing this as a condition of its agreement with publishers (i.e., not making the information too open?)

I know that GS restricts scraping by limiting searches to (something like) 100 at a time; this can of course be overcome by searching in chunks (which could be done automatically). There are no terms of use stated explicitly for GS, that I can find ...
Interactive use

Even more useful (and less of an opportunity for distracting side projects) would be a better/more useful interface. (Some of these capabilities may well be available already, but I haven't figured out how to use them.  I would be very happy to have pointers in the right direction.) Typical things that I would like to be able to do in GS that I can't/don't know how to do:
  • sort by date! (There is a restrict by date option, but not a sorting option)
  • sort by other interesting fields: number of citations, journal, etc.)
  • compact/"spreadsheet" view (title,authors,type,year,source ...) -- including export to spreadsheet or export to Zotero via little checkboxes in the margin ...
  • restrict to peer-reviewed sources (and/or type (book chapter, conference proceedings, ...) and/or sort by type)
  • easy "cited by" search, like WoS's "cited reference search"
  • something akin to WoS's "author finder" (search by author name but restrict by affiliation and/or subject area: I know that pieces of this are available but not nearly as easily)
  • I'd like to be able to get straight to an "Advanced search" window that looks like this, with the date restriction field easily accessible
  • some way to restrict by journal that is smart about synonyms
  • restrict by source (open access, e.g. JSTOR/PubMed; journals available from my library ...)
Presumably a lot of this interface could be piggybacked on what GS already offers, by someone who was skilled in Perl/PHP etc.
 

      Sunday, February 6, 2011

      Post #1

        I have been persuaded (perhaps wrongly) that it would be a good idea if I had a blog, even for extremely random/sporadic postings. We'll see what happens: I have never been a good journal-keeper. I had a pseudo-blog going at the website for my book, http://emdbolker.wikidot.com/blog for a while, but that had a few annoying formatting issues and was hard for anyone to find.

      I wanted to call this "commonplace", after commonplace books, which I have always thought were a neat idea. commonplace.blogspot.com was already taken; ominously, I see that the owner of that blog posted once in October 2000 and never again (!)

      Things I hope to post here: thoughts about ecology/evolution/epidemiology, statistics, R (don't know how well the Google blog format will handle technical things like code fragments ...), and (for fun) about words, or folk music and dance, or other stuff that I spend my time thinking about.

      Enjoy.