How Scholarly is Google Scholar? A Comparison of Google Scholar to Library Databases

American Library Association Annual Conference

Jared Howland, Rebecca Boughan and Tom Wright

Given June 2008 in Anaheim, CA

Download: PowerPoint | PDF | Keynote

7 Subject specialists from 3 disciplines:
1. 3 from sciences
2. 2 from humanities
3. 2 from social sciences
Blind to purpose of study
Asked them to give us 3 things…

This is what things looked like after we got all the information back from the librarians

The first was to actually run the search query in the suggested database.
We put the first 30 citations into a bibliographic citation manager and saved all of the actual full text
We chose 30 because usability studies (Jakob Neilsen) tell us that less than 1% of all users ever go beyond the 3rd page of results and very few people ever change the defaults (ie, once they run a search they stick with it, success or failure).
Most of our DBs present 10 results per page so 30 results should represent a large enough sample to represent the actual set of results the majority of our users is ever going to see after performing a search.

We ran the same query in Google Scholar and saved the results again in a bibliographic Manager.
We used Zotero to quickly export all of the results.
We also saved the full text of each citation for later use in our study.

So, the first searches we ran using the native DBs and GS was for the query given to us by the librarian
The second set of searches we ran was to see if the citations we found in the DB were available in GS and vice versa

Here is the same screenshot we saw just a minute ago.
We took the bibliographic information for each citation and searched for the citation within Google Scholar.

We then did the same thing in reverse.
We took the 30 results from GS and searched for each citation within the database

This allowed us to later calculate something we called “exclusivity”
We put the citations into 1 of 3 possible “exclusivity” categories
Shows proportion of citations within our study that overlap. As you can see, within our study we found that, on average, GS had a larger result set overall as well as more exclusively than the databases.

So now that we have the citations from the database and the citations from Google Scholar. We used the bibliographic manager to generate a list of references that we input into an Excel spreadsheet. Then, using a random number table, we completely randomized the order of the citations for each subject specialist.

Finally, to deliver the content to the librarians in a way in which it would be easiest for them to evaluate, we saved the full-text of each citation according to its randomly assigned citation number. Then we used Excel to create hyperlinks to the full-text of each citation and delivered this list along with the full-text on a CD to the subject librarians. We asked them to evaluate each citation using a rubric which we provided in hard copy form. As you can see, the subject librarians were only able to see the citation number and the bibliographic information. By clicking on the hyperlinked citation number, the full-text of that citation would appear and the subject librarians could easily rate the citation on the rubric. 
Have full text appear on this page after click to simulate linking from provided document.

This screen shows the rubric that we used. It is based on a rubric that has popularly been used to evaluate print resources (Alexander, 1999)
Alexander, J. E. (1999). Web wisdom: How to evaluate and create information quality on the Web.
We asked each subject librarian to assign a score of between 1 and 3 within 6 different categories to each of the citations (1 was below average, 2 was average and 3 was above average).
These six categories were:
1. Accuracy – which looks at
2. Authority – specifically the
3. Objectivity – looking for
4. Currency – is the information up to date?
5. How deep is the Coverage
6. And finally Relevancy – how well does the citation relate to the research question
This resulted in a total possible score of 18 for each citation - we called this a scholarliness score

We used this statistical model to evaluate the data. Essentially this formula says 2 important things about the way we used the data:
1. We controlled for the differences between the way librarians grade
2. We controlled for the differences in how exclusively the citation was available
This allowed us to pinpoint and measure any differences there may have been between disciplines in our data as well as any differences that can be attributed to the source of the citations

Citations found only in GS had, on average, a 17.6% higher scholarliness score than citations found only in the DB

Citations found in both GS and the DB were even higher than citations found only in GS

We found no statistically significant difference in the scholarliness scores between disciplines (i.e., humanities citations in GS are just as scholarly as science citations found in GS)

This study can only be extrapolated statistically to the specific topics and subject specialists used in this study
A more robust statistical methodology would need to be employed to make these results generally applicable
We are encouraged by the results we received and feel that they would probably hold up but cannot say so until another study is done

If we had to do it over again, we would have increased the Likert scale on our rubric from 1-3 to 1-7 or 1-10
This would have allowed for a more nuanced statistical analysis and made it easier to spot significant differences, if any, between GS and databases

Our scholarliness calculation, ultimately, was based on the subjective opinions of librarians with subject expertise.
There are lots of ways to create a scholarliness score (citation counts, impact factors, etc). Which is best is still debatable

Our study compared GS to individual library databases. A more appropriate comparison may be GS to federated search tools.