April 30, 2007

Google Books: What’s Not to Like?

By Robert B. Townsend

The Google Books project promises to open up a vast amount of older literature, but a closer look at the material on the site raises real worries about how well it can fulfill that promise and what its real objectives might be.

Over the past three months I spent a fair amount of time on the site as part of a research project on the early history of the profession, and from a researcher’s point of view I have to say the results were deeply disconcerting. Yes, the site offers up a number of hard-to-find works from the early 20th century with instant access to the text. And yes, for some books it offers a useful keyword search function for finding a reference that might not be in the index. But my experience suggests the project is falling far short of its central promise of exposing the literature of the world, and is instead piling mistake upon mistake with little evidence of basic quality control. The problems I encountered fit into three broad categories—the quality of the scans is decidedly mixed, the information about the books (the “metadata” in info-speak) is often erroneous, and the public domain is curiously restricted.

Poor Scan Quality
My reading of the materials was not scientific or comprehensive, by any means, but a significant number of the books I encountered included basic scanning errors. For instance, the site currently offers a version of the Report of the Committee of Ten from 1893 (the start of the great curriculum chase for the secondary schools). It offers a catalog of scanning errors, as Google has double-scanned pages (page 3 appears twice, for instance), pulled in pages improperly so they are now unreadable (page 147 between page 164 and 166), and cut off some pages (page 146, for example).

I’ve digitized a number of the AHA’s old publications and appreciate that scanners don’t always work as they should and pages can often get jammed. But even fairly rudimentary quality controls should catch those problems before they go live online. After years of implementing those kinds of quality checks here—precisely because friends in the library community took me to task about their necessity—I find it passing strange that so many libraries are joining in Google’s headlong rush to digitize without similar quality requirements.

Faulty Metadata
Mistakes in Google Book Search Metadata
Beyond the fundamental quality of the scanning, a more significant problem is the incredibly poor descriptive information attached to many of the books on the site (the “metadata”). This is particularly evident in the serial publications, where having the proper name and date of a publication is particularly important. Take for example a volume of History Teacher’s Magazine that is labeled as a volume of Social Studies (the name the magazine took in 1934) and dated as published in 1953 (even though it seems to be from 1917).

These kinds of problems have two unfortunate effects. First, it makes it more difficult to place a particular work in time and thus actually locate a particular item “discovered” by using Google Books. At the same time, in many instances you will be unable to inspect public domain items more closely, because the erroneous date places the information on the wrong side of the copyright line.

Truncated Public Domain
These problems are exacerbated by Google’s rather peculiar views on copyright. While taking an expansive view of copyright for recent works, it has taken a very narrow view about books that actually are in the public domain. As I have always understood it (and the U.S. Copyright Office confirms), “works by the U.S. government are not eligible for U.S. copyright protection.” But Google locks all government documents published after ~~1923~~ 1922* behind the same wall as any other copyrighted work. Among other things, that locks up works that should be in the public domain, such as the AHA’s Annual Report (published by the Government Printing Office from 1890 to 1993) and circulars from the U.S. Bureau of Education. This problem is exacerbated by the often errant data about when these materials were published—which places these works even further beyond reach.

For more than a year now, Siva Vaidhyanathan, a cultural historian and media scholar at New York University, has been objecting that the rush to digitize is moving far in advance of considered thought. His concerns seemed rather abstract when I first heard them last year, but working with Google Books over the past few months made his objections seem much more tangible and worrying.

What particularly troubles me is the likelihood that these problems will just be compounded over time. From my own modest experience here at the AHA, I know how hard it is to go back and correct mistakes online when the imperative is always to move forward, to add content and inevitably pile more mistakes on top of the ones already buried one or two layers down. With Google adding in more than 3,000 new books each day, the growth in the number of mistakes seems that much higher.

The problem of quality control only exacerbates my most basic worry about the larger rush to digitize every scrap of information—that we are adding to the pile much faster than the technology can advance to extract the information in a useful or meaningful way. When I have asked people who know a lot more about the technology than me about this problem, they tend to wave their hand and mumble about “brilliant scientists” and “technological progress.” Forgive me if I remain unconvinced. Even for someone fairly proficient in Boolean search terms I find a lot of the results from Google Books (and Google more generally) just page after page of useless and irrelevant information. I find it increasingly hard to believe that Google can add tens of thousands of additional books each month to the information pile—many containing basic mistakes in content and metadata—and the information results will actually grow better over time.

So I have to ask, what’s the rush? In Google’s case the answer seems clear enough. Like any large corporation with a lot of excess cash the company seems bent on scooping up as much market share as possible, driving competition off the board and increasing the number of people seeing (and clicking on) its highly lucrative ads. But I am not sure why the rest of us should share the company’s sense of haste. Surely the libraries providing the content, and anyone else who cares about a rich digital environment, needs to worry about the potential costs of creating a “universal library” that is filled with mistakes and an impenetrable smog of information. Shouldn’t we ponder the costs to history if the real libraries take error-filled digital versions of particular books and bury the originals in a dark archive (or the dumpster)? And what is the cost to historical thinking if the only substantive information one can glean out of Google is precisely the kind of narrow facts and dates that make history classes such a bore? The future will be here soon enough. Shouldn’t we make sure we will be happy when we get there?

* Thanks to Ralph Luker for this correction.

You didn’t mention my pet peeve. In my work, I need to basically fact- check some historical info. The snippet view for copyrighted works would be, if not ideal, then sufficient for my objectives. That is, if the snippet actually included the search terms requested with a little surrounding text. However, more often than not some text other than what one asked for is highlighted, but one can’t, of course, scroll up or down in the snippet to see adjacent passages. So one is left wondering: now what? This is now more than just incidental. I’ve reported it to Google and they respond that it’s still beta so be patient.

— Jim Roan May 2, 08:49 AM
One important point to remember is that this is only the first step of a larger project that will ultimately be beneficial for all of us – especially those of us without ready access to large university or even large public libraries.
We can’t take the traditional academic approach to this project – otherwise, it would simply never get done. I’m on a National Archives committee grappling with these very issues – how do we keep track, organize, and access all of our information? We create so much more everyday and it will take all of us working together to make whatever product that becomes available the best that it can be.
Unlike the print world, these digital images are not necessarily the “final” word on any particular item. And while there are good technical specs on scanning, the size of this project means there are going to be mistakes.
It’s actually better to have some of the material available than none at all – even if some of the pages are badly scanned.
Furthermore, it’s not necessarily Google’s responsibility to organize it the way that we would. One significant advantage of the digital world and the digital community is that we can all work collaboratively. For example, I was initially quite skeptical of wikis but am now finding significant uses for them – including organizing conference discussions in a way that everyone can contribute and each person does not necessarily have to rely on a gatekeeper and the resulting time delay that may occur. (No, this does NOT mean I allow students to cite wiki references in their work – that’s not their most appropriate use.)
So, why don’t we discuss what we can add to this project along with our suggestions for improvement?
Too many projects in the digital world are not interested in the “past” and I for one am grateful that Google decided to enter this particular venue of the past that will benefit all of us in the future.

— Kelly Woestman May 2, 09:21 AM
One can add that Google is blocking users outside the US when they wish to consult worldwide PD works:
http://archiv.twoday.net/stories/2922570/

— Klaus Graf May 2, 09:55 AM
Another problem that I think is a pretty big one is the complete lack of persistency and sustainability. As far as I know, Google hasn’t even made clear publicly if their “identifier” will be a permanent one (a Google Books representative did tell me that they are in a chat at the Frankfurt Bookfair last year, though).

— Andreas Praefcke May 2, 10:36 AM
One thinks of the excitement in ca. 1495 when Aldus Manutius began his great project of converting imperiled Greek and Latin manuscripts into type, thus both preserving them and disseminating their learning. He wisely adopted then his famous dolphin and anchor logo and the motto, festine lente, “make haste slowly.”

— Norman Fiering May 2, 12:22 PM
I am a cataloger librarian, and, unfortunately, I understood immediately what the problem with “History Teacher’s Magazine” is: the metadata associated with it is following older cataloging rules for serial publications. The rush to digitize means that no clean-up or updating of the records is happening, so nonsensical results such as this example will occur frequently. The current thinking in the world of library cataloging is that full-text solves all problems and abbreviated metadata, preferably machine-generated or input by the lowest levels of staff, should be sufficient. Quality—and correct—metadata “doesn’t scale,” as they say. I’m ashamed that these economies of scale reflect so poorly upon my institution, its metadata, and Google Books, which I generally support as a project that needs to be done.
Regarding the U.S. government publications, my understanding (and I may be wrong) is that there can be copyrighted material within U.S. governmental publications when that material was provided by a non-governmental body or researcher.
There is no doubt in my mind that without the participation of Google, my institution could not embark on a digitization project of this scale. I imagine this is the case for most large research libraries. As much as I wish there were better means for quality control and more care for the metadata, I know this won’t happen, and the digital scholars of the future will need to develop new skills in deciphering the digitized past.

— Kay Teel May 2, 01:42 PM
Mr Townshend says about Google, “any large corporation with a lot of excess cash the company seems bent on scooping up as much market share as possible, driving competition off the board and increasing the number of people seeing (and clicking on) its highly lucrative ads.”
And Mr Vaidhyanathan is quoted, “Do we want all our information needs to be met and filtered through a lens that utimately has profit as its main aim?”
Hmm, what publishing houses in the past 500 years were NOT trying to make a profit? I’m more astonished at a company like Brill for its $185 books than at Google.

— Mark Conrad May 2, 02:46 PM
The problem of metadata is not as forbidding as it would appear. Every book in Google is from a library, therefore there is (or should be) a catalog record for every book. Therefore, it is “only” a matter of linking them together. This is a lot of work, but much less than re-cataloging everything and lower level staff could do it.
Also, the problems of scanning quality are appreciable, but we must see what happens when these errors are reported. Will Google be willing to fix the mistakes? I personally doubt it since the goal of the project is: “What is Google Book Search? Search the full text of books to find ones that interest you and learn where to buy or borrow them.” See: http://books.google.com/googlebooks/about.html
Google may not consider a “few” errors in scanning to be problems. For someone who wants to use the item, it’s a disaster, but if the purpose is to “learn where to buy or borrow them,” they may take another view. An interested person can always borrow the book or buy it.
Quite a different view from a user in India or Africa.

— James Weinheimer May 3, 02:57 AM
I come from the perspective of someone who has been involved in digitization projects since 1994 (http://www.ProjectWittenberg.org, http://www.ctsfw.edu/library/probono.php) and am quite used to text projects with both promise and limitations. (including my own!) Most of the frustrations we experience is when we use a project for a purpose other than the one for which it was devised. With Google, the basic purpose is for users to be able to locate information within the covers of a book and so be able to decide if it is worth reading. It is not intended as a replacement for the physical text, which, as Robert Townsend points out, it certainly does not. It is not even a good digital text, since no one can cut-and-paste from the works, nor do the images as a rule have enough resolution to covert to text via OCR.
It does serve its stated purpose quite well, however. I’ve been teaching our students to use it to gage the quality of books before they use them for research. For example, if you search Google Books for the author and title of a work over five years old, you sometimes discover that it has been cited in a newer work. When a view greater than snippet is offered, you can even see how the author of the later work uses the title you searched for.
What this means for historians and librarians is that Google Books does not nor will it ever replace digital projects intended to make actual texts fully available. What can distinguish our projects are efforts to provide quality scans, careful editing of OCR texts and detailed metadata. Humans are still better at the bibliographic art than machines and the lessons learned by librarians in the 19th century still serve us well in the 21st. Yet computers still make reliable student workers.

— Bob Smith May 3, 09:04 AM
Many thanks to everyone who has commented on his post so far. My goal was to open a conversation about some of the competing costs in the project (and I include compromises on quality among the costs), and whether Google found a balance that really benefits scholars. The responses so far indicate that I am not alone in this, as the e-mails and blog posts on this suggest a good deal of latent concern about this issue.
Kelly Woestman raises an important point about the need to provide more information to scholars at institutions that do not support research. As the recent MLA report on tenure and promotion in the humanities points out, the expectations for scholars are rising but the resources are not. So digitizing the materials of the past can help fill that urgent need. But this is not a simple either/or choice. Turning a blind eye to the quality issue, dismissing it as some kind of archaic “traditional academic” value, or suggesting that scholars just need to learn “new skills,” seems incredibly short-sighted to me. I worry about whether a scholar will be able to cite something from Google Books if it opens itself to quality criticisms like Wikipedia. I worry about whether the cultural record will be improved if libraries start to deaccession books because they are seemingly available through Google Books, without checking the quality of the materials that are actually available. (I do appreciate that libraries suffer from their own financial imperatives.) And I worry about the value of a finding aid that requires you know the entire publishing history of a particular work in order to actually find something.
It’s hard, I know, to balance the desire to post more content with a commitment to ensuring that content will stand the test of time and future research. I understand that we have to make choices and compromises to get the job done. My concern is that some of the high-toned rhetoric about this project, and the simplistic choice between quality and content implied in some of the responses, stands in the way of a substantive discussion about what this project (and other digitization projects) will mean for the future of scholarship in history.

— Robert B. Townsend May 3, 12:06 PM
Townsend may have a point about the “Truncated Public Domain”—Google treats all works published between 1923 and 1963 as being copyrighted when we know that at least two-thirds of them are not—but his characterization of copyright law with respect to government works is incorrect. He has totally misconstrued the law and thereby has furthered a common misconception about the nature of U.S. Government works that are ineligible for copyright protection. These noncopyrightable public domain works must be works of the U.S. Government, i.e., works of its officials and employees produced as part of their jobs. The fact that a publication, such as the AHA’s Annual Report, is printed by the Government Printing Office or published and distributed by some agency of the government is irrelevant. If a work is produced by a private individual or some nongovernmental (albeit federally incorporated) organization, such as the AHA, the work qualifies for copyright even if it is published by the government. Something published by the Government is not ipso facto a Government publication. Thus, Townsend’s example intending to show unwarranted conduct by Google is poorly chosen. Google may be wrong in other place, but not here. (It turns out, however, that all of these AHA reports up to March 1989 are in fact in the public domain; but this is because they were published without proper copyright notice and not because they qualified as U.S. Government works.)
Paul Newman, J.D., Ph.D.
IP Specialist
University of Michigan

— Paul Newman May 3, 04:33 PM
I’m a huge fan of Google Books and agree that it’s unfair to criticize it for not being what it does not purport to be.
But I think it is fair to make two points:
Besides the mechanical errors in scanning (missing pages etc.), the OCR Google uses to allow for full text searching is extremely unreliable. E.g. “-ants-” becomes “-anis-”; “-el-” becomes “-ei-”. It cannot handle diacritics (vowels with umlauts are often “recognized” as the sequence “ii”); long “s” in older texts turns into “f”. It might one day be possible to automatically correct some—or even all—of this, but for now, while searches yield all kinds of interesting material, don’t imagine that you are in fact performing a full text search of what has been scanned.
And in view of Google’s stated desire to “organize all the world’s information,” one can only wonder what this can mean when, for example, one finds among Key Words and Phrases for Galignani’s New Paris Guide (Paris, 1868) “fauteuils” (“chairs”) and “metres”; or when, under the heading Related Books for Pammachius by Thomas Naogeorgus (a 16th century neo-Latin playwright) the second and third entries are an edition of the 2nd century poet Plautus and one of the complete works of Cicero!
Well, someone else has invoked Aldus Manutius, so let me invoke Gutenberg—the inventor of printing and of the typographical error (with which his Bible teems)!

— David Sullivan May 3, 08:18 PM
Many thanks to Dr. Newman for correcting my reading of the law. But my predecessors at the AHA entered into the relationship with the federal government precisely because they wanted and expected the thousands of pages of primary documents, bibliographies, and reports on the state of history in America published in the Annual Reports to enter the public domain. Perhaps their understanding of the legalities was as wrong as my own, but I would not be keeping faith with their intentions if I failed to object to practices so contrary to their goals.

— Robert B. Townsend May 3, 10:34 PM
I urge all those concerned about quality control in scanning and rendering of digital images to text to read Nicholson Baker’s “Double Fold.”
Although written well-before the Google project, Baker details problems that have occurred with digital conversion of scanned page images to text.
Anyone who works with scanning software knows that the conversion to text without human intervention has some degree of error.
The aforementioned should be a concern to those who see no need for books (at least those scanned and digitally accessible) in the future.

— Maurice J. Freedman, MLS, PhD May 3, 10:35 PM
I think you bring up some great points here. There is a danger of people – especially students – really too heavily on this project, and I have seldom seen a text that is not error-ridden. And the format makes it far too easy to take the information out of context.
We need to worry about quality control – we have seen time and time again, large scale project proceed ahead without people every taking the time to fix the early mistakes and that does cause HUGE problems down the road. In this case, it can cause problems not only with the project itself, but also with historical record.
A note about gov docs – I think in part that google made its blanket (and incorrect decision) because there are many cases in which the GPO partnered or contracted out with a private publisher – and that work (all or in part) is not in public domain (sadly). It was easier to just make the call that way instead of trying to decide, title by title, what works are truely in pd and what are not. Its very frustrating – esp for those of us that wanted to be able to look at older public hearings!

— sarah maximiek May 4, 01:23 PM
RE: metadata – I think the issue is the lack of it, rather than faulty cataloging. The date limiting field of Book Search’s advanced search consistently fails – for example, a search limited to “1972 to 1972” returns the following item:
http://polaris.umuc.edu/library/instruction/GoogleUniverse/images/GoogleBook3.gif
As in Scholar, where Google appears to have treated the highly structured data publishers likely provided as large masses of undifferentiated text (making effective fielded searching impossible), Book Search appears capable of parsing only a few parts of records as distinct – title, author, URL, possibly publisher – but by and large offers a difficult-to-control search of full text. Google has said recently that they have no intention of improving the search tool or results sorting/saving options, so I don’t think it will ever become a core tool for serious searchers.

— Ryan Shepard May 4, 02:57 PM
The biggest problem with Google Books’ treatment of public domain material is that even what is public domain has restricted terms of use. Despite the text itself being unrestrictable, the terms of use of Google Books prohibit many important uses, such as downloading large batches of related texts, and reusing the scanned images in other contexts.
Brewster Kahle of the Internet Archive has explained the problems eloquently:
http://www.futureofthebook.org/blog/archives/2006/11/brewster_kahle_on_the_google_b.html

— Sage Ross May 4, 04:09 PM
Just a few additions:
1. The reason I like Google Books is that if I know the book I am looking for, I can often find a full text of it without leaving my desk. In other words, I use it in exactly the way Google doesn’t intend it to be used.
This is because that way doesn’t seem to work very well: key words and phrases are senseless jumbles; and related books are (in my use) seldom very closely if at all related.
On the other hand, despite the alarming rate of OCR error, the texts are for the most part perfectly legible.
2. As to that error rate, as anyone who has worked with manuscripts knows, some kinds of error are more frequent than others because of similar (or confusing) letter-forms.
A quick test is to compare results for impossible vs possible forms of words.
E.g. “Wissens” (of knowledge) vs “Wiffens” (same thing with long S) the impossible form occurs 10% of the time.
But the impossible forms of Latin “illorum” (of those things/their)add up to close to 25% of instances.
So my concerns are two-fold: these errors will never be corrected; and the Google version of a book will become in many cases the only digital version.

— David Sullivan May 4, 07:46 PM
Important problem with google books’is scanned full text

— Bujar Kocana May 4, 08:29 PM
Perhaps as Kelly Woestman implies Google is waiting for Human Computation to supply the metadata. Though I am not sure how quality control fits in with the “wisdom of crowds.”
I mean if Patents are going this way can Copyright “problems” which are difficult and troublesome going to be solved by collaborative input?

— John Meier May 7, 08:27 AM
I find it difficult to fault google on its error rates when I find librarian prepared data to be so flawed. As a Librarian, I find error or substantive omissions in more than 90% of the MARC copy I encounter.
Would libraries do a better job at scanning? As has been pointed out on numerous occasions, libraries do not have the technical expertise, equipment or salary infrastructures to support quality reformatting.
Libraries have become megalithic tombs that are collapsing under their outmoded methodologies of dealing with information, methodologies tied to times when the growth rates of the creation of information were far more linear. Central to the inefficiency of libraries is MARC, a 40 year old technology, designed to produce cards for a card catalog. It is still central to the ways libraries organize information.
Google’s initiatives reaffirm the notion that information access has become too valuable to be left in the hands of libraries. As access to information becomes more market driven we are likely to see more market driven strategies leaving some of us driving a Yugo and others, a Lexus. The real danger we face is that libaries have abdicated their responsibilities to the public and placed their trust in the market place. There is nothing that insures, that in time, the market place will be as benevolent as it is now, in what I see to be the honeymoon stage. Far too many could be left driving Yugos or perhaps even worse, left without any viable means of access.

— Karl Miller, DMA May 7, 12:04 PM
For insight on related developments in the library world concerning access to scholarly information resources see Thomas Mann’s article, “What is Going on at the Library of Congress?”(http://www.guild2910.org/AFSCMEWhatIsGoingOn.pdf).

— Beth Guay, MLS May 8, 11:04 AM
Professor Townsend makes a number of excellent points and some of us in the library profession have been making them for a number of years. (I have made similar points about this and other mass digitization projects in my FOOL’S GOLD: WHY THE INTERNET IS NO SUBSTITUTE FOR A LIBRARY, due out at the end of this month.) The question is not whether we will have an electronic future but whether we’ll have the one we one we want or the one we have simply settled for. My fears are that we have already accepted the latter for no other reason than when we come to worship at our new electronic idol, we are hard-pressed to criticize. When some do, we immediately brand them as Luddites, or worse. We are racing to a future where “history” means the most recent decade and a half—and with little hope of retrieving more.

— Mark Y. Herring May 8, 12:40 PM
There are some very interesting views here. I wanted to comment on the “try before you buy” aspect of Google’s promotion of the service. One of my uses is to do exactly that – if I’m considering purchase, seeing some of the content of a book so easily can be a very useful aid in decision making. But then it is very frustrating when the preview omits the references.

— Raewyn Adams May 9, 11:57 PM
i’m a big supporter of google’s scanning project.
a huge supporter.
the lack of sufficient quality-control is very bad.
even if google has the determination to go back
and correct the errors, my experience has shown
that process frequently introduces new errors.
in addition, it’s much more expensive than simply
finding and fixing all the glitches while the book
is still laying right there in the scanning bay.
metadata, shmetadata, the books are in libraries
that should have full metadata on ‘em already.
and o.c.r. results can be fixed later rather easily.
but scans that are missing, blurred, truncated,
or flawed in any number of other ways are bad.
-bowerbird

— bowerbird May 10, 01:50 PM
I cannot reproduce the errors noted by Townsend. To illustrate his first point (Poor Scan Quality) he highlights errors in “a version of the Report of the Committee of Ten from 1893.” I believe that there is only one version of this report accessible through Google Book Search—Stanford University’s copy. Page 3 (the first page of the body of the report—it is not numbered) is repeated, but perhaps it is repeated in the original at Stanford. Pages 146 and 147 are intact and completely legible. To use one instance of scanning error to condemn such a huge and obviously valuable project is recklessly irresponsible of Townsend. For him to misrepresent the one instance of error that he cites, suggests that he is a… Townsend begins “My reading of the materials was not scientific or comprehensive…” Well then have the humility and good sense to refrain from bold assertions and gross generalizations. I use Google Books, and yes, there are errors. But they represent a tiny fraction of the pages scanned, and often can be worked around by extrapolation. Google Books creates opportunities for discovery and convenience that are truly exhilarating. Opposition to Google Books is based on fear and ignorance. In the current example (academic history), think of how much easier it will be to ferret out instances of plagiarism, fraudulent citation, and other kinds of sloppy or dishonest scholarship.

— Mead Cain May 15, 10:26 AM
Our group, Free Government Information, has posted some material from the Federal Judicial Center publication, Copyright Law, 2nd Edition that might be helpful to this discussion.
What bothers me most as a documents librarian, is that Google gives the snippet treatment even to items that SCREAM “in-house agency publication” like The Monthly Catalog of United States Government Publications.
I think that Google’s treatment of federal government publications encourages a “pay per view” mind set towards government information that this country simply can’t afford intellectually.
Having said that, perhaps some of the libraries getting digital copies back from Google will be less timid about work that is largely (if not completely) public domain.

— Daniel Cornwall May 15, 04:32 PM
The Google Books issue came up for me in my capacity as a librarian trying to solve a problem for a patron asking why a public domain government document “available on google” wasn’t truly available. Google gives users a very strong indication that everything not available is not in the public domain: Google’s FAQ “If the book is in the public domain and therefore out of copyright, you can page through the entire book and even download it and read it offline.”
As Paul Newman points out in a comment above, what government publications are in the public domain is somewhat complicated, but not so much so that pre-1900 scanned government documents (linked on my blog) shouldn’t be available on GooglePrint.
As Daniel Cornwall mentions above, in-house produced gov docs are often not available except in snippet format—yet there is a link to purchase. The grey area around public domain material exists, but then why doesn’t Google state that?
I hope the library partners will take Daniel Cornwall’s approach when it comes to their own copies of digital materials. However, my experience as a librarian requesting a PDF portion of Google-scanned public domain gov doc (available on google only as tasty snippets) from a partner library was that all such inquiries are handled by Google. Timid exists.
While some may see pointing out issues and concerns with Google Print as giving into “fear and ignorance,” I view this critique as essential, whether one is actually against Google Books entirely, one wants to be cautious about its limitations, or one wants it to be the best product it can be for consumers/patrons.

— Raizel * May 17, 05:23 PM
While persusing Google books for a library project, I tried to search for older items that would be hard to scan. On one of my first few searches I found a very bad scan: http://books.google.com/books?vid=OCLC64961393&id;=rz8CAAAAQAAJ
With most pages containing text cut off at the bottom of the pages. A few random searches for other books returned scanning flaws in every single book (although none as bad as this example).
And if you click the “Find this book in a library” link for the example above, it takes you to a record that doesn’t match the displayed book. If this link is only supposed to be doing a general search, the wording seems to imply that it should link to exactly the same book you are viewing on Google.

— G.F. May 18, 12:41 PM
Mead Cain’s comment highlights another intriguing problem that I only hinted at in the original survey, which is the problem of authoritative citation in or to Google Books. I look in there and still see the problem page 147 plain enough, but it is not easy to get to. If you type 147 into the Page field at the top, for instance, it takes you to pages 147 that looks (mostly) fine. But if you type 145 into the same field, it takes you to a page 163 that is followed two pages later by the page 147 I identified. As some of Cain’s rhetoric suggests, the ability to cite and follow evidence has a moral dimension in scholarship. Here again, Google Books falls short. And this does not even get into the question of whether Google could change or alter the electronic file at some time in the future.
Despite Cain’s rather heated efforts to reduce this to a simple choice between Google Books and nothing, my argument is not against Google per se. My objection is to the gap between the way Google is executing its project and the rhetoric about its goals that some people seem so eager to believe. Surely as scholars we are capable of subtler distinctions, and a more precise weighing of our choices and options.

— Robert B. Townsend May 18, 01:32 PM
1. Anyone can scan public domain materials.
2. Ultimately the consumer market which accesses digitally converted replicas of printed materials will dictate to the producer the level of quality and accuracy which is required/acceptable.

— Paul Jeffko May 18, 08:25 PM

May	SEP	Oct
	14
2007	2008	2010

Articles

Google Books: What’s Not to Like?