Tuesday, March 06, 2007

Google Book Project

Thomas Rudin the associate general council for Microsoft lambasted Google’s approach to copyright protection characterizing it a ‘cavalier’ in comments delivered at the Association of American Publishers conference in New York. Those of us in publishing have a first hand understanding of this opinion and other segments of media are rapidly coming to a realization that even obvious content ownership isn’t enough to preclude Google from adopting and more importantly making money off content under copyright. Google is probably the only company that was willing to take the significant legal risks associated with the purchase of YouTube for example.

Publishers have elected to sue Google to protect their content rights and the content rights of their authors. At the same time, publishers have engaged with Google in participation in the Google Scholar program. Here publishers are equal partners and (I assume) negotiations for the acquisition of content by Google was negotiated in good faith and the results have been good to great for both parties. (Springer, Cambridge University). It is also no bad thing that Google’s content (digitization) programs have spurned other similar content initiatives particularly those of some of the larger trade and academic publishers.

The continued area of friction is the digitization project that Google initiated to scan all the books in as many libraries willing to participate. This is where publishers got upset. They were not consulted nor asked permission, they cannot approve the quality of the scanning, they will not participate in any revenue generated and they can not take for granted that the availability of the scanned book will not undercut any potential revenues they may generate on their own. The books in question are the majority of those published after 1925 or so (It's actually 1923: thanks to Shatzkin for noticing my error) and which are still likely to be under copyright protection of some sort.

Having said that, lets get one thing straight; having all books which exist in library stacks (or deep storage) available in electronic form so that they can be indexed, searched, reassembled, found at all and generally resourced in an easy way is a good thing and an important step forward and opportunity for libraries and library patrons. Ideally, it would lead to one platform (network) providing equal access to high quality, indexed e-book content which any library patron would be able to access via their local library. Sadly, while the vision is still viable the execution represented by the Google library program is not going to get us there.

Setting aside the copyright issue, the Google library program has been going on now for approximately 24mths and results and feedback is starting to show that the reality of the program is not living up to its promise. According to this post from Tim O’Reilly, the scans are not of high quality and importantly are not sufficient to support academic research. Assuming this is universally true (?), the program represents a fantastic opportunity lost for patrons, libraries and Google. BowerBird via O’Reilly states:

umichigan is putting up the o.c.r. from its google scans, for the public-domain books anyway, so the other search engines will be able to scrape that text with ease. what you will find, though, if you look at it (for even as little as a minute or two) is that the quality is so inferior it's almost worthless
Could Google suffer more embarrassment as disillusion grows over the program – perhaps, but I doubt it will force them to rethink their methodology. It would represent a huge act of humility for Google to ‘return to the table’ with publishers and libraries to work with them to rethink the project with the intention of agreeing to the copyright issues, and agreeing a better way to process and tag the content. To suggest that they become less a content repository and more a navigator or ‘switchboard’ which is how O’Reilly phases it is beyond expectation; however, were they to change course in this way they would immediately reap benefits with all segments of the publishing and library communities. O’Reilly – a strong supporter of the Google program – believes the search engines (Google, Yahoo, Others) will ‘lose’ if they continue to create content repositories that are not ‘open’.

Ironically, the lawsuit by the AAP could actually have a beneficial impact on the process of digitization. As some have noted, we may have underestimated the difficultly in finding relevant materials and resources once there is more content to search (this assuming full text is available for search). Initiatives are underway particularly by Library of Congress to address the bibliographic (metadata) requirements of a world with lots more content and perhaps the results of some of these bibliographic activities will result in a better approach to digitization of the more recent content (post 1923). Regrettably, some believe that since there may be only one opportunity to scan the materials in libraries that we may have lost the only opportunity to make these (older) materials accessible to users in an easy way.


Tomorrow, just what is the universe of titles in the post 1923 ‘bucket’? The supporters of the Google project speak about a universe of 30million books but deeper analysis suggests the number is wildly exaggerated.

No comments: