Web harvesting — what and why?
Web harvesting issues involved
Today I will be discussing the WebatRisk Project to harvest & preserve borndigital government and political information. During this presentation I’m hoping to provide an overview of Web harvesting specifically, what it is, and why you or your institution would want to do it. I’m also going to explain the WebatRisk project and its activities, the issues involved in harvesting borndigital materials, give examples of some harvesting tools and services, and give you a few resources to visit for more information.
Web harvesting what and why?
First, what is Web harvesting? The automated capture of Webpublished material, sometimes referred to as borndigital information. These files can be pdf, html, Windows media, gifs, jpgs, etc. Now, why would you want to harvest this material? There are several reasons for harvesting: to capture material in danger of disappearing, to capture a particular event, or moment in time, and/or to build a collection of similar or related materials.
To capture materials in danger of disappearing. We know that the Web is not the most stable environment – information appears & disappears, is modified, etc., all of the time. In terms of government and political information, a change in administration or an election win or loss can mean a Web site scrubbed of certain publications or taken down completely. The CyberCemetery at North Texas (http://govinfo.library.unt.edu) is a good example of this – they capture the Web sites of federal government commissions or agencies that are going out of business, and house the files on their servers to ensure access to them.
To capture a particular event, or moment in time. At the end of the President Bush’s first term in office, the National Archives conducted a harvest of all .gov and .mil Web sites (http://webharvest.gov/collections/peth04/). They also conducted an end of Congressional session harvest at the end of the 109th Congress (http://webharvest.gov/collections/congress109th/). In the aftermath of Hurricane Katrina, numerous blogs, Web sites, etc., sprung up. The WebatRisk folks used this opportunity to test their harvest tools, and to ensure that the sites were captured for posterity.
To build a collection of similar or related materials. Researchers may be interested in water management policies or flood control in particular areas, and how they may change over time. Or a library could be interested in capturing the Web sites of political parties in a particular country or region.
The WebatRisk project is funded by a grant received from the National Digital Information Infrastructure and Preservation Program (NDIIPP), and project partners are the California Digital Library, University of North Texas, and New York University. The purpose of the project is to build tools that will allow librarians to ‘capture, curate and preserve Webbased government and political information.’ While the project consists of programmers, developers, and curators, among others, today I’m going to focus on the role of curators.
The WebatRisk curators are librarians who are familiar with the content. Our role is to develop collection plans and test the capture tools built by the WebatRisk team. The collection plans are important in determining the scope and content of the harvest – and where decisions are made that can affect the entire project. More on those in a moment.
All of the WebatRisk collection plans are available from the WebatRisk Wiki, at http://wiki.cdlib.org/WebAtRisk/tiki-index.php?page=WebCollectionPlans.
Three examples of collections identified by WebatRisk curators are:
The CyberCemetery at the University of North Texas (http://govinfo.library.unt.edu): this collection is comprised of the Web sites of federal agencies no longer in existence. Here, entire Web sites are captured once, just before agencies or commissions shut their doors.
The UCLA Online Campaign Literature Archive (http://digital.library.ucla.edu/campaign/): this site provides access to captured Web sites from California “local, state, and federal offices, and ballot measures affecting the Los Angeles area.”
The Islamic and Middle Eastern Political Web, from Stanford University: while this collection is not yet available online, the plan shows how and why a particular group of sites with a similar focus, that of political parties inside and outside the borders of Islamic and Middle Eastern countries, would be of interest to researchers & kept for research purposes.
Web harvesting issues involved
When creating the collection plan, there are several decisions that curators must make. First, content must be identified. What do you want to capture? Will the collection be based on content, publisher, subject? A combination of those?
Another consideration is the depth of the capture. How much information should be captured? An entire Web site, or only certain levels? If it is determined that an entire Web site will be harvested, should any external links be included in the capture?
Number and frequency of captures
How often should the information be captured? Is this a onetime snapshot, or should it be captured periodically in order to harvest new or changed information?
Simply because a file or files are made available on the Web does not mean that there are no copyright concerns. Does the site owner or publisher need to be contacted for permission before harvesting the content you’ve identified?
Once the scope of the harvest has been determined, you’ll also need to decide how to harvest – i.e., use open source tools that are available & doityourself, or employ a Web harvesting service.
There are several harvesters available for use – many of them can be found via the WebatRisk wiki. Two that I’m familiar with are Heritrix and Httrack – both open source products. A third, the Web Curator Tool, has recently been developed and tested by the WebatRisk project team.
Heritrix (http://crawler.archive.org/) is the harvester used by the Internet Archive, and is the harvester being used in the WebatRisk project. It stores harvested content in Arc files, and requires more massaging of harvested content before it can be reproduced on a different Web server.
The Web Curator Tool (http://webcurator.sourceforge.net/) was developed by the National Library of New Zealand and the British Library. This tool is used for managing the entire process of harvesting Web content – including scope of content, permissions, and the actual harvest itself. The WebatRisk curators were asked to test this product, but I am not sure that an analysis of the test has been concluded.
As the practice of Web harvesting has evolved, harvesting services have developed. These services, such as the model developed in the WebatRisk project, allow the client to determine the scope, etc., and pay someone else to harvest & host the Webpublished material they are interested in.
ArchiveIt! (http://www.archive-it.org/) is a subscription service of the Internet Archive that allows institutions to create their own Web archive without having to host it. Some state governments are using this service, as are other institutions looking to preserve Web content without devoting too many inhouse resources to the effort.
OCLC’s Digital Archive (http://www.oclc.org/digitalarchive/default.htm) is another harvesting service. In addition to harvesting entire sites, this product also allows for individual publications to be harvested. Also, as a part of the WebatRisk project, the Web Archiving Service is being developed. More information about this is on the WebatRisk wiki (http://wiki.cdlib.org/WebAtRisk/tiki-index.php).
I’ve gone through quite a bit of information in a short time, but there is quite a bit of information out there for further research. The WebatRisk Wiki has a wealth of information about the project, including the collection plans developed by curators and presentations by the project team. Web harvesting can be a complex project, and I hope that I’ve given you some things to think about before embarking on such an initiative. Thank you.
About the author
Valerie D. Glenn is the Government Documents Librarian at the University of Alabama Libraries.
Copyright ©2007, First Monday.
Copyright ©2007, Valerie D. Glenn.
Preserving Government and Political Information: The WebatRisk Project by Valerie D. Glenn
First Monday, volume 12, number 7 (July 2007),