Even though many personal archiving projects are done by individuals, the process does have a cost when it is done for institutional or commercial purposes. Three speakers addressed this issue; the first was Jeff Ubois (who was also the conference’s program chair).
He said that there will always be a deficit–we will always produce more data than we save. And in some cases, it might be cheaper to rescan data on demand than to store scanned data, which is particularly true for DNA data which produces massive datasets. There are many cost models for archives and perpetual storage, and some of them are quite complex. And of course, digitizing is only the first step; cataloging, indexing, and long term storage all are not without costs. Ubois presented some European data showing the following costs for data scanning:
|Text pages||0.1 – 0.8/page|
|Images||0.72 – 6/page|
|Audio||6.42 – 78.84/hour|
|Videotape||5.46 – 120/hour|
|Film||8 – 1040/hour|
One European committee studied the cost to digitize Europe’s cultural heritage and suggested that a “digital dark age” would ensue if archiving were left exclusively to the commercial sector. Some archiving partnerships between commercial and non-commercial organizations have been attempted, with varying success. Ubois suggested that there is an opportunity for librarians and museums to support self archiving efforts, collaborate to develop appropriate software, and collect small endowments (a form of “crowdsource funding”).
David Rosenthal, Chief Scientist of the LOCKSS (Lots of Copies Keep Stuff Safe) project hosted at Stanford University, presented 3 models for long-term data preservation. Storage is the major cost in all of these.
- Rent space on a cloud server, such as Amazon’s S3 service.
- Monetize the content by selling ads on it.
- Endow the data by collecting capital up front to pay for perpetual storage.
In Rosenthal’s opinion, Kryder’s Law (magnetic disk storage densities double annually) will continue, at least for a decade. He thinks that the desktop PC market will disappear, so the market for consumer disk drives will shrink. Storage technology is moving forward, but slowly. When solid state disks become available, they will become the storage medium of choice because although they are much more expensive than magnetic storage, they will result in large savings in power, cooling, space, and longevity.
Rosenthal favors endowing data storage, but it also has disadvantages. Once payment has been made, there is little leverage on the storage organization to continue unless an escrow service that audits the storage organization is in place to make sure that it is actually archiving the data. And if a service fails, the data must be transferred to another services, so reserves are needed to pay for such transfers. Rosenthal suggests that archives should be endowed with 70 times the raw storage costs, but the perceived high up-front cost for value received makes it a difficult marketing problem.
Brewster Kahle, Founder of the Internet Archive, gave the audience the benefit of his experiences. Hardware costs for the Internet Archive are about 20% of total operating costs. The remainder goes to user interface development, format conversions, and personnel. Many of the personnel costs are roughly independent of the size of the archive. A major fact in our favor is that a petabyte of data is a huge amount (it can hold about 20 channels of TV broadcasts for 10 years), and it is virtually impossible to write that much text. Archiving projects must be non-commercial and non-profit if they are to last up to 15 years.
Scanning costs for a box of paper range from $100 to $750. If the pages must be fed into the scanner manually, each one will cost about 25¢. Video costs about $15/video-hour; film costs about $300/program-hour. Costs for personal materials are reasonable: books and microfilm each cost about 10¢/page. LP records are about $10/disk, and cassette tapes are about $10/hour. For born-digital material, an “Upload” button has been added to the Internet Archive website which allows anyone to put up anything, as long as it is not offensive.
The endowment model is under study; a terabyte should cost about $2,000. It costs about $1 million to $2 million to start archiving a new media type.
Steve Griffin, a Program Director at the National Science Foundation (NSF), closed this session with some perspectives on funding. All of the data becoming available is opening up new research universes. Are funding practices keeping up with the opportunities? In the 1990s, networks were meant to connect supercomputers to get more cycles for simulations. Now we have the Internet, which has changed its character as more content has become available. Government agencies cannot change their funding practices quickly on their own; there must be a loud campaign by research organizations if they are to get what they want.
Columnist, Information Today and Conference Circuit Blog Editor