Philippe Boucher's Rendez-vous with . . . Celia White
Rendez-vous with . . . Celia White
Director of Content and Services of the Legacy Tobacco Documents Library University of California San Francisco
celia.white@library.ucsf.edu
http://legacy.library.ucsf.edu/
By Philippe Boucher
Rendez-vous 133
Monday, March 11 2002
PB : Thank you Celia for accepting our rendez-vous.
May I ask you to introduce yourself ?
Celia White: I'm Celia White, Director of Content and Services of the Legacy Tobacco Documents Library (LTDL) (http://legacy.library.ucsf.edu/index.html). I'm a librarian, and I've been working with tobacco documents since 1998, when I was Lead Abstractor on Roswell Park's Marketing to Youth project. Since July 1999, I've been running the Tobacco Control digital Archive (TCA) at the University of California, San Francisco Library. (http://www.library.ucsf.edu/tobacco/).
Because I've been immersed in the content and management of tobacco industry documents, I know a little about many topics they cover, but don't consider myself a specialist in any one area. My goals in regard to tobacco control are to educate the public about the insidious behavior of the tobacco industry, and increase everyone's access to this formerly secret information.
Q1. Can you get back to the origins of this project? Why and when the library was created, how Legacy got involved and who else is a sponsor?
Celia White: The decision to create what would become the LTDL had it roots at the second Tobacco Documents Planning Meeting, held in Minnesota in February 2000. At that meeting, all attendees declared that it should be a central priority to mount all the tobacco documents in one location which would be permanent, stable, have financial and institutional support, and which would offer a unified search interface of all collections. In January 2001, the American Legacy Foundation committed funding to accomplish this goal, and in January 2002, the LTDL went live. We also received support from the Tobacco-Related Disease Research Program, and the Robert Wood Johnson Foundation.
Q2. Can you tell us about the services that are provided? How are the documents archived? Who does what? How big is the staff?
Celia White: The LTDL offers simple and advanced searching of the full collection of documents as they were on the websites as of July 1999 from Philip Morris, R.J. Reynolds, Brown & Williamson, the Tobacco Institute, the American Tobacco Company, and the Council for Tobacco Research. Documents added to industry websites after July 1999 will be added to our site over the next year. We also plan to offer integrated searching of the collections presently hosted at the Tobacco Control Archive: the UCSF B&W documents, the Joe Camel Collection, and so forth.
Because we received our data from the National Association of Attorneys General (NAAG), who administrate the Master Settlement Agreement, we maintain the original format of the data, and the TIF images they provided us are archived under standard conditions here at the Library.
Our team consists of one librarian, two programmers, an information systems manager, a system administrator, and sometimes a technical architect. All of us work very hard to make sure everything works, and to improve services when we can.
Q3. I can see the need to save all this info but at the same time I wonder how many people will be willing to dig up. I am probably a bad example but I am very easily discouraged and I don't have that much patience, at least for this type of investigation. Are many people using the library for in depth research?
Celia White: I think so. The site's been up for about a month, and we receive tens of thousands of hits each day, a number far exceeding usual use of Library collections. I can also tell, by the kinds of questions we receive from users, that the LTDL is being utilized for detailed inquiries.
We also hope that people who'd been frustrated by using other tobacco document sites will visit the Legacy Library ad learn how much easier it is here. We will also be mounting an online tutorial to ease the learning curve, later this Spring.
Q4. While you offer a small selection of "popular documents", someone like Ann Landmann (is she mentioned somewhere on the site?) has been working hard to share -almost everyday- one document among the many she has searched. Don't you think such an approach would make the documents more accessible for the general public? Including the possibility to receive such info via list-serve (like Ann does with smokescreen.org). Could the library partner with Ann and other researchers like her to provide such services?
Celia White: I love Anne's work! Our focus at the library is somewhat different. As with any type of research, it is good to get information from several sources. I am a subscriber to Anne's list-serve and recommend it to anyone doing work in this area.
We will be updating the popular documents section regularly on our site.
Q5. There are 24 million pages available. How long will it take to exploit them? Do you have a sense of how many can prove valuable and how many are not? How much gold -if any- eventually hides within this huge pile? Did such documents show up since the library was established?
Celia White: Remember, this is one of the largest digital libraries in the world. "mining" this collection under any circumstances is a heady prospect. Investigation in many areas, from hard science to business and history, is presently being funded on an international level. Several documents which have been removed from the industry websites are available through the Legacy library. Proposals for programmatic data mining are also being discussed all the time. In many cases, the technology is not yet where we need it to be.
For example, our original intention was to create searchable text from all the images we have. Our evaluation and trial of Optical Character Recognition software options for the very large scale of our document collection showed initial promise. However, evaluation of the results and return of accurate character recognition was very disappointing. The OCR created from tobacco industry documents is extremely "dirty" (stray characters, low word and character recognition count, and so forth), because the images were created from documents which are usually at least third or fourth generation copies, often several decades old, and which have a variety of typefaces and layouts. Added to this are the "scarring" which occurs as a document ages, and the marginalia of handwritten notes, stamps, and the sheer volume of pages which bear these characteristics (24-40 million pages).
While we do not intend to pursue OCR at this time, we are hopeful that improvement of technology and tools in this area may one day make the creation of searchable full-text of these documents a worthwhile endeavor. This would lead to other opportunities to utilize automatic indexing software, and other types of programmatic data organization.
Q6. Is there anything else you would like to add?
Celia White: I'm moving on from the Library to pursue some opportunities to work more deeply with tobacco industry documents, but am very proud of what we've accomplished. Your audience can always contact us by visiting the Contact us page at the Legacy library: http://legacy.library.ucsf.edu/cgi/help . Keep seeking the truth!
PB: Thank you Celia for taking the time to be with us today.
Rendez-vous is supported by a contract from the Robert Wood Johnson Foundation
***********************
Go To: Tobacco BBS HomePage / Resources Page / Health Page / Documents Page / Culture Page / Activism Page
***********************
***********************
END OF DOCUMENT