Construction and analysis of the University of Georgia Tobacco Documents Corpus
Abstract
This dissertation provides a detailed description of the construction and analysis of
the University of Georgia Tobacco Documents Corpus, a representative corpus of tobaccoindustry
documents designed to serve as a norm of written tobacco-industry discourse for
the University of Georgia Tobacco-Documents Project (2001–2004). The Tobacco Documents
Corpus was constructed as part of the National Cancer Institute, National Institutes
of Health, U.S. Department of Health and Human Services (NIH-NCI) grant 1 RO1
CA87490-01, ‘Linguistic Analyses of Tobacco Industry Documents.’ This description is provided
primarily as a means of demonstrating the viability of the given premise, that it is
possible to manage and describe large document sets—apart from extensive review of individual
texts—by using a combination of Corpus Linguistics, Humanities Computing, and
Statistics methods. Secondarily, it provides the specifics of the project necessary to 1) properly
implement the resultant corpus as a norm for comparison studies and interpret related
data, and 2) use the Tobacco Documents Corpus as a model for similar projects. In particular,
this work presents the underlying theory, implementation, and results of each step
in the process of corpus creation and description, from the initial sampling and conversion
of documents, through the statistical description and analysis of the resultant corpus, and
ultimately (although in a limited form) to the distribution of the corpus and associated analyses
via Compact Disc and the Internet (http://www.tobaccodocs.uga.edu/TDC). Subtopics
addressed include category theory (categorization and classification), statistical sampling,
text markup using Extensible Markup Language (XML), text extraction using Extensible
Stylesheet Language (XSL) and XSL transformations (XSLT), tokenizing, parsing, count
methods, and proportions analysis. To a limited extent, this work addresses scripting using
the Python programming language as a tool for corpus construction and analysis, and the
Internet as a means for displaying corpus data and analyses. Based on the overall success of
the Tobacco Documents Corpus, it is believed that this process description will be a contribution
to the developing field of Corpus Linguistics, particularly in the area of large-scale
document analysis and text-mining.
URI
http://purl.galileo.usg.edu/uga_etd/darwin_clayton_m_200805_phdhttp://hdl.handle.net/10724/24588