Construction and analysis of the University of Georgia Tobacco Documents Corpus
Darwin, Clayton Martin
MetadataShow full item record
This dissertation provides a detailed description of the construction and analysis of the University of Georgia Tobacco Documents Corpus, a representative corpus of tobaccoindustry documents designed to serve as a norm of written tobacco-industry discourse for the University of Georgia Tobacco-Documents Project (2001–2004). The Tobacco Documents Corpus was constructed as part of the National Cancer Institute, National Institutes of Health, U.S. Department of Health and Human Services (NIH-NCI) grant 1 RO1 CA87490-01, ‘Linguistic Analyses of Tobacco Industry Documents.’ This description is provided primarily as a means of demonstrating the viability of the given premise, that it is possible to manage and describe large document sets—apart from extensive review of individual texts—by using a combination of Corpus Linguistics, Humanities Computing, and Statistics methods. Secondarily, it provides the specifics of the project necessary to 1) properly implement the resultant corpus as a norm for comparison studies and interpret related data, and 2) use the Tobacco Documents Corpus as a model for similar projects. In particular, this work presents the underlying theory, implementation, and results of each step in the process of corpus creation and description, from the initial sampling and conversion of documents, through the statistical description and analysis of the resultant corpus, and ultimately (although in a limited form) to the distribution of the corpus and associated analyses via Compact Disc and the Internet (http://www.tobaccodocs.uga.edu/TDC). Subtopics addressed include category theory (categorization and classification), statistical sampling, text markup using Extensible Markup Language (XML), text extraction using Extensible Stylesheet Language (XSL) and XSL transformations (XSLT), tokenizing, parsing, count methods, and proportions analysis. To a limited extent, this work addresses scripting using the Python programming language as a tool for corpus construction and analysis, and the Internet as a means for displaying corpus data and analyses. Based on the overall success of the Tobacco Documents Corpus, it is believed that this process description will be a contribution to the developing field of Corpus Linguistics, particularly in the area of large-scale document analysis and text-mining.