• Login
    View Item 
    •   Athenaeum Home
    • University of Georgia Theses and Dissertations
    • University of Georgia Theses and Dissertations
    • View Item
    •   Athenaeum Home
    • University of Georgia Theses and Dissertations
    • University of Georgia Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Construction and analysis of the University of Georgia Tobacco Documents Corpus

    Thumbnail
    Date
    2008-05
    Author
    Darwin, Clayton Martin
    Metadata
    Show full item record
    Abstract
    This dissertation provides a detailed description of the construction and analysis of the University of Georgia Tobacco Documents Corpus, a representative corpus of tobaccoindustry documents designed to serve as a norm of written tobacco-industry discourse for the University of Georgia Tobacco-Documents Project (2001–2004). The Tobacco Documents Corpus was constructed as part of the National Cancer Institute, National Institutes of Health, U.S. Department of Health and Human Services (NIH-NCI) grant 1 RO1 CA87490-01, ‘Linguistic Analyses of Tobacco Industry Documents.’ This description is provided primarily as a means of demonstrating the viability of the given premise, that it is possible to manage and describe large document sets—apart from extensive review of individual texts—by using a combination of Corpus Linguistics, Humanities Computing, and Statistics methods. Secondarily, it provides the specifics of the project necessary to 1) properly implement the resultant corpus as a norm for comparison studies and interpret related data, and 2) use the Tobacco Documents Corpus as a model for similar projects. In particular, this work presents the underlying theory, implementation, and results of each step in the process of corpus creation and description, from the initial sampling and conversion of documents, through the statistical description and analysis of the resultant corpus, and ultimately (although in a limited form) to the distribution of the corpus and associated analyses via Compact Disc and the Internet (http://www.tobaccodocs.uga.edu/TDC). Subtopics addressed include category theory (categorization and classification), statistical sampling, text markup using Extensible Markup Language (XML), text extraction using Extensible Stylesheet Language (XSL) and XSL transformations (XSLT), tokenizing, parsing, count methods, and proportions analysis. To a limited extent, this work addresses scripting using the Python programming language as a tool for corpus construction and analysis, and the Internet as a means for displaying corpus data and analyses. Based on the overall success of the Tobacco Documents Corpus, it is believed that this process description will be a contribution to the developing field of Corpus Linguistics, particularly in the area of large-scale document analysis and text-mining.
    URI
    http://purl.galileo.usg.edu/uga_etd/darwin_clayton_m_200805_phd
    http://hdl.handle.net/10724/24588
    Collections
    • University of Georgia Theses and Dissertations

    About Athenaeum | Contact Us | Send Feedback
     

     

    Browse

    All of AthenaeumCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    About Athenaeum | Contact Us | Send Feedback