Show simple item record

dc.contributor.authorDarwin, Clayton Martin
dc.date.accessioned2014-03-04T03:19:17Z
dc.date.available2014-03-04T03:19:17Z
dc.date.issued2008-05
dc.identifier.otherdarwin_clayton_m_200805_phd
dc.identifier.urihttp://purl.galileo.usg.edu/uga_etd/darwin_clayton_m_200805_phd
dc.identifier.urihttp://hdl.handle.net/10724/24588
dc.description.abstractThis dissertation provides a detailed description of the construction and analysis of the University of Georgia Tobacco Documents Corpus, a representative corpus of tobaccoindustry documents designed to serve as a norm of written tobacco-industry discourse for the University of Georgia Tobacco-Documents Project (2001–2004). The Tobacco Documents Corpus was constructed as part of the National Cancer Institute, National Institutes of Health, U.S. Department of Health and Human Services (NIH-NCI) grant 1 RO1 CA87490-01, ‘Linguistic Analyses of Tobacco Industry Documents.’ This description is provided primarily as a means of demonstrating the viability of the given premise, that it is possible to manage and describe large document sets—apart from extensive review of individual texts—by using a combination of Corpus Linguistics, Humanities Computing, and Statistics methods. Secondarily, it provides the specifics of the project necessary to 1) properly implement the resultant corpus as a norm for comparison studies and interpret related data, and 2) use the Tobacco Documents Corpus as a model for similar projects. In particular, this work presents the underlying theory, implementation, and results of each step in the process of corpus creation and description, from the initial sampling and conversion of documents, through the statistical description and analysis of the resultant corpus, and ultimately (although in a limited form) to the distribution of the corpus and associated analyses via Compact Disc and the Internet (http://www.tobaccodocs.uga.edu/TDC). Subtopics addressed include category theory (categorization and classification), statistical sampling, text markup using Extensible Markup Language (XML), text extraction using Extensible Stylesheet Language (XSL) and XSL transformations (XSLT), tokenizing, parsing, count methods, and proportions analysis. To a limited extent, this work addresses scripting using the Python programming language as a tool for corpus construction and analysis, and the Internet as a means for displaying corpus data and analyses. Based on the overall success of the Tobacco Documents Corpus, it is believed that this process description will be a contribution to the developing field of Corpus Linguistics, particularly in the area of large-scale document analysis and text-mining.
dc.languageeng
dc.publisheruga
dc.rightspublic
dc.subjectCorpus Linguistics
dc.subjectHumanities Computing
dc.subjectMarkup schema
dc.subjectStatistical sampling
dc.subjectText mining
dc.subjectTobacco documents
dc.subjectDissertations (academic)
dc.titleConstruction and analysis of the University of Georgia Tobacco Documents Corpus
dc.typeDissertation
dc.description.degreePhD
dc.description.departmentLinguistics
dc.description.majorLinguistics
dc.description.advisorWilliam A. Kretzschmar
dc.description.committeeWilliam A. Kretzschmar
dc.description.committeeMarlyse Baptista
dc.description.committeeMichael A. Covington
dc.description.committeeDonald L. Rubin


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record