Training-less ontology-based text categorization
Janik, Maciej Grzegorz
MetadataShow full item record
In this dissertation, we present a novel approach for automatic text categorization that combines some features of traditional text categorization approaches with the knowledge discovery methods in the Semantic Web. The proposed method, OmniCat, uses ontological knowledge to perform categorization of text documents. More specifically, it concentrates on the recognized named entities and relationships in document text to perform the categorization according to the taxonomy in the used ontology. Traditional approaches in text categorization use machine learning to train a classifier on a pre-classified set of training documents, in order to learn the distinguishing features of each category. Later, such classifier is used to classify previously unseen documents. One of the distinguishing features of OmniCat is that, unlike in the traditional text categorization methods, it does not rely on the existence of a training set and only uses formal knowledge from the ontology to perform categorization. In the proposed approach, the ontology effectively becomes the classifier. This makes OmniCat a training-less ontology-based text categorization method. A classification category in OmniCat is defined as an ontology fragment, and can be seen as a context of interest for categorization. Categorization in this model depends on finding the best semantic fit of the text document that is transformed to a semantic graph, into one of the defined contexts. Contexts can be dynamically redefined, as the user’s interest changes. Another distinguishing feature of OmniCat is that neither re-training of the classifier, nor supplying a new training set of documents is needed with the change of classification contexts. Simply, OmniCat classifies documents according to the newly supplied contexts of interest. The results from the performed experiments are highly encouraging. OmniCat can achieve a high categorization accuracy, which is within the range of the other tested traditional text categorization methods. We believe that as more rich and comprehensive domain ontologies become available, the proposed approach offers a suitable alternative to the traditional text categorization methods. The implementation of the proposed categorization method builds upon two necessary components, also created as part of this dissertation. The first is BRAHMS, a high-performance main-memory ontology storage engine, which provides an extensive and very fast API for querying large ontologies. The second is SPARQLeR, an extension of the SPARQL query language with regular path expressions for the semantic association discovery. The algorithms and querying methods developed for SPARQLeR became an integral part of the proposed ontology-based categorization. In the dissertation, we describe OmniCat, which is the main contribution of this dissertation, and both components, BRAHMS and SPARQLeR, which are the integral parts of the created prototype. We provide a detailed description of each system and the results of the performed experiments.