Supporting link analysis using advanced querying methods on semantic web databases
MetadataShow full item record
There is an increasing demand for technologies that can help organizations unearth actionable knowledge from their data assets. This demand continues to drive the flurry of activities in data mining research where the emphasis is on technologies that can identify patterns in data. However, in addition to the “patterns” view of data, other data and knowledge perspectives are required to support the broad range of complex analytical tasks found in contemporary applications. For example, in some applications in homeland security, bioinformatics, business and other investigative domains many tasks are focused on “connecting the dots”. For this genre of applications, support for identifying, revealing and analyzing links or relationships between groups of entities (link analysis) is crucial. Currently, mainstream database systems do not provide support for such analyses and current solutions rely on exporting their data from their databases into custom applications to be analyzed. This has the disadvantage of additional overhead and precludes the ability to exploit other mature technologies offered by today’s database systems. This thesis argues for database support for link analysis by providing an appropriate interpretation for such information requests in a graph database model. It addresses several key database issues with respect to supporting such queries. First, it identifies a number of querying constructs that are crucial to supporting linking analysis applications and proposes a formal query language called SPARQ2L that allows their expression. A formal semantics and characterization of the computational complexity of SPARQ2L’s query constructs is also presented. Second, it proposes a database storage model that supports efficient processing of queries while being tolerant of data persistence. The storage model combines a graph linearization strategy rooted in algebraic techniques for solving path problems with a set of heuristics for node and edge clustering that aims to minimize external path lengths. Third, it proposes a novel relevance model SemRank which exploits the “machine processible semantics” of data in ascribing relative importance to query results and offers a flexible or “modulative ranking” model enabling serendipitous knowledge discovery.