System oriented techniques for high-performance anti-spam solutions
MetadataShow full item record
Email has become a crucial part of life as the Internet has developed. However, a massive influx of spam emails has threatened the usefulness of email communication. Many techniques have been developed, such as machine learning, authentication, collaboration, etc. However, little has been done from a systems perspective to provide an effective, robust and efficient anti-spam solution. The arms race between spammers and anti-spam researchers has brought new challenges to the design of modern anti-spam systems. This dissertation focuses on the systems aspect of the challenges that the anti-spam researchers face in designing various anti-spam approaches. the system aspects. In particular, we attempt to provide solutions to the challenges in the collaborative approach, stand-alone approach and sender-based approach. These challenges are 1) preserving privacy of email content in collaboration, 2) achieving both high accuracy and high processing speed, and 3) selectively punishing email senders without exact knowledge of whether the email sender is a spammer or a normal user. We design a novel technique for message transformation to preserve the privacy of email content and derive resemblance information for collaborative email classification. We also carefully design a communication protocol to ensure email privacy during information exchange among the collaborative entities. The experimental results demonstrate a comparable accuracy and greater robustness compared to Bayesian and Distributed Checksum Clearinghouse approaches. This dissertation proposes a new metric for privacy evaluation and demonstrates a system with excellent privacy preservation. This dissertation continues to explore the tradeoff between spam filtering accuracy and speed by using approximate classification. It demonstrates about one order of magnitude of speed improvement over two well-known spam filters, while achieving identical false positive rates and similar false negative rates. For cost-based approaches, we propose to push the spam filter to the early stage of the SMTP conversation, and determine the cost based on the email quality and spam behavior. The experimental results show that under state-of-the-art hardware, the proposed technique can effectively limit the ability of the spammer effectively and significantly even if he possesses more CPU resources than the normal sender.