CLIENT SIDE MULTI-LINGUAL MODEL FOR ENHANCING PERFORMANCE IN SHORT MESSAGE SERVICE SPAM DETECTION
Abstract
Millions of money are lost by mobile phone users every year due to short message service 
spam, a social engineering skill attempting to obtain sensitive information such as passwords, 
personal identification numbers and other private data by posing as a trustworthy entity through 
short message service. Most spammers are constantly developing new sophisticated methods, 
rendering previous techniques obsolete. A thoughtful deficiency in most sms spam detection
methods is lack of satisfying accuracy, reliability, low performance and comprehensibility 
especially when individual classifiers are used, these remains important aspects to be 
considered for an optimal model development. Sms spam detection using machine learning 
techniques is a new approach especially in ubiquitous computing devices such as mobile 
phones, moreover the design of short message spam detection techniques in a mobile platform 
is challenging task due to the non-stationary distribution of the data and the multi-lingual nature 
of text messages from users. It is in this background that the research proposes a multi-stage 
ensemble hybrid client side multilingual sms spam detection model for a mobile environment 
using machine learning techniques. It involves enhanced use of pre-processing techniques, 
content based feature engineering techniques, multilingual natural language processing, data 
training and testing. A hybrid ensemble machine learning method is used to combine the 
classifiers based on a combination algorithm. The contributors of multi-lingual messages data 
include a combination of secondary data from University of California Irvine public repository 
and primary data from local users and sampled local repositories in Kenya. Machine learning 
and data mining experiments are conducted using Java based Waikato environment for 
knowledge analysis. The results and discussions are analyzed and presented in form of 
descriptive statistics. The effectiveness of the proposed model is empirically validated using 
ensemble classification methods that gave an overall classification accuracy of 98.2606%. The 
results from this study demonstrates that the proposed ensemble model improves the overall 
performance by increasing the accuracy and reducing false positives.

