Sunday, July 21, 2019
Internet of Things Paradigm
Internet of Things Paradigm    Introduction  According to 2016 statistical forecast, there are almost 4.77 billion number of mobile phone users in globally and it is expected to pass the five billion by 2019. [1] The main attribute of this significant increasing trend is due to increasing popularity of smartphones. In 2012, about a quarter of all mobile users were smartphone users and this will be doubled by 2018 which mean there are be more than 2.6 million smartphone users. Of these smartphone users more than quarter are using Samsung and Apple smartphone.  Until 2016, there are 2.2 million and 2 million of apps in google app store and apple store respectively. Such explosive growth of apps gives potential benefit to developer and also companies. There are about $88.3 billion revenue for mobile application market.  Prominent exponents of the IT industry estimated that the IoT paradigm will generate $1.7 trillion in value added to the global economy in 2019. By 2020 the Internet of Things device will more than double the size of the smartphone, PC, tablet, connected car, and the wearable market combined.  Technologies and services belonging to the Internet of Things have generated global revenues in $4.8 trillion in 2012 and will reach $8.9 trillion by 2020, growing at a compound annual rate (CAGR) of 7.9%.  From this impressive market growth, malicious attacks also have been increased dramatically. According to Kaspersky Security Network(KSN) data report, there has been more than 171,895,830 malicious attacks from online resources among word wide. In second quarter of 2016, they have detected 3,626,458 malicious installation packages which is 1.7 times more than first quarter of 2016. Type of these attacks are broad such as RiskTool, AdWare, Trojan-SMS, Trojan-Dropper, Trojan, Trojan-Ransom,Trojan-Spy,Trojan-Banker,Trojan-Downloader,Backdoor, etc..  http://resources.infosecinstitute.com/internet-things-much-exposed-cyber-threats/#gref  Unfortunately, the rapid diffusion of the Internet of Things paradigm is not accompanied by a rapid improvement of efficient security solutions for those smart objects, while the criminal ecosystem is exploring the technology as new attack vectors.  Technological solutions belonging to the Internet of Things are forcefully entering our daily life. Lets think, for example, of wearable devices or the SmartTV. The greatest problem for the development of the paradigm is the low perception of the cyber threats and the possible impact on privacy.  Cybercrime is aware of the difficulties faced by the IT community to define a shared strategy to mitigate cyber threats, and for this reason, it is plausible that the number of cyber attacks against smart devices will rapidly increase.  As long there is money to be made criminals will continue to take advantage of opportunities to pick our pockets. While the battle with cybercriminals can seem daunting, its a fight we can win. We only need to break one link in their chain to stop them dead in their tracks. Some tips to success:    Deploy patches quickly  Eliminate unnecessary applications  Run as a non-privileged user  Increase employee awareness  Recognize our weak points  Reducing the threat surface    Currently, both major app store companies, Google and Apple, takes different position to approach spam app detection. One takes an active and the other with passive approach.  There is strong request of malware detection from global  Background (Previous Study)  The paper Early Detection of Spam Mobile Apps was published by dr. Surangs. S with his colleagues at the 2015 International World Wide Web conferences. In this conference, he has been emphasised importance of early detection of malware and also introduced a unique idea of how to detect spam apps. Every market operates with their policies to deleted application from their store and this is done thru continuous human intervention. They want to find reason and pattern from the apps deleted and identified spam apps.  The diagram simply illustrates how they approach the early spam detection using manual labelling.    Data Preparation  New dataset was prepared from previous study [53]. The 94,782 apps of initial seed were curated from the list of apps obtained from more than 10,000 smartphone users. Around 5 months, researcher has been collected metadata from Goole Play Store about application name, application description, and application category for all the apps and discarded non-English description app from the metadata.  Sampling and Labelling Process  One of important process of their research was manual labelling which was the first methodology proposed and this allows to identify the reason behind their removal.  Manual labelling was proceeded around 1.5 month with 3 reviewers at NICTA. Each reviewer labelled by heuristic checkpoint points and majority reason of voting were denoted as following Graph3. They identified 9 key reasons with heuristic checkpoints. These full list checkpoints can be find out from their technical report. (http://qurinet.ucdavis.edu/pubs/conf/www15.pdf)[]  In this report, we only list checkpoints of the reason as spam.  Graph3. Labelled spam data with checkpoint reason.    Checkpoint S1-Does the app description describe the app function clearly and concisely?  100 word bigrams and trigrams were manually conducted from previous studies which describe app functionality. There is high probability of spam apps not having clear description. Therefore, 100 words of bigrams and trigrams were compared with each description and counted frequency of occurrence.  Checkpoint S2-Does the app description contain too much details, incoherent text, or unrelated text?  literary style, known as Stylometry, was used to map checkpoint2. In study, 16 features were listed in table 2.  Table 2. Features associated with Checkpoint 2    Feature    1    Total number of characters in the description    2    Total number of words in the description    3    Total number of sentences in the description    4    Average word length    5    Average sentence length    6    Percentage of upper case characters    7    Percentage of punctuations    8    Percentage of numeric characters    9    Percentage of common English words    10    Percentage of personal pronouns    11    Percentage of emotional words    12    Percentage of misspelled word    13    Percentage of words with alphabet and numeric characters    14    Automatic readability index(AR)    15    Flesch readability score(FR)    For the characterization, feature selection of greedy method [ ] was used with max depth 10 of decision tree classification. The performance was optimized by asymmetric F-Measure   [55]  They found that Feature number 2, 3, 8, 9, and 10 were most discriminativeand spam apps tend to have less wordy app description compare to non-spam apps. About 30% spam app had less than 100 words description.  Checkpoint SÃ 3  Does the app description contain a noticeable repetition of words or key words?  They used vocabulary richness to deduce spam apps.  Vocabulary Richness(VR) =  Researcher expected low VR for spam apps according to repetition of keywords. However, result was opposite to expectation. Surprisingly VR close to 1 was likely to be spam apps and none of non-spam app had high VR result. [ ]  This might be due to terse style of app description among spam apps.  Checkpoint S4  Does the app description contain unrelated keywords or references?  Common spamming technique is adding unrelated keyword to increase search result of app that topic of keyword can vary significantly. New strategy was proposed for these limitations which is counting the mentioning of popular applications name from apps description.  In previous research name of top-100 apps were used for counting number of mentioning.  Only 20% spam apps have mentioned the popular apps more than once in their description. Whereas, 40 to 60 % of non-spam had mention more than once. They found that many of top-apps have social media interface and fan pages to keep connection with users. Therefore, theses can be one of identifier to discriminate spam of non-spam apps.  Checkpoint S5  Does the app description contain excessive references to other applications from the same developer?  Number of times a developers other app names appear.  Only 10 spam apps were considered as this checkpoint because the description contained links to the application rather than the app names.  Checkpoint S6  Does the developer have multiple apps with approximately the same description?  For this checkpoint, 3 features were considered:    The total number of other apps developed by same developer.  The total number of apps that written in English description to measure description similarity.  Have description Cosine similarity(s) of over 60%, 70%, 80%, and 90% from the same developer.    Pre-process was required to calculate the cosine similarity: [ ]  Firstly, converting the words in lower case and removing punctuation symbols.  Then calibrate each document with word frequency vector.  Cosine similarity equation:    http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/  They observed that the most discriminative of the similarity between app descriptions.  Only 10%  15% of the non-spam had 60% of description similarity between 5 other apps that developed by same developer. On the other hand, more than 27% of the spam apps had 60% of description similarity result. This evidence indicates the tendency of the spam apps multiple cone with similar app descriptions.  Checkpoint S7  Does the app identifier (applied) make sense and have some relevance to the functionality of the application or does it appear to be auto generated?  Application identifier(appid) is unique identifier in Google Play Store, name followed by the Java package naming convention. Example, for the facebook , appid is com.facebook.katana.  For 10% of the spam apps the average word length is higher than 10 and it was so only for 2%-3% of the non-spam apps. None of the non-spam apps had more than 20% of non-letter bigram appear in the appid, whereas 5% of spam apps had.  Training and Result  From 1500 of random sampling data 551 apps (36.73%) were suspicious as spam. [ ]  Methods  Automation  We used Checkpoint S1 and S2 for data management due to its comparability and highest number of agreement from reviewers. Due to limitation of accessibility for collect description reason only 100 sample was used for the testing.  We have automated checkpoint S1 and S2 according to following algorithm. Collected data were used log transformation to modify. This can be valuable both for making patterns in the data more interpretable and for helping to meet the assumptions of inferential statistics.  To make a code most time consuming part was description collection which takes more than two weeks to find and store. The raw data directed the description link for appID. However, many of them where not founded due to old version or no more available. So we searched all this info manually from the web and founded description was saved as a file which named as appID. (Diagram.) This allowed us to recall the description more efficiently in automation code.  S1 was automated by identified 100 word-bigrams and word-trigrams that are describing a functionality of applications. Because there is high probability of spam app doesnt have these words in their description, we have counted number of occurrence in each application.  Full list of these bigrams and trigrams found in Table 1.  Table 1. Bigrams and trigrams using the description of top apps    play games    are available    is the game    app for android    you can    get notified    to find    learn how    get your    is used to    your phone    to search    way to    core functionality    a simple    match your    is a smartphone    available for    app for    to play    key features    stay in touch    this app    is available    that allows    to enjoy    take care of    you have to    you to    can you beat    buy your    is effortless    its easy    to use    try to    allows you    keeps you    action game    take advantage    tap the    take a picture    save your    makes it easy    follow what    is the free    is a global    brings together    choose from    is a free    discover more    play as    on the go    more information    learn more    turns on    is an app    face the challenges    game from    in your pocket    your device    on your phone    make your life    with android    it helps    delivers the    offers essential    is a tool    full of features    for android    lets you    is a simple    it gives    support for    need your help    enables your    game of    how to play    at your fingertips    to discover    brings you    to learn    this game    play with    it brings    navigation app    makes mobile    is a fun    your answer    drives you    strategy game    is an easy    game on    your way    app which    on android    application which    train your    game which    helps you    make your    S2 was second highest number of agreement from three reviewers in previous study. Among 551 identified spam apps, 144 apps were confirmed by S2, 63 from 3 reviewers and 81 from 2 reviewer agreed.  We knew that from pre-research result, total number of words in the description, Percentages of numeric characters, Percentage of non-alphabet characters, and Percentage of common English words will give most distinctive feature. Therefore, we automated total number of words in the description and Percentage of common English words using C++.  Algorithm 1. Counting the total number of bi/tri-grams in the description  From literature [], they used 16 features of to find the information from checkpointS2. This characterization was done with wrapper method using decision tree classifier and they have found 30% of spam apps were have less than 100 words in their description and only 15% of most popular apps have less than 100 words. We extracted simple but key point from their result which was number of words in description and the percentage of common English words. This was developed in C++ as followed.  Algorithm 2. Counting the total number of words in the description  int count_Words(std::string input_text){  int number_of_words =1;  for(int i =0; i   if(input_text[i] ==  )  number_of_words++;  return number_of_words;  }  }  Percentage of common English words has not done properly due to difficulty of standard selection. However, here is code that we will develop in future study.  Algorithm 3. Calculate the Percentage of common English words(CEW) in the description  Int count_CEW(std::string input_text){  Int number_of_words=1;  For(int i  while(!CEW.eof(){  if(strcmp(input_text[i],CEW){  number_of_words++;  }  else{  getline(readFile, CEW);  }  }  return number_of_words;  }  Int percentage(int c_words, int words){  return (c_words/words)*100  }  Normalizaton  We had variables between [ min, max] for S1 and S2. Because of high skewness of database, normalization was strongly required. Database normalization is the process of organizing data into tables in such a way that the results of using the database are always unambiguous and as intended. Such normalization is intrinsic to relational database theory.    Using Excel, we had normalized data as following diagram.  Thru normalization, we could have result of transformed data between 0 and 1. The range of 0 and 1 was important for later process in LVQ.  Diagram. Excel spread sheet of automated data(left) and normalized data (right)    After transformation we wanted to test data to show how LVQ algorithm works with modified attributes. Therefore, we sampled only 100 data from modified data set. Even the result was not significant, it was important to test. Because, after this step, we can add more attributes in future study and possible to adjust the calibration. We have randomly sampled 50 entities from each top rank 100 and from pre-identified spam data. Top 100 ranked apps was assumed and high likely identify as non-spam apps.  Diagram.  Initial Results  We used the statistical package python to perform Learning Vector Quantification.  LVQ is prototype-bases supervised classification algorithm which belongs to the field of Artificial Neural Networks. It can have implemented for multi-class classification problem and algorithm can modify during training process.  The information processing objective of the algorithm is to prepare a set of codebook (or prototype) vectors in the domain of the observed input data samples and to use these vectors to classify unseen examples.  An initially random pool of vectors was prepared which are then exposed to training samples. A winner-take-all strategy was employed where one or more of the most similar vectors to a given input pattern are selected and adjusted to be closer to the input vector, and in some cases, further away from the winner for runners up. The repetition of this process results in the distribution of codebook vectors in the input space which approximate the underlying distribution of samples from the test dataset  Our experiments are done using only the for the manufactured products due to data size. We performed 10-fold cross validation on the data. It gives us the average value of 56%, which was quite high compare to previous study considering that only two attributes are used to distribute spam, non-spam.  LVQ program was done by 3 steps; [ ]    Euclidean Distance  Best Matching Unit  Training Codebook Vectors    1. Euclidean Distance.  Distance between two rows in a dataset was required which generate multi-dimensions for the dataset.  The formula for calculating the distance between dataset    Where the difference between two datasets was taken, and squared, and summed for p variables  def euclidean_distance(row1, row2):  distance = 0.0  for i in range(len(row1)-1):  distance += (row1[i]  row2[i])**2  return sqrt(distance)  2. Best Matching Unit  Once all the data was converted using Euclidean Distance, these new piece of data should sorted by their distance.  def get_best_matching_unit(codebooks, test_row):  distances = list()  for codebook in codebooks:  dist = euclidean_distance(codebook, test_row)  distances.append((codebook, dist))  distances.sort(key=lambda tup: tup[1])  return distances [0][0]  3. Training Codebook Vectors  Patterns were constructed from random feature in the training dataset  def random_codebook(train):  n_records = len(train)  n_features = len(train [0])  codebook = [train[randrange(n_records)][i] for i in range(n_features)]  return codebook  Future work  During writing process, I found that data collection from Google Play Store can be automated using Java client. This will induce number of dataset and possible to improve accuracy with high time saving. Because number of attributes and number of random sampling, result of the research is appropriate to call as significant result. However, basic framework was developed to improve accuracy.  Acknowledgement  In the last summer, I did some research reading work under the supervision of Associate Professor Julian Jang-Jaccard. Ive got really great support from Julian and INMS. Thanks to the financial support I received from INMS that I can fully focused on my academic research and benefited a great deal from this amazing opportunity.  The following is a general report of my summer research:  In the beginning of summer, I studied the paper A Detailed Analysis of the KDD CUP 99 Data Set by M. Trvallaee et. al. This gave basic idea of how to handle machine learning techniques.  Approach of KNN and LVQ  Main project was followed from a paper Why My App Got Deleted Detection of Spam Mobile Apps by Suranga Senevirane et. al.  I have tried my best to keep report simple yet technically correct. I hope I succeed in my attempt.  Reference  Appendix  Modified Data    Number of Words in thousands    bigram/tr-gram    Identified as spam(b)/not(g)    0.084    0    b    0.18    0    b    0.121    0    b    0.009    1    b    0.241    0    b    0.452    0    b    0.105    1    b    0.198    0    b    0.692    1    b    0.258    1    b    0.256    1    b    0.225    0    b    0.052    0    b    0.052    0    b    0.021    0    b    0.188    1    b    0.188    1    b    0.092    1    b    0.098    0    b    0.188    1    b    0.161    1    b    0.107    0    b    0.375    0    b    0.195    0    b    0.112    0    b    0.11    1    g    0.149    1    g    0.368    1    g    0.22    1    g    0.121    1    g    0.163    1    g    0.072    1    g    0.098    1    g    0.312    1    g    0.282    1    g    0.229    1    g    0.256    1    g    0.298    0    g    0.092    0    g    0.189    0    g    0.134    1    g    0.157    1    g    0.253    1    g    0.12    1    g    0.34    1    g    0.57    1    g    0.34    1    g    0.346    1    g    0.126    1    g    0.241    1    g    0.162    1    g    0.084    0    g    0.159    0    g    0.253    1    g    0.231    1    g    
Subscribe to:
Post Comments (Atom)
 
 
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.