Select Page

Creating a model to detect malware using supervised learning algorith

Topic: Malicious Files Identification via Machine Learning Models TASK: You are to train your selected supervised machine learning algorithms using the master dataset provided, and compare their performance to each other and to TOBORRM’s initial attempt to classify the samples.  Part 1 – General data preparation and cleaning.  a) Import the MLDATASET_PartiallyCleaned.xlsx into R Studio. This dataset is a partially cleaned version of MLDATASET-200000-1612938401.xlsx.  b) Write the appropriate code in R Studio to prepare and clean the MLDATASET_PartiallyCleaned dataset as follows:        i. For How.Many.Times.File.Seen, set all values = 65535 to NA;        ii. Convert Threads.Started to a factor whose categories are given by 1 = 1 thread started 2 = 2 threads started 3 = 3                threads started 4 = 4 threads started 5 = 5 or more threads started            Hint: Replace all values greater than 5 with 5, then use the factor(.) function.       iii. Log-transform using the log(.) function, and remove the original column from the             dataset (unless you have overwritten it with the log-transformed data)        iv. Select only the complete cases using the na.omit(.) function, and name the dataset MLDATASET.cleaned.  Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary. c) Write the appropriate code in R Studio to partition the data into training and test sets using an 30/70 split. Be sure to set the randomisation seed using your student ID. Export both the training and test datasets as csv files, and these will need to be submitted along with your code. Part 2 – Compare the performances of different machine learning algorithms a) Select three supervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the command set.seed(.). Your 3 modelling approaches are given by myModels. For each of your ML modelling approaches, you will need to:  a) Run the ML algorithm in R on the training set with Actually.Malicious as the outcome variable. EXCLUDE Sample.ID and Initial.Statistical.Analysis from the modelling process.  b) Perform hyperparameter tuning to optimise the model (except for the Binary Logistic Regression model):  – Outline your hyperparameter tuning/searching strategy for each of the ML modelling approaches, even if you’re using the same search strategy as the workshop notes. Report on the search range(s) for hyperparameter tuning, which 𝑘-fold CV was used, and the number of repeated CVs (if applicable), and the final optimal tuning parameter values and relevant CV statistics (where appropriate).  – If your selected tree model is Bagging, you must tune the nbagg, cp and minsplit hyperparameters, with at least 3 values for each.  – If your selected tree model is Random Forest, you must tune the num.trees, mtry, min.node.size, and sample.fraction hyperparameters, with at least 3 values for each. c) Evaluate the performance of each ML models on the test set. Provide the confusion matrices and report the following:  – Sensitivity (the detection rate for actual malicious samples)  – Specificity (the detection rate for actual non-malicious samples)  – Overall Accuracy  d) Provide a brief statement on your final recommended model and why you chose that model over the others. Parsimony, accuracy, and to a lesser extent, interpretability should be taken into account. e) Create a confusion matrix for the variable Initial.Statistical.Analysis in the test set. Recall that the data in this column correspond to TOBORRM’s initial attempt to classify the samples. Compare and comment on the performance of your optimal ML model in part d) to the initial analysis by the TOBORRM team. Further details and instructions attached. Expected outcome: 1.) Report (5 pages or less, excl. cover/contents page) 2.) Copy of R code  3.) Two csv files corresponding to test datasets — Helpful Links on LinkedIn Learning: 1.      Data Analysis and Visualisation: Introduction 2.      Introduction to Machine Learning and R 3.      Introduction to the R Tidyverse 4.      R Statistics Essential Training  Free textbook: