Advanced Semi-supervised Tensor Decomposition Methods for Malware Characterization

Maksim E. Eren

August 2024

Abstract

Malware continues to be one of the most dangerous and costly cyber threats to national security. As of last year, over 1.3 billion malware specimens have been documented, prompting the use of data-driven machine learning (ML) techniques for their analysis. However, existing ML approaches face significant barriers that limit their widespread implementation. These challenges include the detection of novel malware, maintaining performance with low quantities of labeled data during training, and classifying malware under class imbalance; a scenario where malware families are unevenly represented in the dataset. This dissertation addresses these shortcomings by introducing three novel semi-supervised ML methods based on tensor decomposition. Our methods are based on dimensionality reduction, hierarchical tensor decomposition, automatic model determination, and feature extraction methods with selective classification or reject-option capability. This reject-option capability is a form of self-awareness that allows our models to abstain from making a decision under uncertainty, which in return allows for detection of novel threats. In this dissertation, we describe the foundational concepts underlying our methods and describe the approaches we developed; the Random Forest of Tensors (RFoT), HNMFk Classifier, and MalwareDNA. Additionally, we detail the capabilities of our methods to utilize High Performance Computing (HPC), multi-processing, and Graphical Processing Units (GPUs) for accelerated computation. We showcase our experiments with all three methods where we demonstrate stable task performance under extreme class imbalance, low-quantity of labeled data, and extreme quantities of malware families. We also showcase results when simultaneously classifying benign-ware and malware, classifying malware families, and detecting novel malware families. Our results are compared against state-of-the-art semi-supervised and supervised ML baselines on two datasets. We showcase how our method surpasses the performance of our baselines with a trade-off in increased abstention or reject-option rate.

Type

Thesis

Publication

Ph.D. Dissertation in Computer Science at the University of Maryland, Baltimore County Department of Computer Science and Electrical Engineering

Keywords:

Tensors, Machine Learning, Ensemble, Semi-supervised, Malware, NMF, Reject Option

Citation:

Eren, M. E.. Advanced Semi-supervised Tensor Decomposition Methods for Malware Characterization. Ph.D. Dissertation in Computer Science at the University of Maryland, Baltimore County Department of Computer Science and Electrical Engineering. 2024.

BibTeX:

@misc{eren2024Dissertation,
      title={Advanced Semi-supervised Tensor Decomposition Methods for Malware Characterization}, 
      author={M. E. {Eren}},
      year={2024},
      note={Ph.D. Dissertation in Computer Science at the University of Maryland, Baltimore County Department of Computer Science and Electrical Engineering. 2024.}
}