Tensor Decomposition for Cybersecurity
Tensor decomposition is a powerful unsupervised machine learning method used to extract hidden patterns from large datasets. This pages gives a selected list of publications on tensor decomposition methods for cybersecurity and data privacy. Publications include a diverse array of capabilities, showcasing the cutting-edge employment of tensors in the detection of network and power grid anomalies, identification of SPAM e-mails, mitigation of credit card fraud, detection of malware, classifying malware families, pinpointing novel forms of malware, analyzing user behavior, and data privacy through federated learning techniques.
Anomaly Detection
General-Purpose Unsupervised Cyber Anomaly Detection via Non-Negative Tensor Factorization
Unsupervised Anomaly Detection - More Accurate and Precise - Large Datasets - Diverse set of Challenging Problems
Distinguishing malicious anomalous activities from unusual but benign activities is a fundamental challenge for cyber defenders. Prior studies have shown that statistical user behavior analysis yields accurate detections by learning behavior profiles from observed user activity. These unsupervised models are able to generalize to unseen types of attacks by detecting deviations from normal behavior, without knowledge of specific attack signatures. However, approaches proposed to date based on probabilistic matrix factorization are limited by the information conveyed in a two-dimensional space. Non-negative tensor factorization, on the other hand, is a powerful unsupervised machine learning method that naturally models multi-dimensional data, capturing complex and multi-faceted details of behavior profiles. Our new unsupervised statistical anomaly detection methodology matches or surpasses state-of-the-art supervised learning baselines across several challenging and diverse cyber application areas, including detection of compromised user credentials, botnets, spam e-mails, and fraudulent credit card transactions.
- Authors: Maksim E. Eren, Juston S. Moore, Erik Skau, Elisabeth Moore, Manish Bhattarai, Gopinath Chennupati, and Boian S. Alexandrov
- DOI: Link
- Pre-print: Link
Electrical Grid Anomaly Detection via Tensor Decomposition
Unsupervised Anomaly Detection - Accurate and Precise SCADA Anomaly Detection - Large Data
Supervisory Control and Data Acquisition (SCADA) systems often serve as the nervous system for substations within power grids. These systems facilitate real-time monitoring, data acquisition, control of equipment, and ensure smooth and efficient operation of the substation and its connected devices. As the dependence on these SCADA systems grows, so does the risk of potential malicious intrusions that could lead to significant outages or even permanent damage to the grid. Previous work has shown that dimensionality reduction-based approaches, such as Principal Component Analysis (PCA), can be used for accurate identification of anomalies in SCADA systems. While not specifically applied to SCADA, non-negative matrix factorization (NMF) has shown strong results at detecting anomalies in wireless sensor networks. These unsupervised approaches model the normal or expected behavior and detect the unseen types of attacks or anomalies by identifying the events that deviate from the expected behavior. These approaches; however, do not model the complex and multi-dimensional interactions that are naturally present in SCADA systems. Differently, non-negative tensor decomposition is a powerful unsupervised machine learning (ML) method that can model the complex and multi-faceted activity details of SCADA events. In this work, we novelly apply the tensor decomposition method Canonical Polyadic Alternating Poisson Regression (CP-APR) with a probabilistic framework, which has previously shown state-of-the-art anomaly detection results on cyber network data, to identify anomalies in SCADA systems. We showcase that the use of statistical behavior analysis of SCADA communication with tensor decomposition improves the specificity and accuracy of identifying anomalies in electrical grid systems. In our experiments, we model real-world SCADA system data collected from the electrical grid operated by Los Alamos National Laboratory (LANL) which provides transmission and distribution service through a partnership with Los Alamos County, and detect synthetically generated anomalies.
- Authors: Alexander B. Most, Maksim E. Eren, Boian S. Alexandrov, and Nigel Lawrence
- DOI: Link
- Pre-print: Link
Malware Characterization
Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection
Semi-supervised Method - Handles Extreme Class-imbalance - Handles Extreme Quantities of Classes - Maintains Performance with Low-quantities of Labels
Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this paper, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier, that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.
- Authors: Maksim E. Eren, Manish Bhattarai, Robert J. Joyce, Edward Raff, Charles Nicholas, and Boian S. Alexandrov
- DOI: Link
- Pre-print: Link
Catch’em all: Classification of Rare, Prominent, and Novel Malware Families
Semi-supervised Method - Handles Class-imbalance - Detects Novel Malware - Detects Rare Malware
National security is threatened by malware, which remains one of the most dangerous and costly cyber threats. As of last year, researchers reported 1.3 billion known malware specimens, motivating the use of data-driven machine learning (ML) methods for analysis. However, shortcomings in existing ML approaches hinder their mass adoption. These challenges include detection of novel malware and the ability to perform malware classification in the face of class imbalance; a situation where malware families are not equally represented in the data. Our work addresses these shortcomings with MalwareDNA; an advanced dimensionality reduction and feature extraction framework. We demonstrate stable task performance under class imbalance for the following tasks; malware family classification and novel malware detection with a trade-off in increased abstention or reject-option rate.
- Authors: Maksim E. Eren, Ryan Barron, Manish Bhattarai, Selma Wanna, Nicholas Solovyev, Kim Rasmussen, Boian S. Alexandrov, and Charles Nicholas
- DOI: Link
- Pre-print: Link
MalwareDNA: Simultaneous Classification of Malware, Malware Families, and Novel Malware
Semi-supervised Method - Handles Class-imbalance - Detects Novel Malware - Detects Rare Malware - Three Capabilities in One Model
Malware is one of the most dangerous and costly cyber threats to national security and a crucial factor in modern cyber-space. However, the adoption of machine learning (ML) based solutions against malware threats has been relatively slow. Shortcomings in the existing ML approaches are likely contributing to this problem. The majority of current ML approaches ignore real-world challenges such as the detection of novel malware. In addition, proposed ML approaches are often designed either for malware/benign-ware classification or malware family classification. Here we introduce and showcase preliminary capabilities of a new method that can perform precise identification of novel malware families, while also unifying the capability for malware/benign-ware classification and malware family classification into a single framework.
- Authors: Maksim E. Eren, Manish Bhattarai, Kim Rasmussen, Boian S. Alexandrov, and Charles Nicholas
- DOI: Link
- Pre-print: Link
Data Privacy
One-Shot Federated Group Collaborative Filtering
Unsupervised Method - One-shot Federated Learning - Recommender Systems
Non-negative matrix factorization (NMF) with missing-value completion is a well-known effective Collaborative Filtering (CF) method used to provide personalized user recommendations. However, traditional CF relies on a privacy-invasive collection of user data to build a central recommender model. One-shot federated learning has recently emerged as a method to mitigate the privacy problem while addressing the traditional communication bottleneck of federated learning. In this paper, we present the first one-shot federated CF implementation, named One-FedCF, for groups of users or collaborating organizations. In our solution, the clients first apply local CF in-parallel to build distinct, client-specific recommenders. Then, the privacy-preserving local item patterns and biases from each client are shared with the processor to perform joint factorization in order to extract the global item patterns. Extracted patterns are then aggregated to each client to build the local models via information retrieval transfer. In our experiments, we demonstrate our approach with two MovieLens datasets and show results competitive with the state-of-the-art federated recommender systems at a substantial decrease in the number of communications.