Automation for Information Security using Machine Learning

Stefan Thaler

promotor: prof. dr. Milan Petković (TU/e)
copromotor: dr. Vlado Menkovski (TU/e)
Eindhoven University of Technology
Date: 20 February 2019
Thesis: PDF


Information security addresses the protection of information and information systems from unauthorized access, use, disclosure, disruption, modification, or destruction in order to provide confidentiality, integrity, and availability. One crucial part of information security is to collect data from these information systems and analyze them to understand which actions have been conducted on this information system and whether they were malicious or not.

Such data analysis in information security comes in many forms. The most basic one involves the manual labor of a domain expert. This expert sifts through the data for intriguing clues on the task that she is trying to complete. In many cases, this process of manually analyzing data is cumbersome, labor-intensive and error-prone. Relying on manual labor is particularly a problem when the amount of data that needs to be analyzed is large.

To address these data analysis challenges, security professionals often use software tools to aid them. Such tools automate repetitive data analysis tasks by formulating rules on the data, which extract and aggregate valuable information from large piles of data. As a result, such tools drastically reduce the efforts on data analysis. However, larger, more complex problems require a large number of rules. The set of rules may be incomplete, contradict each other or may not generalize well.

Machine learning is a sub-field of artificial intelligence that aims to learn rules from data to solve a task, rather than letting an expert define them manually. However, applying machine learning solutions to security problems is not straightforward, and a number of challenges need to be addressed. In this thesis, we focus on three main challenges, the lack of supervision, the usage of additional domain knowledge and the lack of contextual information. If used in a supervised way, machine learning requires labels, but in information security, such labels may not be readily available, difficult to obtain, or may become obsolete because the context changes fast. In many cases, in information security additional domain knowledge is available, but there are no standard mechanisms to integrate this knowledge into the learning process, which results in less efficient learning. Finally, in some information security scenarios, the intent of an action matters, but it cannot be simply distinguished by merely analyzing the logs. For example, copying a project from a repository can be a benign or a malicious action – benign if the intent is to create a backup, malicious if the intent is to steal this project.

In this thesis, we work on two use cases that are representative of the three challenges that we want to address. The first use case deals with the information forensics problem of signature extraction from forensic logs. Forensic logs are semi-structured, high dimensional sequences which are produced by the software processes on the investigated computer. These logs are typically large and labeled data is not typically available since the processes that create these logs change frequently. Furthermore, additional domain knowledge about these logs is readily available, since these logs are created by software. The second use case deals with the detection of data theft. In this scenario, only recording and analyzing users actions may not be sufficient to determine whether a data theft has happened.

To address these three challenges, we contribute in the following ways: First, we devise a Deep Learning – based method for extracting signatures from forensic logs. This method uses labeled log lines to learn a model that is capable of extracting the signatures. This method validates our assumption that such a model is capable of learning the complex relationship within the forensic logs. Secondly, we proposed a method that clusters forensic log-lines according to their signatures in an unsupervised way. Since the second method does not rely on human input, it addresses the lack of supervision.
Thirdly, we propose a method that efficiently allows combining available heuristic domain knowledge with labeled knowledge. We evaluated the approaches of the forensic use case on a forensic log and two large system logs. The results of the experiments showed improved accuracy over baseline approaches and more efficient use of human labor by reducing the need for expert input. Thus, our method reduces the need for supervision and adds a pathway to combine available knowledge in the machine learning process.
Finally, we devise a data-driven method for generating decoy project folders to detect data theft. We confirm the believability of the generated decoys via a user study. Hence, these decoy projects look real but are intrinsically valueless which renders any interaction with them suspicious. Monitoring interactions with these decoys can be a complementary measure to detect data theft, and thereby reducing the lack of contextual information.

Our overarching goal was to automate data analysis tasks in information security. To this end, we have empirically shown that we can use labels more efficient, address log clustering tasks without labels and create believable deceptive objects that could be used to gather additional contextual information. We believe that a similar strategy could be deployed for other data theft scenarios, with different types of objects. In the forensic use case, similar procedures as the ones demonstrated in this thesis could be applied to automate other tasks, for example, malware analysis or the detection of phishing attacks. The proposed techniques could also be used for the analysis of other data types such as spatial data.