John Musgrave, Alina Campan, Temesguen Messay-Kebede and David Kapp
Adv. Artif. Intell. Mach. Learn., 4 (1):2052-2076
John Musgrave : University of Cincinnati
Alina Campan : Northern Kentucky University
Temesguen Messay-Kebede : Air Force Research Lab, Wright-Patterson Air Force Base
David Kapp : Air Force Research Lab, Wright-Patterson Air Force Base
DOI: https://dx.doi.org/10.54364/AAIML.2024.41117
Article History: Received on: 20-Jan-24, Accepted on: 23-Feb-24, Published on: 20-Mar-24
Corresponding Author: John Musgrave
Email: musgrajw@mail.uc.edu
Citation: John Musgrave, Alina Campan, Temesguen Messay-Kebede, David Kapp, Boyang Wang (2024). Search and Retrieval in Semantic-Structural Representations of Novel Malware. Adv. Artif. Intell. Mach. Learn., 4 (1 ):2052-2076
In this study we present a novel representation for binary programs, which captures semantic similarity and structural properties. Our representation is composed in a bottom-up approach and enables new methods of analysis. We show that we can perform search and retrieval of binary executable programs based on similarity of behavioral properties, with an adjustable level of feature resolution. We begin by extracting data dependency graphs (DDG), which are representative of both program structure and operational semantics. We then encode each program as a set of graph hashes representing isomorphic uniqueness, a method we have labeled DDG Fingerprinting. Next, we use k-Nearest Neighbors to search in a metric space constructed from examples. This approach allows us to perform a quantitative analysis of patterns of program operation. By evaluating similarity of behavior we are able to recognize patterns in novel malware with functionality not previously identified. We present experimental results from search based on program semantics and structural properties in a dataset of binary executables with features extracted using our method of representation. We show that the associated metric space allows an adjustable level of resolution. Resolution of the features may be decreased for breadth of search and retrieval, or as the search space is reduced, the resolution may be increased for accuracy and fine-grained analysis of malware behavior.