An Efficient Approach for Automating Threat Intelligence Analysis through Similarity Detection
Manually analyzing large-scale threats requires considerable resources. However, this task often becomes repetitive and tedious for cybersecurity experts since many threats are variations of past ones. Furthermore, given that the analysis of threat intelligence — including factors such as threat type, threat actor, and technique IDs — demands time and expertise, processing a large volume of threats is practically challenging.
To alleviate these challenges, we propose the Deep Binary Profiler (DBP). The DBP segments assembly code into multiple functions and then transforms these functions into vectors using an embedding model. By calculating the similarity between each pair of functions, it is possible to trace how newly emerging threats have evolved from past threats and how certain functionalities have been reused in these evolutions. As a result, new threats can inherit threat intelligence from past threats, enabling the automatic analysis of these new threats.
The large volume of threat intelligence stored in the database results in a proportional increase in time complexity during similarity searches. This data volume continues to grow as experts analyze new threats. To mitigate this, we quantized the stored function vectors to produce representative code vectors. These vectors are then converted into strings, termed Function Hash (FHash). Given that similar functions yield similar vector representations, they generate identical FHash strings. By filtering functions with matching FHash values, we efficiently reduced the search space for similarity searches.
In our experiments, we validated the proposed method using 2,803,509 assembly functions derived from 11,613 malware samples. Among the entire set of functions, 531,892 unique FHashes were extracted, representing approximately 18.9% of the total. This approach enabled a reduction of the overall search space by nearly 99.7%, thus enhancing the performance of similarity searches. Consequently, the DBP technique identifies threat variants in real-time, allowing experts to concentrate their resources on analyzing novel threats.
Mr. Hyunjong Lee
Hyunjong Lee is an AI researcher at SANDSLab in South Korea, with a primary focus on applying AI/ML techniques to the field of cybersecurity. He earned his M.S. degree from Dankook University, South Korea, and has four years of experience working as an AI researcher. His research interests center around Representation Learning.
Mr. Chang-Gyun Kim
Chang-Gyun Kim is an assistant research engineer at KSign’s Security Technical Research Institute.
He has a Master’s degree in computing, specializing in artificial intelligence and machine learning, from Imperial College London.
His current research and interests are using artificial intelligence for malware classification and threat intelligence analysis.