Abstract: Machine learning-based Android malware classifiers achieve high accuracy in stationary environments but struggle with concept drift. The rapid evolution of malware, especially with new families, can depress classification accuracy to near-random levels. Previous research has largely centered on detecting drift samples, with expert-led label revisions on these samples to guide model retraining. However, these methods often lack a comprehensive understanding of malware concepts and provide limited guidance for effective drift adaptation, leading to unstable detection performance and high human labeling costs. To combat concept drift, we propose DREAM, a novel system that improves drift detection and establishes an explanatory adaptation process. Our core idea is to integrate classifier and expert knowledge within a unified model. To achieve this, we embed malware explanations (or concepts) within the latent space of a contrastive autoencoder, while constraining sample reconstruction based on classifier predictions. This approach enhances classifier retraining in two key ways: 1) capturing the target classifier’s characteristics to select more effective samples in drift detection and 2) enabling concept revisions that extend the classifier’s semantics to provide stronger guidance for adaptation. Additionally, DREAM eliminates reliance on training data during real-time drift detection and provides a behavior-based drift explainer to support concept revision. Our evaluation shows that DREAM effectively improves the drift detection accuracy and reduces the expert analysis effort in adaptation across different malware datasets and classifiers. Notably, when updating a widely-used Drebin classifier, DREAM achieves the same accuracy with 76.6% fewer newly labeled samples compared to the best existing methods.
On Benchmarking Code LLMs for Android Malware Analysis
Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in various code intelligence tasks. However, their effectiveness for Android malware analysis remains underexplored. Decompiled Android malware code presents unique challenges for analysis, due to the malicious logic being buried within a large number of functions and the frequent lack of meaningful function names.This paper presents Cama, a benchmarking framework designed to systematically evaluate the effectiveness of Code LLMs in Android malware analysis. Cama specifies structured model outputs to support key malware analysis tasks, including malicious function identification and malware purpose summarization. Built on these, it integrates three domain-specific evaluation metrics—consistency, fidelity, and semantic relevance—enabling rigorous stability and effectiveness assessment and cross-model comparison.We construct a benchmark dataset of 118 Android malware samples from 13 families collected in recent years, encompassing over 7.5 million distinct functions, and use Cama to evaluate four popular open-source Code LLMs. Our experiments provide insights into how Code LLMs interpret decompiled code and quantify the sensitivity to function renaming, highlighting both their potential and current limitations in malware analysis.
LAMD: Context-driven Android Malware Detection and Classification with LLMs
Xingzhi Qian,
Xinran Zheng,
Yiling He📧,
Shuo Yang,
and Lorenzo Cavallaro
In 2025 IEEE Security and Privacy Workshops (SPW)
2025
Abstract: The rapid growth of mobile applications has escalated Android mal ware threats. Although there are numerous detection methods, they often struggle with evolving attacks, dataset biases, and limited explainability. Large Language Models (LLMs) offer a promising alternative with their zero-shot inference and reasoning capabilities. However, applying LLMs to Android malware detection presents two key chal-lenges: (1) the extensive support code in Android applications, often spanning thousands of classes, exceeds LLMs’ context limits and obscures malicious behavior within benign functionality; (2) the structural complexity and interdepen-dencies of Android applications surpass LLMs’ sequence-based reasoning, fragmenting code analysis and hindering malicious intent inference. To address these challenges, we propose LAMD, a practical context-driven framework to enable LLM-based Android malware detection. LAMD integrates key context extraction to isolate security-critical code regions and construct program structures, then applies tier-wise code reasoning to analyze application behavior progressively, from low-level instructions to high-level semantics, providing final prediction and explanation. A well-designed factual consistency verification mechanism is equipped to mitigate LLM hallucinations from the first tier. Evaluation in real-world settings demonstrates LAMD’s effectiveness over conventional detectors, establishing a feasible basis for LLM -driven mal ware analysis in dynamic threat landscapes.
Explanation as a Watermark: Towards Harmless and Multi-bit Model Ownership Verification via Watermarking Feature Attribution
Shuo Shao,
Yiming Li,
Hongwei Yao,
Yiling He,
Zhan Qin,
and Kui Ren
In Network and Distributed System Security Symposium (NDSS)
2025
Abstract: Ownership verification is currently the most critical and widely adopted post-hoc method to safeguard model copyright. In general, model owners exploit it to identify whether a given suspicious third-party model is stolen from them by examining whether it has particular properties ‘inherited’ from their released models. Currently, backdoor-based model watermarks are the primary and cutting-edge methods to implant such properties in the released models. However, backdoor-based methods have two fatal drawbacks, including harmfulness and ambiguity. The former indicates that they introduce maliciously controllable misclassification behaviors (i.e., backdoor) to the watermarked released models. The latter denotes that malicious users can easily pass the verification by finding other misclassified samples, leading to ownership ambiguity.
Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey
Abstract: Modern language models (LMs) have been successfully employed in source code generation and understanding, leading to a significant increase in research focused on learning-based code intelligence, such as automated bug repair, and test case generation. Despite their great potential, language models for code intelligence (LM4Code) are susceptible to potential pitfalls, which hinder realistic performance and further impact their reliability and applicability in real-world deployment. Such challenges drive the need for a comprehensive understanding - not just identifying these issues but delving into their possible implications and existing solutions to build more reliable language models tailored to code intelligence. Based on a well-defined systematic research approach, we conducted an extensive literature review to uncover the pitfalls inherent in LM4Code. Finally, 121 primary studies from top-tier venues have been identified. After carefully examining these studies, we designed a taxonomy of pitfalls in LM4Code research and conducted a systematic study to summarize the issues, current solutions, implications, and challenges of different pitfalls for LM4Code systems. We developed a comprehensive classification scheme that dissects pitfalls across four crucial aspects: data collection and labeling, system design and learning, performance evaluation, and deployment and maintenance. Through this study, we aim to provide a roadmap for researchers and practitioners, facilitating their understanding and utilization of LM4Code in reliable and trustworthy ways.
Distilling Benign Knowledge with Fine-Grained AST Fragments for Precise Real-World Web Shell Detection
Mingzhe Gao,
Ligeng Chen,
Yiling He📧,
Yuhang Chen,
Lingyun Ying,
and Wang Yang
In 2025 IEEE/ACM 33rd International Symposium on Quality of Service (IWQoS)
2025
Abstract: Web shell detection has become increasingly crucial with the expansion of cloud computing, where automated malware analysis serves as a foundational approach. A key challenge in malware detection lies in balancing the reduction of false positives with maintaining detection accuracy amid rapid software ecosystem evolution. Existing methods require substantial expert intervention to mitigate false positives and often neglect the resource-intensive measures required to address model degradation caused by software updates. This study introduces ASTBAR, a novel method that extracts fine-grained AST fragments to distill benign behavioral knowledge from webserver software. By leveraging program structure and semantic analysis, ASTBAR generates fragment-level representations of benign samples and employs fragment matching to identify malware. Unlike prior techniques, ASTBAR achieves simultaneous improvements in precision, recall, and adaptability to software evolution. The evaluation results demonstrate that ASTBAR achieves an F1 score of 65. 35%, outperforming the state-of-theart methods by 10.39%. In a 12-month industrial deployment spanning over one million users, ASTBAR maintained a 97.63% recall rat while reducing false positives by 700+ cases daily (equivalent to 30 expert hours).
RetouchUAA: Unconstrained Adversarial Attack via Realistic Image Retouching
Abstract: Deep Neural Networks (DNNs) are susceptible to adversarial examples. Conventional attacks generate controlled noise-like perturbations that fail to reflect real-world scenarios and hard to interpretable. In contrast, recent unconstrained attacks mimic natural image transformations occurring in the real world for perceptible but inconspicuous attacks, yet compromise realism due to neglect of image post-processing and uncontrolled attack direction. In this paper, we propose RetouchUAA, an unconstrained attack that exploits a real-life perturbation: image retouching styles, highlighting its potential threat to DNNs. Compared to existing attacks, RetouchUAA offers several notable advantages. Firstly, RetouchUAA excels in generating interpretable and realistic perturbations through two key designs: the image retouching attack framework and the retouching style guidance module. The former custom-designed human-interpretability retouching framework for adversarial attack by linearizing images while modelling the local processing and retouching decision-making in human retouching behaviour, provides an explicit and reasonable pipeline for understanding the robustness of DNNs against retouching. The latter guides the adversarial image towards standard retouching styles, thereby ensuring its realism. Secondly, attributed to the design of the retouching decision regularization and the persistent attack strategy, RetouchUAA also exhibits outstanding attack capability and defense robustness, posing a heavy threat to DNNs. Experiments on ImageNet, Place365 and CUB200 reveal that RetouchUAA achieves nearly 100% white-box attack success against three DNNs, while achieving a better trade-off between image naturalness, transferability and defense robustness than baseline attacks.
FINER: Enhancing State-of-the-art Classifiers with Feature Attribution to Facilitate Security Analysis
Abstract: Deep learning classifiers achieve state-of-the-art performance in various risk detection applications. They explore rich semantic representations and are supposed to automatically discover risk behaviors. However, due to the lack of transparency, the behavioral semantics cannot be conveyed to downstream security experts to reduce their heavy workload in security analysis. Although feature attribution (FA) methods can be used to explain deep learning, the underlying classifier is still blind to what behavior is suspicious, and the generated explanation cannot adapt to downstream tasks, incurring poor explanation fidelity and intelligibility. In this paper, we propose FINER, the first framework for risk detection classifiers to generate high-fidelity and high-intelligibility explanations. The high-level idea is to gather explanation efforts from model developer, FA designer, and security experts. To improve fidelity, we fine-tune the classifier with an explanation-guided multi-task learning strategy. To improve intelligibility, we engage task knowledge to adjust and ensemble FA methods. Extensive evaluations show that FINER improves explanation quality for risk detection. Moreover, we demonstrate that FINER outperforms a state-of-the-art tool in facilitating malware analysis.
DeUEDroid: Detecting Underground Economy Apps Based on UTG Similarity
Zhuo Chen,
Jie Liu,
Yubo Hu,
Lei Wu,
Yajin Zhou,
Yiling He,
Xianhao Liao,
Ke Wang,
Jinku Li,
and Zhan Qin
In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA)
2023
Abstract: In recent years, the underground economy is proliferating in the mobile system. These underground economy apps (UEware for short) make profits from providing non-compliant services, especially in sensitive areas (e.g., gambling, porn, loan). Unlike traditional malware, most of them (over 80%) do not have malicious payloads. Due to their unique characteristics, existing detection approaches cannot effectively and efficiently mitigate this emerging threat. To address this problem, we propose a novel approach to effectively and efficiently detect UEware by considering their UI transition graphs (UTGs). Based on the proposed approach, we design and implement a system, named DeUEDroid, to perform the detection. To evaluate DeUEDroid, we collect 25, 717 apps and build up the first large-scale ground-truth dataset (1, 700 apps) of UEware. The evaluation result based on the ground-truth dataset shows that DeUEDroid can cover new UI features and statically construct precise UTG. It achieves 98.22% detection F1-score and 98.97% classification accuracy, a significantly better performance than the traditional approaches. The evaluation result involving 24, 017 apps demonstrates the effectiveness and efficiency of UEware detection in real-world scenarios. Furthermore, the result also reveals that UEware are prevalent, i.e., 54% apps in the wild and 11% apps in the app stores are UEware. Our work sheds light on the future work of analyzing and detecting UEware. To engage the community, we have made our prototype system and the dataset available online.
MsDroid: Identifying Malicious Snippets for Android Malware Detection
Abstract: Machine learning has shown promise for improving the accuracy of Android malware detection in the literature. However, it is challenging to (1) stay robust towards real-world scenarios and (2) provide interpretable explanations for experts to analyse. In this article, we propose MsDroid , an An droid malware detection system that makes decisions by identifying m alicious s nippets with interpretable explanations. We mimic a common practice of security analysts, i.e., filtering APIs before looking through each method, to focus on local snippets around sensitive APIs instead of the whole program. Each snippet is represented with a graph encoding both code attributes and domain knowledge and then classified by Graph Neural Network (GNN). The local perspective helps the GNN classifier to concentrate on code highly correlated with malicious behaviors, and the information contained in graphs benefit in better understanding of the behaviors. Hence, MsDroid is more robust and interpretable in nature. To identify malicious snippets, we present a semi-supervised learning approach that only requires app labeling. The key insight is that malicious snippets only exist in malwares and appear at least once in a malware. To make malicious snippets less opaque, we design an explanation mechanism to show the importance of control flows and to retrieve similarly implemented snippets from known malwares. A comprehensive comparison with 5 baseline methods is conducted on a dataset of more than 81K apps in 3 real-world scenarios, including zero-day , evolution , and obfuscation . The experimental results show that MsDroid is more robust than state-of-the-art systems in all cases, with 5.37% to 49.52% advantage in F1-score. Besides, we demonstrate that the provided explanations are effective and illustrate how the explanations facilitate malware analysis.