May 25, 2020Yiling He Reading time ~10 minutes

IJCAI-2019-Aidroid

Contents

Out-of-sample Node Representation Learning for Heterogeneous Graph in Real-time Android Malware Detection
- Proposed Method
- Experimental Results

Out-of-sample Node Representation Learning for Heterogeneous Graph in Real-time Android Malware Detection

与腾讯安全实验室的合作项目，部署在名为AiDroid的系统中。一作即HinDroid (KDD 2017最佳应用论文)的作者，改进的部分主要是解决了out-of-sample node的问题。

Proposed Method

Feature Extraction

Dynamic Behavior Extraction: extract the sequences of API calls in the application framework from runtime executions of Android apps to capture their behaviors.

“TigerEyeing” trojan: connecting to the C&C server in order to fetch the configuration information;

(StartActivity, checkConnect, getPhoneInfo, receiveMsg, sendMsg, finishActivity)

事实上AiDroid并未用到序列信息，仅在对比实验中用了。
Relation-based Feature Extraction
- R1: the app-invoke-API relation
- R2: the app-exist-IMEI relation
  
  IMEI（International Mobile Equipment Identity，国际移动设备识别码）
- R3: the app-certify-signature relation
- R4: the app-associate-affiliation relation
  
  根据package name推测，如”com.tencent.mobileqq”->app: mobileqq, affiliation: tencent.com
- R5: the IMEI-have-signature relation
- R6: the IMEI-possess-affiliation relation

March 29, 2020Yiling He Reading time ~7 minutes

S&P-2019-Asm2Vec

Contents

论文概要
PV-DM 模型
Asm2Vec
- CFG序列化
- Asm2Vec模型
My Review

论文概要

题目：Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization

主要任务：对代码混淆以及编译器优化鲁棒的二进制克隆检测

主要方法：无监督的Assembly Code Representation Learning，学习的对象是汇编代码中的函数（function）的向量表示，顺带还有每个token的。

参照NLP中的PV-DM模型，该模型基于tokens学习文档的表示 Doc2Vec。
区别：文档是顺序排列的（sequentially laid out），汇编代码可被视作有特定语法的图（a graph and has a specific syntax）。

PV-DM 模型

论文：Distributed Representations of Sentences and Documents

句向量的分布记忆模型（Distributed Memory version of Paragraph Vectors），同时学习每个词和整个段落的表示（jointly learn vector representations for each word and each paragraph）。

与CBOW模型（Word2Vec）不同的地方在于，输入中多了一个段落的表示向量（paragraph vector）。

在由一系列段落组成的语料库 \(T\) 中，给定一个由多个句子 \(s\) 组成的段落 \(p\) , 在每个句子 \(s\) 上使用一个大小为 \(2k+1\) 的滑动窗口采样，每步向前移动一个单词 \(w\) 。在每一步中，如下图所示：

中间一个词作为目标（target），而两边的词作为上下文（context），完成多分类任务。
通过段落和词的ID，将它们分别映射到一个对应的向量上。
把这些向量加和平均后（经激活函数sigmod），使用多类别分类器（如softmax）预测目标词。
分类错误被反向传播，用于更新词和段落向量。

March 25, 2020Yiling He Reading time ~3 minutes

NDSS-2020-DeepBinDiff

Contents

Binary Diffing
- 问题定义
- 解决方法
二进制文件预处理
- 生成程序的ICFG
- 生成基本块的特征向量
基本块的嵌入表示
代码差异比较

Binary Diffing

问题定义

对于给定的两个二进制程序 \(p_1=(B_1, E_1)\) 以及 \(p_2=(B_2, E_2)\) ，找到最优的代码块匹配，使得 \(p_1\) 和 \(p_2\) 之间的相似度尽可能大：

\[SIM(p_1,p_2)=\underset{m_1,m_2,...,m_k\in{M(p_1,p_2)}}{\max}\sum_{i=1}^ksim(m_i)\]

解决方法

DeepBinDiff 将问题分解为两个子任务：

找到一个能量化两个基本块相似性的度量方法 \(sim(m_i)\)
- 无监督学习（unsupervised learning）生成嵌入（embeddings）
找到两个基本块集合间的最优匹配 \(M(p_1,p_2)\)
- k跳贪婪匹配算法（k-hop greedy matching algorithm）

Assumptions 对输入的二进制文件有以下设定：

经过strip（删除了符号表和调试信息），没有源代码和符号信息（如函数名）。商用软件（COTS: Commercial off-the-shelf）通常经过strip，恶意软件也往往不包含符号信息
未被打包，但可以是被不同编译器优化方法转换的。

E0的磕盐之路

Latest Posts

IJCAI-2019-Aidroid

Out-of-sample Node Representation Learning for Heterogeneous Graph in Real-time Android Malware Detection

Proposed Method

Feature Extraction

S&P-2019-Asm2Vec

论文概要

PV-DM 模型

NDSS-2020-DeepBinDiff

Binary Diffing

问题定义

解决方法