malconv / paper /PaperReview.md

cycloevan

Upload 17 files

b92918a verified 5 months ago

preview code

raw

history blame contribute delete

27.8 kB

Malware Detection by Eating a Whole EXE

Instroduction

[ENG] The introduction of MalConv addresses the fundamental limitations of traditional signature-based malware detection systems and the complexities of dynamic analysis approaches. Current anti-virus technologies rely on manually crafted rules that are specific to particular malware families and cannot recognize new variants, making them increasingly ineffective against the millions of new malware samples discovered daily. Dynamic analysis, while intuitive, presents significant challenges including high computational requirements, potential detection by sophisticated malware, and discrepancies between analysis and target environments

The core innovation lies in processing raw byte sequences from entire executable files without requiring domain knowledge or feature engineering. This approach presents unique challenges not encountered in traditional machine learning domains: handling multi-modal byte information (text, code, images), managing spatial correlations with discontinuities, processing sequences exceeding two million time steps, and addressing multiple levels of concept drift over time

[KOR] MalConv의 도입배경은 전통적인 시그니처 기반 악성코드 탐지 시스템의 근본적 한계와 동적 분석 방법의 복잡성을 해결하는 데 있습니다. 현재 안티바이러스 기술은 특정 악성코드 패밀리에 특화된 수동 제작 규칙에 의존하며 새로운 변종을 인식할 수 없어, 매일 발견되는 수백만 개의 새로운 악성코드 샘플에 대해 점점 비효율적이 되고 있습니다. 동적 분석은 직관적이지만 높은 계산 요구사항, 정교한 악성코드의 탐지 가능성, 분석 환경과 대상 환경 간의 불일치 등 심각한 문제를 제시합니다.

핵심 혁신은 도메인 지식이나 특징 공학 없이 전체 실행 파일의 원시 바이트 시퀀스를 처리하는 것입니다. 이 접근법은 전통적인 기계학습 영역에서 만나지 못한 고유한 도전과제들을 제시합니다: 다중 모달 바이트 정보(텍스트, 코드, 이미지) 처리, 불연속성을 가진 공간 상관관계 관리, 200만 시간 단계를 초과하는 시퀀스 처리, 시간에 따른 다중 레벨 개념 드리프트 해결

2. Related work

[ENG] The application of neural networks to extremely long sequences represents a significant computational challenge that MalConv addresses at an unprecedented scale. Prior work in this area includes WaveNet, which processes audio sequences of up to 16,000 time steps per second, still two orders of magnitude smaller than MalConv's capability. ByteNet and similar architectures for machine translation handle relatively shorter sequences compared to the malware detection problem.

The use of dilated convolutions, popularized by WaveNet and ByteNet for capturing wide receptive fields, was explored but found ineffective for binary data. Unlike spatially consistent domains like images and audio, the values in dilated convolution "holes" are not easily interpolated for binary content, leading to poor performance. RNN-based approaches face memory and computational complexity limitations when dealing with sequences of this magnitude

[KOR] 극도로 긴 시퀀스에 대한 신경망 적용은 MalConv가 전례 없는 규모로 해결하는 중요한 계산적 도전을 나타냅니다. 이 영역의 선행 연구로는 초당 최대 16,000 시간 단계의 오디오 시퀀스를 처리하는 WaveNet이 있지만, 여전히 MalConv 능력보다 두 자릿수 작습니다. 기계 번역을 위한 ByteNet과 유사한 아키텍처들은 악성코드 탐지 문제에 비해 상대적으로 짧은 시퀀스를 처리합니다.

넓은 수용 필드를 포착하기 위해 WaveNet과 ByteNet에서 인기를 얻은 팽창 컨볼루션의 사용이 탐구되었지만 바이너리 데이터에 대해서는 비효과적임이 발견되었습니다. 이미지와 오디오 같은 공간적으로 일관된 도메인과 달리, 팽창 컨볼루션 "구멍"의 값들은 바이너리 콘텐츠에 대해 쉽게 보간되지 않아 성능 저하를 초래합니다. RNN 기반 접근법은 이러한 규모의 시퀀스를 다룰 때 메모리와 계산 복잡도 제한에 직면합니다

2.1 Neural Network for Long Sequence

[EMG] The application of neural networks to extremely long sequences represents a significant computational challenge that MalConv addresses at an unprecedented scale. Prior work in this area includes WaveNet, which processes audio sequences of up to 16,000 time steps per second, still two orders of magnitude smaller than MalConv's capability. ByteNet and similar architectures for machine translation handle relatively shorter sequences compared to the malware detection problem.

2.2 Neural Netwokrs for Malware Detectioon

[ENG] Previous applications of neural networks to malware detection have relied heavily on domain knowledge and feature extraction, limiting their generalizability. Saxe and Berlin (2015) used histogram-based features including byte entropy values, ASCII string lengths, and PE metadata, discarding most information about actual binary content. Kolosnjaji et al. (2016) applied LSTM networks to API call sequences from dynamic analysis, focusing on only 60 kernel API calls.

The work most closely related to MalConv in terms of feature representation is the PE-header network by Raff et al. (2017), which achieved high accuracy using only 300 bytes from PE headers. However, this approach still requires domain knowledge about executable file structure and cannot process the entire binary content. Most existing work relies on sophisticated emulation environments and manual feature engineering, creating barriers to reproduction and extension.

[KOR] 악성코드 탐지에 대한 신경망의 이전 적용들은 도메인 지식과 특징 추출에 크게 의존하여 일반화 가능성을 제한했습니다. Saxe와 Berlin(2015)은 바이트 엔트로피 값, ASCII 문자열 길이, PE 메타데이터를 포함한 히스토그램 기반 특징을 사용하여 실제 바이너리 콘텐츠에 대한 대부분의 정보를 버렸습니다. Kolosnjaji 등(2016)은 동적 분석의 API 호출 시퀀스에 LSTM 네트워크를 적용하여 단지 60개의 커널 API 호출에만 집중했습니다.

특징 표현 측면에서 MalConv와 가장 밀접한 관련이 있는 작업은 PE 헤더의 300바이트만을 사용하여 높은 정확도를 달성한 Raff 등(2017)의 PE-헤더 네트워크입니다. 그러나 이 접근법은 여전히 실행 파일 구조에 대한 도메인 지식이 필요하고 전체 바이너리 콘텐츠를 처리할 수 없습니다. 기존 작업의 대부분은 정교한 에뮬레이션 환경과 수동 특징 공학에 의존하여 재현과 확장에 장벽을 만듭니다.

3. Training data

[ENG] The training data for MalConv consists of two primary groups with distinct collection methodologies to ensure robust evaluation. Group B contains 400,000 files split evenly between benign and malicious classes, provided by an anti-virus industry partner and representing files encountered on real machines. The Group B test set includes 77,349 files with 40,000 malicious and the remainder benign samples.

Group A data follows the conventional academic approach, with benign samples from clean Microsoft Windows installations and common applications, while malware comes from the VirusShare corpus. The Group A test set contains 43,967 malicious and 21,854 benign files. A critical finding revealed that training on Group A-style data results in severe overfitting, with models learning to recognize "from Microsoft" rather than genuinely benign characteristics.

An extended dataset of 2,011,786 binaries (1,000,020 benign and 1,011,766 malicious) was later obtained to demonstrate MalConv's continued improvement with increased training data, while byte n-gram approaches showed performance plateauing.

[KOR] MalConv의 훈련 데이터는 강건한 평가를 보장하기 위해 서로 다른 수집 방법론을 가진 두 가지 주요 그룹으로 구성됩니다. 그룹 B는 안티바이러스 업계 파트너가 제공한 40만 개의 파일을 포함하며, 악성과 정상 클래스로 균등하게 분할되어 실제 기계에서 발견되는 파일을 대표합니다. 그룹 B 테스트 세트는 77,349개 파일을 포함하며 이중 40,000개가 악성이고 나머지가 정상 샘플입니다.

그룹 A 데이터는 전통적인 학술적 접근법을 따르며, 정상 샘플은 깨끗한 Microsoft Windows 설치와 일반적인 애플리케이션에서, 악성코드는 VirusShare 코퍼스에서 가져왔습니다. 그룹 A 테스트 세트는 43,967개의 악성과 21,854개의 정상 파일을 포함합니다. 중요한 발견은 그룹 A 스타일 데이터에 대한 훈련이 심각한 과적합을 초래하며, 모델이 진정한 정상 특성보다는 "Microsoft에서 나온" 것을 인식하도록 학습한다는 것입니다.

MalConv의 훈련 데이터 증가에 따른 지속적인 성능 향상을 보여주기 위해 2,011,786개 바이너리(정상 1,000,020개, 악성 1,011,766개)의 확장 데이터셋이 나중에 획득되었으며, 바이트 n-그램 접근법은 성능 정체를 보였습니다

4. Model Archtecture

[ENG] The MalConv architecture is designed with three key requirements: scalability with sequence length, ability to consider both local and global context across entire files, and explanatory capability for analysis of flagged malware. The model processes raw byte sequences through an embedding layer that maps each byte (0-255) to an 8-dimensional learned feature vector, avoiding the false assumption that certain byte values are intrinsically closer than others.

The core architecture employs gated convolution with 128 filters, using large convolutional filter width of 500 bytes combined with an aggressive stride of 500. This design choice addresses GPU memory constraints while enabling efficient data-parallel training. The gated convolution approach follows Dauphin et al. (2016), incorporating element-wise multiplication with sigmoid activation to control information flow.

Temporal max-pooling extracts the maximum activation across the entire sequence, allowing the model to detect informative features regardless of their location within the binary. This design addresses the high positional variation in executable files where contents can be arbitrarily rearranged while maintaining functionality. The final component is a fully connected layer with softmax activation for binary classification.

[KOR] MalConv 아키텍처는 세 가지 핵심 요구사항으로 설계되었습니다: 시퀀스 길이에 따른 확장성, 전체 파일에 걸친 지역적 및 전역적 컨텍스트 고려 능력, 플래그된 악성코드 분석을 위한 설명 능력. 모델은 각 바이트(0-255)를 8차원 학습된 특징 벡터로 매핑하는 임베딩 레이어를 통해 원시 바이트 시퀀스를 처리하여, 특정 바이트 값이 다른 값보다 본질적으로 더 가깝다는 잘못된 가정을 피합니다.

핵심 아키텍처는 128개 필터를 가진 게이트 컨볼루션을 사용하며, 500바이트의 큰 컨볼루션 필터 폭과 500의 공격적인 스트라이드를 결합합니다. 이 설계 선택은 GPU 메모리 제약을 해결하면서 효율적인 데이터 병렬 훈련을 가능하게 합니다. 게이트 컨볼루션 접근법은 Dauphin 등(2016)을 따르며, 정보 흐름을 제어하기 위해 시그모이드 활성화와 요소별 곱셈을 통합합니다.

시간적 최대 풀링은 전체 시퀀스에 걸쳐 최대 활성화를 추출하여, 모델이 바이너리 내 위치에 관계없이 정보적 특징을 탐지할 수 있게 합니다. 이 설계는 기능을 유지하면서 콘텐츠가 임의로 재배열될 수 있는 실행 파일의 높은 위치 변동을 해결합니다. 최종 구성요소는 이진 분류를 위한 소프트맥스 활성화를 가진 완전 연결 레이어입니다

4.1 On Failed Archtectures

[ENG] Extensive experimentation with alternative architectures revealed several fundamental challenges in processing extremely long sequences for malware detection. Deep convolutional networks with up to 13 layers suffered from gradient vanishing problems and required rapid compression of state size per layer due to memory constraints, ultimately inhibiting learning. The standard approach of doubling convolutional filters after each pooling round becomes computationally intractable with 2 million time steps.

Chunking approaches, where files are divided into 500-10,000 byte segments for independent processing, achieved reasonable training accuracies up to 95% but failed to generalize with test accuracies dropping to 65-80%. This failure occurs because much of a binary's content may be non-informative for maliciousness decisions, and training on random chunks encourages overfitting to training data rather than learning discriminative features.

RNN-based architectures performed poorly when applied after convolutions, as reshaping temporal CNN outputs into fixed-sized chunks imposes an artificial prior that activation patterns must appear at consistent frequencies. The malware "image" approach, treating bytes as grayscale pixels with arbitrary width selection, introduces false spatial correlations and fails to handle variable file sizes meaningfully.

[KOR] 대안적 아키텍처에 대한 광범위한 실험은 악성코드 탐지를 위한 극도로 긴 시퀀스 처리에서 몇 가지 근본적인 도전과제를 드러냈습니다. 최대 13층의 깊은 컨볼루션 네트워크는 그래디언트 소실 문제를 겪었고 메모리 제약으로 인해 레이어당 상태 크기의 급속한 압축이 필요했으며, 궁극적으로 학습을 저해했습니다. 각 풀링 라운드 후 컨볼루션 필터를 두 배로 늘리는 표준 접근법은 200만 시간 단계에서 계산적으로 다루기 어려워집니다.

파일을 500-10,000바이트 세그먼트로 나누어 독립적으로 처리하는 청킹 접근법은 최대 95%의 합리적인 훈련 정확도를 달성했지만 테스트 정확도가 65-80%로 떨어지면서 일반화에 실패했습니다. 이 실패는 바이너리 콘텐츠의 대부분이 악성 여부 결정에 비정보적일 수 있고, 무작위 청크에 대한 훈련이 판별적 특징을 학습하기보다는 훈련 데이터에 과적합을 장려하기 때문에 발생합니다.

RNN 기반 아키텍처는 컨볼루션 후에 적용될 때 성능이 저조했는데, 이는 시간적 CNN 출력을 고정 크기 청크로 재구성하는 것이 활성화 패턴이 일관된 빈도로 나타나야 한다는 인위적인 사전 지식을 부과하기 때문입니다. 바이트를 임의의 폭 선택으로 그레이스케일 픽셀로 취급하는 악성코드 "이미지" 접근법은 거짓 공간 상관관계를 도입하고 가변 파일 크기를 의미 있게 처리하지 못합니다.

5. Results

5.1 Malware classification

[ENG] MalConv demonstrates superior performance across multiple metrics and test sets, achieving balanced accuracy between Group A and Group B evaluations. The model achieves 88.1% accuracy on Group A and 89.6% on Group B, with AUC scores of 98.5% and 95.8% respectively. Notably, MalConv shows the smallest performance difference between test groups, indicating robust feature learning that generalizes well across different data distributions.

The application of DeCov regularization significantly improves model accuracy by up to 4.8 percentage points, primarily through better calibration of decision thresholds rather than fundamental concept changes. When trained on the extended 2 million sample dataset, MalConv's performance increases substantially: Group A accuracy improves to 94.0% and Group B to 90.9%, while Group B AUC reaches 98.2%.

Comparative analysis reveals that byte n-gram models, while achieving high Group B performance (92.5% accuracy, 97.9% AUC), show significant performance gaps between test groups and demonstrate brittleness to single-byte modifications. The PE-header network achieves slightly higher Group A accuracy (90.8%) but significantly reduced Group B performance, indicating limited feature diversity.

[KOR] MalConv는 여러 메트릭과 테스트 세트에서 우수한 성능을 보여주며, 그룹 A와 그룹 B 평가 간 균형잡힌 정확도를 달성합니다. 모델은 그룹 A에서 88.1%, 그룹 B에서 89.6%의 정확도를 달성하며, AUC 점수는 각각 98.5%와 95.8%입니다. 특히 MalConv는 테스트 그룹 간 가장 작은 성능 차이를 보여, 서로 다른 데이터 분포에서 잘 일반화되는 강건한 특징 학습을 나타냅니다.

DeCov 정규화의 적용은 근본적인 개념 변화보다는 결정 임계값의 더 나은 보정을 통해 모델 정확도를 최대 4.8 퍼센트 포인트까지 크게 향상시킵니다. 200만 샘플의 확장 데이터셋에서 훈련될 때 MalConv의 성능은 상당히 증가합니다: 그룹 A 정확도는 94.0%로, 그룹 B는 90.9%로 향상되며, 그룹 B AUC는 98.2%에 도달합니다.

비교 분석은 바이트 n-그램 모델이 높은 그룹 B 성능(92.5% 정확도, 97.9% AUC)을 달성하지만 테스트 그룹 간 상당한 성능 격차를 보이고 단일 바이트 수정에 대한 취약성을 보인다는 것을 드러냅니다. PE-헤더 네트워크는 약간 더 높은 그룹 A 정확도(90.8%)를 달성하지만 그룹 B 성능이 현저히 감소하여 제한된 특징 다양성을 나타냅니다.

5.2 Maunal Analysis

[ENG] MalConv's interpretability is enhanced through a sparse Class Activation Map (sparse-CAM) approach that adapts Zhou et al. (2016)'s methodology for the extreme sequence lengths encountered in malware detection. Using global max-pooling instead of average pooling produces naturally sparse activation maps, returning one 500-byte region per convolutional filter (maximum 128 regions per binary) as important for classification decisions.

Analysis of 224 randomly selected binaries from Group A test set reveals that MalConv utilizes significantly more diverse information sources compared to byte n-gram approaches. While previous byte n-gram models obtained almost all information from PE-headers, MalConv derives only 58-61% of its information from this region, indicating broader feature utilization across different binary sections.

The model demonstrates sophisticated understanding of binary structure, with activations distributed across .rsrc sections (16%), .text and CODE sections (14%), indicating utilization of resource directories and executable code as features. Notably, the model shows balanced activation patterns for UPX1 sections (indicating packed executables) for both benign and malicious samples, suggesting it has learned to avoid the common but unhelpful association between packing and maliciousness.

[KOR] MalConv의 해석가능성은 악성코드 탐지에서 접하는 극도의 시퀀스 길이에 대해 Zhou 등(2016)의 방법론을 적응시킨 희소 클래스 활성화 맵(sparse-CAM) 접근법을 통해 향상됩니다. 평균 풀링 대신 전역 최대 풀링을 사용하면 자연스럽게 희소한 활성화 맵이 생성되어, 분류 결정에 중요한 것으로 컨볼루션 필터당 하나의 500바이트 영역(바이너리당 최대 128개 영역)을 반환합니다.

그룹 A 테스트 세트에서 무작위로 선택된 224개 바이너리의 분석은 MalConv가 바이트 n-그램 접근법에 비해 훨씬 더 다양한 정보 소스를 활용한다는 것을 드러냅니다. 이전 바이트 n-그램 모델이 PE-헤더에서 거의 모든 정보를 얻었던 반면, MalConv는 이 영역에서 정보의 58-61%만을 도출하여 다른 바이너리 섹션에 걸쳐 더 광범위한 특징 활용을 나타냅니다.

모델은 .rsrc 섹션(16%), .text 및 CODE 섹션(14%)에 걸쳐 분산된 활성화로 바이너리 구조에 대한 정교한 이해를 보여주며, 리소스 디렉토리와 실행 코드를 특징으로 활용함을 나타냅니다. 특히 모델은 정상과 악성 샘플 모두에 대해 UPX1 섹션(패킹된 실행 파일을 나타냄)에 대한 균형잡힌 활성화 패턴을 보여, 패킹과 악성성 간의 일반적이지만 도움이 되지 않는 연관성을 피하도록 학습했음을 시사합니다.

5.3 The Failure of Batch-Normlaization

[ENG] One of the most significant findings in MalConv research is the complete failure of batch normalization, a technique that typically accelerates convergence and improves generalization in deep learning. Models incorporating batch normalization consistently failed to learn, achieving at best 60% training accuracy and 50% test accuracy across multiple framework implementations including PyTorch, TensorFlow, Chainer, and Theano.

Kernel density estimation analysis reveals the root cause: binary executable data exhibits multi-modal activation distributions fundamentally different from the approximately Gaussian distributions assumed by batch normalization. While image processing networks show smooth, unimodal activation patterns suitable for normalization, MalConv activations display distinct multi-modal characteristics with multiple peaks.

This multi-modal nature stems from the diverse byte content within executables, where the same byte value can represent ASCII text, binary code, structured data, or embedded images depending on context. The violation of batch normalization's normality assumption leads to degraded performance, with the technique only functioning when trained on homogeneous sub-regions of 500-10,000 bytes, though these models still failed to generalize to test data.

[KOR] One of the most significant findings in MalConv research is the complete failure of batch normalization, a technique that typically accelerates convergence and improves generalization in deep learning. Models incorporating batch normalization consistently failed to learn, achieving at best 60% training accuracy and 50% test accuracy across multiple framework implementations including PyTorch, TensorFlow, Chainer, and Theano.

6. Conclusion

[ENG] MalConv represents a paradigm shift in malware detection by demonstrating that neural networks can successfully learn to identify malicious software directly from raw byte sequences without domain knowledge. The model's ability to process entire PE files up to 2 million bytes establishes it as the first architecture capable of handling such extreme sequence lengths in cybersecurity applications.

The research contributions extend beyond malware detection to the broader machine learning community by identifying unique challenges in processing multi-modal sequential data and proposing effective solutions. The discovery of batch normalization's failure provides valuable insights for future work on non-standard data distributions, while the sparse-CAM interpretability approach offers a practical method for understanding model decisions on extremely long sequences.

Future research directions include developing architectures that better handle multi-modal data, improving memory efficiency for even longer sequences, and integrating semantic understanding of code structure. The model's demonstrated scalability with increased training data suggests continued performance improvements as larger datasets become available, positioning MalConv as a foundation for next-generation cybersecurity solutions.

[KOR] MalConv는 신경망이 도메인 지식 없이 원시 바이트 시퀀스에서 직접 악성 소프트웨어를 식별하도록 성공적으로 학습할 수 있음을 보여줌으로써 악성코드 탐지의 패러다임 전환을 나타냅니다. 최대 200만 바이트의 전체 PE 파일을 처리할 수 있는 모델의 능력은 사이버보안 애플리케이션에서 이러한 극도의 시퀀스 길이를 처리할 수 있는 최초의 아키텍처로 확립됩니다.

연구 기여는 악성코드 탐지를 넘어 다중 모달 순차 데이터 처리의 고유한 도전과제를 식별하고 효과적인 해결책을 제안함으로써 더 광범위한 기계학습 커뮤니티로 확장됩니다. 배치 정규화 실패의 발견은 비표준 데이터 분포에 대한 향후 작업에 귀중한 통찰을 제공하며, sparse-CAM 해석가능성 접근법은 극도로 긴 시퀀스에서 모델 결정을 이해하는 실용적인 방법을 제공합니다.

향후 연구 방향에는 다중 모달 데이터를 더 잘 처리하는 아키텍처 개발, 더 긴 시퀀스를 위한 메모리 효율성 향상, 코드 구조의 의미론적 이해 통합이 포함됩니다. 훈련 데이터 증가에 따른 모델의 입증된 확장성은 더 큰 데이터셋이 사용 가능해짐에 따라 지속적인 성능 향상을 시사하며, MalConv를 차세대 사이버보안 솔루션의 기반으로 위치시킵니다.