malconv / paper /PaperReview.md
cycloevan's picture
Upload 17 files
b92918a verified

Malware Detection by Eating a Whole EXE

--

Instroduction

[ENG] The introduction of MalConv addresses the fundamental limitations of traditional signature-based malware detection systems and the complexities of dynamic analysis approaches. Current anti-virus technologies rely on manually crafted rules that are specific to particular malware families and cannot recognize new variants, making them increasingly ineffective against the millions of new malware samples discovered daily. Dynamic analysis, while intuitive, presents significant challenges including high computational requirements, potential detection by sophisticated malware, and discrepancies between analysis and target environments

The core innovation lies in processing raw byte sequences from entire executable files without requiring domain knowledge or feature engineering. This approach presents unique challenges not encountered in traditional machine learning domains: handling multi-modal byte information (text, code, images), managing spatial correlations with discontinuities, processing sequences exceeding two million time steps, and addressing multiple levels of concept drift over time

[KOR] MalConv์˜ ๋„์ž…๋ฐฐ๊ฒฝ์€ ์ „ํ†ต์ ์ธ ์‹œ๊ทธ๋‹ˆ์ฒ˜ ๊ธฐ๋ฐ˜ ์•…์„ฑ์ฝ”๋“œ ํƒ์ง€ ์‹œ์Šคํ…œ์˜ ๊ทผ๋ณธ์  ํ•œ๊ณ„์™€ ๋™์  ๋ถ„์„ ๋ฐฉ๋ฒ•์˜ ๋ณต์žก์„ฑ์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ์žˆ์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ์•ˆํ‹ฐ๋ฐ”์ด๋Ÿฌ์Šค ๊ธฐ์ˆ ์€ ํŠน์ • ์•…์„ฑ์ฝ”๋“œ ํŒจ๋ฐ€๋ฆฌ์— ํŠนํ™”๋œ ์ˆ˜๋™ ์ œ์ž‘ ๊ทœ์น™์— ์˜์กดํ•˜๋ฉฐ ์ƒˆ๋กœ์šด ๋ณ€์ข…์„ ์ธ์‹ํ•  ์ˆ˜ ์—†์–ด, ๋งค์ผ ๋ฐœ๊ฒฌ๋˜๋Š” ์ˆ˜๋ฐฑ๋งŒ ๊ฐœ์˜ ์ƒˆ๋กœ์šด ์•…์„ฑ์ฝ”๋“œ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ์ ์  ๋น„ํšจ์œจ์ ์ด ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋™์  ๋ถ„์„์€ ์ง๊ด€์ ์ด์ง€๋งŒ ๋†’์€ ๊ณ„์‚ฐ ์š”๊ตฌ์‚ฌํ•ญ, ์ •๊ตํ•œ ์•…์„ฑ์ฝ”๋“œ์˜ ํƒ์ง€ ๊ฐ€๋Šฅ์„ฑ, ๋ถ„์„ ํ™˜๊ฒฝ๊ณผ ๋Œ€์ƒ ํ™˜๊ฒฝ ๊ฐ„์˜ ๋ถˆ์ผ์น˜ ๋“ฑ ์‹ฌ๊ฐํ•œ ๋ฌธ์ œ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ˜์‹ ์€ ๋„๋ฉ”์ธ ์ง€์‹์ด๋‚˜ ํŠน์ง• ๊ณตํ•™ ์—†์ด ์ „์ฒด ์‹คํ–‰ ํŒŒ์ผ์˜ ์›์‹œ ๋ฐ”์ดํŠธ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์€ ์ „ํ†ต์ ์ธ ๊ธฐ๊ณ„ํ•™์Šต ์˜์—ญ์—์„œ ๋งŒ๋‚˜์ง€ ๋ชปํ•œ ๊ณ ์œ ํ•œ ๋„์ „๊ณผ์ œ๋“ค์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค: ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๋ฐ”์ดํŠธ ์ •๋ณด(ํ…์ŠคํŠธ, ์ฝ”๋“œ, ์ด๋ฏธ์ง€) ์ฒ˜๋ฆฌ, ๋ถˆ์—ฐ์†์„ฑ์„ ๊ฐ€์ง„ ๊ณต๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„ ๊ด€๋ฆฌ, 200๋งŒ ์‹œ๊ฐ„ ๋‹จ๊ณ„๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ์‹œํ€€์Šค ์ฒ˜๋ฆฌ, ์‹œ๊ฐ„์— ๋”ฐ๋ฅธ ๋‹ค์ค‘ ๋ ˆ๋ฒจ ๊ฐœ๋… ๋“œ๋ฆฌํ”„ํŠธ ํ•ด๊ฒฐ

--

2. Related work

[ENG] The application of neural networks to extremely long sequences represents a significant computational challenge that MalConv addresses at an unprecedented scale. Prior work in this area includes WaveNet, which processes audio sequences of up to 16,000 time steps per second, still two orders of magnitude smaller than MalConv's capability. ByteNet and similar architectures for machine translation handle relatively shorter sequences compared to the malware detection problem.

The use of dilated convolutions, popularized by WaveNet and ByteNet for capturing wide receptive fields, was explored but found ineffective for binary data. Unlike spatially consistent domains like images and audio, the values in dilated convolution "holes" are not easily interpolated for binary content, leading to poor performance. RNN-based approaches face memory and computational complexity limitations when dealing with sequences of this magnitude

[KOR] ๊ทน๋„๋กœ ๊ธด ์‹œํ€€์Šค์— ๋Œ€ํ•œ ์‹ ๊ฒฝ๋ง ์ ์šฉ์€ MalConv๊ฐ€ ์ „๋ก€ ์—†๋Š” ๊ทœ๋ชจ๋กœ ํ•ด๊ฒฐํ•˜๋Š” ์ค‘์š”ํ•œ ๊ณ„์‚ฐ์  ๋„์ „์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ์˜์—ญ์˜ ์„ ํ–‰ ์—ฐ๊ตฌ๋กœ๋Š” ์ดˆ๋‹น ์ตœ๋Œ€ 16,000 ์‹œ๊ฐ„ ๋‹จ๊ณ„์˜ ์˜ค๋””์˜ค ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” WaveNet์ด ์žˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ MalConv ๋Šฅ๋ ฅ๋ณด๋‹ค ๋‘ ์ž๋ฆฟ์ˆ˜ ์ž‘์Šต๋‹ˆ๋‹ค. ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์„ ์œ„ํ•œ ByteNet๊ณผ ์œ ์‚ฌํ•œ ์•„ํ‚คํ…์ฒ˜๋“ค์€ ์•…์„ฑ์ฝ”๋“œ ํƒ์ง€ ๋ฌธ์ œ์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ์งง์€ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๋„“์€ ์ˆ˜์šฉ ํ•„๋“œ๋ฅผ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด WaveNet๊ณผ ByteNet์—์„œ ์ธ๊ธฐ๋ฅผ ์–ป์€ ํŒฝ์ฐฝ ์ปจ๋ณผ๋ฃจ์…˜์˜ ์‚ฌ์šฉ์ด ํƒ๊ตฌ๋˜์—ˆ์ง€๋งŒ ๋ฐ”์ด๋„ˆ๋ฆฌ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” ๋น„ํšจ๊ณผ์ ์ž„์ด ๋ฐœ๊ฒฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€์™€ ์˜ค๋””์˜ค ๊ฐ™์€ ๊ณต๊ฐ„์ ์œผ๋กœ ์ผ๊ด€๋œ ๋„๋ฉ”์ธ๊ณผ ๋‹ฌ๋ฆฌ, ํŒฝ์ฐฝ ์ปจ๋ณผ๋ฃจ์…˜ "๊ตฌ๋ฉ"์˜ ๊ฐ’๋“ค์€ ๋ฐ”์ด๋„ˆ๋ฆฌ ์ฝ˜ํ…์ธ ์— ๋Œ€ํ•ด ์‰ฝ๊ฒŒ ๋ณด๊ฐ„๋˜์ง€ ์•Š์•„ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค. RNN ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์€ ์ด๋Ÿฌํ•œ ๊ทœ๋ชจ์˜ ์‹œํ€€์Šค๋ฅผ ๋‹ค๋ฃฐ ๋•Œ ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ ๋ณต์žก๋„ ์ œํ•œ์— ์ง๋ฉดํ•ฉ๋‹ˆ๋‹ค

2.1 Neural Network for Long Sequence

[EMG] The application of neural networks to extremely long sequences represents a significant computational challenge that MalConv addresses at an unprecedented scale. Prior work in this area includes WaveNet, which processes audio sequences of up to 16,000 time steps per second, still two orders of magnitude smaller than MalConv's capability. ByteNet and similar architectures for machine translation handle relatively shorter sequences compared to the malware detection problem.

The use of dilated convolutions, popularized by WaveNet and ByteNet for capturing wide receptive fields, was explored but found ineffective for binary data. Unlike spatially consistent domains like images and audio, the values in dilated convolution "holes" are not easily interpolated for binary content, leading to poor performance. RNN-based approaches face memory and computational complexity limitations when dealing with sequences of this magnitude

[KOR] ๊ทน๋„๋กœ ๊ธด ์‹œํ€€์Šค์— ๋Œ€ํ•œ ์‹ ๊ฒฝ๋ง ์ ์šฉ์€ MalConv๊ฐ€ ์ „๋ก€ ์—†๋Š” ๊ทœ๋ชจ๋กœ ํ•ด๊ฒฐํ•˜๋Š” ์ค‘์š”ํ•œ ๊ณ„์‚ฐ์  ๋„์ „์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ์˜์—ญ์˜ ์„ ํ–‰ ์—ฐ๊ตฌ๋กœ๋Š” ์ดˆ๋‹น ์ตœ๋Œ€ 16,000 ์‹œ๊ฐ„ ๋‹จ๊ณ„์˜ ์˜ค๋””์˜ค ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” WaveNet์ด ์žˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ MalConv ๋Šฅ๋ ฅ๋ณด๋‹ค ๋‘ ์ž๋ฆฟ์ˆ˜ ์ž‘์Šต๋‹ˆ๋‹ค. ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์„ ์œ„ํ•œ ByteNet๊ณผ ์œ ์‚ฌํ•œ ์•„ํ‚คํ…์ฒ˜๋“ค์€ ์•…์„ฑ์ฝ”๋“œ ํƒ์ง€ ๋ฌธ์ œ์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ์งง์€ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๋„“์€ ์ˆ˜์šฉ ํ•„๋“œ๋ฅผ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด WaveNet๊ณผ ByteNet์—์„œ ์ธ๊ธฐ๋ฅผ ์–ป์€ ํŒฝ์ฐฝ ์ปจ๋ณผ๋ฃจ์…˜์˜ ์‚ฌ์šฉ์ด ํƒ๊ตฌ๋˜์—ˆ์ง€๋งŒ ๋ฐ”์ด๋„ˆ๋ฆฌ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” ๋น„ํšจ๊ณผ์ ์ž„์ด ๋ฐœ๊ฒฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€์™€ ์˜ค๋””์˜ค ๊ฐ™์€ ๊ณต๊ฐ„์ ์œผ๋กœ ์ผ๊ด€๋œ ๋„๋ฉ”์ธ๊ณผ ๋‹ฌ๋ฆฌ, ํŒฝ์ฐฝ ์ปจ๋ณผ๋ฃจ์…˜ "๊ตฌ๋ฉ"์˜ ๊ฐ’๋“ค์€ ๋ฐ”์ด๋„ˆ๋ฆฌ ์ฝ˜ํ…์ธ ์— ๋Œ€ํ•ด ์‰ฝ๊ฒŒ ๋ณด๊ฐ„๋˜์ง€ ์•Š์•„ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค. RNN ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์€ ์ด๋Ÿฌํ•œ ๊ทœ๋ชจ์˜ ์‹œํ€€์Šค๋ฅผ ๋‹ค๋ฃฐ ๋•Œ ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ ๋ณต์žก๋„ ์ œํ•œ์— ์ง๋ฉดํ•ฉ๋‹ˆ๋‹ค

2.2 Neural Netwokrs for Malware Detectioon

[ENG] Previous applications of neural networks to malware detection have relied heavily on domain knowledge and feature extraction, limiting their generalizability. Saxe and Berlin (2015) used histogram-based features including byte entropy values, ASCII string lengths, and PE metadata, discarding most information about actual binary content. Kolosnjaji et al. (2016) applied LSTM networks to API call sequences from dynamic analysis, focusing on only 60 kernel API calls.

The work most closely related to MalConv in terms of feature representation is the PE-header network by Raff et al. (2017), which achieved high accuracy using only 300 bytes from PE headers. However, this approach still requires domain knowledge about executable file structure and cannot process the entire binary content. Most existing work relies on sophisticated emulation environments and manual feature engineering, creating barriers to reproduction and extension.

[KOR] ์•…์„ฑ์ฝ”๋“œ ํƒ์ง€์— ๋Œ€ํ•œ ์‹ ๊ฒฝ๋ง์˜ ์ด์ „ ์ ์šฉ๋“ค์€ ๋„๋ฉ”์ธ ์ง€์‹๊ณผ ํŠน์ง• ์ถ”์ถœ์— ํฌ๊ฒŒ ์˜์กดํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ์„ ์ œํ•œํ–ˆ์Šต๋‹ˆ๋‹ค. Saxe์™€ Berlin(2015)์€ ๋ฐ”์ดํŠธ ์—”ํŠธ๋กœํ”ผ ๊ฐ’, ASCII ๋ฌธ์ž์—ด ๊ธธ์ด, PE ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ํŠน์ง•์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹ค์ œ ๋ฐ”์ด๋„ˆ๋ฆฌ ์ฝ˜ํ…์ธ ์— ๋Œ€ํ•œ ๋Œ€๋ถ€๋ถ„์˜ ์ •๋ณด๋ฅผ ๋ฒ„๋ ธ์Šต๋‹ˆ๋‹ค. Kolosnjaji ๋“ฑ(2016)์€ ๋™์  ๋ถ„์„์˜ API ํ˜ธ์ถœ ์‹œํ€€์Šค์— LSTM ๋„คํŠธ์›Œํฌ๋ฅผ ์ ์šฉํ•˜์—ฌ ๋‹จ์ง€ 60๊ฐœ์˜ ์ปค๋„ API ํ˜ธ์ถœ์—๋งŒ ์ง‘์ค‘ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํŠน์ง• ํ‘œํ˜„ ์ธก๋ฉด์—์„œ MalConv์™€ ๊ฐ€์žฅ ๋ฐ€์ ‘ํ•œ ๊ด€๋ จ์ด ์žˆ๋Š” ์ž‘์—…์€ PE ํ—ค๋”์˜ 300๋ฐ”์ดํŠธ๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•œ Raff ๋“ฑ(2017)์˜ PE-ํ—ค๋” ๋„คํŠธ์›Œํฌ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ์ ‘๊ทผ๋ฒ•์€ ์—ฌ์ „ํžˆ ์‹คํ–‰ ํŒŒ์ผ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ๋„๋ฉ”์ธ ์ง€์‹์ด ํ•„์š”ํ•˜๊ณ  ์ „์ฒด ๋ฐ”์ด๋„ˆ๋ฆฌ ์ฝ˜ํ…์ธ ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด ์ž‘์—…์˜ ๋Œ€๋ถ€๋ถ„์€ ์ •๊ตํ•œ ์—๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ๊ณผ ์ˆ˜๋™ ํŠน์ง• ๊ณตํ•™์— ์˜์กดํ•˜์—ฌ ์žฌํ˜„๊ณผ ํ™•์žฅ์— ์žฅ๋ฒฝ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

--

3. Training data

[ENG] The training data for MalConv consists of two primary groups with distinct collection methodologies to ensure robust evaluation. Group B contains 400,000 files split evenly between benign and malicious classes, provided by an anti-virus industry partner and representing files encountered on real machines. The Group B test set includes 77,349 files with 40,000 malicious and the remainder benign samples.

Group A data follows the conventional academic approach, with benign samples from clean Microsoft Windows installations and common applications, while malware comes from the VirusShare corpus. The Group A test set contains 43,967 malicious and 21,854 benign files. A critical finding revealed that training on Group A-style data results in severe overfitting, with models learning to recognize "from Microsoft" rather than genuinely benign characteristics.

An extended dataset of 2,011,786 binaries (1,000,020 benign and 1,011,766 malicious) was later obtained to demonstrate MalConv's continued improvement with increased training data, while byte n-gram approaches showed performance plateauing.

[KOR] MalConv์˜ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋Š” ๊ฐ•๊ฑดํ•œ ํ‰๊ฐ€๋ฅผ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ ์ˆ˜์ง‘ ๋ฐฉ๋ฒ•๋ก ์„ ๊ฐ€์ง„ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๊ทธ๋ฃน์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฃน B๋Š” ์•ˆํ‹ฐ๋ฐ”์ด๋Ÿฌ์Šค ์—…๊ณ„ ํŒŒํŠธ๋„ˆ๊ฐ€ ์ œ๊ณตํ•œ 40๋งŒ ๊ฐœ์˜ ํŒŒ์ผ์„ ํฌํ•จํ•˜๋ฉฐ, ์•…์„ฑ๊ณผ ์ •์ƒ ํด๋ž˜์Šค๋กœ ๊ท ๋“ฑํ•˜๊ฒŒ ๋ถ„ํ• ๋˜์–ด ์‹ค์ œ ๊ธฐ๊ณ„์—์„œ ๋ฐœ๊ฒฌ๋˜๋Š” ํŒŒ์ผ์„ ๋Œ€ํ‘œํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฃน B ํ…Œ์ŠคํŠธ ์„ธํŠธ๋Š” 77,349๊ฐœ ํŒŒ์ผ์„ ํฌํ•จํ•˜๋ฉฐ ์ด์ค‘ 40,000๊ฐœ๊ฐ€ ์•…์„ฑ์ด๊ณ  ๋‚˜๋จธ์ง€๊ฐ€ ์ •์ƒ ์ƒ˜ํ”Œ์ž…๋‹ˆ๋‹ค.

๊ทธ๋ฃน A ๋ฐ์ดํ„ฐ๋Š” ์ „ํ†ต์ ์ธ ํ•™์ˆ ์  ์ ‘๊ทผ๋ฒ•์„ ๋”ฐ๋ฅด๋ฉฐ, ์ •์ƒ ์ƒ˜ํ”Œ์€ ๊นจ๋—ํ•œ Microsoft Windows ์„ค์น˜์™€ ์ผ๋ฐ˜์ ์ธ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ, ์•…์„ฑ์ฝ”๋“œ๋Š” VirusShare ์ฝ”ํผ์Šค์—์„œ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฃน A ํ…Œ์ŠคํŠธ ์„ธํŠธ๋Š” 43,967๊ฐœ์˜ ์•…์„ฑ๊ณผ 21,854๊ฐœ์˜ ์ •์ƒ ํŒŒ์ผ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ค‘์š”ํ•œ ๋ฐœ๊ฒฌ์€ ๊ทธ๋ฃน A ์Šคํƒ€์ผ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ›ˆ๋ จ์ด ์‹ฌ๊ฐํ•œ ๊ณผ์ ํ•ฉ์„ ์ดˆ๋ž˜ํ•˜๋ฉฐ, ๋ชจ๋ธ์ด ์ง„์ •ํ•œ ์ •์ƒ ํŠน์„ฑ๋ณด๋‹ค๋Š” "Microsoft์—์„œ ๋‚˜์˜จ" ๊ฒƒ์„ ์ธ์‹ํ•˜๋„๋ก ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

MalConv์˜ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ€์— ๋”ฐ๋ฅธ ์ง€์†์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด 2,011,786๊ฐœ ๋ฐ”์ด๋„ˆ๋ฆฌ(์ •์ƒ 1,000,020๊ฐœ, ์•…์„ฑ 1,011,766๊ฐœ)์˜ ํ™•์žฅ ๋ฐ์ดํ„ฐ์…‹์ด ๋‚˜์ค‘์— ํš๋“๋˜์—ˆ์œผ๋ฉฐ, ๋ฐ”์ดํŠธ n-๊ทธ๋žจ ์ ‘๊ทผ๋ฒ•์€ ์„ฑ๋Šฅ ์ •์ฒด๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค

--

4. Model Archtecture

[ENG] The MalConv architecture is designed with three key requirements: scalability with sequence length, ability to consider both local and global context across entire files, and explanatory capability for analysis of flagged malware. The model processes raw byte sequences through an embedding layer that maps each byte (0-255) to an 8-dimensional learned feature vector, avoiding the false assumption that certain byte values are intrinsically closer than others.

The core architecture employs gated convolution with 128 filters, using large convolutional filter width of 500 bytes combined with an aggressive stride of 500. This design choice addresses GPU memory constraints while enabling efficient data-parallel training. The gated convolution approach follows Dauphin et al. (2016), incorporating element-wise multiplication with sigmoid activation to control information flow.

Temporal max-pooling extracts the maximum activation across the entire sequence, allowing the model to detect informative features regardless of their location within the binary. This design addresses the high positional variation in executable files where contents can be arbitrarily rearranged while maintaining functionality. The final component is a fully connected layer with softmax activation for binary classification.

[KOR] MalConv ์•„ํ‚คํ…์ฒ˜๋Š” ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ์š”๊ตฌ์‚ฌํ•ญ์œผ๋กœ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค: ์‹œํ€€์Šค ๊ธธ์ด์— ๋”ฐ๋ฅธ ํ™•์žฅ์„ฑ, ์ „์ฒด ํŒŒ์ผ์— ๊ฑธ์นœ ์ง€์—ญ์  ๋ฐ ์ „์—ญ์  ์ปจํ…์ŠคํŠธ ๊ณ ๋ ค ๋Šฅ๋ ฅ, ํ”Œ๋ž˜๊ทธ๋œ ์•…์„ฑ์ฝ”๋“œ ๋ถ„์„์„ ์œ„ํ•œ ์„ค๋ช… ๋Šฅ๋ ฅ. ๋ชจ๋ธ์€ ๊ฐ ๋ฐ”์ดํŠธ(0-255)๋ฅผ 8์ฐจ์› ํ•™์Šต๋œ ํŠน์ง• ๋ฒกํ„ฐ๋กœ ๋งคํ•‘ํ•˜๋Š” ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ์›์‹œ ๋ฐ”์ดํŠธ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜์—ฌ, ํŠน์ • ๋ฐ”์ดํŠธ ๊ฐ’์ด ๋‹ค๋ฅธ ๊ฐ’๋ณด๋‹ค ๋ณธ์งˆ์ ์œผ๋กœ ๋” ๊ฐ€๊น๋‹ค๋Š” ์ž˜๋ชป๋œ ๊ฐ€์ •์„ ํ”ผํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์•„ํ‚คํ…์ฒ˜๋Š” 128๊ฐœ ํ•„ํ„ฐ๋ฅผ ๊ฐ€์ง„ ๊ฒŒ์ดํŠธ ์ปจ๋ณผ๋ฃจ์…˜์„ ์‚ฌ์šฉํ•˜๋ฉฐ, 500๋ฐ”์ดํŠธ์˜ ํฐ ์ปจ๋ณผ๋ฃจ์…˜ ํ•„ํ„ฐ ํญ๊ณผ 500์˜ ๊ณต๊ฒฉ์ ์ธ ์ŠคํŠธ๋ผ์ด๋“œ๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์„ค๊ณ„ ์„ ํƒ์€ GPU ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์„ ํ•ด๊ฒฐํ•˜๋ฉด์„œ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ํ›ˆ๋ จ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๊ฒŒ์ดํŠธ ์ปจ๋ณผ๋ฃจ์…˜ ์ ‘๊ทผ๋ฒ•์€ Dauphin ๋“ฑ(2016)์„ ๋”ฐ๋ฅด๋ฉฐ, ์ •๋ณด ํ๋ฆ„์„ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•ด ์‹œ๊ทธ๋ชจ์ด๋“œ ํ™œ์„ฑํ™”์™€ ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ์„ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

์‹œ๊ฐ„์  ์ตœ๋Œ€ ํ’€๋ง์€ ์ „์ฒด ์‹œํ€€์Šค์— ๊ฑธ์ณ ์ตœ๋Œ€ ํ™œ์„ฑํ™”๋ฅผ ์ถ”์ถœํ•˜์—ฌ, ๋ชจ๋ธ์ด ๋ฐ”์ด๋„ˆ๋ฆฌ ๋‚ด ์œ„์น˜์— ๊ด€๊ณ„์—†์ด ์ •๋ณด์  ํŠน์ง•์„ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ด ์„ค๊ณ„๋Š” ๊ธฐ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ฝ˜ํ…์ธ ๊ฐ€ ์ž„์˜๋กœ ์žฌ๋ฐฐ์—ด๋  ์ˆ˜ ์žˆ๋Š” ์‹คํ–‰ ํŒŒ์ผ์˜ ๋†’์€ ์œ„์น˜ ๋ณ€๋™์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ข… ๊ตฌ์„ฑ์š”์†Œ๋Š” ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ์†Œํ”„ํŠธ๋งฅ์Šค ํ™œ์„ฑํ™”๋ฅผ ๊ฐ€์ง„ ์™„์ „ ์—ฐ๊ฒฐ ๋ ˆ์ด์–ด์ž…๋‹ˆ๋‹ค

4.1 On Failed Archtectures

[ENG] Extensive experimentation with alternative architectures revealed several fundamental challenges in processing extremely long sequences for malware detection. Deep convolutional networks with up to 13 layers suffered from gradient vanishing problems and required rapid compression of state size per layer due to memory constraints, ultimately inhibiting learning. The standard approach of doubling convolutional filters after each pooling round becomes computationally intractable with 2 million time steps.

Chunking approaches, where files are divided into 500-10,000 byte segments for independent processing, achieved reasonable training accuracies up to 95% but failed to generalize with test accuracies dropping to 65-80%. This failure occurs because much of a binary's content may be non-informative for maliciousness decisions, and training on random chunks encourages overfitting to training data rather than learning discriminative features.

RNN-based architectures performed poorly when applied after convolutions, as reshaping temporal CNN outputs into fixed-sized chunks imposes an artificial prior that activation patterns must appear at consistent frequencies. The malware "image" approach, treating bytes as grayscale pixels with arbitrary width selection, introduces false spatial correlations and fails to handle variable file sizes meaningfully.

[KOR] ๋Œ€์•ˆ์  ์•„ํ‚คํ…์ฒ˜์— ๋Œ€ํ•œ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์€ ์•…์„ฑ์ฝ”๋“œ ํƒ์ง€๋ฅผ ์œ„ํ•œ ๊ทน๋„๋กœ ๊ธด ์‹œํ€€์Šค ์ฒ˜๋ฆฌ์—์„œ ๋ช‡ ๊ฐ€์ง€ ๊ทผ๋ณธ์ ์ธ ๋„์ „๊ณผ์ œ๋ฅผ ๋“œ๋Ÿฌ๋ƒˆ์Šต๋‹ˆ๋‹ค. ์ตœ๋Œ€ 13์ธต์˜ ๊นŠ์€ ์ปจ๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ๋Š” ๊ทธ๋ž˜๋””์–ธํŠธ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ๊ฒช์—ˆ๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ๋ ˆ์ด์–ด๋‹น ์ƒํƒœ ํฌ๊ธฐ์˜ ๊ธ‰์†ํ•œ ์••์ถ•์ด ํ•„์š”ํ–ˆ์œผ๋ฉฐ, ๊ถ๊ทน์ ์œผ๋กœ ํ•™์Šต์„ ์ €ํ•ดํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ํ’€๋ง ๋ผ์šด๋“œ ํ›„ ์ปจ๋ณผ๋ฃจ์…˜ ํ•„ํ„ฐ๋ฅผ ๋‘ ๋ฐฐ๋กœ ๋Š˜๋ฆฌ๋Š” ํ‘œ์ค€ ์ ‘๊ทผ๋ฒ•์€ 200๋งŒ ์‹œ๊ฐ„ ๋‹จ๊ณ„์—์„œ ๊ณ„์‚ฐ์ ์œผ๋กœ ๋‹ค๋ฃจ๊ธฐ ์–ด๋ ค์›Œ์ง‘๋‹ˆ๋‹ค.

ํŒŒ์ผ์„ 500-10,000๋ฐ”์ดํŠธ ์„ธ๊ทธ๋จผํŠธ๋กœ ๋‚˜๋ˆ„์–ด ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ์ฒญํ‚น ์ ‘๊ทผ๋ฒ•์€ ์ตœ๋Œ€ 95%์˜ ํ•ฉ๋ฆฌ์ ์ธ ํ›ˆ๋ จ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ ํ…Œ์ŠคํŠธ ์ •ํ™•๋„๊ฐ€ 65-80%๋กœ ๋–จ์–ด์ง€๋ฉด์„œ ์ผ๋ฐ˜ํ™”์— ์‹คํŒจํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ์‹คํŒจ๋Š” ๋ฐ”์ด๋„ˆ๋ฆฌ ์ฝ˜ํ…์ธ ์˜ ๋Œ€๋ถ€๋ถ„์ด ์•…์„ฑ ์—ฌ๋ถ€ ๊ฒฐ์ •์— ๋น„์ •๋ณด์ ์ผ ์ˆ˜ ์žˆ๊ณ , ๋ฌด์ž‘์œ„ ์ฒญํฌ์— ๋Œ€ํ•œ ํ›ˆ๋ จ์ด ํŒ๋ณ„์  ํŠน์ง•์„ ํ•™์Šตํ•˜๊ธฐ๋ณด๋‹ค๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๊ณผ์ ํ•ฉ์„ ์žฅ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

RNN ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜๋Š” ์ปจ๋ณผ๋ฃจ์…˜ ํ›„์— ์ ์šฉ๋  ๋•Œ ์„ฑ๋Šฅ์ด ์ €์กฐํ–ˆ๋Š”๋ฐ, ์ด๋Š” ์‹œ๊ฐ„์  CNN ์ถœ๋ ฅ์„ ๊ณ ์ • ํฌ๊ธฐ ์ฒญํฌ๋กœ ์žฌ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ์ด ํ™œ์„ฑํ™” ํŒจํ„ด์ด ์ผ๊ด€๋œ ๋นˆ๋„๋กœ ๋‚˜ํƒ€๋‚˜์•ผ ํ•œ๋‹ค๋Š” ์ธ์œ„์ ์ธ ์‚ฌ์ „ ์ง€์‹์„ ๋ถ€๊ณผํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋ฐ”์ดํŠธ๋ฅผ ์ž„์˜์˜ ํญ ์„ ํƒ์œผ๋กœ ๊ทธ๋ ˆ์ด์Šค์ผ€์ผ ํ”ฝ์…€๋กœ ์ทจ๊ธ‰ํ•˜๋Š” ์•…์„ฑ์ฝ”๋“œ "์ด๋ฏธ์ง€" ์ ‘๊ทผ๋ฒ•์€ ๊ฑฐ์ง“ ๊ณต๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋„์ž…ํ•˜๊ณ  ๊ฐ€๋ณ€ ํŒŒ์ผ ํฌ๊ธฐ๋ฅผ ์˜๋ฏธ ์žˆ๊ฒŒ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.

--

5. Results

5.1 Malware classification

[ENG] MalConv demonstrates superior performance across multiple metrics and test sets, achieving balanced accuracy between Group A and Group B evaluations. The model achieves 88.1% accuracy on Group A and 89.6% on Group B, with AUC scores of 98.5% and 95.8% respectively. Notably, MalConv shows the smallest performance difference between test groups, indicating robust feature learning that generalizes well across different data distributions.

The application of DeCov regularization significantly improves model accuracy by up to 4.8 percentage points, primarily through better calibration of decision thresholds rather than fundamental concept changes. When trained on the extended 2 million sample dataset, MalConv's performance increases substantially: Group A accuracy improves to 94.0% and Group B to 90.9%, while Group B AUC reaches 98.2%.

Comparative analysis reveals that byte n-gram models, while achieving high Group B performance (92.5% accuracy, 97.9% AUC), show significant performance gaps between test groups and demonstrate brittleness to single-byte modifications. The PE-header network achieves slightly higher Group A accuracy (90.8%) but significantly reduced Group B performance, indicating limited feature diversity.

[KOR] MalConv๋Š” ์—ฌ๋Ÿฌ ๋ฉ”ํŠธ๋ฆญ๊ณผ ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๊ทธ๋ฃน A์™€ ๊ทธ๋ฃน B ํ‰๊ฐ€ ๊ฐ„ ๊ท ํ˜•์žกํžŒ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ๊ทธ๋ฃน A์—์„œ 88.1%, ๊ทธ๋ฃน B์—์„œ 89.6%์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉฐ, AUC ์ ์ˆ˜๋Š” ๊ฐ๊ฐ 98.5%์™€ 95.8%์ž…๋‹ˆ๋‹ค. ํŠนํžˆ MalConv๋Š” ํ…Œ์ŠคํŠธ ๊ทธ๋ฃน ๊ฐ„ ๊ฐ€์žฅ ์ž‘์€ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋ณด์—ฌ, ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์—์„œ ์ž˜ ์ผ๋ฐ˜ํ™”๋˜๋Š” ๊ฐ•๊ฑดํ•œ ํŠน์ง• ํ•™์Šต์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

DeCov ์ •๊ทœํ™”์˜ ์ ์šฉ์€ ๊ทผ๋ณธ์ ์ธ ๊ฐœ๋… ๋ณ€ํ™”๋ณด๋‹ค๋Š” ๊ฒฐ์ • ์ž„๊ณ„๊ฐ’์˜ ๋” ๋‚˜์€ ๋ณด์ •์„ ํ†ตํ•ด ๋ชจ๋ธ ์ •ํ™•๋„๋ฅผ ์ตœ๋Œ€ 4.8 ํผ์„ผํŠธ ํฌ์ธํŠธ๊นŒ์ง€ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. 200๋งŒ ์ƒ˜ํ”Œ์˜ ํ™•์žฅ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ›ˆ๋ จ๋  ๋•Œ MalConv์˜ ์„ฑ๋Šฅ์€ ์ƒ๋‹นํžˆ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค: ๊ทธ๋ฃน A ์ •ํ™•๋„๋Š” 94.0%๋กœ, ๊ทธ๋ฃน B๋Š” 90.9%๋กœ ํ–ฅ์ƒ๋˜๋ฉฐ, ๊ทธ๋ฃน B AUC๋Š” 98.2%์— ๋„๋‹ฌํ•ฉ๋‹ˆ๋‹ค.

๋น„๊ต ๋ถ„์„์€ ๋ฐ”์ดํŠธ n-๊ทธ๋žจ ๋ชจ๋ธ์ด ๋†’์€ ๊ทธ๋ฃน B ์„ฑ๋Šฅ(92.5% ์ •ํ™•๋„, 97.9% AUC)์„ ๋‹ฌ์„ฑํ•˜์ง€๋งŒ ํ…Œ์ŠคํŠธ ๊ทธ๋ฃน ๊ฐ„ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ๋ฅผ ๋ณด์ด๊ณ  ๋‹จ์ผ ๋ฐ”์ดํŠธ ์ˆ˜์ •์— ๋Œ€ํ•œ ์ทจ์•ฝ์„ฑ์„ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์„ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค. PE-ํ—ค๋” ๋„คํŠธ์›Œํฌ๋Š” ์•ฝ๊ฐ„ ๋” ๋†’์€ ๊ทธ๋ฃน A ์ •ํ™•๋„(90.8%)๋ฅผ ๋‹ฌ์„ฑํ•˜์ง€๋งŒ ๊ทธ๋ฃน B ์„ฑ๋Šฅ์ด ํ˜„์ €ํžˆ ๊ฐ์†Œํ•˜์—ฌ ์ œํ•œ๋œ ํŠน์ง• ๋‹ค์–‘์„ฑ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

5.2 Maunal Analysis

[ENG] MalConv's interpretability is enhanced through a sparse Class Activation Map (sparse-CAM) approach that adapts Zhou et al. (2016)'s methodology for the extreme sequence lengths encountered in malware detection. Using global max-pooling instead of average pooling produces naturally sparse activation maps, returning one 500-byte region per convolutional filter (maximum 128 regions per binary) as important for classification decisions.

Analysis of 224 randomly selected binaries from Group A test set reveals that MalConv utilizes significantly more diverse information sources compared to byte n-gram approaches. While previous byte n-gram models obtained almost all information from PE-headers, MalConv derives only 58-61% of its information from this region, indicating broader feature utilization across different binary sections.

The model demonstrates sophisticated understanding of binary structure, with activations distributed across .rsrc sections (16%), .text and CODE sections (14%), indicating utilization of resource directories and executable code as features. Notably, the model shows balanced activation patterns for UPX1 sections (indicating packed executables) for both benign and malicious samples, suggesting it has learned to avoid the common but unhelpful association between packing and maliciousness.

[KOR] MalConv์˜ ํ•ด์„๊ฐ€๋Šฅ์„ฑ์€ ์•…์„ฑ์ฝ”๋“œ ํƒ์ง€์—์„œ ์ ‘ํ•˜๋Š” ๊ทน๋„์˜ ์‹œํ€€์Šค ๊ธธ์ด์— ๋Œ€ํ•ด Zhou ๋“ฑ(2016)์˜ ๋ฐฉ๋ฒ•๋ก ์„ ์ ์‘์‹œํ‚จ ํฌ์†Œ ํด๋ž˜์Šค ํ™œ์„ฑํ™” ๋งต(sparse-CAM) ์ ‘๊ทผ๋ฒ•์„ ํ†ตํ•ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค. ํ‰๊ท  ํ’€๋ง ๋Œ€์‹  ์ „์—ญ ์ตœ๋Œ€ ํ’€๋ง์„ ์‚ฌ์šฉํ•˜๋ฉด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํฌ์†Œํ•œ ํ™œ์„ฑํ™” ๋งต์ด ์ƒ์„ฑ๋˜์–ด, ๋ถ„๋ฅ˜ ๊ฒฐ์ •์— ์ค‘์š”ํ•œ ๊ฒƒ์œผ๋กœ ์ปจ๋ณผ๋ฃจ์…˜ ํ•„ํ„ฐ๋‹น ํ•˜๋‚˜์˜ 500๋ฐ”์ดํŠธ ์˜์—ญ(๋ฐ”์ด๋„ˆ๋ฆฌ๋‹น ์ตœ๋Œ€ 128๊ฐœ ์˜์—ญ)์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฃน A ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ 224๊ฐœ ๋ฐ”์ด๋„ˆ๋ฆฌ์˜ ๋ถ„์„์€ MalConv๊ฐ€ ๋ฐ”์ดํŠธ n-๊ทธ๋žจ ์ ‘๊ทผ๋ฒ•์— ๋น„ํ•ด ํ›จ์”ฌ ๋” ๋‹ค์–‘ํ•œ ์ •๋ณด ์†Œ์Šค๋ฅผ ํ™œ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค. ์ด์ „ ๋ฐ”์ดํŠธ n-๊ทธ๋žจ ๋ชจ๋ธ์ด PE-ํ—ค๋”์—์„œ ๊ฑฐ์˜ ๋ชจ๋“  ์ •๋ณด๋ฅผ ์–ป์—ˆ๋˜ ๋ฐ˜๋ฉด, MalConv๋Š” ์ด ์˜์—ญ์—์„œ ์ •๋ณด์˜ 58-61%๋งŒ์„ ๋„์ถœํ•˜์—ฌ ๋‹ค๋ฅธ ๋ฐ”์ด๋„ˆ๋ฆฌ ์„น์…˜์— ๊ฑธ์ณ ๋” ๊ด‘๋ฒ”์œ„ํ•œ ํŠน์ง• ํ™œ์šฉ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

๋ชจ๋ธ์€ .rsrc ์„น์…˜(16%), .text ๋ฐ CODE ์„น์…˜(14%)์— ๊ฑธ์ณ ๋ถ„์‚ฐ๋œ ํ™œ์„ฑํ™”๋กœ ๋ฐ”์ด๋„ˆ๋ฆฌ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ์ •๊ตํ•œ ์ดํ•ด๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋ฆฌ์†Œ์Šค ๋””๋ ‰ํ† ๋ฆฌ์™€ ์‹คํ–‰ ์ฝ”๋“œ๋ฅผ ํŠน์ง•์œผ๋กœ ํ™œ์šฉํ•จ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํŠนํžˆ ๋ชจ๋ธ์€ ์ •์ƒ๊ณผ ์•…์„ฑ ์ƒ˜ํ”Œ ๋ชจ๋‘์— ๋Œ€ํ•ด UPX1 ์„น์…˜(ํŒจํ‚น๋œ ์‹คํ–‰ ํŒŒ์ผ์„ ๋‚˜ํƒ€๋ƒ„)์— ๋Œ€ํ•œ ๊ท ํ˜•์žกํžŒ ํ™œ์„ฑํ™” ํŒจํ„ด์„ ๋ณด์—ฌ, ํŒจํ‚น๊ณผ ์•…์„ฑ์„ฑ ๊ฐ„์˜ ์ผ๋ฐ˜์ ์ด์ง€๋งŒ ๋„์›€์ด ๋˜์ง€ ์•Š๋Š” ์—ฐ๊ด€์„ฑ์„ ํ”ผํ•˜๋„๋ก ํ•™์Šตํ–ˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

5.3 The Failure of Batch-Normlaization

[ENG] One of the most significant findings in MalConv research is the complete failure of batch normalization, a technique that typically accelerates convergence and improves generalization in deep learning. Models incorporating batch normalization consistently failed to learn, achieving at best 60% training accuracy and 50% test accuracy across multiple framework implementations including PyTorch, TensorFlow, Chainer, and Theano.

Kernel density estimation analysis reveals the root cause: binary executable data exhibits multi-modal activation distributions fundamentally different from the approximately Gaussian distributions assumed by batch normalization. While image processing networks show smooth, unimodal activation patterns suitable for normalization, MalConv activations display distinct multi-modal characteristics with multiple peaks.

This multi-modal nature stems from the diverse byte content within executables, where the same byte value can represent ASCII text, binary code, structured data, or embedded images depending on context. The violation of batch normalization's normality assumption leads to degraded performance, with the technique only functioning when trained on homogeneous sub-regions of 500-10,000 bytes, though these models still failed to generalize to test data.

[KOR] One of the most significant findings in MalConv research is the complete failure of batch normalization, a technique that typically accelerates convergence and improves generalization in deep learning. Models incorporating batch normalization consistently failed to learn, achieving at best 60% training accuracy and 50% test accuracy across multiple framework implementations including PyTorch, TensorFlow, Chainer, and Theano.

Kernel density estimation analysis reveals the root cause: binary executable data exhibits multi-modal activation distributions fundamentally different from the approximately Gaussian distributions assumed by batch normalization. While image processing networks show smooth, unimodal activation patterns suitable for normalization, MalConv activations display distinct multi-modal characteristics with multiple peaks.

This multi-modal nature stems from the diverse byte content within executables, where the same byte value can represent ASCII text, binary code, structured data, or embedded images depending on context. The violation of batch normalization's normality assumption leads to degraded performance, with the technique only functioning when trained on homogeneous sub-regions of 500-10,000 bytes, though these models still failed to generalize to test data.

--

6. Conclusion

[ENG] MalConv represents a paradigm shift in malware detection by demonstrating that neural networks can successfully learn to identify malicious software directly from raw byte sequences without domain knowledge. The model's ability to process entire PE files up to 2 million bytes establishes it as the first architecture capable of handling such extreme sequence lengths in cybersecurity applications.

The research contributions extend beyond malware detection to the broader machine learning community by identifying unique challenges in processing multi-modal sequential data and proposing effective solutions. The discovery of batch normalization's failure provides valuable insights for future work on non-standard data distributions, while the sparse-CAM interpretability approach offers a practical method for understanding model decisions on extremely long sequences.

Future research directions include developing architectures that better handle multi-modal data, improving memory efficiency for even longer sequences, and integrating semantic understanding of code structure. The model's demonstrated scalability with increased training data suggests continued performance improvements as larger datasets become available, positioning MalConv as a foundation for next-generation cybersecurity solutions.

[KOR] MalConv๋Š” ์‹ ๊ฒฝ๋ง์ด ๋„๋ฉ”์ธ ์ง€์‹ ์—†์ด ์›์‹œ ๋ฐ”์ดํŠธ ์‹œํ€€์Šค์—์„œ ์ง์ ‘ ์•…์„ฑ ์†Œํ”„ํŠธ์›จ์–ด๋ฅผ ์‹๋ณ„ํ•˜๋„๋ก ์„ฑ๊ณต์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์คŒ์œผ๋กœ์จ ์•…์„ฑ์ฝ”๋“œ ํƒ์ง€์˜ ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ตœ๋Œ€ 200๋งŒ ๋ฐ”์ดํŠธ์˜ ์ „์ฒด PE ํŒŒ์ผ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์€ ์‚ฌ์ด๋ฒ„๋ณด์•ˆ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์ด๋Ÿฌํ•œ ๊ทน๋„์˜ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์ดˆ์˜ ์•„ํ‚คํ…์ฒ˜๋กœ ํ™•๋ฆฝ๋ฉ๋‹ˆ๋‹ค.

์—ฐ๊ตฌ ๊ธฐ์—ฌ๋Š” ์•…์„ฑ์ฝ”๋“œ ํƒ์ง€๋ฅผ ๋„˜์–ด ๋‹ค์ค‘ ๋ชจ๋‹ฌ ์ˆœ์ฐจ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์˜ ๊ณ ์œ ํ•œ ๋„์ „๊ณผ์ œ๋ฅผ ์‹๋ณ„ํ•˜๊ณ  ํšจ๊ณผ์ ์ธ ํ•ด๊ฒฐ์ฑ…์„ ์ œ์•ˆํ•จ์œผ๋กœ์จ ๋” ๊ด‘๋ฒ”์œ„ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ์ปค๋ฎค๋‹ˆํ‹ฐ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค. ๋ฐฐ์น˜ ์ •๊ทœํ™” ์‹คํŒจ์˜ ๋ฐœ๊ฒฌ์€ ๋น„ํ‘œ์ค€ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๋Œ€ํ•œ ํ–ฅํ›„ ์ž‘์—…์— ๊ท€์ค‘ํ•œ ํ†ต์ฐฐ์„ ์ œ๊ณตํ•˜๋ฉฐ, sparse-CAM ํ•ด์„๊ฐ€๋Šฅ์„ฑ ์ ‘๊ทผ๋ฒ•์€ ๊ทน๋„๋กœ ๊ธด ์‹œํ€€์Šค์—์„œ ๋ชจ๋ธ ๊ฒฐ์ •์„ ์ดํ•ดํ•˜๋Š” ์‹ค์šฉ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์—๋Š” ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์ž˜ ์ฒ˜๋ฆฌํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜ ๊ฐœ๋ฐœ, ๋” ๊ธด ์‹œํ€€์Šค๋ฅผ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ ํ–ฅ์ƒ, ์ฝ”๋“œ ๊ตฌ์กฐ์˜ ์˜๋ฏธ๋ก ์  ์ดํ•ด ํ†ตํ•ฉ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ€์— ๋”ฐ๋ฅธ ๋ชจ๋ธ์˜ ์ž…์ฆ๋œ ํ™•์žฅ์„ฑ์€ ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ด์ง์— ๋”ฐ๋ผ ์ง€์†์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์‹œ์‚ฌํ•˜๋ฉฐ, MalConv๋ฅผ ์ฐจ์„ธ๋Œ€ ์‚ฌ์ด๋ฒ„๋ณด์•ˆ ์†”๋ฃจ์…˜์˜ ๊ธฐ๋ฐ˜์œผ๋กœ ์œ„์น˜์‹œํ‚ต๋‹ˆ๋‹ค.