Model Card for MalConv
Model Details
Model Description: This is a TensorFlow 2 implementation of the MalConv model, a deep neural network for malware detection from raw byte sequences. MalConv is a convolutional neural network (CNN) designed to classify executable files as either malicious or benign. It takes the raw bytes of an entire executable file as input, making it an end-to-end, feature-free malware detection model.
- Developed by: This implementation by [Your Name or Organization], based on the original work by Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and Charles Nicholas.
- Model type: Binary Classification
- Language(s) (NLP): Not applicable
- License: MIT
- Finetuned from model: Not applicable
Uses
Direct Use
This model is intended for classifying executable files as either malicious or benign.
Downstream Use
This model can be used as a base for further fine-tuning on other malware classification tasks.
Out-of-Scope Use
This model is designed for binary classification of executables and should not be used for other tasks such as malware generation or analysis of other file types.
Bias, Risks, and Limitations
This model's performance is highly dependent on the dataset it was trained on. If the training data is not representative of the types of malware you are trying to detect, the model's performance will be poor. The model may also be susceptible to adversarial attacks.
Recommendations
Users should be aware of the potential biases and limitations of the model. It is recommended to train the model on a large and diverse dataset of malware and benign files.
How to Get Started with the Model
1. Data Preparation
Before training or tuning the model, you need to prepare your dataset. You can do this in two ways:
CSV File: Create a CSV file with two columns:
filepathandlabel. Thefilepathcolumn should contain the absolute path to each executable file, and thelabelcolumn should contain the corresponding label (0 for benign, 1 for malware).Directories: Organize your malware and benign files into separate directories.
2. Training
To train the MalConv model, use the src/train.py script. You can provide the training data using either a CSV file or directories.
Using a CSV file:
python src/train.py --csv /path/to/your/data.csv
Using directories:
python src/train.py --malware_dir /path/to/malware --benign_dir /path/to/benign
The trained model will be saved to models/malconv_model.h5 by default. You can change this with the --save_path argument.
3. Prediction
To make predictions on new executable files, use the src/predict.py script.
Predicting a single file:
python src/predict.py /path/to/your/model.h5 --file /path/to/your/executable.exe
Predicting a batch of files from a CSV:
python src/predict.py /path/to/your/model.h5 --csv /path/to/your/files.csv --output /path/to/your/predictions.csv
Training Details
Training Data: This model should be trained on a large and diverse dataset of malware and benign executable files. The original paper used a dataset of 1.2 million files. Another option is the DIKE dataset, which contains both benign and malicious PE and OLE files.
Training Procedure: The model was trained using the Adam optimizer with a learning rate of 0.001 and a batch size of 512. The training procedure is described in detail in the original paper.
Evaluation
Testing Data: The model should be evaluated on a held-out test set of malware and benign executable files.
Metrics: The model's performance can be evaluated using the following metrics:
- Accuracy
- Precision
- Recall
- F1-score
Results: The results of the evaluation will depend on the dataset used. The original paper reported an AUC of 0.99.
Model Card Authors
[Seokhee Chang]
Model Card Contact
References
Citation
If you use this code in your research, please cite the original MalConv paper:
@article{raff2017malware_arxiv,
title={Malware Detection by Eating a Whole EXE},
author={Edward Raff and Jon Barker and Jared Sylvester and Robert Brandon and Bryan Catanzaro and Charles Nicholas},
year={2017},
eprint={1710.09435},
archivePrefix={arXiv},
primaryClass={cs.CR}
}