--- language: en license: mit library_name: tensorflow tags: - malware-detection - text-classification - tensorflow - keras datasets: - custom - iosifache/DikeDataset metrics: - accuracy - precision - recall - f1 --- # Model Card for MalConv ## Model Details **Model Description:** This is a TensorFlow 2 implementation of the MalConv model, a deep neural network for malware detection from raw byte sequences. MalConv is a convolutional neural network (CNN) designed to classify executable files as either malicious or benign. It takes the raw bytes of an entire executable file as input, making it an end-to-end, feature-free malware detection model. * **Developed by:** This implementation by [Your Name or Organization], based on the original work by Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and Charles Nicholas. * **Model type:** Binary Classification * **Language(s) (NLP):** Not applicable * **License:** MIT * **Finetuned from model:** Not applicable ## Uses ### Direct Use This model is intended for classifying executable files as either malicious or benign. ### Downstream Use This model can be used as a base for further fine-tuning on other malware classification tasks. ### Out-of-Scope Use This model is designed for binary classification of executables and should not be used for other tasks such as malware generation or analysis of other file types. ## Bias, Risks, and Limitations This model's performance is highly dependent on the dataset it was trained on. If the training data is not representative of the types of malware you are trying to detect, the model's performance will be poor. The model may also be susceptible to adversarial attacks. ### Recommendations Users should be aware of the potential biases and limitations of the model. It is recommended to train the model on a large and diverse dataset of malware and benign files. ## How to Get Started with the Model ### 1. Data Preparation Before training or tuning the model, you need to prepare your dataset. You can do this in two ways: * **CSV File:** Create a CSV file with two columns: `filepath` and `label`. The `filepath` column should contain the absolute path to each executable file, and the `label` column should contain the corresponding label (0 for benign, 1 for malware). * **Directories:** Organize your malware and benign files into separate directories. ### 2. Training To train the MalConv model, use the `src/train.py` script. You can provide the training data using either a CSV file or directories. **Using a CSV file:** ```bash python src/train.py --csv /path/to/your/data.csv ``` **Using directories:** ```bash python src/train.py --malware_dir /path/to/malware --benign_dir /path/to/benign ``` The trained model will be saved to `models/malconv_model.h5` by default. You can change this with the `--save_path` argument. ### 3. Prediction To make predictions on new executable files, use the `src/predict.py` script. **Predicting a single file:** ```bash python src/predict.py /path/to/your/model.h5 --file /path/to/your/executable.exe ``` **Predicting a batch of files from a CSV:** ```bash python src/predict.py /path/to/your/model.h5 --csv /path/to/your/files.csv --output /path/to/your/predictions.csv ``` ## Training Details **Training Data:** This model should be trained on a large and diverse dataset of malware and benign executable files. The original paper used a dataset of 1.2 million files. Another option is the [DIKE dataset](https://github.com/iosifache/dike-dataset), which contains both benign and malicious PE and OLE files. **Training Procedure:** The model was trained using the Adam optimizer with a learning rate of 0.001 and a batch size of 512. The training procedure is described in detail in the original paper. ## Evaluation **Testing Data:** The model should be evaluated on a held-out test set of malware and benign executable files. **Metrics:** The model's performance can be evaluated using the following metrics: * Accuracy * Precision * Recall * F1-score **Results:** The results of the evaluation will depend on the dataset used. The original paper reported an AUC of 0.99. ## Model Card Authors [Seokhee Chang] ## Model Card Contact [cycloevan97@gmail.com] ## References - [Malware Detection by Eating a Whole EXE (arXiv:1710.09435)](https://arxiv.org/abs/1710.09435) ## Citation If you use this code in your research, please cite the original MalConv paper: ``` @article{raff2017malware_arxiv, title={Malware Detection by Eating a Whole EXE}, author={Edward Raff and Jon Barker and Jared Sylvester and Robert Brandon and Bryan Catanzaro and Charles Nicholas}, year={2017}, eprint={1710.09435}, archivePrefix={arXiv}, primaryClass={cs.CR} } ```