This dataset is a textual dataset of 57,293 malware samples and 77,142 benign samples. These samples were collected over the span of a year by collecting hexadecimal instances of binary code and labeling them as either ‘benign’ or ‘malware’. Compiled by researchers at the University of Illinois and researchers at Blue Hexagon, this dataset represents a concerted effort to initialize a paradigm shift in malware identification: from deterministic and predetermined (using classical algorithms) to dynamic and learnable (using deep learning). The website links to a paper that further discusses dataset use-cases and motivation, and the genuine dataset.
Author:
Yang, Limin and Ciptadi, Arridhana and Laziuk, Ihar and Ahmadzadeh, Ali and Wang, Gang