DGA domain classifier (Keras-ONNX)#
This tutorial is related to the DGA domain classifier using H2O engine tutorial but in this case is used Keras as machine learning engine.
We are going to use the Keras framework to create a model capable of detecting whether a domain is malicious or not. Then in order to be able to register and use the Keras model in Devo we will show how to transform it into ONNX format.
Requirements#
Python >= 3.7.
Devo table
demo.ecommerce.data
.
It is recommended for convenience to create a virtual environment to run the tutorial or use the notebook provided.
Setup#
Let’s start by installing the required packages.
$ pip install devo-sdk
$ pip install devo-mlmodelmanager
$ pip install tensorflow
$ pip install tf2onnx
$ pip install scikit-learn
$ pip install numpy
$ pip install pandas
We can start coding by the needed imports.
import os
import math
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import tf2onnx
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from devo.api import Client, ClientConfig, SIMPLECOMPACT_TO_OBJ
from devo_ml.modelmanager import create_client_from_token, engines
Declare some constants for convenience in the code.
# A valid Devo access token
DEVO_TOKEN = '<your_token_here>'
# URL of Devo API, e.g. https://apiv2-us.devo.com/search/query/
DEVO_API_URL = '<devo_api_url_here>'
# URL of Devo ML Model Manager, e.g. https://api-us.devo.com/mlmodelmanager/
DEVO_MLMM_URL = '<devo_mlmm_url_here>'
# The domain to connect to, e.g. self
DOMAIN = '<your_domain_here>'
# The name of the model
MODEL_NAME = 'dga_classifier_onnx'
# The description of the models
MODEL_DESCRIPTION = 'DGA domain classifier (Keras-ONNX)'
# File to store the onnx model
MODEL_FILE = f'{MODEL_NAME}.onnx'
# The URL of a dataset to build the model
DATASET_URL = "https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv"
VOWELS = "aeiouAEIOU"
# fix random seed for reproducibility
seed = 42
np.random.seed(seed)
Prepare the data#
This dataset
will help us to train our model once it has been built. The dataset has the
form host;domain;class;subclass
.
host;domain;class;subclass
000directory.com.ar;000directory;legit;legit
001fans.com;001fans;legit;legit
...
1002n0q11m17h017r1shexghfqf.net;1002n0q11m17h017r1shexghfqf;dga;newgoz
100bestbuy.com;100bestbuy;legit;legit
...
In the dataset preparation we will add the columns length
, entropy
and
vowel_proportion
for each domain, and also the flag malicious
indicating
if it is a DGA domain according to the class
column value.
def entropy(s):
l = len(s)
return -sum(map(lambda a: (a/l)*math.log2(a/l), Counter(s).values()))
domains = pd.read_csv(DATASET_URL, ';')
domains = domains[~domains['subclass'].isna()]
domains['length'] = domains['domain'].str.len()
domains['entropy'] = domains['domain'].apply(lambda row: entropy(row))
domains['vowel_proportion'] = 0
for v in VOWELS:
domains['vowel_proportion'] += domains['domain'].str.count(v)
domains['vowel_proportion'] /= domains['length']
domains['malicious'] = domains['class'] != 'legit'
After preparation our dataset of domains should looks like this.
>>> domains.head()
host domain class subclass length entropy vowel_proportion malicious
0 000directory.com.ar 000directory legit legit 12 3.022055 0.250000 False
1 000webhost.com 000webhost legit legit 10 2.846439 0.200000 False
2 001fans.com 001fans legit legit 7 2.521641 0.142857 False
3 01-telecharger.com 01-telecharger legit legit 14 3.324863 0.285714 False
4 010shangpu.com 010shangpu legit legit 10 3.121928 0.200000 False
Note
Be aware that our dataset is a pandas.DataFrame.
Build the model#
We are now ready to build the model. We will rely on a Keras Sequential model for that.
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(
10,
input_dim=3,
activation=tf.nn.relu,
kernel_initializer='he_normal',
kernel_regularizer=tf.keras.regularizers.l2(0.01)
))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(
7,
activation=tf.nn.relu,
kernel_initializer='he_normal',
kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.001, l2=0.001)
))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(
5,
activation=tf.nn.relu,
kernel_initializer='he_normal',
kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.001, l2=0.001)
))
model.add(tf.keras.layers.Dense(2, activation=tf.nn.softmax))
Before we can train our model we have to properly transform the data for Keras.
Y = domains['malicious']
X = domains.drop(
['host', 'domain', 'class', 'subclass', 'malicious'],
axis=1
)
# Keras requires your output feature to be one-hot encoded values.
lbl_clf = LabelEncoder()
Y_final = tf.keras.utils.to_categorical(lbl_clf.fit_transform(Y))
Let’s train our model with our transformed datasets, X
and Y_final
.
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.fit(X , Y_final , epochs=10, batch_size=7)
You will see the progress of the training in the output, something like this.
>>>
Epoch 1/10
19133/19133 [==============================] - 59s 3ms/step - loss: 0.4520 - accuracy: 0.8100
Epoch 2/10
19133/19133 [==============================] - 58s 3ms/step - loss: 0.4413 - accuracy: 0.8037
Epoch 3/10
19133/19133 [==============================] - 54s 3ms/step - loss: 0.4282 - accuracy: 0.8098
Epoch 4/10
19133/19133 [==============================] - 54s 3ms/step - loss: 0.4301 - accuracy: 0.8098
Epoch 5/10
19133/19133 [==============================] - 55s 3ms/step - loss: 0.4299 - accuracy: 0.8085
Epoch 6/10
19133/19133 [==============================] - 55s 3ms/step - loss: 0.4249 - accuracy: 0.8124
Epoch 7/10
19133/19133 [==============================] - 54s 3ms/step - loss: 0.4284 - accuracy: 0.8101
Epoch 8/10
19133/19133 [==============================] - 57s 3ms/step - loss: 0.4292 - accuracy: 0.8083
Epoch 9/10
19133/19133 [==============================] - 58s 3ms/step - loss: 0.4295 - accuracy: 0.8096
Epoch 10/10
19133/19133 [==============================] - 57s 3ms/step - loss: 0.4278 - accuracy: 0.8091
<keras.callbacks.History at 0x7f02e1620610>
Note
The Keras framework is beyond the scope of this tutorial, please, refer to Keras API reference to learn more.
Transform to ONNX#
In order to register the model in Devo we need to transform it to ONNX format first.
We will use the tf2onnx tool to convert our Keras model to ONNX and save it.
onnx_model = tf2onnx.convert.from_keras(model, opset=13, output_path=MODEL_FILE)
Register the model#
Once the model has been transformed and saved, it must be registered on the Devo platform in order to exploit it.
We will use the ML Model Manager Client for that.
# create the mlmm client
mlmm = create_client_from_token(DEVO_MLMM_URL, DEVO_TOKEN)
# register the model
mlmm.add_model(
MODEL_NAME,
engines.ONNX,
MODEL_FILE,
description=MODEL_DESCRIPTION,
force=True
)
Note
Refer to User’s Guide of this documentation to learn more about the ML Model Manager Client.
So far we have everything ready to exploit our model, i.e. to detect malicious domains.
Classify domains#
One way to evaluate a model is to use the mlevalmodel(...)
operator when
querying a table. The mlevalmodel(...)
operator is capable of evaluating
machine learning models and is available in the Devo query engine.
We are going to use the demo.ecommerce.data
table, which contains the
referralUri
field, from which we can extract the domain we want to check.
A query that might be worthwhile would be something like this.
query = f'''from demo.ecommerce.data
select split(referralUri, "/",2) as domain,
float(length(domain)) as length,
shannonentropy(domain) as entropy,
float(countbyfilter(domain, "{VOWELS}")) as vowel_proportion,
at(mlevalmodel("{DOMAIN}", "{MODEL_NAME}", [float4(length), float4(vowel_proportion)]),0) as res,
ifthenelse(res>0.5, "false", "true") as isMalicious
'''
Note
Refer to Build a query using LINQ to learn more about queries.
Well, now we just need to create an access to the Devo API and launch the query.
With the Devo Python SDK, among other features, we can execute queries against the Devo platform easily and securely.
# create a Devo API client
api = Client(
auth={"token": DEVO_TOKEN},
address=DEVO_API_URL,
config=ClientConfig(
response="json/simple/compact",
stream=True,
processor=SIMPLECOMPACT_TO_OBJ
)
)
response = api.query(query=query, dates={'from': "now()-1*hour()"})
for row in response:
print("domain: ",row['domain'], "isMalicious:", row['isMalicious'])
You will see a result like the following depending on the contents of the
demo.ecommerce.data
table.
>>>
domain: www.logcasts.com isMalicious: false
domain: www.google.com isMalicious: false
domain: www.logtrust.com isMalicious: false
...
Note
Refer to Query API to learn more about the Devo Query API.