DGA domain scoring (GBM-ONNX)#
This tutorial shows how to perform real-time DGA domain classification using a machine learning model to have as output the classification probability (score).
We will use the Gradient Boosting algorithm from the scikit-learn library to create a model capable of detecting whether a domain is malicious. Then we will transform the model into ONNX format in order to aggregate the scoring of the classification. Finally, we will register the model in ML Model Manager to enable it in the Devo platform and exploit it through Devo query engine.
Requirements#
Python >= 3.7.
Devo table
demo.ecommerce.data
.
It is recommended for convenience to create a virtual environment to run the tutorial or use the notebook provided.
Setup#
Let’s start by installing the required packages. Open your favourite terminal and type the following command.
$ pip install devo-sdk \
devo-mlmodelmanager \
numpy \
onnx \
onnxruntime \
pandas \
scikit-learn \
skl2onnx
We are ready to start coding, so in your coding environment let’s start by the needed imports.
import os
import math
import time
import numpy as np
import pandas as pd
from onnx import TensorProto
from onnx.defs import ONNX_ML_DOMAIN
from onnx.helper import make_node, make_tensor_value_info
from onnxruntime import InferenceSession
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from skl2onnx import convert_sklearn, to_onnx
from skl2onnx.common.data_types import FloatTensorType
from devo.api import Client, ClientConfig, SIMPLECOMPACT_TO_OBJ
from devo_ml.modelmanager import create_client_from_token, engines
Declare some constants for convenience in the code.
# A valid Devo access token
DEVO_TOKEN = '<your_token_here>'
# URL of Devo API, e.g. https://apiv2-us.devo.com/search/query/
DEVO_API_URL = '<devo_api_url_here>'
# URL of Devo ML Model Manager, e.g. https://api-us.devo.com/mlmodelmanager/
DEVO_MLMM_URL = '<devo_mlmm_url_here>'
# The domain to connect to, e.g. self
DOMAIN = '<your_domain_here>'
# The name of the model
MODEL_NAME = 'dga_scoring'
# The description of the models
MODEL_DESCRIPTION = 'DGA domain label scoring'
# File to store the onnx model
MODEL_FILE = f'{MODEL_NAME}.onnx'
# The URL of a dataset to build the model
DATASET_URL = "https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv"
# Random seed to initialize random variables
RANDOM_SEED = 42
Prepare the data#
This dataset
will help us to train our model once it has been built. The dataset has the
form host;domain;class;subclass
.
host;domain;class;subclass
000directory.com.ar;000directory;legit;legit
001fans.com;001fans;legit;legit
...
1002n0q11m17h017r1shexghfqf.net;1002n0q11m17h017r1shexghfqf;dga;newgoz
100bestbuy.com;100bestbuy;legit;legit
...
With the pandas library we can handle and transform data in a simple way, so create a pandas.DataFrame from the dataset.
df = pd.read_csv(DATASET_URL, sep=';')
This is the dataset as a pandas.DataFrame.
>>> df.head()
host domain class subclass
0 000directory.com.ar 000directory legit legit
1 000webhost.com 000webhost legit legit
2 001fans.com 001fans legit legit
3 01-telecharger.com 01-telecharger legit legit
4 010shangpu.com 010shangpu legit legit
We need to add the columns length
,
entropy
and vowel_proportion
for each domain, and also the flag
malicious
indicating if it is a DGA domain according to the class
column value.
def entropy(text):
"""Helper function to calculate the Shannon entropy of a text."""
prob = [float(text.count(c)) / len(text) for c in set(text)]
return -sum([p * math.log(p) / math.log(2.0) for p in prob])
df = df[~df['subclass'].isna()]
df['length'] = df['domain'].apply(lambda x: len(x))
df['vowel_proportion'] = df['domain'].apply(lambda x: sum([x.lower().count(v) for v in "aeiou"]) / len(x))
df['entropy'] = df['domain'].apply(lambda x: entropy(x))
df['malicious'] = df['class'].apply(lambda x: int(x != 'legit'))
This is the dataset ready to use.
>>> df.head()
host domain class subclass length vowel_proportion entropy malicious
0 000directory.com.ar 000directory legit legit 12 0.250000 3.022055 0
1 000webhost.com 000webhost legit legit 10 0.200000 2.846439 0
2 001fans.com 001fans legit legit 7 0.142857 2.521641 0
3 01-telecharger.com 01-telecharger legit legit 14 0.285714 3.324863 0
4 010shangpu.com 010shangpu legit legit 10 0.200000 3.121928 0
Build the model#
We are now ready to build the model. We will rely on a sklearn GradientBoostingClassifier for that.
X_data = df[['length', 'vowel_proportion', 'entropy']].values
y_data = df['malicious'].values
# Split the data in test and train chunks
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, random_state=RANDOM_SEED)
model = GradientBoostingClassifier(random_state=RANDOM_SEED)
# Train the model
model = model.fit(X_train, y_train)
We can now check the accuracy of our model. We will use the F-score measure provided by the sklearn.metrics.f1_score function.
The X_test
chunk splited before allows us to validate the model.
# Validate how good is the model
pred_test = model.predict(X_test)
score = f1_score(y_test, pred_test)
>>> print(f'F1-Score: {score:.4f}')
F1-Score: 0.9199
F1 score reaches its best value at 1
and worst score at 0
. We have
0.9199
, so our model has a goog accuracy. So far so good.
Transform into ONNX#
In order to calculate the scoring we need to transform the model to ONNX format first. We will use the skl2onnx.to_onnx function of the sklearn-onnx library for that.
# Transform to ONNX format
onnx_model = to_onnx(
model,
X_train.astype(np.float32),
target_opset=13,
)
We now proceed to the calculation of the score. This is done by modifying the ONNX graph, removing the current output of the model and adding nodes to compute the desired output.
# Remove all defined outputs
while onnx_model.graph.output:
_ = onnx_model.graph.output.pop()
# Remove node ZipMap since it won't be necessary
n_nodes = len(onnx_model.graph.node)
for i in range(n_nodes):
if onnx_model.graph.node[i].name == 'ZipMap':
del onnx_model.graph.node[i]
break
# Define the outputs by adding proper nodes
onnx_model.graph.node.append(
make_node(
'Constant',
inputs=[],
outputs=['output_pos'],
value_int=0,
)
)
onnx_model.graph.node.append(
make_node(
'ArrayFeatureExtractor',
inputs=['probabilities', 'output_pos'],
outputs=['output_probability_at'],
domain=ONNX_ML_DOMAIN,
)
)
onnx_model.graph.output.append(
make_tensor_value_info(
name='output_probability_at',
elem_type=TensorProto.FLOAT,
shape=[-1, 1],
)
)
Note
Refer to ONNX documentation to learn more about to manipulate an ONNX graph.
We can check is the transformed model works correctly by comparing the
predictions of the model
and onnx_model
.
# Predict with ONNX model
session = InferenceSession(onnx_model.SerializeToString())
input_name = session.get_inputs()[0].name
result = session.run(None, {input_name: X_test.astype(np.float32)})
onnx_scores = result[0].reshape(-1)
# Predict with model
scores = model.predict_proba(X_test)[:, 0]
# Compare predictions
threshold = 1e-3
prediction_validation = (np.abs(scores - onnx_scores) < threshold).all()
>>> print(f'Predictions are similar: {prediction_validation}')
Predictions are similar: True
Great, seems our onnx_model
is valid, now let’s to save it.
with open(MODEL_FILE, 'wb') as fp:
fp.write(onnx_model.SerializeToString())
Register the model#
Once the model has been built and saved, it must be registered on the Devo platform in order to exploit it. We will use the ML Model Manager Client for that.
# create the mlmm client
mlmm = create_client_from_token(DEVO_MLMM_URL, DEVO_TOKEN)
# register the model
mlmm.add_model(
MODEL_NAME,
engines.ONNX,
MODEL_FILE,
description=MODEL_DESCRIPTION,
force=True
)
Note
Refer to User’s Guide of this documentation to learn more about the ML Model Manager Client.
So far we are ready to exploit our model, i.e. to score domains according to how malicious they are.
Scoring domains#
One way to evaluate a model is to use the mlevalmodel(...)
operator when
querying a table. The mlevalmodel(...)
operator is capable of evaluating
machine learning models and is available in the Devo query engine.
We are going to use the demo.ecommerce.data
table, which contains the
referralUri
field, from which we can extract the domain we want to score.
A query that might be worthwhile would be something like this.
query = f'''from demo.ecommerce.data
select
eventdate,
split(referralUri, "/", 2) as domain
group by domain every -
select
float4(length(domain)) as length,
float4(shannonentropy(domain)) as entropy,
float4(countbyfilter(domain, "aeiouAEIOU") / length) as vowel_proportion,
at(mlevalmodel(
"{DOMAIN}",
"{MODEL_NAME}",
[length, vowel_proportion, entropy]
), 0) as score
'''
Note
Refer to Build a query using LINQ to learn more about queries.
Well, now we just need to create an access to the Devo API and launch the query.
With the Devo Python SDK we can execute queries against the Devo platform easily and securely.
# create a Devo API client
api = Client(
auth={"token": DEVO_TOKEN},
address=DEVO_API_URL,
config=ClientConfig(
response="json/simple/compact",
stream=True,
processor=SIMPLECOMPACT_TO_OBJ
)
)
response = api.query(
query=query,
dates={'from': 'now() - 1 * hour()', 'to': 'now()'}
)
for row in response:
print(f"{row['domain']} -> {row['score']})
You will see the scoring like the following depending on the contents of the
demo.ecommerce.data
table.
>>>
www.bing.com -> 0.0182034969329834
www.google.com -> 0.5790193676948547
www.logcasts.com -> 0.24745863676071167
www.logtrust.com -> 0.28057998418807983
...
Note
Refer to Query API to learn more about the Devo Query API.