spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lujan Moreno, Gustavo" <>
Subject Re: Evaluating the ML for netflow, proxy, dns
Date Tue, 06 Jun 2017 14:19:25 GMT

Yes, "--ldamaxiterations 20” is the iteration parameter. You should change that 20 for something
higher, at least 100, ideally +200.

You can leave the default --dupfactor (1000) for now. This is only when the user provides

The --threshold (1e-6) parameter represents what in supervised learning is called the cutoff.
Everything below that value is classified as “suspicious” and everything above as “normal”.
Let’s say that you have only 5 records with probabilities: 0.1 0.2 0.3 0.8 0.9. If the threshold
is set to 0.4 you will get back 3 suspicious records. However, because it is vey difficult
to know that threshold and it may vary from dataset to dataset I recommend you change it to
1 (this will in theory bring all the dataset) and then control the number of records you want
to see with the parameter maxResults. Actually takes as parameters these two values,
something like:  20160128 flow 1 200. In this case you are saying: set the threshold to 1
(bring everything) but show me only the 200 most suspicious records. 

--ldatopiccount (20) You are right in the explanation you gave. I have observed that for my
internal experiments proxy works well with 5 topics and for netflow between 10-20 topics is
ok. In our case I have observed that performance degrades after 70 topics, but as you say
this number will depend on the variety of your network.

Alpha and beta can also be changed. However, for now I recommend you just leave the default.
If you are not getting good results the number of iterations and topics should be your priority.



On 6/6/17, 1:55 AM, "Giacomo Bernardi" <> wrote:

>Thanks for the suggestions.
>> If I remember correctly the default is 20 but that is too low. I recommend running
at the very least 100 iterations.
>You mean "--ldamaxiterations 20" right?
>> There are other hyper parameters that need to be tuned like the alpha, beta and the
number of topics.
>What about the following? I'm leaving them at the defaults (in brackets):
>  --threshold (1e-6)
>  --dupfactor (1000)
>  --ldatopiccount (20)
>In particular, shouldn't the topic count be in some way proportional
>to the "variety" of hosts in my network? Intuitively, if LDA operates
>on a very heterogeneous network (say: a large corporate network with
>engineering labs, office users, datacenters and whatnot) then the
>characterisation of network conversations into topics would be very
View raw message