From Daniel Quinlan <>
Subject Re: Bayes optimization?
Date Sat, 06 Mar 2004 20:17:50 GMT
Theo Van Dinter <> writes:

> Well, ok, but I was talking about using hash tokens in the code we
> have now.  For 3.0, we're not going to be replacing DB_File, and we're
> not going to write our own DB module

Perhaps with DB_File, and perhaps not--there are other options like SQL.
In addition, Michael has been experimenting with QDBM which we could
easily use in this way.

> (frankly I don't think we should do that at all...)

In addition, using hash tokens is more of a requirement due to size
reasons if we do multiple token stuff like CRM114 or DSPAM.

> BTW: I did a little more testing...  Took my 440k token bayes db and
> ran through it using DB_File in a while(...= each ...) loop.  Took 11.4
> seconds.  I then converted the DB to use crc64 hashed keys instead,
> but everything else exactly the same.  Then ran through the read-only
> loop from up above.  11.25 seconds.
> So if we combined the read time decrease with the CPU time increase
> from the hashing function, we end up taking an extra 0.2 seconds,

This is in a simple benchmark, it could still be much better or much
worse and you're still neglecting the major disk space benefits for a
32bit key.

> so it's still not worthwhile given the current code.

That seems like a premature conclusion, although if you want to conclude
we cannot simply slap hashing into our code, then I agree *that* would
not be worth it.

I think the likely lack of significant overhead shows that this idea is
still quite worthy of serious investigation.


Daniel Quinlan                     anti-spam (SpamAssassin), Linux,    and open source consulting

