From Olivier Grisel <>
Subject Re: Efficient dictionary storage in memory
Date Sat, 16 Jan 2010 14:54:07 GMT
2010/1/16 Sean Owen <>:
> 351MB isn't so bad.
> I do think the next-best idea to explore is a trie, which could use a
> char->Object map data structure provided by our new collections
> module? To the extent this data is more compact when encoded in UTF-8,
> it will be *much* more compact encoded in a trie.

A more radical way to solve this dictionary memory issue would be to
use a hashed representation of the term counts: or maybe a less
radical yet more complicated to implement approach such as Counting
Filters (a variant of Bloom Filters ).

Maybe it would be best implemented as a extracting the public API of
DictionaryVectorizer as an interface TermVectorizer or just Vectorizer
and providing alternative implementations such as HashingVectorizer
and CountingFiltersVectorizer (though I haven't checked yet if they
are iso-functional even setting aside the conflict / false negative

Olivier -

