mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Grisel <olivier.gri...@ensta.org>
Subject Re: Efficient dictionary storage in memory
Date Sat, 16 Jan 2010 14:54:07 GMT
2010/1/16 Sean Owen <srowen@gmail.com>:
> 351MB isn't so bad.
>
> I do think the next-best idea to explore is a trie, which could use a
> char->Object map data structure provided by our new collections
> module? To the extent this data is more compact when encoded in UTF-8,
> it will be *much* more compact encoded in a trie.

A more radical way to solve this dictionary memory issue would be to
use a hashed representation of the term counts:
http://hunch.net/~jl/projects/hash_reps/index.html or maybe a less
radical yet more complicated to implement approach such as Counting
Filters (a variant of Bloom Filters
http://en.wikipedia.org/wiki/Bloom_filter#Counting_filters ).

Maybe it would be best implemented as a extracting the public API of
DictionaryVectorizer as an interface TermVectorizer or just Vectorizer
and providing alternative implementations such as HashingVectorizer
and CountingFiltersVectorizer (though I haven't checked yet if they
are iso-functional even setting aside the conflict / false negative
probabilities).

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Mime
View raw message