lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <cowtownco...@yahoo.com>
Subject Re: Optimizing/minimizing memory usage of memory-based indexes
Date Sun, 12 Feb 2006 06:16:22 GMT
--- Wolfgang Hoschek <wolfgang.hoschek@mac.com> wrote:
> Hi Tatu,
> 
> I take it that simply maintaining the frequencies in
> a hashmap  
> similar to  
>
org.apache.lucene.index.memory.AnalyzerUtil.getMostFrequentTerms()
>  
> isn't sufficient for your usecases?

Initially it might, but probably eventually not. I was
thinking Lucene formats might also be bit more compact
than vanilla hash maps, but I guess that depends on
many factors. But I will probably want to play with
actual queries later on, based on frequencies.

> In the latter case, are you using  
> org.apache.lucene.store.RAMDirectory or  
> org.apache.lucene.index.memory.MemoryIndex?

I'm using RAMDirectory. Should I be using MemoryIndex
maybe instead (I'll check it out)?

Thanks!

-+ Tatu +-

> 
> Wolfgang.
> 
> On Feb 10, 2006, at 12:29 PM, Tatu Saloranta wrote:
> 
> > I am building a simple classifier system, using
> Lucene
> > essentially to efficiently+incrementally calculate
> > term frequencies.
> > (due to input variations, I am currently creating
> a
> > separate index for each attribute, although I
> guess I
> > could (should?) just use different field for each
> > attribute)
> >
> > Now, one potential problem I have is that although
> > memory usage is probably sub-linear (I just index
> > terms, don't store; vocabulary grows
> sub-linearly),
> > and thus actual memory used should not grow too
> fast,
> > the way Lucene builds and merges indexes
> fluctuates: I
> > assume memory usage mostly changes when merging
> > segments. I have simple diagnostics for memory
> usage
> > that force gc every 1000 documents processed [yes,
> I
> > know that System.gc() does not strictly guarantee
> it,
> > but in practice it is good enough], and notice
> usage
> > fluctuating it a bit, with overall increase. but
> 10%
> > drop every 12000 documents or so, with default
> > settings).
> >
> > So... I am essentially wondering if there are good
> > techniques for tuning memory usage (minimize index
> > structure size) adaptively, to avoid running out
> of
> > memory, in cases where compacting the index would
> > avoid out of mem case.
> >
> > Further, are there possibilities to perhaps trade
> > reduced memory usage for slightly slower indexing?
> (or
> > even better, searching -- in my case, I only
> traverse
> > term indexes to get counts).
> IndexWriter.optimize()
> > probably does not really help here does it?
> >
> > -+ Tatu +-
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message