lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Digy" <digyd...@gmail.com>
Subject RE: Big data Suggestions. (Out of Memory)
Date Tue, 25 May 2010 16:43:04 GMT
It may be related with that issue
(https://issues.apache.org/jira/browse/LUCENENET-358)

http://mail-archives.apache.org/mod_mbox/lucene-lucene-net-dev/201005.mbox/%
3CAANLkTinwf5JCjSqZmBBNCsQ_jHxQJgH4Ktehr-0UyGWF@mail.gmail.com%3E

Can you try Lucene.Net 2.9.2.2 (in trunk or 2.9.2 tag. I updated it
yesterday)

DIGY

-----Original Message-----
From: Josh Handel [mailto:Josh.Handel@catapultsystems.com] 
Sent: Tuesday, May 25, 2010 4:55 PM
To: lucene-net-user@lucene.apache.org
Subject: RE: Big data Suggestions. (Out of Memory)

Set the commit to every 1000 documents and changed to just 1 thread writing
indexes, it still just marches on up to an Out of memory exception.. 

Im also using the IndexWriter.SetRAMBufferSizeMB but it doesn't seem to make
a bit of difference either... 
(here is how I am newing up my writer)

Lucene.Net.Index.IndexWriter indexWriter = new
Lucene.Net.Index.IndexWriter(dir, analyzer, new
IndexWriter.MaxFieldLength(10000));
indexWriter.SetRAMBufferSizeMB(128);
LogByteSizeMergePolicy lbsmp = new LogByteSizeMergePolicy(indexWriter);
lbsmp.SetMaxMergeMB(5);
lbsmp.SetMinMergeMB(10240);
lbsmp.SetMergeFactor(10);
indexWriter.SetMergePolicy(lbsmp);
ConcurrentMergeScheduler scheduler = new ConcurrentMergeScheduler();
scheduler.SetMaxThreadCount(15);
indexWriter.SetMergeScheduler(scheduler);

This is some pretty tweaky stuff I am doing here (based on Lucene in action
and what I can figure out from the API docs) so if I am doing it wrong I am
all ears to learn the right way :-)

So what other options are there to keep these indexes from blowing up in
memory? I don't mind this taking a lot of ram, as long as it doesn't take so
much it crashes.. Just a way to gate the upper limits of the ram it uses
would be awesome! :-).


Thanks!
Josh
PS: I will be out of pocket most of the day, so any suggestions that come in
I won't be able to try until tomorrow morning.

-----Original Message-----
From: Digy [mailto:digydigy@gmail.com] 
Sent: Monday, May 24, 2010 1:59 PM
To: lucene-net-user@lucene.apache.org
Subject: RE: Big data Suggestions. (Out of Memory)

Try to use "commit" (for ex, at every 10000 docs)
DIGY

-----Original Message-----
From: Josh Handel [mailto:Josh.Handel@catapultsystems.com] 
Sent: Monday, May 24, 2010 9:57 PM
To: lucene-net-user@lucene.apache.org
Subject: Big data Suggestions. (Out of Memory)

I hate to ping multiple times in the same data on this but I wanted to add
something real quick.

With all those unique terms, I am now running out of memory (on my dev box)
when indexing.. I am thinking this is being caused by the hundreds of
thousands of unique terms (about 120,000 according to LUKE in the index
after it crashed locally).. Is there a way to control memory usage by
lucene's caching?

(FYI:I am using Lucene.NET 2.9.2)

Thanks
Josh Handel


-----Original Message-----
From: Josh Handel [mailto:Josh.Handel@catapultsystems.com] 
Sent: Monday, May 24, 2010 1:22 PM
To: lucene-net-user@lucene.apache.org
Subject: Big data Suggestions.

Guys,
   I am working on a Lucene index to allow some backend processes access to
some post-processing type data.. The result is a document that looks
something like this.


*         ProfileID (Long.ToString())

*         Delimited Array of FKs (int.ToString() delimited and tokenized)

*         Multiple delimited arrays of strings (each array its own field
name, delimited and tokenized)

*         Delimited array of about 150 Ints between 0 and 1600
(int.ToString() delimited and tokenized).

(this is a bolt on to a current app so we have limited control over its data
model, and the above document is the best we could come up with to describe
our data in a way Lucene might like)..

We have about 130 million of these records, We don't "need" the profileID to
be indexed except for doing updates, and we won't be storing the array of
unique ints.

My concern is on the terms database, with all those profileIDs in the terms
data, that's 130 million terms before we look at what we care to search (the
list of FKs and the list of ints between 0 and 1600.

I was wondering if anyone had suggestions on this model, or ways to manage
the potential size of our terms list?

Thanks in advance.
Josh Handel
Senior Lead Consultant
512.328.8181 | Main
512.328.0584 | Fax
512.577-6568 | Cell
www.catapultsystems.com<blocked::blocked::http://www.catapultsystems.com/>

CATAPULT SYSTEMS INC.
ENABLING BUSINESS THROUGH TECHNOLOGY




Mime
View raw message