lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Digy" <digyd...@gmail.com>
Subject RE: Big data Suggestions. (Out of Memory)
Date Mon, 24 May 2010 18:58:53 GMT
Try to use "commit" (for ex, at every 10000 docs)
DIGY

-----Original Message-----
From: Josh Handel [mailto:Josh.Handel@catapultsystems.com] 
Sent: Monday, May 24, 2010 9:57 PM
To: lucene-net-user@lucene.apache.org
Subject: Big data Suggestions. (Out of Memory)

I hate to ping multiple times in the same data on this but I wanted to add
something real quick.

With all those unique terms, I am now running out of memory (on my dev box)
when indexing.. I am thinking this is being caused by the hundreds of
thousands of unique terms (about 120,000 according to LUKE in the index
after it crashed locally).. Is there a way to control memory usage by
lucene's caching?

(FYI:I am using Lucene.NET 2.9.2)

Thanks
Josh Handel


-----Original Message-----
From: Josh Handel [mailto:Josh.Handel@catapultsystems.com] 
Sent: Monday, May 24, 2010 1:22 PM
To: lucene-net-user@lucene.apache.org
Subject: Big data Suggestions.

Guys,
   I am working on a Lucene index to allow some backend processes access to
some post-processing type data.. The result is a document that looks
something like this.


*         ProfileID (Long.ToString())

*         Delimited Array of FKs (int.ToString() delimited and tokenized)

*         Multiple delimited arrays of strings (each array its own field
name, delimited and tokenized)

*         Delimited array of about 150 Ints between 0 and 1600
(int.ToString() delimited and tokenized).

(this is a bolt on to a current app so we have limited control over its data
model, and the above document is the best we could come up with to describe
our data in a way Lucene might like)..

We have about 130 million of these records, We don't "need" the profileID to
be indexed except for doing updates, and we won't be storing the array of
unique ints.

My concern is on the terms database, with all those profileIDs in the terms
data, that's 130 million terms before we look at what we care to search (the
list of FKs and the list of ints between 0 and 1600.

I was wondering if anyone had suggestions on this model, or ways to manage
the potential size of our terms list?

Thanks in advance.
Josh Handel
Senior Lead Consultant
512.328.8181 | Main
512.328.0584 | Fax
512.577-6568 | Cell
www.catapultsystems.com<blocked::blocked::http://www.catapultsystems.com/>

CATAPULT SYSTEMS INC.
ENABLING BUSINESS THROUGH TECHNOLOGY


Mime
View raw message