lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Indexing large files? - No answers yet...
Date Fri, 11 Sep 2009 18:02:47 GMT
On Fri, Sep 11, 2009 at 1:15 PM,  <> wrote:

> I've been testing out "paging" the document this past week.  I'm
> still working on getting a successful test and think I'm close.  The
> down side was a drastic slow down in indexing speed, and lots of
> open files, but that was expected.

You mean a slowdown in indexing speed because you now flush after
every page not after every document, right?  That's expected.

But I'm not sure why you'd see a change in the number of open files...

> I tried with small mergeFactors, maxBufferedDocs(haven't tried 1
> though), and ramBufferSizeMB.  Using JConsole to monitor the heap
> usage, this method slowly creeps towards my max heap space until
> OOM. I can say that at least some of the document gets indexed
> before OOM.  So I performed a heap dump at OOM and saw that
> FreqProxTermsWriterPerField had by far consumed the most memory.  I
> haven't looked into that yet...

It's at least ~60 bytes per unique term, not counting the char[] to
hold the term, and FreqProxTermsWriterPerField is exactly where most
of those bytes are allocated (eg its PostingList class).

> Let's say I page the document into ten different smaller documents
> and they are indexed successfully (I'm not quite at this point yet).
> Is there a way to select documents by id and merge them into one
> large document after they are in the index?  That was my plan to
> work around OOM and achieve the same end result as trying to index
> the large document in one shot.

You mean at search time right?  You basically want the equivalent of
SQL's "group by".

You could make a custom Collector that does this...  Or look at how
Solr is iterating on field collapsing


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message