lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject Re: heap memory issues when sorting by a string field
Date Fri, 11 Dec 2009 13:30:39 GMT
I've spend the last day working on a multipass order builder, where the
order is defined by a Collator and stored in an int-array. Compromising
a bit on the "minimal memory at all cost"-approach resulted in a fair
boost in speed, but it's still very slow for the first sorted search,
compared to Lucene's build-in Collator-based sorter.

One test-corpus was 2.9M documents (16GB index), sorting on title (2.1M
unique terms). The machine was a fairly new Xeon-something using SSDs.
Lucene 2.4 was used.

Memory-usage for the multipass order builder on first sort was
approximately
2.9M (#documents) * 4 bytes (int) + 
2.1M (#unique terms) * 4 bytes (int) +
50MB (size of the sliding window) +
2.9M (#documents) * 4 bytes (int)
= 12MB + 9MB + 50MB + 12MB ~= 100MB (rounded up).

The 50MB buffer required 3 passes through the terms, which took 47
seconds.
Lowering the buffer to 10MB resulted in 2 minutes build-time.
Upping to 500MB (all terms in memory) meant 31 seconds.

After that, only the order-array was remembered, which took
2.9M (#documents) * 4 bytes (int) = 12MB.

Performing subsequent sorted searches gave the following results:
"hest" (top 10 out of 1295 hits): 6-13ms
"bog" (top 10 out of 2446835 hits): 600-620ms


For comparison, Lucene's build-in sorter took only 8.5 seconds for first
sort search, with subsequent sorted search times
"hest" (top 10 out of 1295 hits): 9-18ms
"bog" (top 10 out of 2446835 hits): 2000-2900ms


Disclaimer: This was just a quick test. I have not checked properly how
the numbers change when the corpus is scaled up, how it works on
spinning drives and so on. All this is very much at the experimental
stage.

I'll continue a bit on this road, as some of our setups used nightly
batch-runs to update the index and thus fit well into the tradeoffs
outlined above. Regrettably we're also looking at 5-minute updates for a
larger index (10M documents-range), where the multipass approach is too
expensive.

- Toke

On Thu, 2009-12-10 at 11:45 +0100, Michael McCandless wrote:
> The big challenge w/ a global sort ords is handling reopen, ie, for
> apps that need to frequently reopen their index.
> 
> If you have a static index, then this approach is a good one -- you
> make a one time investment, after indexing, to compute your global
> ords, store them on disk.
> 
> Then at searching you load the ords in, and you can make a
> FieldComparator that taps into it (just has to re-base the per-segment
> docIDs).  No String values are needed...
> 
> Also, once we've done
> https://issues.apache.org/jira/browse/LUCENE-1990 (anyone out there
> wanna take a stab?  Nice self-contained utility API needed for
> Lucene....), that ord array can be very compact for cases where the
> number of unique Strings is smallish.  No reason to spend a full 4
> bytes per document...
> 
> Mike
> 
> On Wed, Dec 9, 2009 at 6:46 PM, Toke Eskildsen <te@statsbiblioteket.dk> wrote:
> > Thanks for the heads-up, TCK. The Dietz & Sleator article I found at
> > http://www.cs.cmu.edu/~sleator/papers/maintaining-order.pdf
> > looks very interesting.
> >
> > String sorting in Lucene is indeed fairly expensive and we've experimented with
> > two solutions to this, none of which are golden bullets.
> >
> > 1) Store the order, not the content.
> >
> > Like the Lucene default sorter, we fetch all the terms for the sort field on first
> > sort. The docIDs are mapped to the terms and an complete sort is performed
> > (by using a custom Collator: We want to avoid using CollationKeys as they
> > take too much memory. But that's another story). When the sort is finished,
> > we have an integer-array where each entry is the relative position for a given
> > document (referenced by docID). We then de-reference the loaded terms
> > so that the memory is freed.
> >
> > Determining the order of two documents after the initial build is extremely
> > simple: order[docIDa] - order[docIDb] (or maybe it was the other way
> > around, I forget).
> >
> > Cons: Large temporarily memory overhead upon initial sort, though not as
> > much as using CollationKeys. Slower initial sort than Lucene build-in.
> > Pros: Very fast subsequent sorts, very low memory-overhead when running.
> > Joker: This approach could be used to perform a post-processing on a
> > generated index and storing the orders as a file along with the index, then
> > pushing the complete package to searchers. This would lower the memory
> > requirements for the searchers significantly.
> >
> >
> > 2) As above, but use a multipass order builder.
> >
> > For the first sort, the order must be calculated: A sliding window sorter is created
> > with a buffer of a fixed size. The window slides through all terms for the sort
field
> > and extracts the top X. The order-array is updated for the documents that has
> > one of these terms. The sliding is repeated multiple times, where terms ordered
> > before the last term of the previous iteration are ignored.
> >
> > Cons: _Very_ slow (too slow in the current implementation) order build.
> > Pros: Same as above.
> > Joker: The buffer size determines memory use vs. order build time.
> >
> >
> > The multipass approach looks promising, but requires more work to get to a
> > usable state. Right now it takes minutes to build the order-array for half a
> > million documents, with a buffer size requiring 5 iterations. If I ever get it to
> > work, I'll be sure to share it.
> >
> > Regards,
> > Toke Eskildsen
> >
> > ________________________________________
> > From: TCK [moonwatcher32329@gmail.com]
> > Sent: 09 December 2009 22:58
> > To: java-user@lucene.apache.org
> > Subject: Re: heap memory issues when sorting by a string field
> >
> > Thanks Mike for opening this jira ticket and for your patch. Explicitly
> > removing the entry from the WHM definitely does reduce the number of GC
> > cycles taken to free the huge StringIndex objects that get created when
> > doing a sort by a string field.
> >
> > But I'm still trying to figure out why it is necessary for lucene to load
> > into memory the sorted-by field of every single document in the index (which
> > is what makes the StringIndex objects so big) on the first query. Why isn't
> > it sufficient to load the field values for only the documents that match the
> > given search criteria? Perhaps by using an order-maintenance data structure
> > (Dietz&Sleator, or Bender et al) coupled with a balanced search tree,
> > instead of the simple lookup and order arrays currently used in StringIndex,
> > we can dynamically grow it as needed by successive queries rather than
> > loading everything on the first query.
> >
> > Apologies if this is a naive question... I'm a newbie to the lucene code and
> > can't wait for the new lucene book to come out:-)
> >
> > -TCK
> >
> >
> >
> >
> > On Tue, Dec 8, 2009 at 5:43 AM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> I've opened LUCENE-2135.
> >>
> >> Mike
> >>
> >> On Tue, Dec 8, 2009 at 5:36 AM, Michael McCandless
> >> <lucene@mikemccandless.com> wrote:
> >> > This is a rather disturbing implementation detail of WeakHashMap, that
> >> > it needs the one extra step (invoking one of its methods) for its weak
> >> > keys to be reclaimable.
> >> >
> >> > Maybe on IndexReader.close(), Lucene should go and evict all entries
> >> > in the FieldCache associated with that reader.  Ie, step through the
> >> > sub-readers, and if they are truly closed as well (not shared w/ other
> >> > readers), evict.  I'll open an issue.
> >> >
> >> > Even in TCK's code fragment, it's not until the final line is done
> >> > executing, that the cache key even loses all hard references, because
> >> > it's that line that assigns to sSearcher, replacing the strong
> >> > reference to the old searcher.  Inserting sSearcher = null prior to
> >> > that would drop the hard reference sooner, but because of this impl
> >> > detail of WeakHashMap, something would still have to touch it (eg, a
> >> > warmup query that hits the field cache) before it's reclaimable.
> >> >
> >> > Mike
> >> >
> >> > On Mon, Dec 7, 2009 at 7:38 PM, Tom Hill <solr-list@worldware.com>
> >> wrote:
> >> >> Hi -
> >> >>
> >> >> If I understand correctly, WeakHashMap does not free the memory for
the
> >> >> value (cached data) when the key is nulled, or even when the key is
> >> garbage
> >> >> collected.
> >> >>
> >> >> It requires one more step: a method on WeakHashMap must be called to
> >> allow
> >> >> it to release its hard reference to the cached data. It appears that
> >> most
> >> >> methods in WeakHashMap end up calling expungeStaleEntries, which will
> >> clear
> >> >> the hard reference. But you have to call some method on the map, before
> >> the
> >> >> memory is eligible for garbage collection.
> >> >>
> >> >> So it requires four stages to free the cached data. Null the key; A
GC
> >> to
> >> >> release the weak reference to the key; A call to some method on the
map;
> >> >> Then the next GC cycle should free the value.
> >> >>
> >> >> So it seems possible that you could end up with double memory usage
for
> >> a
> >> >> time. If you don't have a GC between the time that you close the old
> >> reader,
> >> >> and you start to load the field cache entry for the next reader, then
> >> the
> >> >> key may still be hanging around uncollected.
> >> >>
> >> >> At that point, it may run a GC when you allocate the new cache, but
> >> that's
> >> >> only the first GC. It can't free the cached data until after the next
> >> call
> >> >> to expungeStaleEntries, so for a while you have both caches around.
> >> >>
> >> >> This extra usage could cause things to move into tenured space. Could
> >> this
> >> >> be causing your problem?
> >> >>
> >> >> Workaround would be to cause some method to be called on the
> >> WeakHashMap.
> >> >> You don't want to call get(), since that will try to populate the cache.
> >> >> Maybe if you tried putting a small value to the cache, and doing a
GC,
> >> and
> >> >> see if your memory drops then.
> >> >>
> >> >>
> >> >> Tom
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Dec 7, 2009 at 1:48 PM, TCK <moonwatcher32329@gmail.com>
wrote:
> >> >>
> >> >>> Thanks for the response. But I'm definitely calling close() on
the old
> >> >>> reader and opening a new one (not using reopen). Also, to simplify
the
> >> >>> analysis, I did my test with a single-threaded requester to eliminate
> >> any
> >> >>> concurrency issues.
> >> >>>
> >> >>> I'm doing:
> >> >>> sSearcher.getIndexReader().close();
> >> >>> sSearcher.close(); // this actually seems to be a no-op
> >> >>> IndexReader newIndexReader = IndexReader.open(newDirectory);
> >> >>> sSearcher = new IndexSearcher(newIndexReader);
> >> >>>
> >> >>> Btw, isn't it bad practice anyway to have an unbounded cache? Are
there
> >> any
> >> >>> plans to replace the HashMaps used for the innerCaches with an
actual
> >> >>> size-bounded cache with some eviction policy (perhaps EhCache or
> >> something)
> >> >>> ?
> >> >>>
> >> >>> Thanks again,
> >> >>> TCK
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Mon, Dec 7, 2009 at 4:37 PM, Erick Erickson <
> >> erickerickson@gmail.com
> >> >>> >wrote:
> >> >>>
> >> >>> > What this sounds like is that you're not really closing your
> >> >>> > readers even though you think you are. Sorting indeed uses
up
> >> >>> > significant memory when it populates internal caches and keeps
> >> >>> > it around for later use (which is one of the reasons that
warming
> >> >>> > queries matter). But if you really do close the reader, I'm
pretty
> >> >>> > sure the memory should be GC-able.
> >> >>> >
> >> >>> > One thing that trips people up is IndexReader.reopen(). If
it
> >> >>> > returns a reader different than the original, you *must* close
the
> >> >>> > old one. If you don't, the old reader is still hanging around
and
> >> >>> > memory won't be returne.... An example from the Javadocs...
> >> >>> >
> >> >>> >  IndexReader reader = ...
> >> >>> >  ...
> >> >>> >  IndexReader new = r.reopen();
> >> >>> >  if (new != reader) {
> >> >>> >   ...     // reader was reopened
> >> >>> >   reader.close();
> >> >>> >  }
> >> >>> >  reader = new;
> >> >>> >  ...
> >> >>> >
> >> >>> >
> >> >>> > If this is irrelevant, could you post your close/open
> >> >>> >
> >> >>> > code?
> >> >>> >
> >> >>> > HTH
> >> >>> >
> >> >>> > Erick
> >> >>> >
> >> >>> >
> >> >>> > On Mon, Dec 7, 2009 at 4:27 PM, TCK <moonwatcher32329@gmail.com>
> >> wrote:
> >> >>> >
> >> >>> > > Hi,
> >> >>> > > I'm having heap memory issues when I do lucene queries
involving
> >> >>> sorting
> >> >>> > by
> >> >>> > > a string field. Such queries seem to load a lot of data
in to the
> >> heap.
> >> >>> > > Moreover lucene seems to hold on to references to this
data even
> >> after
> >> >>> > the
> >> >>> > > index reader has been closed and a full GC has been run.
Some of
> >> the
> >> >>> > > consequences of this are that in my generational heap
configuration
> >> a
> >> >>> lot
> >> >>> > > of
> >> >>> > > memory gets promoted to tenured space each time I close
the old
> >> index
> >> >>> > > reader
> >> >>> > > and after opening and querying using a new one, and the
tenured
> >> space
> >> >>> > > eventually gets fragmented causing a lot of promotion
failures
> >> >>> resulting
> >> >>> > in
> >> >>> > > jvm hangs while the jvm does stop-the-world GCs.
> >> >>> > >
> >> >>> > > Does anyone know any workarounds to avoid these memory
issues when
> >> >>> doing
> >> >>> > > such lucene queries?
> >> >>> > >
> >> >>> > > My profiling showed that even after a full GC lucene
is holding on
> >> to a
> >> >>> > lot
> >> >>> > > of references to field value data notably via the
> >> >>> > > FieldCacheImpl/ExtendedFieldCacheImpl. I noticed that
the
> >> WeakHashMap
> >> >>> > > readerCaches are using unbounded HashMaps as the innerCaches
and I
> >> used
> >> >>> > > reflection to replace these innerCaches with dummy empty
HashMaps,
> >> but
> >> >>> > > still
> >> >>> > > I'm seeing the same behavior. I wondered if anyone has
gone through
> >> >>> these
> >> >>> > > same issues before and would offer any advice.
> >> >>> > >
> >> >>> > > Thanks a lot,
> >> >>> > > TCK
> >> >>> > >
> >> >>> >
> >> >>>
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message