lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject Re: heap memory issues when sorting by a string field
Date Thu, 10 Dec 2009 12:38:20 GMT
Regarding LUCENE-1990, I've been experimenting a bit with packed
positive integer arrays. I'd love to take a look at it, but I simply do
not have the time before next year.

My approach was to pack the bits rights after each other, but it does
represent certain challenges. Going for smallest possible size means
that some numbers will be divided between array-entries: Storing a three
40 bit values in longs means that the first 24 bits of the second value
will be stored in one long, with the remaining 16 bits in another. The
problem here is performance.

Iøve created an implementation that avoids conditionals in the getter by
using 9 shifts, two ors, one and a mask, but it takes more time than I
thought: More than a second for 100 million lookups in an array of 1
million values on my desktop machine. That's simply too slow. I'll try
with 3 masks, 3 shifts and 3 ors instead, but I fear that it won't help
much.

A hybrid approach would be to allow only 1, 2, 4, 8, 16, 32 or 64 bits
for the values, as that would ensure that there is no spanning of
array-elements. We have a danish proverb regarding that: "A good
compromise leaves no one satisfied".


...Most probably reinventing the wheel here and definitely moving
outside Lucene territory.

- Toke

On Thu, 2009-12-10 at 11:45 +0100, Michael McCandless wrote:
> The big challenge w/ a global sort ords is handling reopen, ie, for
> apps that need to frequently reopen their index.
> 
> If you have a static index, then this approach is a good one -- you
> make a one time investment, after indexing, to compute your global
> ords, store them on disk.
> 
> Then at searching you load the ords in, and you can make a
> FieldComparator that taps into it (just has to re-base the per-segment
> docIDs).  No String values are needed...
> 
> Also, once we've done
> https://issues.apache.org/jira/browse/LUCENE-1990 (anyone out there
> wanna take a stab?  Nice self-contained utility API needed for
> Lucene....), that ord array can be very compact for cases where the
> number of unique Strings is smallish.  No reason to spend a full 4
> bytes per document...
> 
> Mike
> 
> On Wed, Dec 9, 2009 at 6:46 PM, Toke Eskildsen <te@statsbiblioteket.dk> wrote:
> > Thanks for the heads-up, TCK. The Dietz & Sleator article I found at
> > http://www.cs.cmu.edu/~sleator/papers/maintaining-order.pdf
> > looks very interesting.
> >
> > String sorting in Lucene is indeed fairly expensive and we've experimented with
> > two solutions to this, none of which are golden bullets.
> >
> > 1) Store the order, not the content.
> >
> > Like the Lucene default sorter, we fetch all the terms for the sort field on first
> > sort. The docIDs are mapped to the terms and an complete sort is performed
> > (by using a custom Collator: We want to avoid using CollationKeys as they
> > take too much memory. But that's another story). When the sort is finished,
> > we have an integer-array where each entry is the relative position for a given
> > document (referenced by docID). We then de-reference the loaded terms
> > so that the memory is freed.
> >
> > Determining the order of two documents after the initial build is extremely
> > simple: order[docIDa] - order[docIDb] (or maybe it was the other way
> > around, I forget).
> >
> > Cons: Large temporarily memory overhead upon initial sort, though not as
> > much as using CollationKeys. Slower initial sort than Lucene build-in.
> > Pros: Very fast subsequent sorts, very low memory-overhead when running.
> > Joker: This approach could be used to perform a post-processing on a
> > generated index and storing the orders as a file along with the index, then
> > pushing the complete package to searchers. This would lower the memory
> > requirements for the searchers significantly.
> >
> >
> > 2) As above, but use a multipass order builder.
> >
> > For the first sort, the order must be calculated: A sliding window sorter is created
> > with a buffer of a fixed size. The window slides through all terms for the sort
field
> > and extracts the top X. The order-array is updated for the documents that has
> > one of these terms. The sliding is repeated multiple times, where terms ordered
> > before the last term of the previous iteration are ignored.
> >
> > Cons: _Very_ slow (too slow in the current implementation) order build.
> > Pros: Same as above.
> > Joker: The buffer size determines memory use vs. order build time.
> >
> >
> > The multipass approach looks promising, but requires more work to get to a
> > usable state. Right now it takes minutes to build the order-array for half a
> > million documents, with a buffer size requiring 5 iterations. If I ever get it to
> > work, I'll be sure to share it.
> >
> > Regards,
> > Toke Eskildsen
> >
> > ________________________________________
> > From: TCK [moonwatcher32329@gmail.com]
> > Sent: 09 December 2009 22:58
> > To: java-user@lucene.apache.org
> > Subject: Re: heap memory issues when sorting by a string field
> >
> > Thanks Mike for opening this jira ticket and for your patch. Explicitly
> > removing the entry from the WHM definitely does reduce the number of GC
> > cycles taken to free the huge StringIndex objects that get created when
> > doing a sort by a string field.
> >
> > But I'm still trying to figure out why it is necessary for lucene to load
> > into memory the sorted-by field of every single document in the index (which
> > is what makes the StringIndex objects so big) on the first query. Why isn't
> > it sufficient to load the field values for only the documents that match the
> > given search criteria? Perhaps by using an order-maintenance data structure
> > (Dietz&Sleator, or Bender et al) coupled with a balanced search tree,
> > instead of the simple lookup and order arrays currently used in StringIndex,
> > we can dynamically grow it as needed by successive queries rather than
> > loading everything on the first query.
> >
> > Apologies if this is a naive question... I'm a newbie to the lucene code and
> > can't wait for the new lucene book to come out:-)
> >
> > -TCK
> >
> >
> >
> >
> > On Tue, Dec 8, 2009 at 5:43 AM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> I've opened LUCENE-2135.
> >>
> >> Mike
> >>
> >> On Tue, Dec 8, 2009 at 5:36 AM, Michael McCandless
> >> <lucene@mikemccandless.com> wrote:
> >> > This is a rather disturbing implementation detail of WeakHashMap, that
> >> > it needs the one extra step (invoking one of its methods) for its weak
> >> > keys to be reclaimable.
> >> >
> >> > Maybe on IndexReader.close(), Lucene should go and evict all entries
> >> > in the FieldCache associated with that reader.  Ie, step through the
> >> > sub-readers, and if they are truly closed as well (not shared w/ other
> >> > readers), evict.  I'll open an issue.
> >> >
> >> > Even in TCK's code fragment, it's not until the final line is done
> >> > executing, that the cache key even loses all hard references, because
> >> > it's that line that assigns to sSearcher, replacing the strong
> >> > reference to the old searcher.  Inserting sSearcher = null prior to
> >> > that would drop the hard reference sooner, but because of this impl
> >> > detail of WeakHashMap, something would still have to touch it (eg, a
> >> > warmup query that hits the field cache) before it's reclaimable.
> >> >
> >> > Mike
> >> >
> >> > On Mon, Dec 7, 2009 at 7:38 PM, Tom Hill <solr-list@worldware.com>
> >> wrote:
> >> >> Hi -
> >> >>
> >> >> If I understand correctly, WeakHashMap does not free the memory for
the
> >> >> value (cached data) when the key is nulled, or even when the key is
> >> garbage
> >> >> collected.
> >> >>
> >> >> It requires one more step: a method on WeakHashMap must be called to
> >> allow
> >> >> it to release its hard reference to the cached data. It appears that
> >> most
> >> >> methods in WeakHashMap end up calling expungeStaleEntries, which will
> >> clear
> >> >> the hard reference. But you have to call some method on the map, before
> >> the
> >> >> memory is eligible for garbage collection.
> >> >>
> >> >> So it requires four stages to free the cached data. Null the key; A
GC
> >> to
> >> >> release the weak reference to the key; A call to some method on the
map;
> >> >> Then the next GC cycle should free the value.
> >> >>
> >> >> So it seems possible that you could end up with double memory usage
for
> >> a
> >> >> time. If you don't have a GC between the time that you close the old
> >> reader,
> >> >> and you start to load the field cache entry for the next reader, then
> >> the
> >> >> key may still be hanging around uncollected.
> >> >>
> >> >> At that point, it may run a GC when you allocate the new cache, but
> >> that's
> >> >> only the first GC. It can't free the cached data until after the next
> >> call
> >> >> to expungeStaleEntries, so for a while you have both caches around.
> >> >>
> >> >> This extra usage could cause things to move into tenured space. Could
> >> this
> >> >> be causing your problem?
> >> >>
> >> >> Workaround would be to cause some method to be called on the
> >> WeakHashMap.
> >> >> You don't want to call get(), since that will try to populate the cache.
> >> >> Maybe if you tried putting a small value to the cache, and doing a
GC,
> >> and
> >> >> see if your memory drops then.
> >> >>
> >> >>
> >> >> Tom
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Dec 7, 2009 at 1:48 PM, TCK <moonwatcher32329@gmail.com>
wrote:
> >> >>
> >> >>> Thanks for the response. But I'm definitely calling close() on
the old
> >> >>> reader and opening a new one (not using reopen). Also, to simplify
the
> >> >>> analysis, I did my test with a single-threaded requester to eliminate
> >> any
> >> >>> concurrency issues.
> >> >>>
> >> >>> I'm doing:
> >> >>> sSearcher.getIndexReader().close();
> >> >>> sSearcher.close(); // this actually seems to be a no-op
> >> >>> IndexReader newIndexReader = IndexReader.open(newDirectory);
> >> >>> sSearcher = new IndexSearcher(newIndexReader);
> >> >>>
> >> >>> Btw, isn't it bad practice anyway to have an unbounded cache? Are
there
> >> any
> >> >>> plans to replace the HashMaps used for the innerCaches with an
actual
> >> >>> size-bounded cache with some eviction policy (perhaps EhCache or
> >> something)
> >> >>> ?
> >> >>>
> >> >>> Thanks again,
> >> >>> TCK
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Mon, Dec 7, 2009 at 4:37 PM, Erick Erickson <
> >> erickerickson@gmail.com
> >> >>> >wrote:
> >> >>>
> >> >>> > What this sounds like is that you're not really closing your
> >> >>> > readers even though you think you are. Sorting indeed uses
up
> >> >>> > significant memory when it populates internal caches and keeps
> >> >>> > it around for later use (which is one of the reasons that
warming
> >> >>> > queries matter). But if you really do close the reader, I'm
pretty
> >> >>> > sure the memory should be GC-able.
> >> >>> >
> >> >>> > One thing that trips people up is IndexReader.reopen(). If
it
> >> >>> > returns a reader different than the original, you *must* close
the
> >> >>> > old one. If you don't, the old reader is still hanging around
and
> >> >>> > memory won't be returne.... An example from the Javadocs...
> >> >>> >
> >> >>> >  IndexReader reader = ...
> >> >>> >  ...
> >> >>> >  IndexReader new = r.reopen();
> >> >>> >  if (new != reader) {
> >> >>> >   ...     // reader was reopened
> >> >>> >   reader.close();
> >> >>> >  }
> >> >>> >  reader = new;
> >> >>> >  ...
> >> >>> >
> >> >>> >
> >> >>> > If this is irrelevant, could you post your close/open
> >> >>> >
> >> >>> > code?
> >> >>> >
> >> >>> > HTH
> >> >>> >
> >> >>> > Erick
> >> >>> >
> >> >>> >
> >> >>> > On Mon, Dec 7, 2009 at 4:27 PM, TCK <moonwatcher32329@gmail.com>
> >> wrote:
> >> >>> >
> >> >>> > > Hi,
> >> >>> > > I'm having heap memory issues when I do lucene queries
involving
> >> >>> sorting
> >> >>> > by
> >> >>> > > a string field. Such queries seem to load a lot of data
in to the
> >> heap.
> >> >>> > > Moreover lucene seems to hold on to references to this
data even
> >> after
> >> >>> > the
> >> >>> > > index reader has been closed and a full GC has been run.
Some of
> >> the
> >> >>> > > consequences of this are that in my generational heap
configuration
> >> a
> >> >>> lot
> >> >>> > > of
> >> >>> > > memory gets promoted to tenured space each time I close
the old
> >> index
> >> >>> > > reader
> >> >>> > > and after opening and querying using a new one, and the
tenured
> >> space
> >> >>> > > eventually gets fragmented causing a lot of promotion
failures
> >> >>> resulting
> >> >>> > in
> >> >>> > > jvm hangs while the jvm does stop-the-world GCs.
> >> >>> > >
> >> >>> > > Does anyone know any workarounds to avoid these memory
issues when
> >> >>> doing
> >> >>> > > such lucene queries?
> >> >>> > >
> >> >>> > > My profiling showed that even after a full GC lucene
is holding on
> >> to a
> >> >>> > lot
> >> >>> > > of references to field value data notably via the
> >> >>> > > FieldCacheImpl/ExtendedFieldCacheImpl. I noticed that
the
> >> WeakHashMap
> >> >>> > > readerCaches are using unbounded HashMaps as the innerCaches
and I
> >> used
> >> >>> > > reflection to replace these innerCaches with dummy empty
HashMaps,
> >> but
> >> >>> > > still
> >> >>> > > I'm seeing the same behavior. I wondered if anyone has
gone through
> >> >>> these
> >> >>> > > same issues before and would offer any advice.
> >> >>> > >
> >> >>> > > Thanks a lot,
> >> >>> > > TCK
> >> >>> > >
> >> >>> >
> >> >>>
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message