lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: heap memory issues when sorting by a string field
Date Wed, 09 Dec 2009 22:44:12 GMT
It's not that it's "necessary" -- this is just how Lucene's sorting
has always worked ;)  But, it's just software!  You could whip up a
patch...

I'm not familiar with the order-maintenance problem & solutions
offhand, but it certainly sounds interesting.

One issue is that loading only certain values on demand (from disk)
can be costly, eg it requires multiple disk head seeks (though, SSDs
will help here, though it's still maybe several order of magnitudes
slower than RAM (as long as the RAM isn't swapped out).

I think Lucene could cut back on the RAM usage by more compactly
representing the values.  Making separate String per value is alot of
overhead...

Mike

On Wed, Dec 9, 2009 at 4:58 PM, TCK <moonwatcher32329@gmail.com> wrote:
> Thanks Mike for opening this jira ticket and for your patch. Explicitly
> removing the entry from the WHM definitely does reduce the number of GC
> cycles taken to free the huge StringIndex objects that get created when
> doing a sort by a string field.
>
> But I'm still trying to figure out why it is necessary for lucene to load
> into memory the sorted-by field of every single document in the index (which
> is what makes the StringIndex objects so big) on the first query. Why isn't
> it sufficient to load the field values for only the documents that match the
> given search criteria? Perhaps by using an order-maintenance data structure
> (Dietz&Sleator, or Bender et al) coupled with a balanced search tree,
> instead of the simple lookup and order arrays currently used in StringIndex,
> we can dynamically grow it as needed by successive queries rather than
> loading everything on the first query.
>
> Apologies if this is a naive question... I'm a newbie to the lucene code and
> can't wait for the new lucene book to come out:-)
>
> -TCK
>
>
>
>
> On Tue, Dec 8, 2009 at 5:43 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> I've opened LUCENE-2135.
>>
>> Mike
>>
>> On Tue, Dec 8, 2009 at 5:36 AM, Michael McCandless
>> <lucene@mikemccandless.com> wrote:
>> > This is a rather disturbing implementation detail of WeakHashMap, that
>> > it needs the one extra step (invoking one of its methods) for its weak
>> > keys to be reclaimable.
>> >
>> > Maybe on IndexReader.close(), Lucene should go and evict all entries
>> > in the FieldCache associated with that reader.  Ie, step through the
>> > sub-readers, and if they are truly closed as well (not shared w/ other
>> > readers), evict.  I'll open an issue.
>> >
>> > Even in TCK's code fragment, it's not until the final line is done
>> > executing, that the cache key even loses all hard references, because
>> > it's that line that assigns to sSearcher, replacing the strong
>> > reference to the old searcher.  Inserting sSearcher = null prior to
>> > that would drop the hard reference sooner, but because of this impl
>> > detail of WeakHashMap, something would still have to touch it (eg, a
>> > warmup query that hits the field cache) before it's reclaimable.
>> >
>> > Mike
>> >
>> > On Mon, Dec 7, 2009 at 7:38 PM, Tom Hill <solr-list@worldware.com>
>> wrote:
>> >> Hi -
>> >>
>> >> If I understand correctly, WeakHashMap does not free the memory for the
>> >> value (cached data) when the key is nulled, or even when the key is
>> garbage
>> >> collected.
>> >>
>> >> It requires one more step: a method on WeakHashMap must be called to
>> allow
>> >> it to release its hard reference to the cached data. It appears that
>> most
>> >> methods in WeakHashMap end up calling expungeStaleEntries, which will
>> clear
>> >> the hard reference. But you have to call some method on the map, before
>> the
>> >> memory is eligible for garbage collection.
>> >>
>> >> So it requires four stages to free the cached data. Null the key; A GC
>> to
>> >> release the weak reference to the key; A call to some method on the map;
>> >> Then the next GC cycle should free the value.
>> >>
>> >> So it seems possible that you could end up with double memory usage for
>> a
>> >> time. If you don't have a GC between the time that you close the old
>> reader,
>> >> and you start to load the field cache entry for the next reader, then
>> the
>> >> key may still be hanging around uncollected.
>> >>
>> >> At that point, it may run a GC when you allocate the new cache, but
>> that's
>> >> only the first GC. It can't free the cached data until after the next
>> call
>> >> to expungeStaleEntries, so for a while you have both caches around.
>> >>
>> >> This extra usage could cause things to move into tenured space. Could
>> this
>> >> be causing your problem?
>> >>
>> >> Workaround would be to cause some method to be called on the
>> WeakHashMap.
>> >> You don't want to call get(), since that will try to populate the cache.
>> >> Maybe if you tried putting a small value to the cache, and doing a GC,
>> and
>> >> see if your memory drops then.
>> >>
>> >>
>> >> Tom
>> >>
>> >>
>> >>
>> >> On Mon, Dec 7, 2009 at 1:48 PM, TCK <moonwatcher32329@gmail.com> wrote:
>> >>
>> >>> Thanks for the response. But I'm definitely calling close() on the old
>> >>> reader and opening a new one (not using reopen). Also, to simplify the
>> >>> analysis, I did my test with a single-threaded requester to eliminate
>> any
>> >>> concurrency issues.
>> >>>
>> >>> I'm doing:
>> >>> sSearcher.getIndexReader().close();
>> >>> sSearcher.close(); // this actually seems to be a no-op
>> >>> IndexReader newIndexReader = IndexReader.open(newDirectory);
>> >>> sSearcher = new IndexSearcher(newIndexReader);
>> >>>
>> >>> Btw, isn't it bad practice anyway to have an unbounded cache? Are there
>> any
>> >>> plans to replace the HashMaps used for the innerCaches with an actual
>> >>> size-bounded cache with some eviction policy (perhaps EhCache or
>> something)
>> >>> ?
>> >>>
>> >>> Thanks again,
>> >>> TCK
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Dec 7, 2009 at 4:37 PM, Erick Erickson <
>> erickerickson@gmail.com
>> >>> >wrote:
>> >>>
>> >>> > What this sounds like is that you're not really closing your
>> >>> > readers even though you think you are. Sorting indeed uses up
>> >>> > significant memory when it populates internal caches and keeps
>> >>> > it around for later use (which is one of the reasons that warming
>> >>> > queries matter). But if you really do close the reader, I'm pretty
>> >>> > sure the memory should be GC-able.
>> >>> >
>> >>> > One thing that trips people up is IndexReader.reopen(). If it
>> >>> > returns a reader different than the original, you *must* close
the
>> >>> > old one. If you don't, the old reader is still hanging around and
>> >>> > memory won't be returne.... An example from the Javadocs...
>> >>> >
>> >>> >  IndexReader reader = ...
>> >>> >  ...
>> >>> >  IndexReader new = r.reopen();
>> >>> >  if (new != reader) {
>> >>> >   ...     // reader was reopened
>> >>> >   reader.close();
>> >>> >  }
>> >>> >  reader = new;
>> >>> >  ...
>> >>> >
>> >>> >
>> >>> > If this is irrelevant, could you post your close/open
>> >>> >
>> >>> > code?
>> >>> >
>> >>> > HTH
>> >>> >
>> >>> > Erick
>> >>> >
>> >>> >
>> >>> > On Mon, Dec 7, 2009 at 4:27 PM, TCK <moonwatcher32329@gmail.com>
>> wrote:
>> >>> >
>> >>> > > Hi,
>> >>> > > I'm having heap memory issues when I do lucene queries involving
>> >>> sorting
>> >>> > by
>> >>> > > a string field. Such queries seem to load a lot of data in
to the
>> heap.
>> >>> > > Moreover lucene seems to hold on to references to this data
even
>> after
>> >>> > the
>> >>> > > index reader has been closed and a full GC has been run. Some
of
>> the
>> >>> > > consequences of this are that in my generational heap configuration
>> a
>> >>> lot
>> >>> > > of
>> >>> > > memory gets promoted to tenured space each time I close the
old
>> index
>> >>> > > reader
>> >>> > > and after opening and querying using a new one, and the tenured
>> space
>> >>> > > eventually gets fragmented causing a lot of promotion failures
>> >>> resulting
>> >>> > in
>> >>> > > jvm hangs while the jvm does stop-the-world GCs.
>> >>> > >
>> >>> > > Does anyone know any workarounds to avoid these memory issues
when
>> >>> doing
>> >>> > > such lucene queries?
>> >>> > >
>> >>> > > My profiling showed that even after a full GC lucene is holding
on
>> to a
>> >>> > lot
>> >>> > > of references to field value data notably via the
>> >>> > > FieldCacheImpl/ExtendedFieldCacheImpl. I noticed that the
>> WeakHashMap
>> >>> > > readerCaches are using unbounded HashMaps as the innerCaches
and I
>> used
>> >>> > > reflection to replace these innerCaches with dummy empty HashMaps,
>> but
>> >>> > > still
>> >>> > > I'm seeing the same behavior. I wondered if anyone has gone
through
>> >>> these
>> >>> > > same issues before and would offer any advice.
>> >>> > >
>> >>> > > Thanks a lot,
>> >>> > > TCK
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message