lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <>
Subject Crazy increase of MultiPhraseQuery memory usage in Lucene 5 (compared with 3)
Date Mon, 24 Aug 2015 01:27:10 GMT
There is a MultiPhraseQuery we use which looks a bit like:

    MultiPhraseQuery query = new MultiPhraseQuery();
    query.add(new Term[] { "first" });
    query.add(new Term[] { "second1", "second2", ... });

The actual number of terms in this particular case is 207087. The size
of the index itself is 21GB or so, with around 1,300,000 docs. Large
but not gigantic. I ran the test with 2GB of RAM which was certainly
enough for Lucene 3.

Although I do think that this is abusing MultiPhraseQuery and that
SpanQuery is probably a better fit, I think that back in Lucene 3,
there were problems with SpanQuery performance which resulted in
switching to this as a performance hack.

Anyway, we now get an OOME when running this query and the heap
histogram comes out sort of like this:
  int[]  995,093 (5.2%) 617,539,592 (31.6%)
  byte[] 1,065,597 (5.6%) 434,990,616 (22.3%)
  DocIdSet[]  777,620 (4.1%) 149,303,040 (7.6%)
  Lucene50PostingsReader$BlockPostingsEnum  326,022 (1.7%) 67,486,554 (3.5%)
  Lucene50PostingsFormat$IntBlockTermState  621,265 (3.2%) 57,777,645 (3%)

I went looking for the owner of these int arrays and it turns out to
be a postings reader which is ultimately (unsurprisingly) being held
by the MultiPhraseQuery.

What I'm wondering is:
- Why the increase in memory cost?
- Is our performance hack of using MultiPhraseQuery over SpanQuery
really warranted anymore?
- Is there a better way to do this particular query?

Also, just in case this is an X-Y problem, what we're actually
implementing here is simulating a large number of integer fields
without using a large number of fields. We index the name of the
sub-field followed by the value and then use this as a proximity query
to say "find values in range X to Y with the sub-field immediately in
front". This was done because there was some conventional wisdom
saying that having a large number of fields in Lucene is problematic,
although whether this still applies is unknown.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message