lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Understanding performance characteristics of the new point types
Date Wed, 02 Nov 2016 19:09:11 GMT
Yeah it's best to use StringField for low-cardinality use cases.

When cardinality is low (4 unique values in your case), legacy
numerics would rewrite to a BooleanQuery, which is much more
performant for MUST clauses, vs dimensional points which will always
need to construct an up front bitset for all documents with that
value.  Using StringField instead will ensure you always get a
BooleanQuery...

Mike McCandless

http://blog.mikemccandless.com


On Wed, Nov 2, 2016 at 2:43 PM, Fuad Efendi <fuad@efendi.ca> wrote:
> Hi florian,
>
> If my understanting is correct, you are using IntPoint to index 4 different
> document types which is overkill; why not to try classic “non-tokenized”
> keyword field (a.k.a. “legacy string”) for document types? Cardinality is
> only four for document types.
>
>
> --
>
> Fuad Efendi
>
> (416) 993-2060
>
> http://www.tokenizer.ca
> Recommender Systems
>
>
> On November 2, 2016 at 2:10:14 PM, Florian Hopf (
> mailinglists@florian-hopf.de) wrote:
>
> Hi,
>
> we are indexing different types of documents in one Lucene index. They
> have most fields in common but we need to filter some types for certain
> queries. We are using numeric values to determine the types of documents
> (1-4). Now, when querying these documents we see that the performance
> degrades the more documents of a type are in the index.
>
> Using a simple test that indexes 10 Mio documents I can see the
> following when filtering on everything but 100000 documents:
>
> * When issuing the query alone the new PointRangeQuery
> (IntPoint.newExactQuery) is a lot faster than term and legacy numeric
> (in my case around 2x the speed of the others)
> * When issuing a bool query that contains a term query that selects 5
> documents together with a must query that selects on the numeric the
> points are 5x slower than legacy numeric
> (LegacyNumericRangeQuery.newIntRange) and terms (TermQuery)
> * When doing the same thing with SHOULD instead of MUST for the
> additional term query the PointRangeQuery is fastests as well
>
> I suspect this to be related to the discussion in
> https://issues.apache.org/jira/browse/LUCENE-7254
>
> Of course there could be something wrong with the way I am measuring the
> performance, I'd be happy to share the code. But what I read in the
> ticket above seems to hint that the points are not suited for every use
> case? Is it recommended to use StringField in a case like this instead?
>
> Regards
> Florian
>
> --
> Florian Hopf
> Freelance Software Developer
>
> http://blog.florian-hopf.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message