lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hasenberger, Josef" <Josef.Hasenber...@zetcom.com>
Subject Efficient way to define large Boolean Occur.FILTER clause in Lucene 6
Date Tue, 26 Jun 2018 10:02:05 GMT
Hi,

I want to filter a result of a query by Long values (applicable for specific field, actually
DocValue field) in Lucene 6 (as replacement for Filters which are removed in Lucene 6).

The amount of allowed Long values can range from just a few up to hundred thousands.
What I do now is to create a TermsQuery from generated Terms and apply them on a BooleanQuery
as Filter, like this:

    public Query getFilteredQuery(Query query) {
        List<Term> terms = new ArrayList<>(getValueSize());
        String keyFieldName = getFieldName();
        for (Long value : getValues()) {
            BytesRef valueAsBytesRef = LongToUTF8Converter.toBytesRef(value); // save conversion
from UTF16 to UTF8
            Term term = new Term(keyFieldName, valueAsBytesRef);
            terms.add(term);
        }
        TermsQuery termsQuery = new TermsQuery(terms);

        return new BooleanQuery.Builder()
                .add(query, Occur.MUST)  // original query
                .add(termsQuery, Occur.FILTER) // add filter
                .build();
    }

However, I have a feeling that the conversion from Long values to Terms is rather inefficient
for large collections and also uses a lot of memory.
To ease conversion overhead somewhat, I created a class that converts a Long value directly
to BytesRef instance (in order to avoid conversion to UTF16 and then UTF8 again) and pass
that instance to the Term constructor.

I just wonder if there is a better method for passing large amount of filter criteria to a
BooleanQuery Occur.FILTER clause, that avoids excessive object creation.
Or maybe there is a better approach than using BooleanQuery in this case?

Would be glad if you could share your thoughts on this.

Thanks a lot,
Josef


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message