lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Shalyminov <>
Subject Re: Lucene in-memory index
Date Fri, 25 Oct 2013 13:58:58 GMT
What is ProxBooleanTermQuery?
I couldn't find it in the trunk and in that ticket's (
And for now it's very fuzzy to me how the searching/scoring works. Are there any tutorials
or talks on how do Queries, Scorers, Collectors interoperate?


23.10.2013, 19:06, "Michael McCandless" <>:
> On Tue, Oct 22, 2013 at 9:43 AM, Igor Shalyminov
> <> wrote:
>>  Thanks for the link, I'll definitely dig into SpanQuery internals very soon.
> You could also just make a custom query.  If you start from the
> ProxBooleanTermQuery on that issue, but change it so that it rejects
> hits that didn't have terms in the right positions, then you'll likely
> have a much faster way to do your query.
>>>>   For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1.
>>>  I didn't even realize you could pass negative slop to span queries.
>>>  What does that do?  Or did you mean slop=1?
>>  I indeed use an unordered SpanNearQuery with the slop = --1 (I saw it on some forum,
maybe here:
> Wow, OK.  I have no idea what slop=-1 does...
>>  So far it works for me:)
>>>>   I wrap them into an ordered SpanNearQuery with the slop=0.
>>>>   I see getPayload() in the profiler top. I think I can emulate payload
checking with cleverly assigned position increments (and then maximum position in a document
might jump up to ~10^9 - I hope it won't blow the whole index up).
>>>>   If I remove payload matching and keep only position checking, will it
speed up everything, or the positions and payloads are the same?
>>>  I think it would help to avoid payloads, but I'm not sure by how much.
>>>   E.g., I see that NearSpansOrdered creates a new Set for every hit
>>>  just to hold payloads, even if payloads are not going to be used.
>>>  Really the span scorers should check Terms.hasPayloads up front ...
>>>>   My main goal is getting the precise results for a query, so proximity
boosting won't help, unfortunately.
>>>  OK.
>>>  I wonder if you can somehow identify the spans you care about at
>>>  indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the
>>>  index at that point; this would make searching much faster (it becomes
>>>  a TermQuery).  For exact matching (slop=0) you can also index
>>>  shingles.
>>  Thanks for the clue, I think it can be a good optimization heuristic.
>>  I actually tried a similar approach to optimize search of attributes at the same
>>  Here's how it was supposed to work for a feature set "S,sg,nom,fem":
>>  * the regular approach: split it into grammar atomics: "S", "sg", "nom", "fem".
With payloads and positions assigned the right way, this would allow us to search for an arbitrary
combination of these attributes _but_ with multiple postings merging.
>>  * the experimental approach: sort the atomics lexicographically and index all the
subsets: "S", "fem", "nom", "sg", "S,fem", "S,nom", ..., "S,fem,nom,sg". With the preprocessing
of the user query the same way (split - sort - join) it would allow us to process the same
queries exactly within one posting.
>>  This technique is actually used in our current production index based on Yandex.Server
>>  But Yandex.Server somehow makes the index size reasonable (within the order of
magnitude of original text size), while Lucene index blows up totally ( >10 times original
text size) and no search performance improvements appear.
> That's really odd.  I would expect index to become much larger, but
> search performance ought to be much faster since you run simple
> TermQuery.
> Mike McCandless
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message