lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Lucene in-memory index
Date Sat, 19 Oct 2013 10:53:01 GMT
On Fri, Oct 18, 2013 at 5:50 PM, Igor Shalyminov
<ishalyminov@yandex-team.ru> wrote:
> But why is it so costly?

I think because the matching is inherently complex?  But also because
it does high-cost things like allocating new List and Set for every
matched doc (e.g. NearSpansOrdered.shrinkToAfterShortestMatch) to hold
all payloads it encountered within each span. Patches welcome!

> In a regular query we walk postings and match document numbers, in a SpanQuery we match
position numbers (or position segments), what's the principal difference?
> I think it's just that #documents << #positions.

Conceptually, that's right, we just need to decode "more ints" (and
also the payloads), but then need to essentially merge-sort the
positions of N terms, and then "coalesce" them into spans, is at heart
rather costly.  Lots of hard-for-CPU-to-predict branches...

But I suspect we could get some good speedups on span queries with a
better implementation;
https://issues.apache.org/jira/browse/LUCENE-2878 is [slowly]
exploring making positions "first class" in Scorer, so you can iterate
over position + payload for each hit.

> For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1.

I didn't even realize you could pass negative slop to span queries.
What does that do?  Or did you mean slop=1?

> I wrap them into an ordered SpanNearQuery with the slop=0.
>
> I see getPayload() in the profiler top. I think I can emulate payload checking with cleverly
assigned position increments (and then maximum position in a document might jump up to ~10^9
- I hope it won't blow the whole index up).
>
> If I remove payload matching and keep only position checking, will it speed up everything,
or the positions and payloads are the same?

I think it would help to avoid payloads, but I'm not sure by how much.
 E.g., I see that NearSpansOrdered creates a new Set for every hit
just to hold payloads, even if payloads are not going to be used.
Really the span scorers should check Terms.hasPayloads up front ...

> My main goal is getting the precise results for a query, so proximity boosting won't
help, unfortunately.

OK.

I wonder if you can somehow identify the spans you care about at
indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the
index at that point; this would make searching much faster (it becomes
a TermQuery).  For exact matching (slop=0) you can also index
shingles.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message