lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Lucene in-memory index
Date Wed, 23 Oct 2013 15:05:32 GMT
On Tue, Oct 22, 2013 at 9:43 AM, Igor Shalyminov
<ishalyminov@yandex-team.ru> wrote:

> Thanks for the link, I'll definitely dig into SpanQuery internals very soon.

You could also just make a custom query.  If you start from the
ProxBooleanTermQuery on that issue, but change it so that it rejects
hits that didn't have terms in the right positions, then you'll likely
have a much faster way to do your query.

>>>  For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1.
>>
>> I didn't even realize you could pass negative slop to span queries.
>> What does that do?  Or did you mean slop=1?
>
> I indeed use an unordered SpanNearQuery with the slop = --1 (I saw it on some forum,
maybe here: http://www.gossamer-threads.com/lists/lucene/java-user/89377?do=post_view_flat#89377)

Wow, OK.  I have no idea what slop=-1 does...

> So far it works for me:)
>
>>
>>>  I wrap them into an ordered SpanNearQuery with the slop=0.
>>>
>>>  I see getPayload() in the profiler top. I think I can emulate payload checking
with cleverly assigned position increments (and then maximum position in a document might
jump up to ~10^9 - I hope it won't blow the whole index up).
>>>
>>>  If I remove payload matching and keep only position checking, will it speed
up everything, or the positions and payloads are the same?
>>
>> I think it would help to avoid payloads, but I'm not sure by how much.
>>  E.g., I see that NearSpansOrdered creates a new Set for every hit
>> just to hold payloads, even if payloads are not going to be used.
>> Really the span scorers should check Terms.hasPayloads up front ...
>>
>>>  My main goal is getting the precise results for a query, so proximity boosting
won't help, unfortunately.
>>
>> OK.
>>
>> I wonder if you can somehow identify the spans you care about at
>> indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the
>> index at that point; this would make searching much faster (it becomes
>> a TermQuery).  For exact matching (slop=0) you can also index
>> shingles.
>
> Thanks for the clue, I think it can be a good optimization heuristic.
> I actually tried a similar approach to optimize search of attributes at the same position.
> Here's how it was supposed to work for a feature set "S,sg,nom,fem":
>
> * the regular approach: split it into grammar atomics: "S", "sg", "nom", "fem". With
payloads and positions assigned the right way, this would allow us to search for an arbitrary
combination of these attributes _but_ with multiple postings merging.
> * the experimental approach: sort the atomics lexicographically and index all the subsets:
"S", "fem", "nom", "sg", "S,fem", "S,nom", ..., "S,fem,nom,sg". With the preprocessing of
the user query the same way (split - sort - join) it would allow us to process the same queries
exactly within one posting.
>
> This technique is actually used in our current production index based on Yandex.Server
engine.
> But Yandex.Server somehow makes the index size reasonable (within the order of magnitude
of original text size), while Lucene index blows up totally ( >10 times original text size)
and no search performance improvements appear.

That's really odd.  I would expect index to become much larger, but
search performance ought to be much faster since you run simple
TermQuery.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message