lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Shalyminov <ishalymi...@yandex-team.ru>
Subject Re: Lucene in-memory index
Date Fri, 25 Oct 2013 13:58:58 GMT
What is ProxBooleanTermQuery?
I couldn't find it in the trunk and in that ticket's (https://issues.apache.org/jira/browse/LUCENE-2878)
patch.
And for now it's very fuzzy to me how the searching/scoring works. Are there any tutorials
or talks on how do Queries, Scorers, Collectors interoperate?


-- 
Igor

23.10.2013, 19:06, "Michael McCandless" <lucene@mikemccandless.com>:
> On Tue, Oct 22, 2013 at 9:43 AM, Igor Shalyminov
> <ishalyminov@yandex-team.ru> wrote:
>
>>  Thanks for the link, I'll definitely dig into SpanQuery internals very soon.
>
> You could also just make a custom query.  If you start from the
> ProxBooleanTermQuery on that issue, but change it so that it rejects
> hits that didn't have terms in the right positions, then you'll likely
> have a much faster way to do your query.
>
>>>>   For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1.
>>>  I didn't even realize you could pass negative slop to span queries.
>>>  What does that do?  Or did you mean slop=1?
>>  I indeed use an unordered SpanNearQuery with the slop = --1 (I saw it on some forum,
maybe here: http://www.gossamer-threads.com/lists/lucene/java-user/89377?do=post_view_flat#89377)
>
> Wow, OK.  I have no idea what slop=-1 does...
>
>>  So far it works for me:)
>>>>   I wrap them into an ordered SpanNearQuery with the slop=0.
>>>>
>>>>   I see getPayload() in the profiler top. I think I can emulate payload
checking with cleverly assigned position increments (and then maximum position in a document
might jump up to ~10^9 - I hope it won't blow the whole index up).
>>>>
>>>>   If I remove payload matching and keep only position checking, will it
speed up everything, or the positions and payloads are the same?
>>>  I think it would help to avoid payloads, but I'm not sure by how much.
>>>   E.g., I see that NearSpansOrdered creates a new Set for every hit
>>>  just to hold payloads, even if payloads are not going to be used.
>>>  Really the span scorers should check Terms.hasPayloads up front ...
>>>>   My main goal is getting the precise results for a query, so proximity
boosting won't help, unfortunately.
>>>  OK.
>>>
>>>  I wonder if you can somehow identify the spans you care about at
>>>  indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the
>>>  index at that point; this would make searching much faster (it becomes
>>>  a TermQuery).  For exact matching (slop=0) you can also index
>>>  shingles.
>>  Thanks for the clue, I think it can be a good optimization heuristic.
>>  I actually tried a similar approach to optimize search of attributes at the same
position.
>>  Here's how it was supposed to work for a feature set "S,sg,nom,fem":
>>
>>  * the regular approach: split it into grammar atomics: "S", "sg", "nom", "fem".
With payloads and positions assigned the right way, this would allow us to search for an arbitrary
combination of these attributes _but_ with multiple postings merging.
>>  * the experimental approach: sort the atomics lexicographically and index all the
subsets: "S", "fem", "nom", "sg", "S,fem", "S,nom", ..., "S,fem,nom,sg". With the preprocessing
of the user query the same way (split - sort - join) it would allow us to process the same
queries exactly within one posting.
>>
>>  This technique is actually used in our current production index based on Yandex.Server
engine.
>>  But Yandex.Server somehow makes the index size reasonable (within the order of
magnitude of original text size), while Lucene index blows up totally ( >10 times original
text size) and no search performance improvements appear.
>
> That's really odd.  I would expect index to become much larger, but
> search performance ought to be much faster since you run simple
> TermQuery.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message