lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Shalyminov <ishalymi...@yandex-team.ru>
Subject Under the hood of SpanQueries
Date Wed, 03 Apr 2013 21:55:22 GMT
Hi all!

I have a ~20GB index of documents that have words with several attributes associated with
them, e.g.:

WORD: word_1 word_2 ... word_n
POS:    pos1_1:pos1_2:pos1:3 pos2 ... pos_n_1:pos_n_2
LEMMA: lemma1_1:lemma1:2:lemma1_3 lemma2 lemma_n_1:lemma_n_2

Field tokens separated by ':' are ambiguous, i.e. they correspond to the same position in
the document.
An important detail of ambiguous word attributes is that, e.g., pos1_1 corresponds only to
lemma1_1, not to lemma1_2 or 1_3, so one must not match word_1 when searching for pos1_1 &
lemma1_3 at the same position.

I handle ambiguous tokens position with standard positionIncrement = 0, and attribute number
correspondence with token payloads. Say, lemma1_1 has payload = 1, lemma1_2 - 2; pos1_1 -
1, pos1_2 - 2, and so on. And while searching for token attributes at the same position I
use payload filter that checks if the payloads of all tokens matched are the same.

And that's it: SpanNearQueries run super slow on that index (10's of seconds, and the majority
of indexed documents matches to a common query).
I don't know actually how SpanQueries work in-depth, but is there some inefficiency in them
by design? Or is payload retrieval so expensive?
I'm just wondering if I'm missing something obvious that slows down the entire search.

-- 
Best regards,
Igor

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message