lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Shalyminov <>
Subject Under the hood of SpanQueries
Date Wed, 03 Apr 2013 21:55:22 GMT
Hi all!

I have a ~20GB index of documents that have words with several attributes associated with
them, e.g.:

WORD: word_1 word_2 ... word_n
POS:    pos1_1:pos1_2:pos1:3 pos2 ... pos_n_1:pos_n_2
LEMMA: lemma1_1:lemma1:2:lemma1_3 lemma2 lemma_n_1:lemma_n_2

Field tokens separated by ':' are ambiguous, i.e. they correspond to the same position in
the document.
An important detail of ambiguous word attributes is that, e.g., pos1_1 corresponds only to
lemma1_1, not to lemma1_2 or 1_3, so one must not match word_1 when searching for pos1_1 &
lemma1_3 at the same position.

I handle ambiguous tokens position with standard positionIncrement = 0, and attribute number
correspondence with token payloads. Say, lemma1_1 has payload = 1, lemma1_2 - 2; pos1_1 -
1, pos1_2 - 2, and so on. And while searching for token attributes at the same position I
use payload filter that checks if the payloads of all tokens matched are the same.

And that's it: SpanNearQueries run super slow on that index (10's of seconds, and the majority
of indexed documents matches to a common query).
I don't know actually how SpanQueries work in-depth, but is there some inefficiency in them
by design? Or is payload retrieval so expensive?
I'm just wondering if I'm missing something obvious that slows down the entire search.

Best regards,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message