lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject RE: Re: Replacing FAST functionality atsesam.no-ShingleFilter+exactmatching
Date Tue, 16 Sep 2008 20:59:31 GMT

: The query parser expects you to assign positionIncrement=0 for synonyms
: in this manner.

correct.

: The one kludge i see is that the QueryParser expects the total positions
: found to be greater than or equal to one. It might not be intentionally
: dealing with the total position count being zero. But the situation
: where you have many synonyms is the same as having one token and it
: having many synonyms, so positionCount=0 == positionCount=1.

there has definitely been some wonkiness in various places in the code 
relating to the first token not having a positionIncremenet of "1" ... i 
don't rememebr the details, and maybe it works fine even if every token in 
a stream is "0" but the safe thing to do is make sure the first token has 
a positionIncrement of "1" and the 'synonyms" after that use an increment 
of "0"

This is important not only in case the Lucene internals freak out when 
the "first" token has an increment of "0" but also because you have no way 
of knowing if the first token you produce is really the first token being 
given to the IndexWriter (or QueryParser or what have you)

To be a well behave TokenStream producer you can't assume you opperate in 
a vacume:

1) multiple "Field" instances with the same field name could be added to a 
document, with an Analyzer that uses your Filter but doesn't define any 
particular positionIncrementGap ... if every token you produce has an 
increment of "0" all the tokens from the second Field instance will have 
the same resulting positions as all the tokens from the first Field 
instance (ie: they will all be considered synonyms of each other)

2) I could write an Analyzer that uses your Filter but always adds a 
starting "marker token" to the front of the TokenStream and a differnt 
ending marker token to the end of hte stream (for doing creative things 
with SpanNearQueries) ... if all the tokens you produce have a 
positionIncrement of 0, the result would be that they would be considered 
synonyms of the starting marker token.

: I would think that both should lead to a BooleanQuery being constructed
: by the QueryParser. (But the synonyms generated by the ShingleFilter are
: in fact phrases so maybe it is wiser to use the MultiPhraseQuery.)

If QueryParser gives an Analyzer a chunk of text, and it produces a stream 
of tokens that all exist at the same position, it produces a a 
BooleanQuery, if they are *all* at differnet positions it produces a 
PhraseQuery, if *some* are at the same position, it produces a 
MultiPhraseQuery ... this is fairly fundamental to how QueryParser works, 
and can be relied upon.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message