lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Phrase Search
Date Mon, 18 Jun 2007 14:23:31 GMT
Erick Erickson wrote:
> Phrase queries won't help you here....
> Your particular issue can be addressed, but I'm not sure it's a
> reasonable long-term solution....
> If you indexed your address field as UN_TOKENIZED, and
> did NOT tokenize your query, it should give you what you want.
> What's happening is that StandardAnalyzer is indexing indivdual
> tokens, not phrases. So, doc 1 has the tokens
> "hiran", "margi"
> Doc 2 has tokens.
> "hiran", "magri", "sec", and "10"
> and so on...
> Searching, even for phrases, on "hiran margi" matches
> 4 docs because those two tokens appear next to each other.
> If, on the other hand, you index your address field UN_TOKENIZED,
> then doc1 has a "token" of "hiran margi", while doc 2 has a token
> of "hiran magri sec 10". Doc2 won't match a query on
> "hiran margi" etc.
> But, this may not be a good solution because searching on
> "hiran" won't match *any* document. You might have to index
> the same fields two different ways to get all the behavior you
> want.

Another good old trick is to index field values (tokenized) with 
appended special starting and ending tokens, e.g. instead of "Hiran 
Magri" use "_start_ Hiran Magri _end_". Then you can query for fields 
that are exactly equal to a phrase, while still retaining the 
possibility to search by individual terms and phrases not equal to the 
field value.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message