lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Search with wild-cards by words with forward-slash ("/")
Date Thu, 24 Sep 2009 18:09:36 GMT
First, really think about getting a copy of Luke to help you investigate
what'sactually in your index, it's invaluable. It'll also let you try
running queries
through different analyzers and seeing the results.

But I think you're a bit fuzzy on what analyzers do. Their primary purpose
is to break the incoming stream up into *tokens*. By indexing
untokenized, you're bypassing processing by the analyzers. Further,
when indexing and searching with *different* analyzers, all hell breaks
loose unless you know *exactly* what you're doing <G>...

Lucene isn't "replacing the / with a space". Instead, StandardAnalyzer
treats most non-alphanum characters as delimiters, so what's happening
on search is cat/test is being broken up into "cat" and "test" as
separate tokens. The '/' is just completely lost.

Try this (sorry, I haven't got a handy test bed at the moment, so this is
from very rusty memory). Index and search with WhitespaceAnalyzer
(watch out, this will be case sensitive but that can be fixed pretty
easily). And index things tokenized, then try your searches.


KeywordAnalyzer is deliberately stupid. It's intended for fields that should
not be altered in any way. Besides, you're searching with one analyzer
and indexing with another I think. See the caution above.

HTH
Erick

On Thu, Sep 24, 2009 at 1:49 PM, coldserenity <muntyanu.roman@gmail.com>wrote:

>
> Hello,
>
>    I've been searching the forum and found several more or less relevant
> topic listed below.
>
>
> http://www.nabble.com/Parsing-text-containing-forward-slash-and-wildcard-td13541503.html#a13541503
>
> http://www.nabble.com/Parsing-text-containing-forward-slash-and-wildcard-td13541503.html#a13541503
>
>
> http://www.nabble.com/Search-that-supports-all-valid-characters-in-a-Unix-filename-td11495983.html
>
> http://www.nabble.com/Search-that-supports-all-valid-characters-in-a-Unix-filename-td11495983.html
>
>
> http://www.nabble.com/Payloads%2C-Tokenizers%2C-and-Filters.--Oh-My!-td13806662.html
>
> http://www.nabble.com/Payloads%2C-Tokenizers%2C-and-Filters.--Oh-My!-td13806662.html
>    However they did not help me much and that's why I've decided to start a
> separate topic.
>
>    Basically what I need to do, is to be able to search '/' (forward slash)
> symbol much the same as any alpha-numeric symbol AND with using wildcards.
> E.g. if I have following records in my index
>      1 |  /cat1
>      2 |  /cat1/test
>      3 |  /cat1/tea
>      4 |  /cat2/test
>    Search by
>      /cat1* - will return records 1, 2, 3
>      /cat1/test - will return record 2
>      /cat1/t* - will return records 2, 3
>      /cat1/t - will return no records
>
>    I use Lucene with Compass. The searched property is indexed as
> "untokenized".
>    I have tried using StandardAnalyzer and KeywordAnalyzer but got the
> following in the
>
>    Results from testing
> org.apache.lucene.analysis.standard.StandardAnalyzer
>        Search string:             rootKey:/cat1/te*
>        Rewritten search string:   rootKey:/cat1/te*
>        Found results          :   0
>
>        Search string:             rootKey:/cat1/test
>        Rewritten search string:   rootKey:"cat1 test"
>        Found results          :   1
>
>        Search string:             rootKey:/cat1/te
>        Rewritten search string:   rootKey:"cat1 te"
>        Found results          :   0
>
>    Results from testing org.apache.lucene.analysis.KeywordAnalyzer
>        Search string:             rootKey:/cat1/te
>        Rewritten search string:   rootKey:/cat1/te
>        Found results          :   0
>
>        Search string:             rootKey:/cat1/te*
>        Rewritten search string:   rootKey:/cat1/te*
>        Found results          :   0
>
>    So it looks like Lucene internally replaces the '/' with whitespace when
> searching however I did not find any "replace symbol" code in the source
> code of StandardAnalyzer, neither I had much luck in finding that place
> during debugging the Lucene sources.
>
>    With that all said, I would appreciate any hint on how this task could
> be accomplished.
>
> Regards,
>  Roman
> --
> View this message in context:
> http://www.nabble.com/Search-with-wild-cards-by-words-with-forward-slash-%28%22-%22%29-tp25578017p25578017.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message