lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Problems to get suggestions from an intermediate word using AnalyzingSuggester
Date Tue, 26 Mar 2013 11:43:18 GMT
AnalyzingSuggester only matches by prefix, by design.

You can try AnalyzingInfixSuggester, which is currently two
alternative patches on
https://issues.apache.org/jira/browse/LUCENE-4845

And please post back any feedback you have on the issue ... as the
issue stands I don't think either approach will be committed any time
soon.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Mar 26, 2013 at 3:45 AM, Andres Garcia <hgarcia@fi.upm.es> wrote:
> Hi all,
>
>
> My use case is very simple, given a string I would like to suggest all the
> possible urls that contain that string (given the limitations of the
> tokenizer and suggester). So far I have created a custom analyzer and
> tokenizer to parse urls, and that analyzer is used to create an
> AnalyzingSuggester object. When I look for a suggestion using a prefix of a
> url it works fine. However when I use an in between word I don’t get any
> suggestion.
>
>
> Let’s see my test case. I have a unique suggestion entry “www.google.com”
> in my TermFreq array.  If I search a suggestion for “www” it returns the
> url. If I search a suggestion for “google” the result is empty.
>
>
> My tokenizer splits the suggestion entry into the following tuples
> (token,offset): (www,0:3),(google,4:10),(com,11:14). Please note that I’m
> getting rid of the dots
>
>
> The automaton created for this entry is:
>
> state 0 [reject]: w -> 1 state 1 [reject]: w -> 2 state 2 [reject]: w -> 3
> state 3 [reject]: \\U00000100 -> 4 state 4 [reject]: g -> 5 state 5
> [reject]: o -> 6 state 6 [reject]: o -> 7 state 7 [reject]:  g -> 8 state 8
> [reject]: l -> 9 state 9 [reject]: e -> 10 state 10 [reject]: \\U00000100
> -> 11 state 11 [reject]: c -> 12 state 12 [reject]: o -> 13 state 13
> [reject]: m -> 14 state 14 [accept]:
>
>
> When I print the fst I get this: “wwwgooglecom”
>
>
> The automaton created for “google”
>
> Initial state: 0 state 0 [reject]: g -> 1 state 1 [reject]: o -> 2 state 2
> [reject]: o -> 3 state 3 [reject]: g -> 4 state 4 [reject]: l -> 5 state 5
> [reject]: e -> 6 state 6 [accept]:
>
>
> I think I have a problem with my tokenizer (I’m not an expert) and this is
> affecting the creation of the first automaton. I really don’t know how to
> get this fixed, any advice?
>
>
> best regards!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message