lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andres Garcia <hgar...@fi.upm.es>
Subject Problems to get suggestions from an intermediate word using AnalyzingSuggester
Date Tue, 26 Mar 2013 07:45:03 GMT
Hi all,


My use case is very simple, given a string I would like to suggest all the
possible urls that contain that string (given the limitations of the
tokenizer and suggester). So far I have created a custom analyzer and
tokenizer to parse urls, and that analyzer is used to create an
AnalyzingSuggester object. When I look for a suggestion using a prefix of a
url it works fine. However when I use an in between word I don’t get any
suggestion.


Let’s see my test case. I have a unique suggestion entry “www.google.com”
in my TermFreq array.  If I search a suggestion for “www” it returns the
url. If I search a suggestion for “google” the result is empty.


My tokenizer splits the suggestion entry into the following tuples
(token,offset): (www,0:3),(google,4:10),(com,11:14). Please note that I’m
getting rid of the dots


The automaton created for this entry is:

state 0 [reject]: w -> 1 state 1 [reject]: w -> 2 state 2 [reject]: w -> 3
state 3 [reject]: \\U00000100 -> 4 state 4 [reject]: g -> 5 state 5
[reject]: o -> 6 state 6 [reject]: o -> 7 state 7 [reject]:  g -> 8 state 8
[reject]: l -> 9 state 9 [reject]: e -> 10 state 10 [reject]: \\U00000100
-> 11 state 11 [reject]: c -> 12 state 12 [reject]: o -> 13 state 13
[reject]: m -> 14 state 14 [accept]:


When I print the fst I get this: “wwwgooglecom”


The automaton created for “google”

Initial state: 0 state 0 [reject]: g -> 1 state 1 [reject]: o -> 2 state 2
[reject]: o -> 3 state 3 [reject]: g -> 4 state 4 [reject]: l -> 5 state 5
[reject]: e -> 6 state 6 [accept]:


I think I have a problem with my tokenizer (I’m not an expert) and this is
affecting the creation of the first automaton. I really don’t know how to
get this fixed, any advice?


best regards!

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message