lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Solr SynonymFilter in Lucene analyzer
Date Thu, 19 Aug 2010 05:28:04 GMT
Yes, you need an analyzer that leaves successive words together as one
long term. This might be easier to do with the new CharFilter tool,
which processes text before it goes to the tokenizer.

What you are doing here is similar to Parts-Of-Speech analysis, where
text analysis software parses a sentence and labels words 'Noun',
'Verb', etc. One suite stores these labels as payloads on the terms.
This might be a better way to store your categories, rather than using
the synonym filter.

On Wed, Aug 18, 2010 at 9:55 PM, Arun Rangarajan
<arunrangarajan@gmail.com> wrote:
> I think the lucene WhitespaceAnalyzer I am using inside Solr's SynonymFilter
> is the one that prevents multi-word synonyms like "New York" from getting
> mapped to the generic synonym name like CONCEPTYcity. It appears to me that
> an analyzer which recognizes that a white-space is inside a synonym like
> "New York" will be required. Do I need to implement one like this or is
> there already an analyzer I can use? Looks like I am missing something here,
> since Solr's SynonymFilter is supposed to handle this. Can someone tell me
> what is the correct way to integrate Solr's SynonymFilter within a custom
> lucene analyzer? Thanks.
>
>
> On Tue, Aug 17, 2010 at 4:44 PM, Arun Rangarajan
> <arunrangarajan@gmail.com>wrote:
>
>> I am trying to have multi-word synonyms work in lucene using Solr's *
>> SynonymFilter*.
>>
>> I need to match synonyms at index time, since many of the synonym lists are
>> huge. Actually they are really not synonyms, but are words that belong to a
>> concept. For example, I would like to map {"New York", "Los Angeles", "New
>> Orleans", "Salt Lake City"...}, a bunch of city names, to the concept called
>> "city". While searching, the user query for the concept "city" will be
>> translated to a keyword like, say "CONCEPTcity", which is the synonym for
>> any city name.
>>
>> Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p. 131),
>> all I could match for "CONCEPTcity" is single word city names like
>> "Chicago", "Seattle", "Boston", etc., It would not match multi-word city
>> names like "New York", "Los Angeles", etc.,
>>
>> I tried using Solr's SynonymFilter in tokenStream method in a custom
>> Analyzer (that extends org.apache.lucene.analysis.
>> Analyzer - lucene ver. 2.9.3) using:
>>
>> *    public TokenStream tokenStream(String fieldName, Reader reader) {
>>         TokenStream result = new SynonymFilter(
>>                 new WhitespaceTokenizer(reader),
>>                 synonymMap);
>>         return result;
>>     }
>> *
>> where *synonymMap* is loaded with synonyms using
>>
>> *synonymMap.add(conceptTerms, listOfTokens, true, true);*
>>
>> where *conceptTerms* is of type *ArrayList<String>* of all the terms in a
>> concept and *listofTokens* is of type *List<Token>  *and contains only the
>> generic synonym identifier like *CONCEPTcity*.
>>
>> When I print synonymMap using synonymMap.toString(), I get the output like
>>
>> <{New York=<{Chicago=<{Seattle=<{New
>> Orleans=....<[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null>}>}>}>....}>
>>
>> so it looks like all the synonyms are loaded. But if I search for
>> "CONCEPTcity" then it says no matches found. I am not sure whether I have
>> loaded the synonyms correctly in the synonymMap.
>>
>> Any help will be deeply appreciated. Thanks!
>>
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message