lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Rangarajan <arunrangara...@gmail.com>
Subject Re: Solr SynonymFilter in Lucene analyzer
Date Fri, 27 Aug 2010 01:00:42 GMT
Thanks, Lance. After exploring for a while, I used lucene's ShingleFilter
followed by the SynonymFilter in Lucene in Action book. Then using the type
attribute, I removed all the shingles which did not belong to any category.

On Wed, Aug 18, 2010 at 10:28 PM, Lance Norskog <goksron@gmail.com> wrote:

> Yes, you need an analyzer that leaves successive words together as one
> long term. This might be easier to do with the new CharFilter tool,
> which processes text before it goes to the tokenizer.
>
> What you are doing here is similar to Parts-Of-Speech analysis, where
> text analysis software parses a sentence and labels words 'Noun',
> 'Verb', etc. One suite stores these labels as payloads on the terms.
> This might be a better way to store your categories, rather than using
> the synonym filter.
>
> On Wed, Aug 18, 2010 at 9:55 PM, Arun Rangarajan
> <arunrangarajan@gmail.com> wrote:
> > I think the lucene WhitespaceAnalyzer I am using inside Solr's
> SynonymFilter
> > is the one that prevents multi-word synonyms like "New York" from getting
> > mapped to the generic synonym name like CONCEPTYcity. It appears to me
> that
> > an analyzer which recognizes that a white-space is inside a synonym like
> > "New York" will be required. Do I need to implement one like this or is
> > there already an analyzer I can use? Looks like I am missing something
> here,
> > since Solr's SynonymFilter is supposed to handle this. Can someone tell
> me
> > what is the correct way to integrate Solr's SynonymFilter within a custom
> > lucene analyzer? Thanks.
> >
> >
> > On Tue, Aug 17, 2010 at 4:44 PM, Arun Rangarajan
> > <arunrangarajan@gmail.com>wrote:
> >
> >> I am trying to have multi-word synonyms work in lucene using Solr's *
> >> SynonymFilter*.
> >>
> >> I need to match synonyms at index time, since many of the synonym lists
> are
> >> huge. Actually they are really not synonyms, but are words that belong
> to a
> >> concept. For example, I would like to map {"New York", "Los Angeles",
> "New
> >> Orleans", "Salt Lake City"...}, a bunch of city names, to the concept
> called
> >> "city". While searching, the user query for the concept "city" will be
> >> translated to a keyword like, say "CONCEPTcity", which is the synonym
> for
> >> any city name.
> >>
> >> Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p.
> 131),
> >> all I could match for "CONCEPTcity" is single word city names like
> >> "Chicago", "Seattle", "Boston", etc., It would not match multi-word city
> >> names like "New York", "Los Angeles", etc.,
> >>
> >> I tried using Solr's SynonymFilter in tokenStream method in a custom
> >> Analyzer (that extends org.apache.lucene.analysis.
> >> Analyzer - lucene ver. 2.9.3) using:
> >>
> >> *    public TokenStream tokenStream(String fieldName, Reader reader) {
> >>         TokenStream result = new SynonymFilter(
> >>                 new WhitespaceTokenizer(reader),
> >>                 synonymMap);
> >>         return result;
> >>     }
> >> *
> >> where *synonymMap* is loaded with synonyms using
> >>
> >> *synonymMap.add(conceptTerms, listOfTokens, true, true);*
> >>
> >> where *conceptTerms* is of type *ArrayList<String>* of all the terms in
> a
> >> concept and *listofTokens* is of type *List<Token>  *and contains only
> the
> >> generic synonym identifier like *CONCEPTcity*.
> >>
> >> When I print synonymMap using synonymMap.toString(), I get the output
> like
> >>
> >> <{New York=<{Chicago=<{Seattle=<{New
> >> Orleans=....<[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null>}>}>}>....}>
> >>
> >> so it looks like all the synonyms are loaded. But if I search for
> >> "CONCEPTcity" then it says no matches found. I am not sure whether I
> have
> >> loaded the synonyms correctly in the synonymMap.
> >>
> >> Any help will be deeply appreciated. Thanks!
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message