lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Custom Tokenizer
Date Thu, 05 Dec 2013 18:52:45 GMT
You can also string together one of a myriad of TokenFilters, see:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

I'd recommend spending some time on the admin/analysis page
to understand what all the combinations do. I'd also recommend
against dealing with punctuation etc by using wildcards. When
you use wildcards, the terms matched don't contribute to the
relevance score.

For instance, LowerCaseTokenizerFactory will tokenize all
letter sequences and drop all non-letters.

PatternReplaceFilterFactory will allow you to define with
regexes what you want to be included in your tokens etc. You
could use this in conjunction with WhitespaceTokenizerFactory
for instance.

Or as Furukan suggests, use PatternReplaceCharFilterFactory
to operate on the entire input before it's broken up by
whatever tokenizer you use. Or....

You _really_ should make the effort to define a proper
analysis chain rather than just use wildcards IMO.

Best,
Erick


On Thu, Dec 5, 2013 at 12:24 PM, Furkan KAMACI <furkankamaci@gmail.com>wrote:

> Hi;
>
> Standard tokenizer includes of that bydefault:
>
> StandardFilter, LowerCaseFilter and StopFilter
>
> You can consider char filters. Did you read here:
> https://cwiki.apache.org/confluence/display/solr/CharFilterFactories
>
> Thanks;
> Furkan KAMACI
>
>
> 2013/12/5 <raghavendra.k.rao@barclays.com>
>
> > Hi,
> >
> > I have used StandardAnalyzer in my code and it is working fine. One of
> the
> > challenges that I face is the fact that, this Analyzer by default
> tokenizes
> > on some special characters such as hyphen, apart from the SPACE
> character.
> >
> > I want to tokenize only on the SPACE character. Could you please suggest
> > how I can achieve this?
> >
> > I got this example when I googled for it. What I want to use is the
> > WhitespaceTokenizer so that data is not manipulated in anyway. I
> understand
> > that in this case, searches such as "mechanisms" won't return results
> > because of the period (.) at the end. I want to then address this by
> > introducing wild-card searches.
> >
> > Data: 1097-0215 (i.v) product-123 anti-virus, we investigated the
> > mechanisms. 2266-73 In the present study
> > Tokens generated with StandardTokenizer:
> > [1097-0215] [i.v] [product-123] [anti] [virus] [we] [investigated] [the]
> > [mechanisms] [2266-73] [In] [the] [present] [study]
> > Tokens generated with WhiteSpaceTokenizer:
> > [1097-0215] [(i.v)] [product-123] [anti-virus,] [we] [investigated] [the]
> > [mechanisms.] [2266-73] [In] [the] [present] [study]
> > Note: I have tried using the WhitespaceAnalyzer which tokenizes by
> default
> > ONLY on the space, but my attempt at performing wildcard searches didn't
> > work as expected. Where as, wildcard searches worked fine with
> > StandardAnalyzer.
> >
> > Please provide your inputs.
> >
> > Regards,
> > Raghu
> >
> >
> > _______________________________________________
> >
> > This message is for information purposes only, it is not a
> recommendation,
> > advice, offer or solicitation to buy or sell a product or service nor an
> > official confirmation of any transaction. It is directed at persons who
> are
> > professionals and is not intended for retail customer use. Intended for
> > recipient only. This message is subject to the terms at:
> > www.barclays.com/emaildisclaimer.
> >
> > For important disclosures, please see:
> > www.barclays.com/salesandtradingdisclaimer regarding market commentary
> > from Barclays Sales and/or Trading, who are active market participants;
> and
> > in respect of Barclays Research, including disclosures relating to
> specific
> > issuers, please see http://publicresearch.barclays.com.
> >
> > _______________________________________________
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message