lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: How to avoid underscore sign indexing problem?
Date Thu, 22 Aug 2013 03:15:44 GMT
"I thought that the StandardTokenizer always split on punctuation, "

Proving that you haven't read my book! The section on the standard tokenizer 
details the rules that the tokenizer uses (in addition to extensive 
examples.) That's what I mean by "deep dive."

-- Jack Krupansky

-----Original Message----- 
From: Shawn Heisey
Sent: Wednesday, August 21, 2013 10:41 PM
To: solr-user@lucene.apache.org
Subject: Re: How to avoid underscore sign indexing problem?

On 8/21/2013 7:54 PM, Floyd Wu wrote:
> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>
> ST
> textraw_bytesstartendtypeposition
> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
>
> How to make this string to be tokenized to these two tokens "Pacific",
> "Rim"?
> Set _ as stopword?
> Please kindly help on this.
> Many thanks.

Interesting.  I thought that the StandardTokenizer always split on
punctuation, but apparently that's not the case for the underscore
character.

You can always use the WordDelimeterFilter after the StandardTokenizer.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Thanks,
Shawn 


Mime
View raw message