Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EAED610E51 for ; Thu, 22 Aug 2013 03:16:29 +0000 (UTC) Received: (qmail 32173 invoked by uid 500); 22 Aug 2013 03:16:26 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 31891 invoked by uid 500); 22 Aug 2013 03:16:20 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 31883 invoked by uid 99); 22 Aug 2013 03:16:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Aug 2013 03:16:18 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of SRS0=y3ZcDS=SD=basetechnology.com=jack@yourhostingaccount.com designates 65.254.254.70 as permitted sender) Received: from [65.254.254.70] (HELO mailout04.yourhostingaccount.com) (65.254.254.70) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Aug 2013 03:16:12 +0000 Received: from mailscan11.yourhostingaccount.com ([10.1.15.11] helo=mailscan11.yourhostingaccount.com) by mailout04.yourhostingaccount.com with esmtp (Exim) id 1VCLNH-00077r-Pq for solr-user@lucene.apache.org; Wed, 21 Aug 2013 23:15:51 -0400 Received: from impout01.yourhostingaccount.com ([10.1.55.1] helo=impout01.yourhostingaccount.com) by mailscan11.yourhostingaccount.com with esmtp (Exim) id 1VCLNG-0006v1-WE for solr-user@lucene.apache.org; Wed, 21 Aug 2013 23:15:51 -0400 Received: from authsmtp04.yourhostingaccount.com ([10.1.18.4]) by impout01.yourhostingaccount.com with NO UCE id FfFq1m00405G96J01fFqP1; Wed, 21 Aug 2013 23:15:50 -0400 X-Authority-Analysis: v=2.0 cv=EJGEIilC c=1 sm=1 a=UdCbmyego4VUa/xJBgcoFg==:17 a=aQzbgH187woA:10 a=gCMI4_mOowMA:10 a=3jZET7lWBKwA:10 a=8nJEP1OIZ-IA:10 a=jvYhGVW7AAAA:8 a=0AKuR-71pgYA:10 a=mV9VRH-2AAAA:8 a=9EPFFydypnTkfNNLVCkA:9 a=wPNLvfGTeEIA:10 a=ZyCNx9LFiA0kwLx3ZJIN5w==:117 X-EN-OrigOutIP: 10.1.18.4 X-EN-IMPSID: FfFq1m00405G96J01fFqP1 Received: from 207-237-114-232.c3-0.nyr-ubr1.nyr.ny.cable.rcn.com ([207.237.114.232] helo=JackKrupansky) by authsmtp04.yourhostingaccount.com with esmtpa (Exim) id 1VCLNG-00010Y-Ir for solr-user@lucene.apache.org; Wed, 21 Aug 2013 23:15:50 -0400 Message-ID: <566EDD43E7BE446AB3233A290EDF5348@JackKrupansky> From: "Jack Krupansky" To: References: <52157A3C.9010909@elyograg.org> In-Reply-To: <52157A3C.9010909@elyograg.org> Subject: Re: How to avoid underscore sign indexing problem? Date: Wed, 21 Aug 2013 23:15:44 -0400 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal Importance: Normal X-Mailer: Microsoft Windows Live Mail 15.4.3555.308 X-MimeOLE: Produced By Microsoft MimeOLE V15.4.3555.308 X-EN-UserInfo: e0a4b55451ed9f27313ebf02e3d4348d:fc4a93e1349e680c52bdd723c0ab3ef6 X-EN-AuthUser: jack@basetechnology.com Sender: "Jack Krupansky" X-EN-OrigIP: 207.237.114.232 X-EN-OrigHost: 207-237-114-232.c3-0.nyr-ubr1.nyr.ny.cable.rcn.com X-Virus-Checked: Checked by ClamAV on apache.org "I thought that the StandardTokenizer always split on punctuation, " Proving that you haven't read my book! The section on the standard tokenizer details the rules that the tokenizer uses (in addition to extensive examples.) That's what I mean by "deep dive." -- Jack Krupansky -----Original Message----- From: Shawn Heisey Sent: Wednesday, August 21, 2013 10:41 PM To: solr-user@lucene.apache.org Subject: Re: How to avoid underscore sign indexing problem? On 8/21/2013 7:54 PM, Floyd Wu wrote: > When using StandardAnalyzer to tokenize string "Pacific_Rim" will get > > ST > textraw_bytesstartendtypeposition > pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]0111 > > How to make this string to be tokenized to these two tokens "Pacific", > "Rim"? > Set _ as stopword? > Please kindly help on this. > Many thanks. Interesting. I thought that the StandardTokenizer always split on punctuation, but apparently that's not the case for the underscore character. You can always use the WordDelimeterFilter after the StandardTokenizer. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory Thanks, Shawn