Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 5388 invoked from network); 29 Apr 2010 20:03:24 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Apr 2010 20:03:24 -0000 Received: (qmail 73212 invoked by uid 500); 29 Apr 2010 20:03:22 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 73062 invoked by uid 500); 29 Apr 2010 20:03:22 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 73054 invoked by uid 99); 29 Apr 2010 20:03:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Apr 2010 20:03:22 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of weiho@princeton.edu designates 128.112.128.213 as permitted sender) Received: from [128.112.128.213] (HELO ppa01.Princeton.EDU) (128.112.128.213) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Apr 2010 20:03:13 +0000 Received: from smtpserver1.Princeton.EDU (smtpserver1.Princeton.EDU [128.112.129.65]) by ppa01.Princeton.EDU (8.14.3/8.14.3) with ESMTP id o3TK2rQN018058 for ; Thu, 29 Apr 2010 16:02:53 -0400 Received: from [140.180.52.65] (dynamic-oit-vapornet-c-2416.Princeton.EDU [140.180.52.65]) (authenticated bits=0) by smtpserver1.Princeton.EDU (8.12.9/8.12.9) with ESMTP id o3TK2qKR006876 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NOT) for ; Thu, 29 Apr 2010 16:02:52 -0400 (EDT) Message-ID: <4BD9E616.7030103@princeton.edu> Date: Thu, 29 Apr 2010 16:03:34 -0400 From: Wei Ho User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100317 Lightning/1.0b1 Thunderbird/3.0.4 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Lucene QueryParser and Analyzer References: <4BD9E329.5030306@princeton.edu> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org No, there is no whitespace after the comma in Input1 Input1: C1C2,C3C4,C5C6,C7,C8C9C10 Input2: C1C2 C3C4 C5C6 C7 C8C9C10 Input1 is basically one big long word with commas and Chinese characters one after the other. Input2 is where I manually separated the string into the component terms by replacing the comma with whitespace. My confusion stems from the fact that I thought it should not matter since the analyzer should be discarding the punctuation anyway? So the tokenization process should be the same for both Input1 and Input2? If that is not the case, what do I need to change? Thanks, Wei Ho -------- Original Message -------- Subject: Re: Lucene QueryParser and Analyzer From: Sudarsan, Sithu D. To: java-user@lucene.apache.org Date: 4/29/2010 3:54 PM > Hi, > > Is there a whitespace after the comma? > > > Sincerely, > Sithu D Sudarsan > > > -----Original Message----- > From: Wei Ho [mailto:weiho@princeton.edu] > Sent: Thursday, April 29, 2010 3:51 PM > To: java-user@lucene.apache.org > Subject: Lucene QueryParser and Analyzer > > Hello, > > I'm using Lucene to index and search through a collection of Chinese > documents. However, I'm noticing an odd behavior in query > parsing/searching. > > Given the two queries below: > > (Ci refers to Chinese character i) > Input1: C1C2,C3C4,C5C6,C7,C8C9C10 > Input2: C1C2 C3C4 C5C6 C7 C8C9C10 > > Input1 returns absolutely nothing, while Input2 (replacing the commas > with spaces) works as expected. I'm a bit confused why this would be > happening - it seems that QueryParser uses the Analyzer passed to it to > tokenize the input query string, so if the Analyzer ignores the > punctuations, it seems that Input1 and Input2 should return identical > results. Is there some pre-Analyzer filtering or whatever that > QueryParser does? I've tried this with the StandardAnalyzer, > SmartChineseAnalyzer, and an analyzer that I implemented which > explicitly skips over punctuations and whitespaces in tokenizing the > query string, but to no avail. > > -------sample code------------- > Analyzer analyzer = new LingPipeAnalyzer(); > Searcher searcher = new IndexSearcher(directory); > QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30, > SEARCH_FIELDS, analyzer); > Query query = qParser.parse(queryLine[1]); > ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs; > ----------------------------------- > > I'm probably just doing something dumb, but any help would be greatly > appreciated! > > Thanks, > Wei Ho > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org