From java-user-return-30932-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Fri Nov 09 03:00:04 2007 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 23399 invoked from network); 9 Nov 2007 03:00:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Nov 2007 03:00:04 -0000 Received: (qmail 46368 invoked by uid 500); 9 Nov 2007 02:59:44 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 46335 invoked by uid 500); 9 Nov 2007 02:59:44 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 46324 invoked by uid 99); 9 Nov 2007 02:59:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2007 18:59:44 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cedric.ho@gmail.com designates 72.14.204.234 as permitted sender) Received: from [72.14.204.234] (HELO qb-out-0506.google.com) (72.14.204.234) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Nov 2007 02:59:48 +0000 Received: by qb-out-0506.google.com with SMTP id e12so1116692qba for ; Thu, 08 Nov 2007 18:59:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; bh=GULgnxQDds0TfgddFnF01xk/3VGwxRP9YYqLRNgX/dA=; b=edGP2WAc6bayXf7fQoAqG9MNZrna+65fww9RwtQOzzoA6vQNw2tEXc2IeNrrO1tzuSzpBJE2FbgjbuoUlCwfNtt/2bZZZNNVyQ3EaeGvmWx+PTDhGWGSGR2Ef4lelsP/uqjAi6xSjkBCG0zvP9OC1GssP+MB/yaZCYrdD3+RHSY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=nLjLKtY4K+VtrJJHPnnifblyauheFazN+4BBg6e7F7pVvBPq8se2xCDbZQwzMOteX2iZa3La9Pqgjz1JniNwM5yIO8bzyjD/+KQTBTIpP+Mr74WBBjtvr+wtXBe0dENVaKuSMIN6yd1u1QVw1V8mpl+Nm2tRVdmgY4h4eUeTNYc= Received: by 10.115.75.1 with SMTP id c1mr252183wal.1194577165548; Thu, 08 Nov 2007 18:59:25 -0800 (PST) Received: by 10.114.52.11 with HTTP; Thu, 8 Nov 2007 18:59:25 -0800 (PST) Message-ID: <839ba01c0711081859h2d485e8vef8d193ae9edb59d@mail.gmail.com> Date: Fri, 9 Nov 2007 10:59:25 +0800 From: "Cedric Ho" To: java-user@lucene.apache.org Subject: Chinese Segmentation with Phase Query MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Virus-Checked: Checked by ClamAV on apache.org Hi, We are having an issue while indexing Chinese Documents in Lucene. Some background first: Since CJK languages doesn't have space between words, we first have to determine the words from sentences. e.g. a sentence containing characters ABC, it may be segmented into AB, C or A, BC. the problem is sometimes there can be ambiguities in how the sentence should be segmented. It is possible that both AB, C and A, BC are valid segmentations. In this cases we would like to index both segmentation into the index: AB offset (0,1) position 0 C offset (2,2) position 1 A offset (0,0) position 0 BC offset (1,2) position 1 Now the problem is, when someone search using a PhraseQuery (AC) it will find this line ABC because it match A (position 0) and C (position 1). Are there any ways to search for exact match using the offset information instead of the position information ? Best Regards, Cedric --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org