Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 9243 invoked from network); 10 Nov 2007 01:28:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Nov 2007 01:28:15 -0000 Received: (qmail 26135 invoked by uid 500); 10 Nov 2007 01:27:56 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 26096 invoked by uid 500); 10 Nov 2007 01:27:56 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 26085 invoked by uid 99); 10 Nov 2007 01:27:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Nov 2007 17:27:56 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cedric.ho@gmail.com designates 209.85.198.187 as permitted sender) Received: from [209.85.198.187] (HELO rv-out-0910.google.com) (209.85.198.187) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Nov 2007 01:27:57 +0000 Received: by rv-out-0910.google.com with SMTP id k20so526348rvb for ; Fri, 09 Nov 2007 17:27:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=zs0mCmanvHtYqgK/N1v2pVbuxIrKUysCUxGigDxB36o=; b=t6Y0xcKuT/RQXOwSq5dAHbN4V6Qaqto1jmdzYoxgFyP9Qyka/fYqiHs5sHTYtkV/jFKZWEQEsH1ahBgUolYujtOKlFRUXA5mmGs62YRLfgrEDk06E/RCLCRSIva8YvKqwqfcWo+pJAqWhsEy3RA93KKktsoPARw6YbdEBnOJtkg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=k3Hdyz4FbGaWDieTIx4KVitHckbpdRv9Rd6FsJFsDEbHau/K1jbSk5FT8SpR+ufvEdbdtdQIGnorRcm2Be1mR1fg5rqT7x2HKyjex6zT+6a8lzwTEct5qbbYsOHpVgXtrcBjtlddI4oNtD7Om9ZgZe3F3pMzhfn8EET0/fPwk8E= Received: by 10.114.204.7 with SMTP id b7mr401251wag.1194658056277; Fri, 09 Nov 2007 17:27:36 -0800 (PST) Received: by 10.114.52.11 with HTTP; Fri, 9 Nov 2007 17:27:36 -0800 (PST) Message-ID: <839ba01c0711091727u5744510fk67d82548b53f0a01@mail.gmail.com> Date: Sat, 10 Nov 2007 09:27:36 +0800 From: "Cedric Ho" To: java-user@lucene.apache.org Subject: Re: Chinese Segmentation with Phase Query In-Reply-To: <9294E20AED46934EA459020706463F9450FE0F@SUEXCL-02.ad.syr.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <839ba01c0711081859h2d485e8vef8d193ae9edb59d@mail.gmail.com> <9294E20AED46934EA459020706463F9450FE0F@SUEXCL-02.ad.syr.edu> X-Virus-Checked: Checked by ClamAV on apache.org On Nov 10, 2007 2:08 AM, Steven A Rowe wrote: > Hi Cedric, > > On 11/08/2007, Cedric Ho wrote: > > a sentence containing characters ABC, it may be segmented into AB, C or A, BC. > [snip] > > In this cases we would like to index both segmentation into the index: > > > > AB offset (0,1) position 0 A offset (0,0) position 0 > > C offset (2,2) position 1 BC offset (1,2) position 1 > > > > Now the problem is, when someone search using a PhraseQuery (AC) it > > will find this line ABC because it match A (position 0) and C > > (position 1). > > > > Are there any ways to search for exact match using the offset > > information instead of the position information ? > > Since you are writing the tokenizer (the Lucene term for the module that performs the segmentation), you yourself can substitute the beginning offset for the position. But I think that without the end offset, it won't get you what you want. > > For example, if your above example were indexed with beginning offsets as positions, a phrase query for "AB, C" will fail to match -- even though it should match -- because the segments' beginning offsets (0 and 2) are not contiguous. > > The new Payloads feature could provide the basis for storing beginning and ending offsets required to determine contiguity when matching phrases, but you would have to write matching and scoring for this representation, and that may not be the quickest route available to you. > > Solution #1: Create multiple fields, one for each full alternative segmentation, and then query against all of them. > > Solution #2: Store the alternative segmentations in the same field, but instead of interleaving the segments' positions, as in your example, make the position ranges of the alternatives non-contiguous. Recasting your example: > > lternative #1 Alternative #2 Alternative #3 > ------------- -------------- -------------- > AB position 0 A position 100 A position 200 > C position 1 BC position 101 B position 201 > C position 202 > > There is a problem with both of the above-described solutions: in my limited experience with Chinese segmentation, substantially less than half the text has alternative segmentations. As a result, the segments on which all of alternatives agree (call them "uncontested segments") will have higher term frequencies than those segments which differ among the alternatives ("contested segments"). This means that document scores will be influenced by the variable density of the contested segments they contain. > > However, if you were to use my above-described Solution #1 along with a DisjunctionMaxQuery[1] as a wrapper around one query per alternative segmentation field, the term frequency problem would no longer be an issue. From the API doc for DisjunctionMaxQuery: > > A query that generates the union of documents produced by its > subqueries, and that scores each document with the maximum > score for that document as produced by any subquery, plus a > tie breaking increment for any additional matching subqueries. > This is useful when searching for a word in multiple fields > with different boost factors (so that the fields cannot be > combined equivalently into a single search field). We want > the primary score to be the one associated with the highest > boost, not the sum of the field scores (as BooleanQuery would > give). > > Unlike the use-case mentioned above, where each field will be boosted differently, you probably don't have any information about the relative probability of the alternative segmentations, so you'll want to use the same boost for each sub-query. > > Steve > > [1] > > -- > Steve Rowe > Center for Natural Language Processing > http://www.cnlp.org/tech/lucene.asp > Hi Steve, We have actually thought about solution #1, and in our case, sorting by scoring is not a very important factor as well. However this would double the index size. A full index of our documents now would take > 80G already, and it's expected to grow much faster in the near future. As you've mentioned, the ambiguities in segmentation are rare indeed. So we are not very willing to double the index size just for that. For solution #2, it is a good solution indeed, but you see, our problem is much deeper in fact. We are currently try to replace our old search engine with Lucene. Which means whatever features the old engine have, we need to simulate it in Lucene. One such feature is similar to Lucene's SpanNearQuery, which, unfortunately, doesn't work with solution #2 if the search terms contains the term BC in Alternative #2. e.g. Given a sentence XYZABCDEF suppose the segmentation is XY position 0 Z position 1 Alternative #2 AB position 2 A position 102 C position 3 BC position 103 DEF position 4 A SpanNearQuery of (XY, BC, DEF) with distance of 10 would fail to match this document. So it seems we may have to go the more difficult route in exploring the Payload feature that you mentioned. I've seen it being announced during the release of 2.2.0, but the API says "experimental" and there ain't any example about how can it be used. But thanks very much for your good suggestions. Cheers, Cedric --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org