Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of cedric.ho@gmail.com
 designates 209.85.198.187 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=k3Hdyz4FbGaWDieTIx4KVitHckbpdRv9Rd6FsJFsDEbHau/K1jbSk5FT8SpR+ufvEdbdtdQIGnorRcm2Be1mR1fg5rqT7x2HKyjex6zT+6a8lzwTEct5qbbYsOHpVgXtrcBjtlddI4oNtD7Om9ZgZe3F3pMzhfn8EET0/fPwk8E=
Message-ID: <839ba01c0711091727u5744510fk67d82548b53f0a01@mail.gmail.com>
Date: Sat, 10 Nov 2007 09:27:36 +0800
From: "Cedric Ho" <cedric.ho@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Chinese Segmentation with Phase Query
In-Reply-To: <9294E20AED46934EA459020706463F9450FE0F@SUEXCL-02.ad.syr.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <839ba01c0711081859h2d485e8vef8d193ae9edb59d@mail.gmail.com>
	 <9294E20AED46934EA459020706463F9450FE0F@SUEXCL-02.ad.syr.edu>

On Nov 10, 2007 2:08 AM, Steven A Rowe <sarowe@syr.edu> wrote:
> Hi Cedric,
>
> On 11/08/2007, Cedric Ho wrote:
> > a sentence containing characters ABC, it may be segmented into AB, C or A, BC.
> [snip]
> > In this cases we would like to index both segmentation into the index:
> >
> > AB offset (0,1) position 0            A offset (0,0) position 0
> > C offset (2,2) position 1             BC offset (1,2) position 1
> >
> > Now the problem is, when someone search using a PhraseQuery (AC) it
> > will find this line ABC because it match A (position 0) and C
> > (position 1).
> >
> > Are there any ways to search for exact match using the offset
> > information instead of the position information ?
>
> Since you are writing the tokenizer (the Lucene term for the module that performs the segmentation), you yourself can substitute the beginning offset for the position.  But I think that without the end offset, it won't get you what you want.
>
> For example, if your above example were indexed with beginning offsets as positions, a phrase query for "AB, C" will fail to match -- even though it should match -- because the segments' beginning offsets (0 and 2) are not contiguous.
>
> The new Payloads feature could provide the basis for storing beginning and ending offsets required to determine contiguity when matching phrases, but you would have to write matching and scoring for this representation, and that may not be the quickest route available to you.
>
> Solution #1: Create multiple fields, one for each full alternative segmentation, and then query against all of them.
>
> Solution #2: Store the alternative segmentations in the same field, but instead of interleaving the segments' positions, as in your example, make the position ranges of the alternatives non-contiguous.  Recasting your example:
>
>         lternative #1   Alternative #2  Alternative #3
>         -------------   --------------  --------------
>         AB position 0   A position 100  A position 200
>         C position 1    BC position 101 B position 201
>                                                         C position 202
>
> There is a problem with both of the above-described solutions: in my limited experience with Chinese segmentation, substantially less than half the text has alternative segmentations.  As a result, the segments on which all of alternatives agree (call them "uncontested segments") will have higher term frequencies than those segments which differ among the alternatives ("contested segments").  This means that document scores will be influenced by the variable density of the contested segments they contain.
>
> However, if you were to use my above-described Solution #1 along with a DisjunctionMaxQuery[1] as a wrapper around one query per alternative segmentation field, the term frequency problem would no longer be an issue.  From the API doc for DisjunctionMaxQuery:
>
>     A query that generates the union of documents produced by its
>     subqueries, and that scores each document with the maximum
>     score for that document as produced by any subquery, plus a
>     tie breaking increment for any additional matching subqueries.
>     This is useful when searching for a word in multiple fields
>     with different boost factors (so that the fields cannot be
>     combined equivalently into a single search field).  We want
>     the primary score to be the one associated with the highest
>     boost, not the sum of the field scores (as BooleanQuery would
>     give).
>
> Unlike the use-case mentioned above, where each field will be boosted differently, you probably don't have any information about the relative probability of the alternative segmentations, so you'll want to use the same boost for each sub-query.
>
> Steve
>
> [1] <http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/DisjunctionMaxQuery.html>
>
> --
> Steve Rowe
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
>

Hi Steve,

We have actually thought about solution #1, and in our case, sorting
by scoring is not a very important factor as well. However this would
double the index size. A full index of our documents now would take >
80G already, and it's expected to grow much faster in the near future.
As you've mentioned, the ambiguities in segmentation are rare indeed.
So we are not very willing to double the index size just for that.

For solution #2, it is a good solution indeed, but you see, our
problem is much deeper in fact. We are currently try to replace our
old search engine with Lucene. Which means whatever features the old
engine have, we need to simulate it in Lucene. One such feature is
similar to Lucene's SpanNearQuery, which, unfortunately, doesn't work
with solution #2 if the search terms contains the term BC in
Alternative #2.

e.g.
Given a sentence XYZABCDEF
suppose the segmentation is
XY position 0
Z position 1           Alternative #2
AB position 2         A position 102
C position 3          BC position 103
DEF position 4

A SpanNearQuery of (XY, BC, DEF) with distance of 10 would fail to
match this document.

So it seems we may have to go the more difficult route in exploring
the Payload feature that you mentioned. I've seen it being announced
during the release of 2.2.0, but the API says "experimental" and there
ain't any example about how can it be used.

But thanks very much for your good suggestions.

Cheers,
Cedric

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org