lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cedric Ho" <cedric...@gmail.com>
Subject Re: Chinese Segmentation with Phase Query
Date Sat, 10 Nov 2007 00:50:39 GMT
The CJKAnalyzer is too simple for our need. But thanks for suggesting anyway.

Cheers,
Cedric

On Nov 9, 2007 10:43 PM, Open Study <open.study@gmail.com> wrote:
> Hi Cedric
>
> You may try the CJKAnalyzer within the lucene sandbox. It doesn't give
> a perfect solution for Chinese word segmentation, but will solve the
> problem in your case.
>
>
> On Nov 9, 2007 10:59 AM, Cedric Ho <cedric.ho@gmail.com> wrote:
> > Hi,
> >
> > We are having an issue while indexing Chinese Documents in Lucene.
> >
> > Some background first:
> > Since CJK languages doesn't have space between words, we first have to
> > determine the words from sentences. e.g.
> >
> > a sentence containing characters ABC, it may be segmented into AB, C or A, BC.
> >
> > the problem is sometimes there can be ambiguities in how the sentence
> > should be segmented. It is possible that
> > both AB, C and A, BC are valid segmentations.
> >
> > In this cases we would like to index both segmentation into the index:
> >
> > AB offset (0,1) position 0
> > C offset (2,2) position 1
> > A offset (0,0) position 0
> > BC offset (1,2) position 1
> >
> > Now the problem is, when someone search using a PhraseQuery (AC) it
> > will find this line ABC because it match A (position 0) and C
> > (position 1).
> >
> > Are there any ways to search for exact match using the offset
> > information instead of the position information ?
> >
> > Best Regards,
> > Cedric
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message