lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Open Study" <open.st...@gmail.com>
Subject Re: Chinese Segmentation with Phase Query
Date Fri, 09 Nov 2007 14:43:51 GMT
Hi Cedric

You may try the CJKAnalyzer within the lucene sandbox. It doesn't give
a perfect solution for Chinese word segmentation, but will solve the
problem in your case.

On Nov 9, 2007 10:59 AM, Cedric Ho <cedric.ho@gmail.com> wrote:
> Hi,
>
> We are having an issue while indexing Chinese Documents in Lucene.
>
> Some background first:
> Since CJK languages doesn't have space between words, we first have to
> determine the words from sentences. e.g.
>
> a sentence containing characters ABC, it may be segmented into AB, C or A, BC.
>
> the problem is sometimes there can be ambiguities in how the sentence
> should be segmented. It is possible that
> both AB, C and A, BC are valid segmentations.
>
> In this cases we would like to index both segmentation into the index:
>
> AB offset (0,1) position 0
> C offset (2,2) position 1
> A offset (0,0) position 0
> BC offset (1,2) position 1
>
> Now the problem is, when someone search using a PhraseQuery (AC) it
> will find this line ABC because it match A (position 0) and C
> (position 1).
>
> Are there any ways to search for exact match using the offset
> information instead of the position information ?
>
> Best Regards,
> Cedric
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message