lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Herman Chen" <hc...@intumit.com>
Subject Re: about bigram based word segment
Date Fri, 13 Sep 2002 16:07:06 GMT
I think there's another flaw with the bigram approach when the query
consists of 3+ characters.  i.e. a query of w1w2w3 would match such
text as w1w2w4w2w3.  Currently I do unigram tokenization and perform
auto phrase queries for cjk searches, but performance could take a hit in
large-scale situations.

----- Original Message -----
From: "Che Dong" <chedong@hotmail.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Friday, September 13, 2002 9:43 AM
Subject: about bigram based word segment


> > I don't know any Asian languages but from earlier experimentations, I
> > remember that some time bigram tokenization could hurt matching, e.g.:
> >
> > w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
> > miss a search for w2. w1 w2 w3 would work better.
> >
> if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3,
> you search "w1w2" and "w2w1" will return with same the result. isn't it?
>
>
> with bigram based word segment "w1w2w3" => "w1w2" "w2w3"
> or even trigram base word segment "w1w2w3w4" => "w1w2w3" "w2w3w4"
> will avoid above charactor sequence problem.
>
> According to the stat. the bigram based word segment returned best
resutls. but need queryParser parser query with "and" relation by default
>
> You can try the bigram based word segment at http://search.163.com  in
category search and news search(web page is powered by google).
> google's Chinese language analysis is provided by basistech with
Dictionary based word segment.
> http://www.basistech.com/products/language-analysis/cma.html
>
>
>
> Che, Dong
>
>
>
>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message