lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Che Dong" <ched...@hotmail.com>
Subject about bigram based word segment
Date Fri, 13 Sep 2002 01:43:11 GMT
> I don't know any Asian languages but from earlier experimentations, I
> remember that some time bigram tokenization could hurt matching, e.g.:
> 
> w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
> miss a search for w2. w1 w2 w3 would work better.
> 
if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3, 
you search "w1w2" and "w2w1" will return with same the result. isn't it?


with bigram based word segment "w1w2w3" => "w1w2" "w2w3"
or even trigram base word segment "w1w2w3w4" => "w1w2w3" "w2w3w4"
will avoid above charactor sequence problem.

According to the stat. the bigram based word segment returned best resutls. but need queryParser
parser query with "and" relation by default 

You can try the bigram based word segment at http://search.163.com  in  category search and
news search(web page is powered by google).
google's Chinese language analysis is provided by basistech with Dictionary based word segment.
http://www.basistech.com/products/language-analysis/cma.html



Che, Dong




Mime
View raw message