lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Murzaku <murz...@yahoo.com>
Subject Re: fixed url and How to contribute code to lucene sandbox?
Date Thu, 12 Sep 2002 11:34:15 GMT
I don't know any Asian languages but from earlier experimentations, I
remember that some time bigram tokenization could hurt matching, e.g.:

w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
miss a search for w2. w1 w2 w3 would work better.

--- Doug Cutting <cutting@lucene.com> wrote:
> Che Dong wrote:
> > 2. CJK support: 
> >        2.1 sigram based(no word segment just use one character as a
> token):  modified from StandardTokenizer.java
> >    
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330905
> >     CJKTokenizer for Asia language(Chinese Japanese Korean) Word
> Segment
> >    
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=450266
> >     StandardTokenizer with sigram based CJK Support
> > 
> >     2.2 bigram based word segment: modified from SimpleTokenizer to
> CJKTokenizer.java
> >    
>
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01220.html
> 
> I think it would be great to have some support for asian languages
> built 
> into Lucene.  Which of these approaches do you think is best?  I like
> 
> the idea of a StandardTokenizer or SimpleTokenizer that automatically
> 
> provides this via bigrams.  What do others think?
> 
> Doug
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> 


=====
__________________________________
alex@lissus.com -- http://www.lissus.com

__________________________________________________
Do you Yahoo!?
Yahoo! News - Today's headlines
http://news.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message