From lucene-dev-return-2295-qmlist-jakarta-archive-lucene-dev=jakarta.apache.org@jakarta.apache.org Thu Sep 12 11:34:26 2002 Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 23267 invoked from network); 12 Sep 2002 11:34:25 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 12 Sep 2002 11:34:25 -0000 Received: (qmail 23384 invoked by uid 97); 12 Sep 2002 11:34:57 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 23367 invoked by uid 97); 12 Sep 2002 11:34:56 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 23341 invoked by uid 98); 12 Sep 2002 11:34:55 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-ID: <20020912113415.27911.qmail@web11902.mail.yahoo.com> Date: Thu, 12 Sep 2002 04:34:15 -0700 (PDT) From: Alex Murzaku Subject: Re: fixed url and How to contribute code to lucene sandbox? To: Lucene Developers List In-Reply-To: <3D7FBF0B.5050309@lucene.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I don't know any Asian languages but from earlier experimentations, I remember that some time bigram tokenization could hurt matching, e.g.: w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would miss a search for w2. w1 w2 w3 would work better. --- Doug Cutting wrote: > Che Dong wrote: > > 2. CJK support: > > 2.1 sigram based(no word segment just use one character as a > token): modified from StandardTokenizer.java > > > http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330905 > > CJKTokenizer for Asia language(Chinese Japanese Korean) Word > Segment > > > http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=450266 > > StandardTokenizer with sigram based CJK Support > > > > 2.2 bigram based word segment: modified from SimpleTokenizer to > CJKTokenizer.java > > > http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01220.html > > I think it would be great to have some support for asian languages > built > into Lucene. Which of these approaches do you think is best? I like > > the idea of a StandardTokenizer or SimpleTokenizer that automatically > > provides this via bigrams. What do others think? > > Doug > > > > -- > To unsubscribe, e-mail: > > For additional commands, e-mail: > > ===== __________________________________ alex@lissus.com -- http://www.lissus.com __________________________________________________ Do you Yahoo!? Yahoo! News - Today's headlines http://news.yahoo.com -- To unsubscribe, e-mail: For additional commands, e-mail: