lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: release 1.3-final in 2003?
Date Mon, 22 Dec 2003 21:35:53 GMT
I admit not understanding sigram/CJK issues fully, but I trust Doug
does, so I'm +0.

Otis

--- Doug Cutting <cutting@lucene.com> wrote:
> +1
> 
> I'm willing to include this patch in 1.3 final.  Are there any who
> see 
> problems with it or otherwise oppose it?
> 
> Doug
> 
> John McNally wrote:
> > I'd certainly like to see a resolution to my sigram/cjk
> > question/proposal a few days ago.  It might not be a high priority
> > issue, but I think if there is agreement it is a simple fix.
> > 
> > I'm sure I'm discussing stuff that is well known in this community,
> but
> > will give some background just in case.  There are three main ways
> to
> > create tokens out of text.  Character, multi-character (n-gram),
> and
> > word.  Words are generally considered the best; though for CJK
> languages
> > using words means using a dictionary, since delimiters such as
> > whitespace are not usually used, which increases complexity quite a
> bit.
> > 
> > An n-gram index usually has better precision than a character based
> > index but a much larger index size.  There is a bigram analyzer
> posted
> > as an enhancement in bugzilla.
> > 
> > A character based index lead to long lists for each key, but given
> that
> > inefficiency, they are easy to implement and have shown to be
> useful for
> > CJK, one can even use phrase matching to get word matches.  There
> was a
> > patch made which uses the term sigram which I interpret to mean
> > character based indexing.  It, however, appears flawed.  It is
> treating
> > all consecutive CJK characters as a token; which in the case where
> there
> > is no non-CJK characters in the text is the same as whole document
> > matching.  As this is almost the same behavior that was available
> prior
> > to the patch, I think I am right in thinking there is a bug.
> > 
> > The patch could be small:
> > --- StandardTokenizer.jj-orig   2003-12-19 16:56:31.000000000 -0800
> > +++ StandardTokenizer.jj        2003-12-19 16:54:43.000000000 -0800
> > @@ -125,7 +125,7 @@
> >      (<LETTER>|<DIGIT>)*
> >    >
> >   
> > -| < SIGRAM: (<CJK>)+ >
> > +| < SIGRAM: (<CJK>) >
> >  | < #ALPHA: (<LETTER>)+>
> >  | < #LETTER:                                     // unicode
> letters
> >        [
> > 
> > 
> > I would think that removing SIGRAM and only using CJK as the token
> would
> > be better, but I don't have a setup to test these changes.
> > 
> > Any chance this can be addressed?
> > 
> > john mcnally
> > 
> > 
> > 
> > On Fri, 2003-12-19 at 13:31, Doug Cutting wrote:
> > 
> >>I'm thinking of making a 1.3 final release in the next few days.
> >>
> >>Any objections?
> >>
> >>Doug
> >>
> >>
>
>>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> > 
> > 
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing.
http://photos.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message