lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: release 1.3-final in 2003?
Date Mon, 22 Dec 2003 19:28:30 GMT
+1

I'm willing to include this patch in 1.3 final.  Are there any who see 
problems with it or otherwise oppose it?

Doug

John McNally wrote:
> I'd certainly like to see a resolution to my sigram/cjk
> question/proposal a few days ago.  It might not be a high priority
> issue, but I think if there is agreement it is a simple fix.
> 
> I'm sure I'm discussing stuff that is well known in this community, but
> will give some background just in case.  There are three main ways to
> create tokens out of text.  Character, multi-character (n-gram), and
> word.  Words are generally considered the best; though for CJK languages
> using words means using a dictionary, since delimiters such as
> whitespace are not usually used, which increases complexity quite a bit.
> 
> An n-gram index usually has better precision than a character based
> index but a much larger index size.  There is a bigram analyzer posted
> as an enhancement in bugzilla.
> 
> A character based index lead to long lists for each key, but given that
> inefficiency, they are easy to implement and have shown to be useful for
> CJK, one can even use phrase matching to get word matches.  There was a
> patch made which uses the term sigram which I interpret to mean
> character based indexing.  It, however, appears flawed.  It is treating
> all consecutive CJK characters as a token; which in the case where there
> is no non-CJK characters in the text is the same as whole document
> matching.  As this is almost the same behavior that was available prior
> to the patch, I think I am right in thinking there is a bug.
> 
> The patch could be small:
> --- StandardTokenizer.jj-orig   2003-12-19 16:56:31.000000000 -0800
> +++ StandardTokenizer.jj        2003-12-19 16:54:43.000000000 -0800
> @@ -125,7 +125,7 @@
>      (<LETTER>|<DIGIT>)*
>    >
>   
> -| < SIGRAM: (<CJK>)+ >
> +| < SIGRAM: (<CJK>) >
>  | < #ALPHA: (<LETTER>)+>
>  | < #LETTER:                                     // unicode letters
>        [
> 
> 
> I would think that removing SIGRAM and only using CJK as the token would
> be better, but I don't have a setup to test these changes.
> 
> Any chance this can be addressed?
> 
> john mcnally
> 
> 
> 
> On Fri, 2003-12-19 at 13:31, Doug Cutting wrote:
> 
>>I'm thinking of making a 1.3 final release in the next few days.
>>
>>Any objections?
>>
>>Doug
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message