+1
I'm willing to include this patch in 1.3 final. Are there any who see
problems with it or otherwise oppose it?
Doug
John McNally wrote:
> I'd certainly like to see a resolution to my sigram/cjk
> question/proposal a few days ago. It might not be a high priority
> issue, but I think if there is agreement it is a simple fix.
>
> I'm sure I'm discussing stuff that is well known in this community, but
> will give some background just in case. There are three main ways to
> create tokens out of text. Character, multi-character (n-gram), and
> word. Words are generally considered the best; though for CJK languages
> using words means using a dictionary, since delimiters such as
> whitespace are not usually used, which increases complexity quite a bit.
>
> An n-gram index usually has better precision than a character based
> index but a much larger index size. There is a bigram analyzer posted
> as an enhancement in bugzilla.
>
> A character based index lead to long lists for each key, but given that
> inefficiency, they are easy to implement and have shown to be useful for
> CJK, one can even use phrase matching to get word matches. There was a
> patch made which uses the term sigram which I interpret to mean
> character based indexing. It, however, appears flawed. It is treating
> all consecutive CJK characters as a token; which in the case where there
> is no non-CJK characters in the text is the same as whole document
> matching. As this is almost the same behavior that was available prior
> to the patch, I think I am right in thinking there is a bug.
>
> The patch could be small:
> --- StandardTokenizer.jj-orig 2003-12-19 16:56:31.000000000 -0800
> +++ StandardTokenizer.jj 2003-12-19 16:54:43.000000000 -0800
> @@ -125,7 +125,7 @@
> (<LETTER>|<DIGIT>)*
> >
>
> -| < SIGRAM: (<CJK>)+ >
> +| < SIGRAM: (<CJK>) >
> | < #ALPHA: (<LETTER>)+>
> | < #LETTER: // unicode letters
> [
>
>
> I would think that removing SIGRAM and only using CJK as the token would
> be better, but I don't have a setup to test these changes.
>
> Any chance this can be addressed?
>
> john mcnally
>
>
>
> On Fri, 2003-12-19 at 13:31, Doug Cutting wrote:
>
>>I'm thinking of making a 1.3 final release in the next few days.
>>
>>Any objections?
>>
>>Doug
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
|