lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John McNally <jmcna...@collab.net>
Subject Re: release 1.3-final in 2003?
Date Sat, 20 Dec 2003 00:59:40 GMT
I'd certainly like to see a resolution to my sigram/cjk
question/proposal a few days ago.  It might not be a high priority
issue, but I think if there is agreement it is a simple fix.

I'm sure I'm discussing stuff that is well known in this community, but
will give some background just in case.  There are three main ways to
create tokens out of text.  Character, multi-character (n-gram), and
word.  Words are generally considered the best; though for CJK languages
using words means using a dictionary, since delimiters such as
whitespace are not usually used, which increases complexity quite a bit.

An n-gram index usually has better precision than a character based
index but a much larger index size.  There is a bigram analyzer posted
as an enhancement in bugzilla.

A character based index lead to long lists for each key, but given that
inefficiency, they are easy to implement and have shown to be useful for
CJK, one can even use phrase matching to get word matches.  There was a
patch made which uses the term sigram which I interpret to mean
character based indexing.  It, however, appears flawed.  It is treating
all consecutive CJK characters as a token; which in the case where there
is no non-CJK characters in the text is the same as whole document
matching.  As this is almost the same behavior that was available prior
to the patch, I think I am right in thinking there is a bug.

The patch could be small:
--- StandardTokenizer.jj-orig   2003-12-19 16:56:31.000000000 -0800
+++ StandardTokenizer.jj        2003-12-19 16:54:43.000000000 -0800
@@ -125,7 +125,7 @@
     (<LETTER>|<DIGIT>)*
   >
  
-| < SIGRAM: (<CJK>)+ >
+| < SIGRAM: (<CJK>) >
 | < #ALPHA: (<LETTER>)+>
 | < #LETTER:                                     // unicode letters
       [


I would think that removing SIGRAM and only using CJK as the token would
be better, but I don't have a setup to test these changes.

Any chance this can be addressed?

john mcnally



On Fri, 2003-12-19 at 13:31, Doug Cutting wrote:
> I'm thinking of making a 1.3 final release in the next few days.
> 
> Any objections?
> 
> Doug
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message