lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John McNally <jmcna...@apache.org>
Subject sigram?
Date Tue, 16 Dec 2003 18:46:40 GMT
I'm also trying to figure out what this term intends, because my
interpretation does not agree with the implementation in
StandardTokenizer.jj

There is another bugzilla entry for a bigram tokenizer which basically
slides a selection window over the text creating many tokens of 2
characters each:  abcd is tokenized as {ab, bc, cd}.  I was therefore
expecting the sigram to tokenize abcd as {a, b, c, d}.  What the
StandardTokenizer does though is tokenize abcd as {abcd}.

Note I am using ascii characters above, but the argument is meant for
CJK characters. I'll switch to <CJKphrasex> in the rest of this email to
mean a series of CJK characters to hopefully reduce confusion.

So if I index <CJKphrase1><CJKphrase2>, the current (ST.jj) will create
a token <CJKphrase1CJKphrase2>.  This is actually not much different
than lucene's behavior in 1.2 without any CJK support.  The slight
difference occurs if you do actually use some other characters such as
numbers or latin alphabet.

The current code will tokenize <CJKphrase1>123<CJKphrase2> as
{<CJKphrase1>, 123, <CJKphrase2>}. While version 1.2 would still create
a single token as <CJKphrase1123CJKphrase2>.

Is this the intended behavior where 

SIGRAM = (CJK)+

or should 

SIGRAM = CJK

?

john mcnally



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message