lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: sigram?
Date Tue, 16 Dec 2003 19:40:46 GMT
John - I moderated your message in to the list as you were not 
subscribed.  I rarely do this, but I made an exception for you :)  Be 
sure to subscribe with the address you're sending from.

	Erik


On Tuesday, December 16, 2003, at 01:46  PM, John McNally wrote:

> I'm also trying to figure out what this term intends, because my
> interpretation does not agree with the implementation in
> StandardTokenizer.jj
>
> There is another bugzilla entry for a bigram tokenizer which basically
> slides a selection window over the text creating many tokens of 2
> characters each:  abcd is tokenized as {ab, bc, cd}.  I was therefore
> expecting the sigram to tokenize abcd as {a, b, c, d}.  What the
> StandardTokenizer does though is tokenize abcd as {abcd}.
>
> Note I am using ascii characters above, but the argument is meant for
> CJK characters. I'll switch to <CJKphrasex> in the rest of this email 
> to
> mean a series of CJK characters to hopefully reduce confusion.
>
> So if I index <CJKphrase1><CJKphrase2>, the current (ST.jj) will create
> a token <CJKphrase1CJKphrase2>.  This is actually not much different
> than lucene's behavior in 1.2 without any CJK support.  The slight
> difference occurs if you do actually use some other characters such as
> numbers or latin alphabet.
>
> The current code will tokenize <CJKphrase1>123<CJKphrase2> as
> {<CJKphrase1>, 123, <CJKphrase2>}. While version 1.2 would still create
> a single token as <CJKphrase1123CJKphrase2>.
>
> Is this the intended behavior where
>
> SIGRAM = (CJK)+
>
> or should
>
> SIGRAM = CJK
>
> ?
>
> john mcnally
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message