lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <>
Subject RE: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906
Date Sat, 17 Dec 2011 00:32:51 GMT
Hi Robert,

Thanks for the quick and thoughtful response. 

I didn't realize these complexities and thought maybe there was an easy solution :)

We may be involved in a project that involves Tibetan text and given our current resources
and priorities, we would stick it in the same field as the other 400+ languages.  I was hoping
that with the script attribute output by the ICUTokenizer, we could figure out something to
do script/language specific processing for Tibetan without adversely affecting anything else.

>>. I suppose to inhibit stupid bigrams you would *not*shingle across shad as well

Unfortunately, it sounds like the ICUTokenizer will segment on the Tibetan phrase separators
but downstream filters won't know that, so we couldn't have a downstream filter that avoided
bigramming across a phrase separator. On the other hand it might be that "stupid" overlapping
bigrams don't hurt retrieval compared to treating syllables as if they were words i.e. syllable
unigrams. ( I've not been able to find much published research in English on the issue, and
many of the references are to articles in Chinese language publications.  I'm pretty much
relying on the article by Hackett and Oard) 


Hackett, P. G., & Oard, D. W. (2000). Comparison of word-based and syllable-based retrieval
for Tibetan (poster session). In Proceedings of the fifth international workshop on on Information
retrieval with Asian languages - IRAL '00 (pp. 197-198). Presented at the the fifth international
workshop on, Hong Kong, China. doi:10.1145/355214.355242

-----Original Message-----
From: Robert Muir [] 
Sent: Friday, December 16, 2011 6:45 PM
Subject: Re: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

On Fri, Dec 16, 2011 at 5:44 PM, Burton-West, Tom <> wrote:
> The ICUTokenizer now adds a script attribute for tokens (as do Standard
> Tokenizer and a couple of others (LUCENE-2911)  For example “Tibetan” or
> “Han”.   If the Shingle filter had some provision to only make token n-grams
> when the script attribute matched some specified script, it would solve both
> the need to produce character bigrams for CJK ( Han) and syllable bigrams
> for Tibetan.  We already opened an issue to create overlapping bigrams for
> CJK (LUCENE-2906) .

Not sure it totally would because there are key important differences,
and a few complications:
1. CJKTokenizer today creates bigrams in runs "cjk" text where this is
something like: [IHK]+ (run of ideographic, hiragana, katakana). There
are different variations on this available too, like only bigram I+
and do something else with the katakana (like keep as word). Seems
like the verdict from previous studies is that there are options there
and they tend to both work well. But one thing is still for sure, I
think it would bad here to form bigrams across what was not contiguous
text (e.g. across sentence boundaries). Finally, some CJK
normalization (such as halfwidth/fullwidth conversion) is not 1:1
replacement and so really the process here should at least be aware of
this and consider some sequences of half-width-kana as a single
2. Unlike the CJK case, where you bigram a "run", Tibetan separates
syllables with special punctuation (tsheg among other things). The
reason you have syllables as output from these tokenizers is because
of this reason. So this is already a fundamentally different bigram
algorithm, because its not longer contiguous runs, instead syllables
often had something in between, and depending upon what that something
is tells you if its e.g. a syllable separator or something more like a
phrase separator. I suppose to inhibit stupid bigrams you would *not*
shingle across shad as well.. how to generalize that? The verdict for
this language definitely isn't out here, I've only see some very
initial rough work on this language and we aren't totally sure this
works well on average.
3. Other "complex" languages besides these are also emitting syllables
"at best", too: Thai,Lao,Myanmar,Khmer? Shouldn't we bigram those too?
Except, one implementation (ICUTokenizer) is emitting syllables here
(what type of syllable depends upon the current implementation, too!),
and the other (StandardTokenizer) is emitting whole phrases as words.
Would be great to bigram the former (we think!), but even more
horrible to do it to the latter. I put "we think" here because there
has really been no work done here, so its just intuition/guessing.
And to make matters worse, we have a filter in contrib
(ThaiWordFilter) that relies upon the specifics of how
StandardTokenizer screws up Thai tokenization so it can 'retokenize'.

> Would it make sense to open an issue for modifying the Shingle filter to
> have configurable script-specific behavior, or is this just another use case
> for LUCENE 2906?
> If it is another use case for LUCENE 2906, then perhaps we need to change
> the summary of the issue to generalize it beyond CJK.
> Any suggestions ?
> Tom Burton-West


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message