lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <>
Subject [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
Date Sat, 12 Jun 2010 13:57:16 GMT


Steven Rowe commented on LUCENE-2167:

bq. NewStandardTokenizer is not quite finished; I plan on stealing Robert's Southeast Asian
(Lao, Myanmar, Khmer) syllabification routine

Curious, what is your plan here? Do you plan to somehow "jflex-#include" these into the grammar
so that these are longest-matched instead of the Complex_Context rule?

Sorry, I haven't looked at the details yet, but roughly, yes, what you said.

bq. How to handle the cases where the grammar cannot be forward-only deterministic matching?
(at least i don't see how it could be, but maybe). e.g. the lao cases where some backtracking
is needed... and the combining class reordering needed for real-world text?

I was thinking of trying to make regex versions of all of these, and failing that, recognize
chunks that need special handling, and do that outside of matching in methods in the tokenizer

bq. Curious what would you plan to index for Thai, words? a grammar for TCC?

You had mentioned wanting to make a Thai syllabification routine - I was thinking that either
you or I would do this.

bq. Also, some of these syllable techniques are probably not very good for search without
doing a "shingle" later... in some cases it may perform OK like single ideographs or tibetan
syllables do with the grammar you have. For others (Khmer, etc) I think the shingling is likely
mandatory since they are really only a bit better than indexing grapheme clusters.

I'm thinking of leaving shingling for later, using the conditional branching filter idea (LUCENE-2470)
based on token type.

bq. As far as needing punctuation for shingling, the similar problem already exists. For example,
after tokenizing, some discarding of information (punctuation) has been lost and its too late
to do a nice shingle. practical cheating/workarounds exist for CJK (you could look at the
offset or something and cheat, to figure out that they were adjacent), but for something like
Tibetan the type of punctuation itself is important: the tsheg being unambiguous syllable
separator, but ambiguous word separator, but the shad or whitespace being both.

You're arguing either for in-tokenizer shingling or passing non-tokenized data out of the
tokenizer in addition to the tokens.  Hmm.

bq. Here is the paper I brought up at ehatcher's house recently when we were discussing tibetan,
that recommends this syllable bigram technique, where the shingling is dependent on the original

Interesting paper. With syllable n-grams (in Tibetan anyway), you trade off (quadrupled) index
size for word segmentation, but otherwise, these work equally well.

bq. One alternative for the short term would be to make a tokenfilter that hooks into the
ICUTokenizer logic but looks for Complex_Context, or similar. I definitely agree it would
be best if standardtokenizer worked the best out of box without doing something like this.

Yeah, I'd rather build it into the new StandardTokenizer.

bq. Finally, I think its worth considering a lot of this as a special case of a larger problem
that affects even english. For a lot of users, punctuation such as the hyphen in english might
have some special meaning and they might want to shingle or something else in that case too.
Its a general problem with tokenstreams that the tokenizer often discards this information
and the filters are left with only a partial picture. Some ideas to improve it would be to
make use of properties like [:Terminal_Punctuation=Yes:] somehow, or to try to integrate Sentence

I don't understand how Sentence segmentation could help?

One other possibility is to return *everything* from the tokenizer, marking the non-tokens
with an appropriate type, similar to how the ICU tokenizer works.  This has the unfortunate
side effect of *requiring* post-tokenization filtering to discard non-tokens.

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>                 Key: LUCENE-2167
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch,
LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-lucene-buildhelper-maven-plugin.patch,
LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch,
LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
> It would be really nice for StandardTokenizer to adhere straight to the standard as much
as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer,
as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay
with that EuropeanTokenizer, and it could be used by the european analyzers.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message