lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Itamar Syn-Hershko (JIRA)" <>
Subject [jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
Date Tue, 09 Dec 2014 22:00:15 GMT


Itamar Syn-Hershko commented on LUCENE-6103:

Ok so I did some homework. In swedish, "connect" is a way to shortcut writings of words. So
"C:a" is infact "cirka" which means "approximately". I guess it can be thought of as English
acronyms, only apparently its way less commonly used in Swedish (my source says "very very
seldomly used; old style and not used in modern Swedish at all").

Not only it is hardly being used, apparently it's only legal in 3 letter combinations (c:a
but not c:ka).

And also, the affects it has are quite severe at the moment - 2 words with a colon in between
that didn't have space will be outputted as one token even though its 100% its not applicable
to Swedish, since each words has > 2 characters.

I'm not aiming at changing the Unicode standards, that's way beyond my limited powers, but:

1. Given the above, does it really make sense to use this tokenizer in all language-specific
analyzers as well? e.g.

I'd think for language specific analyzers we'd want tokenizers aiming for this language with
limited support for others. So, in this case, colon will always be considered a tokenizing

2. Can we change the jflex definition to at least limit the effects of this, e.g. only support
colon as MidLetter if the overall token length == 3, so c:a is a valid token and word:word
is not?

> StandardTokenizer doesn't tokenize word:word
> --------------------------------------------
>                 Key: LUCENE-6103
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.9
>            Reporter: Itamar Syn-Hershko
>            Assignee: Steve Rowe
> StandardTokenizer (and by result most default analyzers) will not tokenize word:word
and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic behind it.
> If not, I'll be happy to join in the effort of fixing this.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message