lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Itamar Syn-Hershko (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word
Date Tue, 09 Dec 2014 22:00:15 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240133#comment-14240133
] 

Itamar Syn-Hershko commented on LUCENE-6103:
--------------------------------------------

Ok so I did some homework. In swedish, "connect" is a way to shortcut writings of words. So
"C:a" is infact "cirka" which means "approximately". I guess it can be thought of as English
acronyms, only apparently its way less commonly used in Swedish (my source says "very very
seldomly used; old style and not used in modern Swedish at all").

Not only it is hardly being used, apparently it's only legal in 3 letter combinations (c:a
but not c:ka).

And also, the affects it has are quite severe at the moment - 2 words with a colon in between
that didn't have space will be outputted as one token even though its 100% its not applicable
to Swedish, since each words has > 2 characters.

I'm not aiming at changing the Unicode standards, that's way beyond my limited powers, but:

1. Given the above, does it really make sense to use this tokenizer in all language-specific
analyzers as well? e.g. https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L105

I'd think for language specific analyzers we'd want tokenizers aiming for this language with
limited support for others. So, in this case, colon will always be considered a tokenizing
char.

2. Can we change the jflex definition to at least limit the effects of this, e.g. only support
colon as MidLetter if the overall token length == 3, so c:a is a valid token and word:word
is not?

> StandardTokenizer doesn't tokenize word:word
> --------------------------------------------
>
>                 Key: LUCENE-6103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6103
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.9
>            Reporter: Itamar Syn-Hershko
>            Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize word:word
and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message