lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Stocker (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8181) WordDelimiterTokenFilter does not generate all tokens appropriately
Date Mon, 12 Mar 2018 01:39:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394729#comment-16394729
] 

Robin Stocker commented on LUCENE-8181:
---------------------------------------

I think this is the intended behavior of the filter at the moment. Having said that, it would
be really useful for analyzing source code to have an option to generate those additional
tokens.

Another interesting example to consider:
{code:java}
FooBar.baz_qux{code}
In this case, being able to produce the following tokens would be _really_ useful:

{{foo}}, {{bar}}, {{baz}}, {{qux}}, {{foobar}}, {{baz_qux}}, {{foobar.baz_qux}}

> WordDelimiterTokenFilter does not generate all tokens appropriately
> -------------------------------------------------------------------
>
>                 Key: LUCENE-8181
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8181
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 7.2.1
>         Environment: *Steps to reproduce*:
> *1. Create index*
> _PUT testindex_
> _{_
> _"settings" : {_
> _"index" : {_
> _"number_of_shards" : 2,_
> _"number_of_replicas" : 2_
> _},_
> _"analysis": {_
> _"filter": {_
> _"wordDelimiter": {_
> _"type": "word_delimiter",_
> _"generate_word_parts": "true",_
> _"generate_number_parts": "true",_
> _"catenate_words": "false",_
> _"catenate_numbers": "false",_
> _"catenate_all": "false",_
> _"split_on_case_change": "true",_
> _"preserve_original": "true",_
> _"split_on_numerics": "true",_
> _"stem_english_possessive": "true"_
> _}_
> _},_
> _"analyzer": {_
> _"content_analyzer": {_
> _"type": "custom",_
> _"tokenizer": "whitespace",_
> _"filter": [_
> _"asciifolding",_
> _"wordDelimiter",_
> _"lowercase"_
> _]_
> _}_
> _}_
> _}_
> _}_
> _}_
> *2. Analyze Text-*
> _POST testindex/_analyze_
> _{_
> _"analyzer": "content_analyzer",_
> _"text": "ElasticSearch.TestProject"_
> _}_
> *Following tokens are generated-*
> {
> "token": "elasticsearch-testproject",
> "start_offset": 0,
> "end_offset": 25,
> "type": "word",
> "position": 0
> }
> ,
> {
> "token": "elastic",
> "start_offset": 0,
> "end_offset": 7,
> "type": "word",
> "position": 0
> }
> ,
> {
> "token": "search",
> "start_offset": 7,
> "end_offset": 13,
> "type": "word",
> "position": 1
> }
> ,
> {
> "token": "test",
> "start_offset": 14,
> "end_offset": 18,
> "type": "word",
> "position": 2
> }
> ,
> {
> "token": "project",
> "start_offset": 18,
> "end_offset": 25,
> "type": "word",
> "position": 3
> }
> *Expected Result:*
> Besides the above tokens even elasticsearch and testproject should be generated. such
that the phrase query "elasticsearch testproject" should also match.
> *Another example could be-*
> Text *"Super-Duper-0-AutoCoder"* with above analyzer generates a token *autocoder* while
the text *"Super-Duper-AutoCoder"* does NOT generate the token *autocoder*.
>            Reporter: Atul
>            Priority: Major
>
> When using word delimiter token filter some expected tokens are not generated.
> When I try to analyze the text "ElasticSearch.TestProject"
> I expect the tokens elastic, search, test, project, elasticsearch, testproject, elasticsearch.testproject
to be generated since I have split_on_case_change, split_on_numerics on and using a whitespace
tokenizer and have preserve original true.
> But Actually I only see following tokens -
> elasticsearch.testproject, elastic, search, test, project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message