lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benji Smith (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6348) Incorrect results from UAX_URL_EMAIL tokenizer
Date Fri, 06 Mar 2015 23:02:38 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351092#comment-14351092
] 

Benji Smith commented on LUCENE-6348:
-------------------------------------

Gotcha. Thanks for your help!

> Incorrect results from UAX_URL_EMAIL tokenizer
> ----------------------------------------------
>
>                 Key: LUCENE-6348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6348
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>         Environment: Elasticsearch 1.3.4 on Ubuntu 14.04.2
>            Reporter: Benji Smith
>            Assignee: Steve Rowe
>
> I'm using an analyzer based on the UAX_URL_EMAIL, with a maximum token length of 64 characters.
I expect the analyzer to discard any text in the URL beyond those 64 characters, but the actual
results yield ordinary terms from the tail-end of the URL.
> For example, 
> {code}
> curl -XGET http://localhost:9200/my_index/_analyze?analyzer=uax_url_email_analyzer -d
"hey, check out http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-death-is-optional
for some light reading."
> {code}
> The results look like this:
> {code}
> {
>     "tokens": [
>         {
>             "token": "hey",
>             "start_offset": 0,
>             "end_offset": 3,
>             "type": "<ALPHANUM>",
>             "position": 1
>         },
>         {
>             "token": "check",
>             "start_offset": 5,
>             "end_offset": 10,
>             "type": "<ALPHANUM>",
>             "position": 2
>         },
>         {
>             "token": "out",
>             "start_offset": 11,
>             "end_offset": 14,
>             "type": "<ALPHANUM>",
>             "position": 3
>         },
>         {
>             "token": "http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-d",
>             "start_offset": 15,
>             "end_offset": 79,
>             "type": "<URL>",
>             "position": 4
>         },
>         {
>             "token": "eath",
>             "start_offset": 79,
>             "end_offset": 83,
>             "type": "<ALPHANUM>",
>             "position": 5
>         },
>         {
>             "token": "is",
>             "start_offset": 84,
>             "end_offset": 86,
>             "type": "<ALPHANUM>",
>             "position": 6
>         },
>         {
>             "token": "optional",
>             "start_offset": 87,
>             "end_offset": 95,
>             "type": "<ALPHANUM>",
>             "position": 7
>         },
>         {
>             "token": "for",
>             "start_offset": 96,
>             "end_offset": 99,
>             "type": "<ALPHANUM>",
>             "position": 8
>         },
>         {
>             "token": "some",
>             "start_offset": 100,
>             "end_offset": 104,
>             "type": "<ALPHANUM>",
>             "position": 9
>         },
>         {
>             "token": "light",
>             "start_offset": 105,
>             "end_offset": 110,
>             "type": "<ALPHANUM>",
>             "position": 10
>         },
>         {
>             "token": "reading",
>             "start_offset": 111,
>             "end_offset": 118,
>             "type": "<ALPHANUM>",
>             "position": 11
>         }
>     ]
> }
> {code}
> The term from the extracted URL is correct, and correctly truncated at 64 characters.
But as you can see, the analysis pipeline also creates three spurious terms [ "eath", "is"
"optional" ] which come from the discarded portion of the URL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message