Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 57E2D10EBB for ; Fri, 6 Mar 2015 23:02:39 +0000 (UTC) Received: (qmail 97590 invoked by uid 500); 6 Mar 2015 23:02:38 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 97517 invoked by uid 500); 6 Mar 2015 23:02:38 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 97506 invoked by uid 99); 6 Mar 2015 23:02:38 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Mar 2015 23:02:38 +0000 Date: Fri, 6 Mar 2015 23:02:38 +0000 (UTC) From: "Benji Smith (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (LUCENE-6348) Incorrect results from UAX_URL_EMAIL tokenizer MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351092#comment-14351092 ] Benji Smith commented on LUCENE-6348: ------------------------------------- Gotcha. Thanks for your help! > Incorrect results from UAX_URL_EMAIL tokenizer > ---------------------------------------------- > > Key: LUCENE-6348 > URL: https://issues.apache.org/jira/browse/LUCENE-6348 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Environment: Elasticsearch 1.3.4 on Ubuntu 14.04.2 > Reporter: Benji Smith > Assignee: Steve Rowe > > I'm using an analyzer based on the UAX_URL_EMAIL, with a maximum token length of 64 characters. I expect the analyzer to discard any text in the URL beyond those 64 characters, but the actual results yield ordinary terms from the tail-end of the URL. > For example, > {code} > curl -XGET http://localhost:9200/my_index/_analyze?analyzer=uax_url_email_analyzer -d "hey, check out http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-death-is-optional for some light reading." > {code} > The results look like this: > {code} > { > "tokens": [ > { > "token": "hey", > "start_offset": 0, > "end_offset": 3, > "type": "", > "position": 1 > }, > { > "token": "check", > "start_offset": 5, > "end_offset": 10, > "type": "", > "position": 2 > }, > { > "token": "out", > "start_offset": 11, > "end_offset": 14, > "type": "", > "position": 3 > }, > { > "token": "http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-d", > "start_offset": 15, > "end_offset": 79, > "type": "", > "position": 4 > }, > { > "token": "eath", > "start_offset": 79, > "end_offset": 83, > "type": "", > "position": 5 > }, > { > "token": "is", > "start_offset": 84, > "end_offset": 86, > "type": "", > "position": 6 > }, > { > "token": "optional", > "start_offset": 87, > "end_offset": 95, > "type": "", > "position": 7 > }, > { > "token": "for", > "start_offset": 96, > "end_offset": 99, > "type": "", > "position": 8 > }, > { > "token": "some", > "start_offset": 100, > "end_offset": 104, > "type": "", > "position": 9 > }, > { > "token": "light", > "start_offset": 105, > "end_offset": 110, > "type": "", > "position": 10 > }, > { > "token": "reading", > "start_offset": 111, > "end_offset": 118, > "type": "", > "position": 11 > } > ] > } > {code} > The term from the extracted URL is correct, and correctly truncated at 64 characters. But as you can see, the analysis pipeline also creates three spurious terms [ "eath", "is" "optional" ] which come from the discarded portion of the URL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org