lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Incorrect tokenizing in the UAX29URLEmailAnalyzer analyzer?
Date Wed, 23 Jul 2014 19:29:47 GMT
Hi Milind,

On Jul 23, 2014, at 1:49 PM, Milind <milindr@gmail.com> wrote:

> The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I
> expected.  Is this a bug in the analyzer or is this working as designed?
> 
> If I use the UAX29URLEmailAnalyzer, it tokenizes the following strings as
>    input=bwl-esl2.gbr.hp.com
>    output=[bwl-esl2.gbr.hp.com]

This is the correct tokenization of a valid domain name with token type <URL>: the hyphen
(‘-‘) is an allowed character in DNS names.  From RFC 1035 Domain Implementation and Specification
<http://www.ietf.org/rfc/rfc1035.txt>:

    <domain> ::= <subdomain> | " "
    <subdomain> ::= <label> | <subdomain> "." <label>
    <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
    <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
    <let-dig-hyp> ::= <let-dig> | "-"
    <let-dig> ::= <letter> | <digit>

    <letter> ::= any one of the 52 alphabetic characters A through Z in
    upper case and a through z in lower case

    <digit> ::= any one of the ten digits 0 through 9

    Note that while upper and lower case letters are allowed in domain
    names, no significance is attached to the case.  That is, two names with
    the same spelling but different case are to be treated as if identical.

    The labels must follow the rules for ARPANET host names.  They must
    start with a letter, end with a letter or digit, and have as interior
    characters only letters, digits, and hyphen.  There are also some
    restrictions on the length.  Labels must be 63 characters or less.

From the Lucene grammar UAX29URLEmailTokenizerImpl.jflex:

    DomainLabel = [A-Za-z0-9] ([-A-Za-z0-9]* [A-Za-z0-9])?
    DomainNameStrict = {DomainLabel} ("." {DomainLabel})* {ASCIITLD}
    URIhostStrict = ("[" {IPv6Address} "]") | {IPv4Address} | {DomainNameStrict}  
    […]
    {URIhostStrict} / [^-\w] { yybegin(YYINITIAL); return URL_TYPE; }

>    input=esl2.gbr
>    output=[esl2.gb][r]

This is a bug, which was fixed in Lucene 4.7 - see <https://issues.apache.org/jira/browse/LUCENE-5391>

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message