lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milind <mili...@gmail.com>
Subject Re: Incorrect tokenizing in the UAX29URLEmailAnalyzer analyzer?
Date Wed, 23 Jul 2014 23:43:43 GMT
>>    input=esl2.gbr
>>    output=[esl2.gb][r]
> >
>> This is a bug, which was fixed in Lucene 4.7 - see <
https://issues.apache.org/jira/browse/LUCENE-5391>

BTW, I changed the POM dependency to 4.7.1, but I'm still seeing the same
output.  I can't go beyond 4.7 since it seems 4.8 onwards, Lucene is being
compiled against Java 7 and I'm still on Java 6.  Hopefully, this will be
a non-issue with PerFieldAnalyzerWrapper.  But I just wanted to point that
out.


On Wed, Jul 23, 2014 at 7:34 PM, Milind <milindr@gmail.com> wrote:

> Brilliant.  Thanks!
>
>
> On Wed, Jul 23, 2014 at 6:12 PM, Steve Rowe <sarowe@gmail.com> wrote:
>
>> See PerFieldAnalyzerWrapper, which is itself an Analyzer: <
>> http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html
>> >
>>
>> Steve
>>
>> On Jul 23, 2014, at 6:00 PM, Milind <milindr@gmail.com> wrote:
>>
>> > Thanks Steve, that helped.  I had forgotten about the URL part of the
>> > Analyzer since I was using it for the email field.  I need to see if
>> it's
>> > possible to use different analyzers for different fields.  If so, then
>> I'll
>> > use the UAX29URLEmailAnalyzer only for the email field and use
>> > StandardAnalyzer for everything else.  I'm not sure if that would work
>> > though.  Since I'm using the MultiFieldQueryParser and that takes in a
>> > single Analyzer.
>> >
>> >
>> > On Wed, Jul 23, 2014 at 3:29 PM, Steve Rowe <sarowe@gmail.com> wrote:
>> >
>> >> Hi Milind,
>> >>
>> >> On Jul 23, 2014, at 1:49 PM, Milind <milindr@gmail.com> wrote:
>> >>
>> >>> The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I
>> >>> expected.  Is this a bug in the analyzer or is this working as
>> designed?
>> >>>
>> >>> If I use the UAX29URLEmailAnalyzer, it tokenizes the following
>> strings as
>> >>>   input=bwl-esl2.gbr.hp.com
>> >>>   output=[bwl-esl2.gbr.hp.com]
>> >>
>> >> This is the correct tokenization of a valid domain name with token type
>> >> <URL>: the hyphen (‘-‘) is an allowed character in DNS names.
 From RFC
>> >> 1035 Domain Implementation and Specification <
>> >> http://www.ietf.org/rfc/rfc1035.txt>:
>> >>
>> >>    <domain> ::= <subdomain> | " "
>> >>    <subdomain> ::= <label> | <subdomain> "." <label>
>> >>    <label> ::= <letter> [ [ <ldh-str> ] <let-dig>
]
>> >>    <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
>> >>    <let-dig-hyp> ::= <let-dig> | "-"
>> >>    <let-dig> ::= <letter> | <digit>
>> >>
>> >>    <letter> ::= any one of the 52 alphabetic characters A through
Z in
>> >>    upper case and a through z in lower case
>> >>
>> >>    <digit> ::= any one of the ten digits 0 through 9
>> >>
>> >>    Note that while upper and lower case letters are allowed in domain
>> >>    names, no significance is attached to the case.  That is, two names
>> >> with
>> >>    the same spelling but different case are to be treated as if
>> identical.
>> >>
>> >>    The labels must follow the rules for ARPANET host names.  They must
>> >>    start with a letter, end with a letter or digit, and have as
>> interior
>> >>    characters only letters, digits, and hyphen.  There are also some
>> >>    restrictions on the length.  Labels must be 63 characters or less.
>> >>
>> >> From the Lucene grammar UAX29URLEmailTokenizerImpl.jflex:
>> >>
>> >>    DomainLabel = [A-Za-z0-9] ([-A-Za-z0-9]* [A-Za-z0-9])?
>> >>    DomainNameStrict = {DomainLabel} ("." {DomainLabel})* {ASCIITLD}
>> >>    URIhostStrict = ("[" {IPv6Address} "]") | {IPv4Address} |
>> >> {DomainNameStrict}
>> >>    […]
>> >>    {URIhostStrict} / [^-\w] { yybegin(YYINITIAL); return URL_TYPE; }
>> >>
>> >>>   input=esl2.gbr
>> >>>   output=[esl2.gb][r]
>> >>
>> >> This is a bug, which was fixed in Lucene 4.7 - see <
>> >> https://issues.apache.org/jira/browse/LUCENE-5391>
>> >>
>> >> Steve
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards
>> > Milind
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Regards
> Milind
>



-- 
Regards
Milind

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message