lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milind <mili...@gmail.com>
Subject Re: Incorrect tokenizing in the UAX29URLEmailAnalyzer analyzer?
Date Wed, 23 Jul 2014 23:34:36 GMT
Brilliant.  Thanks!


On Wed, Jul 23, 2014 at 6:12 PM, Steve Rowe <sarowe@gmail.com> wrote:

> See PerFieldAnalyzerWrapper, which is itself an Analyzer: <
> http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html
> >
>
> Steve
>
> On Jul 23, 2014, at 6:00 PM, Milind <milindr@gmail.com> wrote:
>
> > Thanks Steve, that helped.  I had forgotten about the URL part of the
> > Analyzer since I was using it for the email field.  I need to see if it's
> > possible to use different analyzers for different fields.  If so, then
> I'll
> > use the UAX29URLEmailAnalyzer only for the email field and use
> > StandardAnalyzer for everything else.  I'm not sure if that would work
> > though.  Since I'm using the MultiFieldQueryParser and that takes in a
> > single Analyzer.
> >
> >
> > On Wed, Jul 23, 2014 at 3:29 PM, Steve Rowe <sarowe@gmail.com> wrote:
> >
> >> Hi Milind,
> >>
> >> On Jul 23, 2014, at 1:49 PM, Milind <milindr@gmail.com> wrote:
> >>
> >>> The UAX29URLEmailAnalyzer analyzer in Lucene 4.4 is not working as I
> >>> expected.  Is this a bug in the analyzer or is this working as
> designed?
> >>>
> >>> If I use the UAX29URLEmailAnalyzer, it tokenizes the following strings
> as
> >>>   input=bwl-esl2.gbr.hp.com
> >>>   output=[bwl-esl2.gbr.hp.com]
> >>
> >> This is the correct tokenization of a valid domain name with token type
> >> <URL>: the hyphen (‘-‘) is an allowed character in DNS names.  From
RFC
> >> 1035 Domain Implementation and Specification <
> >> http://www.ietf.org/rfc/rfc1035.txt>:
> >>
> >>    <domain> ::= <subdomain> | " "
> >>    <subdomain> ::= <label> | <subdomain> "." <label>
> >>    <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
> >>    <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
> >>    <let-dig-hyp> ::= <let-dig> | "-"
> >>    <let-dig> ::= <letter> | <digit>
> >>
> >>    <letter> ::= any one of the 52 alphabetic characters A through Z in
> >>    upper case and a through z in lower case
> >>
> >>    <digit> ::= any one of the ten digits 0 through 9
> >>
> >>    Note that while upper and lower case letters are allowed in domain
> >>    names, no significance is attached to the case.  That is, two names
> >> with
> >>    the same spelling but different case are to be treated as if
> identical.
> >>
> >>    The labels must follow the rules for ARPANET host names.  They must
> >>    start with a letter, end with a letter or digit, and have as interior
> >>    characters only letters, digits, and hyphen.  There are also some
> >>    restrictions on the length.  Labels must be 63 characters or less.
> >>
> >> From the Lucene grammar UAX29URLEmailTokenizerImpl.jflex:
> >>
> >>    DomainLabel = [A-Za-z0-9] ([-A-Za-z0-9]* [A-Za-z0-9])?
> >>    DomainNameStrict = {DomainLabel} ("." {DomainLabel})* {ASCIITLD}
> >>    URIhostStrict = ("[" {IPv6Address} "]") | {IPv4Address} |
> >> {DomainNameStrict}
> >>    […]
> >>    {URIhostStrict} / [^-\w] { yybegin(YYINITIAL); return URL_TYPE; }
> >>
> >>>   input=esl2.gbr
> >>>   output=[esl2.gb][r]
> >>
> >> This is a bug, which was fixed in Lucene 4.7 - see <
> >> https://issues.apache.org/jira/browse/LUCENE-5391>
> >>
> >> Steve
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Regards
> > Milind
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Regards
Milind

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message