Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0268810945 for ; Fri, 14 Jun 2013 02:38:27 +0000 (UTC) Received: (qmail 81932 invoked by uid 500); 14 Jun 2013 02:38:25 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 81846 invoked by uid 500); 14 Jun 2013 02:38:25 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 81838 invoked by uid 99); 14 Jun 2013 02:38:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jun 2013 02:38:25 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of spragues@gmail.com designates 209.85.223.181 as permitted sender) Received: from [209.85.223.181] (HELO mail-ie0-f181.google.com) (209.85.223.181) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jun 2013 02:38:19 +0000 Received: by mail-ie0-f181.google.com with SMTP id x12so212104ief.12 for ; Thu, 13 Jun 2013 19:37:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=aZV8CgOoqTk7rIeSyv10JVcVklOL7rEA7wrbDSZUJBY=; b=rwU6h8nmAi8GzYu+S9WabR6Z/C3ubzWSkYOWhCa29WzFjHaeowNS2Q3rOUxS0FVJzR M0cEBusiHln/x8vzykFq1mvsKduWCscDoKx6nmgj8es0c5Ki15YPHaoaDHLUc0b9TzXo GjFwmKWySvO1JMkwlbCk7IkPHQMfTmGUP0l7j74VGIXemP72hjvvEzJ+qMFoob0eAKBh nVdG9gusaTmJyWHNFI9y9nVnlzP9RuvQVmefKBwVt5z4nScYStFRHkpciUsS8YHRLCyH rOybT4wJVUpO7JMQcipHuXWFtTQT6WRhOAC/KtMH1boLq5PtBP1AW+k+xJmrnrxCsaKa 702w== X-Received: by 10.50.109.134 with SMTP id hs6mr132095igb.35.1371177478099; Thu, 13 Jun 2013 19:37:58 -0700 (PDT) MIME-Version: 1.0 Received: by 10.50.115.8 with HTTP; Thu, 13 Jun 2013 19:37:38 -0700 (PDT) In-Reply-To: References: From: Stephen Sprague Date: Thu, 13 Jun 2013 19:37:38 -0700 Message-ID: Subject: Re: Issues with Hive 'Like" and multiple '%' wildcard in a pattern To: user@hive.apache.org Content-Type: multipart/alternative; boundary=089e0122e6a28198fd04df14221d X-Virus-Checked: Checked by ClamAV on apache.org --089e0122e6a28198fd04df14221d Content-Type: text/plain; charset=ISO-8859-1 yeah. good one. its a bug alright. where the trouble starts is when a '?' follows a wildcard char. '%' in this case. Looks like hive slavishly mutates a '%' to a '.*' and since you got a '?' after that you get another valid regex looking like '.*?' ('?" being the minimally match directive) so it eats your '?' as meta-data and then things just go weird from there. in your case #2: '%?' (aka '.*?') matches '?&' in your string 'icid=main' matches 'icid=main' '%dl' matches 'moiz%dl' '%' matches nothing but thats ok so you get a true. * * i don't know of a workaround using the like operator. i've tried various number of blackslashes in front of the question mark to no avail to make it a literal question mark and not a meta question mark. maybe someone else on this list more knowledgeable on the internals can prescribe a solution. in the meantime use rlike and a real regexp. use double backslashes to escape meta characters. consider this regexp in hive: 'your_string' rlike '.*\\?icid=main.*dl.*' as equivalent to: 'your_string' like '%?icid=main%dl%' let us know how it turns out. Thanks again for finding and reporting this. thanks, Stephen. PS Note that these are only _my_ findings and as such i can certainly be wrong so please feel free to verify or disprove. I'm using hive 0.80 which is kinda old. On Thu, Jun 13, 2013 at 12:30 PM, Sunderlin, Mark < mark.sunderlin@teamaol.com> wrote: > Hive users: It took me a while to create a standalone use case, but here > we go: > > In Oracle, this SQL returns false for both cases: > > SELECT CASE > WHEN '?&icid=mainmoiz%dl' LIKE ('%?icid=main%') THEN 'T' > ELSE 'F' > END > AS case1, > CASE > WHEN '?&icid=mainmoiz%dl' LIKE ('%?icid=main%dl%') THEN 'T' > ELSE 'F' > END > AS case2 > FROM DUAL > case1 case2 > F F > > But in Hive 0.80, I get one False and one True: > > SELECT CASE > WHEN '?&icid=mainmoiz%dl' LIKE ('%?icid=main%') THEN 'T' > ELSE 'F' > END > AS case1, > CASE > WHEN '?&icid=mainmoiz%dl' LIKE ('%?icid=main%dl%') THEN 'T' > ELSE 'F' > END > AS case2 > FROM DUAL - A dummy single row, single column table we installed > case1 case2 > F T > > > > --- > Mark E. Sunderlin > Solutions Architect |AOL NETWORKS BDM > P: 703-265-6935 |C: 540-327-6222 | AIM: MESunderlin > 22000 AOL Way | Dulles, VA | 20166 > > From: Stephen Sprague [mailto:spragues@gmail.com] > Sent: Wednesday, June 12, 2013 2:22 PM > To: user@hive.apache.org > Subject: Re: Issues with Hive 'Like" and multiple '%' wildcard in a pattern > > Hi Mark, > i'm running v0.80 too and multiple '%'s work as expected for me. so. we're > gonna need a see a definitive test case from you. > show your full string and show where the like clause fails to match. > thanks, > Stephen. > PS here's my test: > > hisql>select city from junk; > +------------------+ > | city | > +------------------+ > | West Haven | > | West Haven | > | West Haven | > | West Haven | > | West Jordan | > | West Chester | > | Westminster | > | West Chester | > | West Chester | > | Westlake | > | West Chester | > | West Chester | > | West Chester | > | Weston | > | West Covina | > | West Chester | > | West Chester | > | West Hollywood | > | West Des Moines | > | West Covina | > | Westland | > | West Valley City | > | West Valley City | > | West Valley City | > | West Valley City | > | Westminster | > | West Memphis | > | West Memphis | > +------------------+ > 28 affected > > hisql>select city from junk where city like '%West%v%'; > +-------------+ > | city | > +-------------+ > | West Haven | > | West Haven | > | West Haven | > | West Haven | > | West Covina | > | West Covina | > +-------------+ > 6 affected > > hisql>select city from junk where city like '%West%v%a%'; > +-------------+ > | city | > +-------------+ > | West Covina | > | West Covina | > +-------------+ > 2 affected > > > On Wed, Jun 12, 2013 at 10:26 AM, Sunderlin, Mark < > mark.sunderlin@teamaol.com> wrote: > We are using Hive 0.80. > > --- > Mark E. Sunderlin > Solutions Architect |AOL NETWORKS BDM > P: 703-265-6935 |C: 540-327-6222 | AIM: MESunderlin > 22000 AOL Way | Dulles, VA | 20166 > > From: Stephen Sprague [mailto:spragues@gmail.com] > Sent: Wednesday, June 12, 2013 1:00 PM > To: user@hive.apache.org > Subject: Re: Issues with Hive 'Like" and multiple '%' wildcard in a pattern > > that seems pretty hard to believe. what version of hive are you using? > > On Wed, Jun 12, 2013 at 6:27 AM, Sunderlin, Mark < > mark.sunderlin@teamaol.com> wrote: > This seems to work just fine in other SQLs, but doesn't seem work in hive. > > I need to have several wild card characters in my 'like' clause as follows. > > In other SQLs, I want: .... where page_url_query like '%?icid=main%dl%' > > But in Hive that doesn't match. I have several work arounds. I can write > it as: > > where (page_url_query like '%?icid=main%') and (page_url_query like > '%dl%') > > or, I can just use something like where > length(regexp_extract(page__url_query, '.*\?cdid-main.*dl.*',0) <> 0 > > But more important is two or more '%' wild cards in a Hive like expression > a known issue? > > --- > Mark E. Sunderlin > Solutions Architect |AOL NETWORKS BDM > P: 703-265-6935 |C: 540-327-6222 | AIM: MESunderlin > 22000 AOL Way, Dulles, VA 20166 > > > --089e0122e6a28198fd04df14221d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
yeah. good one.=A0=A0 i= ts a bug alright.=A0=A0=A0=A0 where the trouble starts is when a '?'= ;=A0 follows a wildcard char.=A0 '%' in this case.=A0 Looks like hi= ve slavishly mutates a '%' to a '.*' and since you got a &#= 39;?' after that you get another valid regex looking like '.*?'= ('?" being the minimally match directive)=A0 so it eats your '= ;?' as meta-data and then things just go weird from there.

=A0in your case #2:

=A0=A0=A0=A0=A0=A0 '%?' (aka '.*?')=A0 matches = '?&' in your string
=A0=A0=A0=A0=A0=A0 'icid=3Dmain' match= es 'icid=3Dmain'
=A0=A0= =A0=A0=A0=A0 '%dl' matches 'moiz%dl'
<= span style=3D"font-family:courier new,monospace">=A0=A0=A0=A0=A0=A0 '%&= #39; matches nothing but thats ok

so you get a true.


i don't know of = a workaround using the like operator. =A0i've tried various number of b= lackslashes in front of the question mark to no avail to make it a literal = question mark and not a meta question mark. maybe someone else on this list= more knowledgeable on the internals can prescribe a solution.

in the meantime use rlike and a real regexp. =A0use double backslashes = to escape meta characters.

consider this regexp in hive:=A0 'your_string' rlike=A0 '.*\\?= icid=3Dmain.*dl.*'=A0=A0

as equivalent to: = 9;your_string' like '%?icid=3Dmain%dl%'

let us kn= ow how it turns out.

Thanks again for finding and report= ing this.

thanks,
Stephen.
PS=A0 Note that these are only= _my_ findings and=A0 as such i can certainly be wrong so please feel free = to verify or disprove. I'm using hive 0.80 which is kinda old.




=


On Thu, Jun 13, 2013 at 12:30 PM, Sunderlin, Mark <mark.sunderlin@teamaol.com> wrote:
Hive users: It took me a while to create a s= tandalone use case, but here we go:

In Oracle, this SQL returns false for both cases:

SELECT CASE
=A0 =A0 =A0 =A0 =A0 WHEN '?&icid=3Dmainmoiz%dl' LIKE ('%?ic= id=3Dmain%') THEN 'T'
=A0 =A0 =A0 =A0 =A0 ELSE 'F'
=A0 =A0 =A0 =A0END
=A0 =A0 =A0 =A0 =A0 AS case1,
=A0 =A0 =A0 =A0CASE
=A0 =A0 =A0 =A0 =A0 WHEN '?&icid=3Dmainmoiz%dl' LIKE ('%?ic= id=3Dmain%dl%') THEN 'T'
=A0 =A0 =A0 =A0 =A0 ELSE 'F'
=A0 =A0 =A0 =A0END
=A0 =A0 =A0 =A0 =A0 AS case2
=A0 FROM DUAL
case1 case2
F =A0 =A0 =A0 F

But in Hive 0.80, I get one False and one True:

SELECT CASE
=A0 =A0 =A0 =A0 =A0 WHEN '?&icid=3Dmainmoiz%dl' LIKE ('%?ic= id=3Dmain%') THEN 'T'
=A0 =A0 =A0 =A0 =A0 ELSE 'F'
=A0 =A0 =A0 =A0END
=A0 =A0 =A0 =A0 =A0 AS case1,
=A0 =A0 =A0 =A0CASE
=A0 =A0 =A0 =A0 =A0 WHEN '?&icid=3Dmainmoiz%dl' LIKE ('%?ic= id=3Dmain%dl%') THEN 'T'
=A0 =A0 =A0 =A0 =A0 ELSE 'F'
=A0 =A0 =A0 =A0END
=A0 =A0 =A0 =A0 =A0 AS case2
FROM DUAL - A dummy single row, single column table we installed
case1 =A0 case2
F =A0 =A0 =A0 T



---
Mark E. Sunderlin
Solutions Architect |AOL NETWORKS BDM
P: 703-265-6935 =A0= =A0 |C: 540-327-6222 | AIM: MESunderlin
Sent: Wednesday, June 12, 2013 2:22 PM
To: user@h= ive.apache.org
Subject: Re: Issues with Hive 'Like" and multiple '%' wild= card in a pattern

Hi Mark,
i'm running v0.80 too and multiple '%'s work as expected for me= . so. we're gonna need a see a definitive test case from you.=A0
show your full string and show where the like clause fails to match.
thanks,
Stephen.
PS here's my test:

hisql>select city from junk;
+------------------+
| city=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 |
+------------------+
| West Haven=A0=A0=A0=A0=A0=A0 |
| West Haven=A0=A0=A0=A0=A0=A0 |
| West Haven=A0=A0=A0=A0=A0=A0 |
| West Haven=A0=A0=A0=A0=A0=A0 |
| West Jordan=A0=A0=A0=A0=A0 |
| West Chester=A0=A0=A0=A0 |
| Westminster=A0=A0=A0=A0=A0 |
| West Chester=A0=A0=A0=A0 |
| West Chester=A0=A0=A0=A0 |
| Westlake=A0=A0=A0=A0=A0=A0=A0=A0 |
| West Chester=A0=A0=A0=A0 |
| West Chester=A0=A0=A0=A0 |
| West Chester=A0=A0=A0=A0 |
| Weston=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 |
| West Covina=A0=A0=A0=A0=A0 |
| West Chester=A0=A0=A0=A0 |
| West Chester=A0=A0=A0=A0 |
| West Hollywood=A0=A0 |
| West Des Moines=A0 |
| West Covina=A0=A0=A0=A0=A0 |
| Westland=A0=A0=A0=A0=A0=A0=A0=A0 |
| West Valley City |
| West Valley City |
| West Valley City |
| West Valley City |
| Westminster=A0=A0=A0=A0=A0 |
| West Memphis=A0=A0=A0=A0 |
| West Memphis=A0=A0=A0=A0 |
+------------------+
28 affected

hisql>select city from junk where city like '%West%v%';
+-------------+
| city=A0=A0=A0=A0=A0=A0=A0 |
+-------------+
| West Haven=A0 |
| West Haven=A0 |
| West Haven=A0 |
| West Haven=A0 |
| West Covina |
| West Covina |
+-------------+
6 affected

hisql>select city from junk where city like '%West%v%a%';
+-------------+
| city=A0=A0=A0=A0=A0=A0=A0 |
+-------------+
| West Covina |
| West Covina |
+-------------+
2 affected


On Wed, Jun 12, 2013 at 10:26 AM, Sunderlin, Mark <mark.sunderlin@teamaol.com> wrote:
We are using Hive 0.80.
=A0
---
Mark E. Sunderlin
Solutions Architect |AO= L NETWORKS BDM
P: 703-265-6935=A0= =A0=A0=A0 |C: 540-327-6= 222 | AIM: MESunderlin
22000 AOL Way | Dulles, VA | 20166
=A0
From: Stephen Sprague [mailto:sprague= s@gmail.com]
Sent: Wednesday, June 12, 2013 1:00 PM
To: user@hive.apache.org
Subject: Re: Issues with Hive 'Like" and multiple '%' wild= card in a pattern
=A0
that seems pretty hard to believe. what version of hive are you using?
=A0
On Wed, Jun 12, 2013 at 6:27 AM, Sunderlin, Mark <mark.sunderlin@teamaol.com> wrote:
This seems to work just fine in other SQLs, but doesn't seem work in hi= ve.

I need to have several wild card characters in my 'like' clause as = follows.

In other SQLs, I want: =A0.... =A0where page_url_query like '%?icid=3Dm= ain%dl%'

But in Hive that doesn't match. =A0I have several work arounds. =A0I ca= n write it as:

=A0where (page_url_query like '%?icid=3Dmain%') =A0and (page_url_qu= ery like '%dl%')

or, I can just use something like where length(regexp_extract(page__url_que= ry, '.*\?cdid-main.*dl.*',0) <> 0

But more important is two or more '%' wild cards in a Hive like exp= ression a known issue?

---
Mark E. Sunderlin
Solutions Architect =A0 |AOL NETWORKS BDM
P: 703-265-6935 =A0= =A0 =A0 |C: 540-327-62= 22 | AIM: MESunderlin
22000 AOL Way, =A0Dulles, VA =A020166
=A0


--089e0122e6a28198fd04df14221d--