Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B148210383 for ; Fri, 30 Aug 2013 23:34:51 +0000 (UTC) Received: (qmail 20143 invoked by uid 500); 30 Aug 2013 23:34:50 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 20058 invoked by uid 500); 30 Aug 2013 23:34:49 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 20050 invoked by uid 99); 30 Aug 2013 23:34:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Aug 2013 23:34:49 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of qiaoresearcher@gmail.com designates 209.85.192.171 as permitted sender) Received: from [209.85.192.171] (HELO mail-pd0-f171.google.com) (209.85.192.171) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Aug 2013 23:34:44 +0000 Received: by mail-pd0-f171.google.com with SMTP id g10so2433481pdj.30 for ; Fri, 30 Aug 2013 16:34:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=5hcPl5WPsRc4MZ0mxmJZbwjFaD/wl/RtKU62zWmot8k=; b=taSDkGKl20RU69mfJtP9YusR+dSFI1Ge+Y/RzqN0QRT8B4ayTOxeaMYHf7NJU1kg0z s23/OeOoCleusLU90QYBs5Y3m19Ng590yEhWK8OwViIRsHNauBoawJz2Pc3XlAAJPEqh YHCHBga3Pf/8Vhn0B/o2cbMZNF9RBHcPcZrUl/x3mxBSclkTj9yCrVgNaSkGN2FLu0LA kmyTTPLHaA3axoMFEYQ40rMd//7oy3BJ7MHE3NxHgD7Y4RJOw5GIGKP9qYvEqm1YoG43 nW5rMf/P7Xmvu8KTGF6TIhtmL0C3RMcCz8J6qNE2UzRqw1EmacHlvthr+st2bww6I/o8 6fQw== MIME-Version: 1.0 X-Received: by 10.66.124.168 with SMTP id mj8mr78396pab.183.1377905664403; Fri, 30 Aug 2013 16:34:24 -0700 (PDT) Received: by 10.68.254.225 with HTTP; Fri, 30 Aug 2013 16:34:24 -0700 (PDT) Date: Fri, 30 Aug 2013 18:34:24 -0500 Message-ID: Subject: how to write hive query to solve this problem? From: qiaoresearcher To: user@hive.apache.org Content-Type: multipart/alternative; boundary=001a11344ca8a9311004e532a9c1 X-Virus-Checked: Checked by ClamAV on apache.org --001a11344ca8a9311004e532a9c1 Content-Type: text/plain; charset=ISO-8859-1 I have three tables: Table 1: record when and who visited gas station or not, this contains all the users of interest, name all the users as a set A date | user name | visited gas station? 2013-09-01 tom yes 2013-09-02 tom yes 2013-09-01 hanks yes 2013-09-03 tomy yes .... ... ... Table 2: record when and who visited Bestbuy, the user in set A appear here, but not all users of A will appear in this table, also table 2 has users does not belong to set A date | user name | visited Bestbuy? 2013-09-01 tom yes 2013-09-02 jacob yes 2013-09-01 hanks yes 2013-09-03 michael yes .... ... ... Table 3: record when and who arrives one of three destinations: CA, NY and DC, the users in table 3 has similar situation as users in table 2 regarding set A. date | user name | visited Bestbuy or not 2013-09-01 tom CA 2013-09-02 tom NY 2013-09-01 hanks DC 2013-09-03 tomy CA .... ... ... Now we want to know, within a 90 days period, what are the following numbers: (1) for any given day, for the users in table 1, how many of them has a path like: visited Gas station first, after that date, user went to Bestbuy, finally arrive CA (2) for any given day, for the users in table 1, how many of them has a path like: visited Gas station first, then did not visit Bestbuy, finally arrive CA (3) for any given day, for the users in table 1, how many of them has a path like: visited Gas station first, then went Bestbuy, finally arrive NY (4) for any given day, for the users in table 1, how many of them has a path like: visited Gas station first, then did not visit Bestbuy, finally arrive NY (5) for any given day, for the users in table 1, how many of them has a path like: visited Gas station first, then went Bestbuy, finally arrive DC (6) for any given day, for the users in table 1, how many of them has a path like: visited Gas station first, then did not visit Bestbuy, finally arrive DC It is possible in a given day, a user may visit Bestbuy many times for table 2, this will be simply regarded as a status 'visited Bestbuy'. One user will be counted once in a given day. >From the day of user visiting gas station, within next 90 days, this user has to arrive one of three final destinations in table 3, and arrives only one station. no multiple arrived stations. For example, a user tom may visit gas station on 9/1, then went to Bestbuy on 9/5, finally went to CA on 9/30, then this user tom can be counted as 1 on 9/1 for path (1). a user hanks may visit gas station on 9/1, then went to Bestbuy on 9/5 and on 9/8, finally went to CA on 9/30, then this user tom count as 1 on 9/1 for path (1). a user ruby may visit gas station on 9/1, and tables 2 does not have records to show ruby visit Bestbuy till 11/30, and ruby arrives DC in table 3 before 11/30, then ruby contribute 1 for path (6) for the day 9/1. How to write Hive query to get those numbers for the six paths? a sample output will be : 9/1: 100 for path (1), 90 for path (2), ...., etc 9/2: 60 for path (1), 80 for path (2), ...., etc 9/3: ................ Any suggestions or suggested reference/readings will be deeply appreciated. Thanks! Qiao --001a11344ca8a9311004e532a9c1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

I have three tables:

Tabl= e 1: record when and who visited gas station or not, this contains all the = users of interest, name all the users as a set A
date =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A0 =A0user name =A0 =A0|= =A0 visited gas station?=A0
2013-09-01 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tom =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 yes
2013-09= -02 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tom =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 yes
2013-09-01 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 hanks =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0yes
2013-09-03 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tomy =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0yes
.... =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 ... =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0...=

Table 2: record when and who visited Bestbuy, =A0= the user in set A appear here, but not all users of A will appear in this = table, also table 2 has users does not belong to set A=A0
date =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A0 =A0user= name =A0 =A0| =A0 visited Bestbuy? =A0=A0
2013-09-01 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tom =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 yes
2013-09-02 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 jacob =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 yes
2013-09-01 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 hanks =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0yes
2013-09-03 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 michael =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 yes
.... =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ... =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0...

Table 3: record when and who arrives one of three= destinations: CA, NY and DC, the users in table 3 has similar situation as= users in table 2 regarding set A. =A0
date =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A0 =A0user name =A0 =A0| =A0 visited Bes= tbuy or not =A0
2013-09-01 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tom =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 CA
2013-09-= 02 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tom =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 NY
2013-09-01 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 hanks =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0DC
2013-09-03 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tomy =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0CA
.... =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 ... =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0...=
=A0=A0=A0
Now we want to know, =A0within a 90 da= ys period, what are the following numbers:=A0
(1) for any given day, for the users in table 1,=A0how many of t= hem has a path like: =A0 visited Gas station first, after that date, user w= ent to Bestbuy, finally arrive CA
(2) for any given day, for= the users in table 1,=A0how many of them has a path like: =A0 visited Gas = station first, then did not visit Bestbuy, finally arrive CA

(3) for any given day, for the us= ers in table 1,=A0how many of them has a path like: =A0 visited Gas station= first, then went Bestbuy, finally=A0arrive=A0NY
(4) for any= given day, for the users in table 1,=A0how many of them has a path like: = =A0 visited Gas station first, then did not visit Bestbuy, finally=A0arrive= =A0NY

(5) for any given day, for the users in table 1,= =A0how many of them has a path like: =A0 visited Gas station first, then we= nt Bestbuy, finally=A0arrive=A0DC
(6) for any given day, for= the users in table 1,=A0how many of them has a path like: =A0 visited Gas = station first, then did not visit Bestbuy, finally=A0arrive=A0DC

It is possible in a given day, a user may visit Bestbuy= many times for table 2, this will be simply regarded as a status 'visi= ted Bestbuy'. One user will be counted once in a given day.=A0
>From the day of user visiting gas station, within next 90 days, this user h= as to arrive one of three final destinations in table 3, and arrives only o= ne station. no multiple arrived stations. =A0=A0

F= or example,=A0
a user tom may visit gas station on 9/1, then went to Bestbuy on 9/5, = finally went to CA on 9/30, then this user tom can be counted as 1 on 9/1 f= or path (1).
a user hanks may visit gas station on 9/1, then= went to Bestbuy on 9/5 and on 9/8, finally went to CA on 9/30, then this u= ser tom count as 1 on 9/1 for path (1).
a user ruby may visit gas station on 9/1, and tables 2 does not have r= ecords to show ruby visit Bestbuy till 11/30, and ruby arrives DC in table = 3 before 11/30, then ruby contribute 1 for path (6) for the day 9/1.=A0

How to write Hive query to get those numbers for the si= x paths?=A0
a sample output will be : =A0
9/1: 100 for = path (1), 90 for path (2), ...., etc =A0
9/2: 60 for path (1= ), 80 for path (2), ...., etc =A0
9/3: ................

Any suggestions o= r suggested reference/readings will be deeply appreciated.=A0
Thanks!
Qiao=A0

--001a11344ca8a9311004e532a9c1--