Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EB56797E2 for ; Tue, 15 May 2012 12:12:35 +0000 (UTC) Received: (qmail 90601 invoked by uid 500); 15 May 2012 12:12:34 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 90393 invoked by uid 500); 15 May 2012 12:12:34 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 90356 invoked by uid 99); 15 May 2012 12:12:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 May 2012 12:12:33 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of jpalmer@care.com does not designate 165.212.64.22 as permitted sender) Received: from [165.212.64.22] (HELO gateout02.mbox.net) (165.212.64.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 May 2012 12:12:23 +0000 Received: from gateout02.mbox.net (localhost.localdomain [127.0.0.1]) by gateout02.mbox.net (Postfix) with ESMTP id A3EA9E00BBEB for ; Tue, 15 May 2012 12:11:59 +0000 (GMT) X-USANET-Received: from gateout02.mbox.net [127.0.0.1] by gateout02.mbox.net via mtad (C8.MAIN.3.82G) with ESMTP id 588qeomL32336Mo2; Tue, 15 May 2012 12:11:54 -0000 X-USANET-Routed: 6 gwsout-disclaimer Q:watd X-USANET-Routed: 10 gwsout-externalarchive C:gwsarchive:625 care.com.g.tcvxu@incoming02.seccas.com X-USANET-Routed: 3 gwsout-vs Q:bmvirus Received: from S1P5HUB3.EXCHPROD.USA.NET [165.212.120.254] by gateout02.mbox.net via smtad (C8.MAIN.3.75S) with ESMTPS id XID524qeomL41404Xo2; Tue, 15 May 2012 12:11:54 -0000 X-USANET-Source: 165.212.120.254 IN jpalmer@care.com S1P5HUB3.EXCHPROD.USA.NET X-USANET-MsgId: XID524qeomL41404Xo2 Received: from S1P5DAG1C.EXCHPROD.USA.NET ([169.254.3.67]) by S1P5HUB3.EXCHPROD.USA.NET ([10.120.223.33]) with mapi id 14.02.0247.003; Tue, 15 May 2012 12:11:54 +0000 From: Jon Palmer To: "user@hive.apache.org" Subject: What's the right data storage/representation? Thread-Topic: What's the right data storage/representation? Thread-Index: Ac0yk+tM3mI7QWoBTpWHVwtIWds/PA== Date: Tue, 15 May 2012 12:11:53 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [173.48.216.24] Content-Type: multipart/alternative; boundary="_000_E5B6EDBA34609842AE673D69746546AD0C0940C9S1P5DAG1CEXCHPR_" MIME-Version: 1.0 --_000_E5B6EDBA34609842AE673D69746546AD0C0940C9S1P5DAG1CEXCHPR_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable All, I'm a relative newcomer to Hadoop/Hive. We have a very standard setup of mu= ltiple webapp servers backed by a mySql database. We are evaluating Hive as= a high scale solution for our relatively sophisticated reporting and analy= tics needs. However, it's not clear what the best practices are around stor= ing and representing the data our application generates. Probably best expl= ained with an example: We imagine a Hive deployment that is importing Apache logs and MySql data f= rom the application db (probably via Sqoop). We would run our analysis dail= y and output the results somewhere (flat files in s3 or another MySql repor= ting database). We have users that have a) a status (Basic or Premium) and = b) a location (a Zip code). We'd like to be able to ask questions like "How= many premium users did we have within ten miles of zip 02110 on Jan 3rd 20= 12?" Computing these numbers for all dates across all zip codes and for a n= umber of radi on a very large set of users seems like a pretty good use of = Hadoop/Hive. However users can move location and change status. The application database= only really cares about the current location and status of a user and not = the history of those fields. This presents a challenge to the analytics pro= cess. If we run the analysis every day we will naturally pick up the change= s in status and location. However, if we were to try to recomputed our enti= re analysis for all dates we would get different results for users that mov= ed location or changed status. The Apache logs are like not of much use as = they are unlikely to contain member ids to deduce the requests which result= ed in the change of status or location for a user. How is this type of problem typically solved with Hive? I can see a few potential solutions: 1. Don't solve it. Accept that you have some artifacts in your report= ing data that cannot be recovered from the source data. 2. Create status and location history tables in the application db an= d use that during the analytics process. 3. Log the status and location change 'events' to some other log file= and use those logs in the Hive analysis. Are there any 'best practices' around these kinds of problems and in partic= ular suggestions for the simplest implementation of the extra logging and a= nalysis required by 3.? Thanks Jon This email is intended for the person(s) to whom it is addressed and may co= ntain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, = distribution, copying, or disclosure by any person other than the addressee(= s) is strictly prohibited. If you have received this email in error, please = notify the sender immediately by return email and delete the message and any= attachments from your system. --_000_E5B6EDBA34609842AE673D69746546AD0C0940C9S1P5DAG1CEXCHPR_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

All,

 

I’m a relative newcomer to Hadoop/Hive. We hav= e a very standard setup of multiple webapp servers backed by a mySql databa= se. We are evaluating Hive as a high scale solution for our relatively soph= isticated reporting and analytics needs. However, it’s not clear what the best practices are around storing a= nd representing the data our application generates. Probably best explained= with an example:

 

We imagine a Hive deployment that is importing Apach= e logs and MySql data from the application db (probably via Sqoop). We woul= d run our analysis daily and output the results somewhere (flat files in s3= or another MySql reporting database). We have users that have a) a status (Basic or Premium) and b) a location (= a Zip code). We’d like to be able to ask questions like “How ma= ny premium users did we have within ten miles of zip 02110 on Jan 3rd<= /sup> 2012?” Computing these numbers for all dates across all zip codes and for a number of radi on a very large set of= users seems like a pretty good use of Hadoop/Hive.

 

However users can move location and change status. T= he application database only really cares about the current location and st= atus of a user and not the history of those fields. This presents a challen= ge to the analytics process. If we run the analysis every day we will naturally pick up the changes in status= and location. However, if we were to try to recomputed our entire analysis= for all dates we would get different results for users that moved location= or changed status. The Apache logs are like not of much use as they are unlikely to contain member ids to ded= uce the requests which resulted in the change of status or location for a u= ser.

 

How is this type of problem typically solved with Hi= ve?

 

I can see a few potential solutions:

1.     &= nbsp; Don’t solve it. Accept that you have some art= ifacts in your reporting data that cannot be recovered from the source data= .

2.     &= nbsp; Create status and location history tables in the ap= plication db and use that during the analytics process.

3.     &= nbsp; Log the status and location change ‘events= 217; to some other log file and use those logs in the Hive analysis.

 

Are there any ‘best practices’ around th= ese kinds of problems and in particular suggestions for the simplest implem= entation of the extra logging and analysis required by 3.?

 

Thanks

Jon



This email is intended for the person(s) to whom it is addressed and may= contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized us= e, distribution, copying, or disclosure by any person other than the address= ee(s) is strictly prohibited. If you have received this email in error, plea= se notify the sender immediately by return email and delete the message and = any attachments from your system.


--_000_E5B6EDBA34609842AE673D69746546AD0C0940C9S1P5DAG1CEXCHPR_--