Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: softfail (nike.apache.org: transitioning domain of
 jpalmer@care.com does not designate 165.212.64.22 as permitted sender)
From: Jon Palmer <jpalmer@care.com>
To: "user@hive.apache.org" <user@hive.apache.org>
Subject: What's the right data storage/representation?
Thread-Topic: What's the right data storage/representation?
Thread-Index: Ac0yk+tM3mI7QWoBTpWHVwtIWds/PA==
Date: Tue, 15 May 2012 12:11:53 +0000
Message-ID: 
 <E5B6EDBA34609842AE673D69746546AD0C0940C9@S1P5DAG1C.EXCHPROD.USA.NET>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_E5B6EDBA34609842AE673D69746546AD0C0940C9S1P5DAG1CEXCHPR_"
MIME-Version: 1.0

--_000_E5B6EDBA34609842AE673D69746546AD0C0940C9S1P5DAG1CEXCHPR_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

All,

I'm a relative newcomer to Hadoop/Hive. We have a very standard setup of mu=
ltiple webapp servers backed by a mySql database. We are evaluating Hive as=
 a high scale solution for our relatively sophisticated reporting and analy=
tics needs. However, it's not clear what the best practices are around stor=
ing and representing the data our application generates. Probably best expl=
ained with an example:

We imagine a Hive deployment that is importing Apache logs and MySql data f=
rom the application db (probably via Sqoop). We would run our analysis dail=
y and output the results somewhere (flat files in s3 or another MySql repor=
ting database). We have users that have a) a status (Basic or Premium) and =
b) a location (a Zip code). We'd like to be able to ask questions like "How=
 many premium users did we have within ten miles of zip 02110 on Jan 3rd 20=
12?" Computing these numbers for all dates across all zip codes and for a n=
umber of radi on a very large set of users seems like a pretty good use of =
Hadoop/Hive.

However users can move location and change status. The application database=
 only really cares about the current location and status of a user and not =
the history of those fields. This presents a challenge to the analytics pro=
cess. If we run the analysis every day we will naturally pick up the change=
s in status and location. However, if we were to try to recomputed our enti=
re analysis for all dates we would get different results for users that mov=
ed location or changed status. The Apache logs are like not of much use as =
they are unlikely to contain member ids to deduce the requests which result=
ed in the change of status or location for a user.

How is this type of problem typically solved with Hive?

I can see a few potential solutions:

1.       Don't solve it. Accept that you have some artifacts in your report=
ing data that cannot be recovered from the source data.

2.       Create status and location history tables in the application db an=
d use that during the analytics process.

3.       Log the status and location change 'events' to some other log file=
 and use those logs in the Hive analysis.

Are there any 'best practices' around these kinds of problems and in partic=
ular suggestions for the simplest implementation of the extra logging and a=
nalysis required by 3.?

Thanks
Jon


This email is intended for the person(s) to whom it is addressed and may co=
ntain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, =
distribution, copying, or disclosure by any person other than the addressee(=
s) is strictly prohibited. If you have received this email in error, please =
notify the sender immediately by return email and delete the message and any=
 attachments from your system.
--_000_E5B6EDBA34609842AE673D69746546AD0C0940C9S1P5DAG1CEXCHPR_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<meta name=3D"Generator" content=3D"Microsoft Word 12 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	{mso-style-priority:99;
	mso-style-link:"Balloon Text Char";
	margin:0in;
	margin-bottom:.0001pt;
	font-size:8.0pt;
	font-family:"Tahoma","sans-serif";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{mso-style-priority:34;
	margin-top:0in;
	margin-right:0in;
	margin-bottom:0in;
	margin-left:.5in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Calibri","sans-serif";
	color:windowtext;}
span.BalloonTextChar
	{mso-style-name:"Balloon Text Char";
	mso-style-priority:99;
	mso-style-link:"Balloon Text";
	font-family:"Tahoma","sans-serif";}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
/* List Definitions */
@list l0
	{mso-list-id:1246844784;
	mso-list-type:hybrid;
	mso-list-template-ids:1809457560 67698703 67698713 67698715 67698703 67698=
713 67698715 67698703 67698713 67698715;}
@list l0:level1
	{mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
ol
	{margin-bottom:0in;}
ul
	{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div class=3D"WordSection1">
<p class=3D"MsoNormal">All,<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">I&#8217;m a relative newcomer to Hadoop/Hive. We hav=
e a very standard setup of multiple webapp servers backed by a mySql databa=
se. We are evaluating Hive as a high scale solution for our relatively soph=
isticated reporting and analytics needs.
 However, it&#8217;s not clear what the best practices are around storing a=
nd representing the data our application generates. Probably best explained=
 with an example:<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">We imagine a Hive deployment that is importing Apach=
e logs and MySql data from the application db (probably via Sqoop). We woul=
d run our analysis daily and output the results somewhere (flat files in s3=
 or another MySql reporting database).
 We have users that have a) a status (Basic or Premium) and b) a location (=
a Zip code). We&#8217;d like to be able to ask questions like &#8220;How ma=
ny premium users did we have within ten miles of zip 02110 on Jan 3<sup>rd<=
/sup> 2012?&#8221; Computing these numbers for all
 dates across all zip codes and for a number of radi on a very large set of=
 users seems like a pretty good use of Hadoop/Hive.<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">However users can move location and change status. T=
he application database only really cares about the current location and st=
atus of a user and not the history of those fields. This presents a challen=
ge to the analytics process. If we
 run the analysis every day we will naturally pick up the changes in status=
 and location. However, if we were to try to recomputed our entire analysis=
 for all dates we would get different results for users that moved location=
 or changed status. The Apache logs
 are like not of much use as they are unlikely to contain member ids to ded=
uce the requests which resulted in the change of status or location for a u=
ser.<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">How is this type of problem typically solved with Hi=
ve?<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">I can see a few potential solutions:<o:p></o:p></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"mso-list:Ignore">1.<span style=
=3D"font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;
</span></span><![endif]>Don&#8217;t solve it. Accept that you have some art=
ifacts in your reporting data that cannot be recovered from the source data=
.<o:p></o:p></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"mso-list:Ignore">2.<span style=
=3D"font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;
</span></span><![endif]>Create status and location history tables in the ap=
plication db and use that during the analytics process.<o:p></o:p></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"mso-list:Ignore">3.<span style=
=3D"font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;
</span></span><![endif]>Log the status and location change &#8216;events&#8=
217; to some other log file and use those logs in the Hive analysis.<o:p></=
o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">Are there any &#8216;best practices&#8217; around th=
ese kinds of problems and in particular suggestions for the simplest implem=
entation of the extra logging and analysis required by 3.?<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">Thanks<o:p></o:p></p>
<p class=3D"MsoNormal">Jon<o:p></o:p></p>
</div>
<br>
<div style=3D"font-family:Calibri; font-size: 11pt;">
<br>
<p>This email is intended for the person(s) to whom it is addressed and may=
 contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized us=
e, distribution, copying, or disclosure by any person other than the address=
ee(s) is strictly prohibited. If you have received this email in error, plea=
se notify the sender immediately by return email and delete the message and =
any attachments from your system.</p>
</div> 

<br>
</body>
</html>

--_000_E5B6EDBA34609842AE673D69746546AD0C0940C9S1P5DAG1CEXCHPR_--