Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com
 designates 65.55.111.82 as permitted sender)
Message-ID: <BLU0-SMTP4052109CF9D34DC249C8EA28F870@phx.gbl>
From: Michael Segel <michael_segel@hotmail.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_A07341EA-2F6E-4174-BEF0-DBE6975A5BC2"
MIME-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Subject: Re: Aggregating data nested into JSON documents
Date: Thu, 13 Jun 2013 17:05:56 -0500
References: 
 <CANj+m=iOFeUK+HMhgFjcMjDWV0KKGq2tywE-P+gWxz0==yf4Qw@mail.gmail.com>
 <BLU0-SMTP20028AC4CD9E933194604158F870@phx.gbl>
 <CANj+m=j9vjQhHMREfOrQPRBvVr5zFX-LcBKVMxifKUanLRL-cQ@mail.gmail.com>
To: user@hadoop.apache.org
In-Reply-To: 
 <CANj+m=j9vjQhHMREfOrQPRBvVr5zFX-LcBKVMxifKUanLRL-cQ@mail.gmail.com>

--Apple-Mail=_A07341EA-2F6E-4174-BEF0-DBE6975A5BC2
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="iso-8859-1"

I think you have a misconception of HBase.=20

You don't need to actually have mutable data for it to be effective.=20
The key is that you need to have access to specific records and work a =
very small subset of the data and not the complete data set.=20


On Jun 13, 2013, at 11:59 AM, Tecno Brain <cerebrotecnologico@gmail.com> =
wrote:

> Hi Mike,
>=20
> Yes, I also have thought about HBase or Cassandra but my data is =
pretty much a snapshot, it does not require updates. Most of my =
aggregations will also need to be computed once and won't change over =
time with the exception of some aggregation that is based on the last N =
days of data.  Should I still consider HBase ? I think that probably it =
will be good for the aggregated data.=20
>=20
> I have no idea what are sequence files, but I will take a look.  My =
raw data is stored in the cloud, not in my Hadoop cluster.=20
>=20
> I'll keep looking at Pig with ElephantBird.=20
> Thanks,
>=20
> -Jorge=20
>=20
>=20
>=20
>=20
>=20
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel =
<michael_segel@hotmail.com> wrote:
> Hi..
>=20
> Have you thought about HBase?=20
>=20
> I would suggest that if you're using Hive or Pig, to look at taking =
these files and putting the JSON records in to a sequence file.=20
> Or set of sequence files.... (Then look at HBase to help index =
them...) 200KB is small.=20
>=20
> That would be the same for either pig/hive.
>=20
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty =
nice. And yes you get each record as a row, however you can always =
flatten them as needed.=20
>=20
> Hive?=20
> I haven't worked with the latest SerDe, but maybe Dean Wampler or =
Edward Capriolo could give you a better answer.=20
> Going from memory, I don't know that there is a good SerDe that would =
write JSON, just read it. (Hive)
>=20
> IMHO Pig/ElephantBird is the best so far, but then again I may be =
dated and biased.=20
>=20
> I think you're on the right track or at least train of thought.=20
>=20
> HTH
>=20
> -Mike
>=20
>=20
> On Jun 12, 2013, at 7:57 PM, Tecno Brain =
<cerebrotecnologico@gmail.com> wrote:
>=20
>> Hello,=20
>>    I'm new to Hadoop.=20
>>    I have a large quantity of JSON documents with a structure similar =
to what is shown below. =20
>>=20
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ...=20
>>          },
>>      ],
>>     ... // other fields omitted=20
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ...=20
>>          },   =20
>>      ]
>>    },
>> }
>> =20
>>=20
>> Each file contains a single JSON document (gzip compressed, and =
roughly about 200KB uncompressed of pretty-printed json text per =
document)
>>=20
>> I am interested in analyzing only the  "important-data" array and the =
"other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a =
couple of tables with a fixed set of columns. Only the column "map" =
would be a complex column, all others would be primitives.
>>=20
>> ( g, sg, j, page, f1, f2, f3 )
>> =20
>> ( g, sg, j, page, x1, x2, x3, map )
>>=20
>> So, for each JSON document, I would like to "create" several rows, =
but I would like to avoid the intermediate step of persisting -and =
duplicating- the "flattened" data.
>>=20
>> In order to avoid persisting the data flattened, I thought I had to =
write my own map-reduce in Java code, but discovered that others have =
had the same problem of using JSON as the source and there are somewhat =
"standard" solutions.=20
>>=20
>> By reading about the SerDe approach for Hive I get the impression =
that each JSON document is transformed into a single "row" of the table =
with some columns being an array, a map of other nested structures.=20
>> a) Is there a way to break each JSON document into several "rows" for =
a Hive external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of =
them considered the de-facto standard?=20
>>=20
>> The Pig approach seems also promising using Elephant Bird Do anybody =
has pointers to more user documentation on this project? Or is browsing =
through the examples in GitHub my only source?
>>=20
>> Thanks
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>=20
>=20


--Apple-Mail=_A07341EA-2F6E-4174-BEF0-DBE6975A5BC2
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="iso-8859-1"

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Diso-8859-1"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">I =
think you have a misconception of HBase.&nbsp;<div><br></div><div>You =
don't need to actually have mutable data for it to be =
effective.&nbsp;</div><div>The key is that you need to have access to =
specific records and work a very small subset of the data and not the =
complete data set.&nbsp;</div><div><br></div><div><br><div><div>On Jun =
13, 2013, at 11:59 AM, Tecno Brain &lt;<a =
href=3D"mailto:cerebrotecnologico@gmail.com">cerebrotecnologico@gmail.com<=
/a>&gt; wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><div dir=3D"ltr"><div style=3D"">Hi Mike,</div><div =
style=3D""><br></div>Yes, I also have thought about HBase or Cassandra =
but my data is pretty much a snapshot, it does not require updates. Most =
of my aggregations will also need to be computed once and won't change =
over time with the exception of some aggregation that is based on the =
last N days of data. &nbsp;Should I still consider HBase ? I think that =
probably it will be good for the aggregated data.&nbsp;<div>
<br></div><div style=3D"">I have no idea what are sequence files, but I =
will take a look. &nbsp;My raw data is stored in the cloud, not in my =
Hadoop cluster.&nbsp;</div><div style=3D""><br></div><div style=3D"">I'll =
keep looking at Pig with ElephantBird.&nbsp;</div>
<div style=3D"">Thanks,</div><div style=3D""><br></div><div =
style=3D"">-Jorge&nbsp;</div><div style=3D""><br></div><div =
style=3D""><br></div><div><br></div></div><div =
class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed, Jun 12, =
2013 at 7:26 PM, Michael Segel <span dir=3D"ltr">&lt;<a =
href=3D"mailto:michael_segel@hotmail.com" =
target=3D"_blank">michael_segel@hotmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word">Hi..<div><br></div><div>Have you thought =
about HBase?&nbsp;</div><div><br></div><div>I would suggest that if =
you're using Hive or Pig, to look at taking these files and putting the =
JSON records in to a sequence file.&nbsp;</div>
<div>Or set of sequence files.... (Then look at HBase to help index =
them...) 200KB is small.&nbsp;</div><div><br></div><div>That would be =
the same for either pig/hive.</div><div><br></div><div>In terms of =
SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you =
get each record as a row, however you can always flatten them as =
needed.&nbsp;</div>
<div><br></div><div>Hive?&nbsp;</div><div>I haven't worked with the =
latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a =
better answer.&nbsp;</div><div>Going from memory, I don't know that =
there is a good SerDe that would write JSON, just read it. (Hive)</div>
<div><br></div><div>IMHO Pig/ElephantBird is the best so far, but then =
again I may be dated and biased.&nbsp;</div><div><br></div><div>I think =
you're on the right track or at least train of =
thought.&nbsp;</div><div><br></div><div>
HTH</div><div><br></div><div>-Mike</div><div><div =
class=3D"h5"><div><br></div><div><br></div><div><div><div>On Jun 12, =
2013, at 7:57 PM, Tecno Brain &lt;<a =
href=3D"mailto:cerebrotecnologico@gmail.com" =
target=3D"_blank">cerebrotecnologico@gmail.com</a>&gt; wrote:</div>
<br><blockquote type=3D"cite"><div dir=3D"ltr"><font face=3D"courier =
new, monospace">Hello,&nbsp;</font><div><font face=3D"courier new, =
monospace">&nbsp; &nbsp;I'm new to =
Hadoop.&nbsp;</font></div><div><div><font face=3D"courier new, =
monospace">&nbsp; &nbsp;I have a large quantity of JSON documents with a =
structure similar to what is shown below. &nbsp;</font></div>

<div><font face=3D"courier new, monospace"><br></font></div><div><font =
face=3D"courier new, monospace">&nbsp; &nbsp;{</font></div><div><font =
face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp;g : =
"some-group-identifier",<br>
</font></div><div><font face=3D"courier new, monospace">&nbsp; &nbsp; =
&nbsp;sg: "some-subgroup-identifier",</font></div><div><font =
face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp;j &nbsp; &nbsp; =
&nbsp;: "some-job-identifier",</font></div>
<div><font face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp;page =
&nbsp; &nbsp; : 23,</font></div><div><font face=3D"courier new, =
monospace">&nbsp; &nbsp; &nbsp;... // other =
fields&nbsp;omitted</font></div><div><font face=3D"courier new, =
monospace">&nbsp; &nbsp; &nbsp;important-data : [</font></div>

<div><font face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;{</font></div><div><font face=3D"courier new, monospace">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;f1 &nbsp;: =
"abc",</font></div><div><font face=3D"courier new, monospace">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;f2 &nbsp;: "a",</font></div>

<div><font face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp;f3 &nbsp;: "/"</font></div><div><font face=3D"courier new, =
monospace">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;...</font></div><div><font face=3D"courier new, monospace">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;},</font></div>
<div><font face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;...</font></div><div><font face=3D"courier new, monospace">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;{</font></div><div><font face=3D"courier new, =
monospace">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;f1 : =
"xyz",</font></div>
<div><font face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp;f2 &nbsp;: "q",</font></div><div><font face=3D"courier new, =
monospace">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;f3 &nbsp;: =
"/",</font></div><div><font face=3D"courier new, monospace">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;...&nbsp;</font></div>

<div><font face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;},</font></div><div><font face=3D"courier new, monospace">&nbsp; =
&nbsp; &nbsp;],</font></div><div><font face=3D"courier new, =
monospace">&nbsp; &nbsp; ... // other =
fields&nbsp;omitted&nbsp;</font></div>
<div><font face=3D"courier new, monospace">&nbsp; &nbsp; =
&nbsp;other-important-data : [</font></div><div><font face=3D"courier =
new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; {</font></div><div><font =
face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;x1 &nbsp;: "ford",</font></div>

<div><font face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp;x2 &nbsp;: "green",</font></div><div><font face=3D"courier =
new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;x3 &nbsp;: =
35</font></div><div><font face=3D"courier new, monospace">&nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;map : {</font></div>

<div><font face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;"free-field" : =
"value",</font></div><div><font face=3D"courier new, monospace">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"other-free-field" : =
value2"</font></div>

<div><font face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; }</font></div><div><font face=3D"courier new, =
monospace">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;},</font></div><div><font =
face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;...</font></div><div><font face=3D"courier new, monospace">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;{</font></div>

<div><font face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp;x1 : "vw",</font></div><div><font face=3D"courier new, =
monospace">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;x2 &nbsp;: =
"red",</font></div><div><font face=3D"courier new, monospace">&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;x3 &nbsp;: 54,</font></div>

<div><font face=3D"courier new, monospace">&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp;...&nbsp;</font></div><div><span =
style=3D"font-family:'courier new',monospace">&nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp;},</span><font face=3D"courier new, monospace">&nbsp; =
&nbsp;&nbsp;</font></div><div><font face=3D"courier new, =
monospace">&nbsp; &nbsp; &nbsp;]</font></div>

<div><font face=3D"courier new, monospace">&nbsp; =
&nbsp;},</font></div><div><font face=3D"courier new, =
monospace">}</font></div><div><font face=3D"courier new, =
monospace">&nbsp;</font></div><div><br></div><div><font face=3D"courier =
new, monospace">Each file contains a single JSON document (gzip =
compressed, and roughly about 200KB uncompressed of pretty-printed json =
text per document)</font></div>

<div><font face=3D"courier new, monospace"><br></font></div><div><font =
face=3D"courier new, monospace">I am interested in analyzing only the =
&nbsp;"important-data" array and the "other-important-data" =
array.</font></div>

<div><font face=3D"courier new, monospace">My source data would ideally =
be easier to analyze if it looked like a couple of tables with a fixed =
set of columns. Only the column "map" would be a complex column, all =
others would be primitives.</font></div>

<div><font face=3D"courier new, monospace"><br></font></div><div><font =
face=3D"courier new, monospace">( g, sg, j, page, f1, f2, f3 =
)</font></div><div><font face=3D"courier new, =
monospace">&nbsp;</font></div><div>
<span style=3D"font-family:'courier new',monospace">( g, sg, j, page, =
x1, x2, x3, map )</span><font face=3D"courier new, =
monospace"><br></font></div><div><font face=3D"courier new, =
monospace"><br></font></div><div>
<font face=3D"courier new, monospace">So, for each JSON document, I =
would like to "create" several rows, but&nbsp;</font><span =
style=3D"font-family:'courier new',monospace">I would like to avoid the =
intermediate step of persisting -and duplicating- the "flattened" =
data.</span></div>

<div><font face=3D"courier new, monospace"><br></font></div><div><font =
face=3D"courier new, monospace">In order to avoid persisting the data =
flattened, I thought I had to write my own map-reduce in Java code, but =
discovered that others have had the same problem of using JSON as the =
source and there are somewhat "standard" solutions.&nbsp;</font></div>

<div><br></div><div><font face=3D"courier new, monospace">By reading =
about the SerDe approach for Hive</font><span =
style=3D"font-family:'courier new',monospace">&nbsp;I get the impression =
that each JSON document is transformed into a single "row" of the table =
with some columns being an array, a map of other nested =
structures.&nbsp;</span></div>

<div><span style=3D"font-family:'courier new',monospace">a) Is there a =
way to break each JSON document into several "rows" for a Hive external =
table?</span></div><div><span style=3D"font-family:'courier =
new',monospace">b) It seems there are too many JSON SerDe libraries! Is =
there any of them considered the de-facto standard?&nbsp;</span></div>

<div><span style=3D"font-family:'courier =
new',monospace"><br></span></div><div><font face=3D"courier new, =
monospace">The Pig approach seems also promising using Elephant Bird Do =
anybody has pointers to more user documentation on this project? Or is =
browsing through the examples in GitHub my only source?</font></div>

<div><font face=3D"courier new, monospace"><br></font></div><div><font =
face=3D"courier new, monospace">Thanks</font></div><div><font =
face=3D"courier new, monospace"><br></font></div><div><br></div><div>
<span style=3D"font-family:'courier =
new',monospace"><br></span></div><div><span style=3D"font-family:'courier =
new',monospace"><br></span></div><div><span style=3D"font-family:'courier =
new',monospace"><br>

</span></div><div><font face=3D"courier new, =
monospace"><br></font></div><div><font face=3D"courier new, =
monospace"><br></font></div><div><font face=3D"courier new, =
monospace"><br></font></div><div><font face=3D"courier new, =
monospace"><br>

</font></div><div><font face=3D"courier new, =
monospace"><br></font></div></div></div>
=
</blockquote></div><br></div></div></div></div></blockquote></div><br></di=
v>
</blockquote></div><br></div></body></html>=

--Apple-Mail=_A07341EA-2F6E-4174-BEF0-DBE6975A5BC2--