Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 48397C3A3 for ; Wed, 19 Jun 2013 21:37:10 +0000 (UTC) Received: (qmail 46405 invoked by uid 500); 19 Jun 2013 21:37:05 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 46273 invoked by uid 500); 19 Jun 2013 21:37:05 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 46266 invoked by uid 99); 19 Jun 2013 21:37:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Jun 2013 21:37:05 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cerebrotecnologico@gmail.com designates 209.85.128.182 as permitted sender) Received: from [209.85.128.182] (HELO mail-ve0-f182.google.com) (209.85.128.182) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Jun 2013 21:36:57 +0000 Received: by mail-ve0-f182.google.com with SMTP id ox1so4430847veb.41 for ; Wed, 19 Jun 2013 14:36:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=bxtFWqVk3pt+KOqgaHr5FY8uDnWao+87AbAy7v8a5dM=; b=xY0HGkwDshgRigwiCL6qfIbeRiUtq9moj7suJjqtXcSfSvjCK98EnjzmhE6BExRFI1 we6v/t87enRnXZQJUURpepy3mwUcaZFafrs93SFK1aijqb9nGBJPiDG9UMmcMRMKh+lY F1ErECiP46ihxCg+CRNCVb/a5NzmjApSWu+5ICn2vWXsb8+UoD5fVlCRHp5EFdYn7swm ayk6EY8TDLWAEjFKDMrKiR7pavmiNEEsylsQYTDDjUGPXq5iMy7tt/z64eCgoTTWOTEB 9eoWmmWiS+Po4nQQfHY0CZn8TlSxIYoVOP+jqqMmyUFWAsZ+smPDEEVVGZbdolo0JxLc AAxA== MIME-Version: 1.0 X-Received: by 10.58.255.229 with SMTP id at5mr1564104ved.44.1371677796903; Wed, 19 Jun 2013 14:36:36 -0700 (PDT) Received: by 10.58.202.66 with HTTP; Wed, 19 Jun 2013 14:36:36 -0700 (PDT) In-Reply-To: References: Date: Wed, 19 Jun 2013 14:36:36 -0700 Message-ID: Subject: Re: Aggregating data nested into JSON documents From: Tecno Brain To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7bea3950d4cccd04df889fca X-Virus-Checked: Checked by ClamAV on apache.org --047d7bea3950d4cccd04df889fca Content-Type: text/plain; charset=ISO-8859-1 I got Pig and Hive working ona single-node and I am able to run some script/queries over regular text files (access log files); with a record per line. Now, I want to process some JSON files. As mentioned before, it seems that ElephantBird would be a would be a good solution to read JSON files. I uploaded 5 files to HDFS. Each file only contain a single JSON document. The documents are NOT in a single line, but rather contain pretty-printed JSON expanding over multiple lines. I'm trying something simple, extracting two (primitive) attributes at the top of the document: { a : "some value", ... b : 133, ... } So, lets start with a LOAD of a single file (single JSON document): REGISTER 'bunch of JAR files from elephant-bird and its dependencies'; doc = LOAD '/json-pcr/pcr-000001.json' using com.twitter.elephantbird.pig.load.JsonLoader(); flat = FOREACH doc GENERATE (chararray)$0#'a' AS first, (long)$0#'b' AS second ; DUMP flat; Apparently the job runs without problem, but I get no output. The output I get includes this message: Input(s): Successfully read 0 records (35863 bytes) from: "/json-pcr/pcr-000001.json" I was expecting to get ( "some value", 133 ) Any idea on what I am doing wrong? On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel wrote: > I think you have a misconception of HBase. > > You don't need to actually have mutable data for it to be effective. > The key is that you need to have access to specific records and work a > very small subset of the data and not the complete data set. > > > On Jun 13, 2013, at 11:59 AM, Tecno Brain > wrote: > > Hi Mike, > > Yes, I also have thought about HBase or Cassandra but my data is pretty > much a snapshot, it does not require updates. Most of my aggregations will > also need to be computed once and won't change over time with the exception > of some aggregation that is based on the last N days of data. Should I > still consider HBase ? I think that probably it will be good for the > aggregated data. > > I have no idea what are sequence files, but I will take a look. My raw > data is stored in the cloud, not in my Hadoop cluster. > > I'll keep looking at Pig with ElephantBird. > Thanks, > > -Jorge > > > > > > On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel wrote: > >> Hi.. >> >> Have you thought about HBase? >> >> I would suggest that if you're using Hive or Pig, to look at taking these >> files and putting the JSON records in to a sequence file. >> Or set of sequence files.... (Then look at HBase to help index them...) >> 200KB is small. >> >> That would be the same for either pig/hive. >> >> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. >> And yes you get each record as a row, however you can always flatten them >> as needed. >> >> Hive? >> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward >> Capriolo could give you a better answer. >> Going from memory, I don't know that there is a good SerDe that would >> write JSON, just read it. (Hive) >> >> IMHO Pig/ElephantBird is the best so far, but then again I may be dated >> and biased. >> >> I think you're on the right track or at least train of thought. >> >> HTH >> >> -Mike >> >> >> On Jun 12, 2013, at 7:57 PM, Tecno Brain >> wrote: >> >> Hello, >> I'm new to Hadoop. >> I have a large quantity of JSON documents with a structure similar to >> what is shown below. >> >> { >> g : "some-group-identifier", >> sg: "some-subgroup-identifier", >> j : "some-job-identifier", >> page : 23, >> ... // other fields omitted >> important-data : [ >> { >> f1 : "abc", >> f2 : "a", >> f3 : "/" >> ... >> }, >> ... >> { >> f1 : "xyz", >> f2 : "q", >> f3 : "/", >> ... >> }, >> ], >> ... // other fields omitted >> other-important-data : [ >> { >> x1 : "ford", >> x2 : "green", >> x3 : 35 >> map : { >> "free-field" : "value", >> "other-free-field" : value2" >> } >> }, >> ... >> { >> x1 : "vw", >> x2 : "red", >> x3 : 54, >> ... >> }, >> ] >> }, >> } >> >> >> Each file contains a single JSON document (gzip compressed, and roughly >> about 200KB uncompressed of pretty-printed json text per document) >> >> I am interested in analyzing only the "important-data" array and the >> "other-important-data" array. >> My source data would ideally be easier to analyze if it looked like a >> couple of tables with a fixed set of columns. Only the column "map" would >> be a complex column, all others would be primitives. >> >> ( g, sg, j, page, f1, f2, f3 ) >> >> ( g, sg, j, page, x1, x2, x3, map ) >> >> So, for each JSON document, I would like to "create" several rows, but I >> would like to avoid the intermediate step of persisting -and duplicating- >> the "flattened" data. >> >> In order to avoid persisting the data flattened, I thought I had to write >> my own map-reduce in Java code, but discovered that others have had the >> same problem of using JSON as the source and there are somewhat "standard" >> solutions. >> >> By reading about the SerDe approach for Hive I get the impression that >> each JSON document is transformed into a single "row" of the table with >> some columns being an array, a map of other nested structures. >> a) Is there a way to break each JSON document into several "rows" for a >> Hive external table? >> b) It seems there are too many JSON SerDe libraries! Is there any of them >> considered the de-facto standard? >> >> The Pig approach seems also promising using Elephant Bird Do anybody has >> pointers to more user documentation on this project? Or is browsing through >> the examples in GitHub my only source? >> >> Thanks >> >> >> >> >> >> >> >> >> >> >> >> > > --047d7bea3950d4cccd04df889fca Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I got Pig and Hive working ona single-node and I am able t= o run some script/queries over regular text files (access log files); with = a record per line.=A0

Now, I want to process some JSON f= iles.

As mentioned before, it seems =A0that Eleph= antBird would be a would be a good solution to read JSON files.=A0

I uploaded 5 files to HDFS. Each file only con= tain a single JSON document. The documents are NOT in a single line, but ra= ther contain pretty-printed JSON expanding over multiple lines.=A0

I'm trying something simple, extracting= two (primitive) attributes at the top of the document:
{
=A0 =A0a : "some value",
=A0 =A0...
=A0 =A0b : 133,
<= div style>=A0 =A0...=A0<= div style>}

So, lets start with a LOAD of a single file= (single JSON document):

REGISTER 'bunch of JAR files from elephant-bir= d and its dependencies';
doc =3D LOAD '/json-pc= r/pcr-000001.json' using =A0com.twitter.elephantbird.pig.load.JsonLoade= r();=A0
fl= at =A0=3D FOREACH doc GENERATE (chararray)$0#'a' AS =A0first, (long= )$0#'b' AS second ;
DUMP flat;

Apparently the job runs without p= roblem, but I get no output. The output I get includes this message:

=A0 =A0Input(s):
=A0 =A0Succ= essfully read 0 records (35863 bytes) from: "/json-pcr/pcr-000001.json= "

I was expecting to get=A0=

( "some value", 133 )

Any idea on what I am doing wrong?



On Thu, Jun 13, 2013 at 3:05 PM, Michael Seg= el <michael_segel@hotmail.com> wrote:
I think you have a misconception of HBa= se.=A0

You don't need to actually have mutable data = for it to be effective.=A0
The key is that you need to have acces= s to specific records and work a very small subset of the data and not the = complete data set.=A0


On Jun 13, 2013, a= t 11:59 AM, Tecno Brain <cerebrotecnologico@gmail.com> wrote:

Hi Mike,

Yes, I also have thought= about HBase or Cassandra but my data is pretty much a snapshot, it does no= t require updates. Most of my aggregations will also need to be computed on= ce and won't change over time with the exception of some aggregation th= at is based on the last N days of data. =A0Should I still consider HBase ? = I think that probably it will be good for the aggregated data.=A0

I have no idea what are sequence files, but I will take a lo= ok. =A0My raw data is stored in the cloud, not in my Hadoop cluster.=A0

I'll keep looking at Pig with ElephantBird.=A0
Thanks,

-Jorge=A0




On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com> wrote:
Hi..
Have you thought about HBase?=A0

I= would suggest that if you're using Hive or Pig, to look at taking thes= e files and putting the JSON records in to a sequence file.=A0
Or set of sequence files.... (Then look at HBase to help index them...= ) 200KB is small.=A0

That would be the same for ei= ther pig/hive.

In terms of SerDes, I've worked= w Pig and ElephantBird, its pretty nice. And yes you get each record as a = row, however you can always flatten them as needed.=A0

Hive?=A0
I haven't worked with the latest= SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better a= nswer.=A0
Going from memory, I don't know that there is a goo= d SerDe that would write JSON, just read it. (Hive)

IMHO Pig/ElephantBird is the best so far, but then agai= n I may be dated and biased.=A0

I think you're= on the right track or at least train of thought.=A0

HTH

-Mike


<= /div>
On Jun 12, 2013, at 7:57 PM, Tecno Brain <cerebrotecnologico@= gmail.com> wrote:

Hello,=A0
=A0 = =A0I'm new to Hadoop.=A0
=A0 =A0I have a large quantity of JSON documents with a struc= ture similar to what is shown below. =A0

=A0 =A0{
=A0 =A0 =A0g : "some-group-identifier",
=A0 =A0 =A0sg: &quo= t;some-subgroup-identifier",
=A0 =A0 =A0j =A0 =A0 =A0: "some-job-identifier",
=A0 =A0 =A0page =A0 =A0 : 23,
=A0 =A0 =A0... // othe= r fields=A0omitted
= =A0 =A0 =A0important-data : [
=A0 =A0 =A0 =A0 =A0{
=A0 =A0 =A0 =A0 =A0 =A0f1 =A0: = "abc",
=A0= =A0 =A0 =A0 =A0 =A0f2 =A0: "a",
=A0 =A0 =A0 =A0 =A0 =A0f3 =A0: &= quot;/"
=A0 =A0= =A0 =A0 =A0 =A0...
= =A0 =A0 =A0 =A0 =A0},
=A0 =A0 =A0 =A0 =A0...
=A0 =A0 =A0 =A0 =A0{
=A0 =A0 =A0 =A0 =A0 =A0f1 : = "xyz",
=A0 =A0 =A0 =A0 =A0 =A0f2 =A0: &= quot;q",
=A0 = =A0 =A0 =A0 =A0 =A0f3 =A0: "/",
=A0 =A0 =A0 =A0 =A0 =A0...=A0
=A0 =A0 =A0 =A0 =A0},
=A0 =A0 =A0],
=A0 =A0 ... // other fields=A0omitte= d=A0
=A0 =A0 =A0other-important-data = : [
=A0 =A0 =A0 =A0 = {
=A0 =A0 =A0 =A0 = =A0 =A0x1 =A0: "ford",
=A0 =A0 =A0 =A0 =A0 =A0x2 =A0: &= quot;green",
= =A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 35
=A0 =A0 =A0 =A0 =A0 =A0map : {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&= quot;free-field" : "value",
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0"other-free-fiel= d" : value2"
=A0 =A0 =A0 =A0 =A0 =A0 =A0 }
=A0 =A0 =A0 =A0 =A0},<= /font>
=A0 =A0 =A0 =A0 =A0.= ..
=A0 =A0 =A0 =A0 = =A0{
=A0 =A0 =A0 =A0 =A0 =A0x1 : &quo= t;vw",
=A0 =A0 = =A0 =A0 =A0 =A0x2 =A0: "red",
=A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 54,
=A0 =A0 =A0 =A0 =A0 =A0...=A0
= =A0 =A0 =A0 =A0 =A0},=A0 =A0= =A0
=A0 =A0 =A0]
=A0 =A0},
}
=A0

Each file contains a single JSON document (gzip compressed, a= nd roughly about 200KB uncompressed of pretty-printed json text per documen= t)

I am interested in analyzing only the =A0"= important-data" array and the "other-important-data" array.<= /font>
My source data would ideally be = easier to analyze if it looked like a couple of tables with a fixed set of = columns. Only the column "map" would be a complex column, all oth= ers would be primitives.

( g, sg, j, page, f1, f2, f3 )
=A0
( g, sg, j, pag= e, x1, x2, x3, map )

So, for each JSON document, I would l= ike to "create" several rows, but=A0I would like to avoid the intermediat= e step of persisting -and duplicating- the "flattened" data.

In order to avoid persisting the data flattened= , I thought I had to write my own map-reduce in Java code, but discovered t= hat others have had the same problem of using JSON as the source and there = are somewhat "standard" solutions.=A0

By reading about = the SerDe approach for Hive=A0I get the impression that each JSON document is trans= formed into a single "row" of the table with some columns being a= n array, a map of other nested structures.=A0
a) Is ther= e a way to break each JSON document into several "rows" for a Hiv= e external table?
b) It seems there are too many JSON SerDe libraries! Is= there any of them considered the de-facto standard?=A0

The Pig approach seems al= so promising using Elephant Bird Do anybody has pointers to more user docum= entation on this project? Or is browsing through the examples in GitHub my = only source?

Thanks




<= br>









--047d7bea3950d4cccd04df889fca--