Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F2FD7C418 for ; Wed, 19 Jun 2013 21:47:56 +0000 (UTC) Received: (qmail 68448 invoked by uid 500); 19 Jun 2013 21:47:52 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 68232 invoked by uid 500); 19 Jun 2013 21:47:52 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 68217 invoked by uid 99); 19 Jun 2013 21:47:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Jun 2013 21:47:52 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cerebrotecnologico@gmail.com designates 209.85.212.47 as permitted sender) Received: from [209.85.212.47] (HELO mail-vb0-f47.google.com) (209.85.212.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Jun 2013 21:47:44 +0000 Received: by mail-vb0-f47.google.com with SMTP id x14so4065710vbb.6 for ; Wed, 19 Jun 2013 14:47:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=m8G4vMnmRHd2lXgYJxc+VYDxz1nrtU1JHcLAA8LCQBs=; b=onCZkVAatHsvcPdAaa59CeKJ5O5/LRhoTlr7ap6J1lLgFbnEQ0G3Fvi6Kc3my9T9hP 69cgUEm5h0r0TDR7bafZG+ShI4yEYpRWmdK6Cvhybz1fLXG775GXwfcKV/q1sO9x+rHV unj2+k4dtZoLlI6naf3kBWaDOWZgiHAqYHzFOUeF3VHRO1Uxu98uDnp4+fDPP+Jgf+DM Or+nDlxaFoSv1xEkhCYZghz6O8PuSMaCQa5EqBBZfG0jAWACSsLNg/lA2eAshRoYwfy6 Hxaw85aKW8V2+U9z1p4dFoT8Ouc1N0s2jwpEmTsBC60GiaU4z55itKfLSWmQRJaNm/4F oKGQ== MIME-Version: 1.0 X-Received: by 10.220.207.72 with SMTP id fx8mr1068789vcb.30.1371678443567; Wed, 19 Jun 2013 14:47:23 -0700 (PDT) Received: by 10.58.202.66 with HTTP; Wed, 19 Jun 2013 14:47:23 -0700 (PDT) In-Reply-To: References: Date: Wed, 19 Jun 2013 14:47:23 -0700 Message-ID: Subject: Re: Aggregating data nested into JSON documents From: Tecno Brain To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c3da0060025d04df88c699 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3da0060025d04df88c699 Content-Type: text/plain; charset=ISO-8859-1 I also tried: doc = LOAD '/json-pcr/pcr-000001.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]); flat = FOREACH doc GENERATE (chararray)json#'a' AS first, (long)json#'b' AS second ; DUMP flat; but I got no output either. Input(s): Successfully read 0 records (35863 bytes) from: "/json-pcr/pcr-000001.json" Output(s): Successfully stored 0 records in: "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210" On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain wrote: > I got Pig and Hive working ona single-node and I am able to run some > script/queries over regular text files (access log files); with a record > per line. > > Now, I want to process some JSON files. > > As mentioned before, it seems that ElephantBird would be a would be a > good solution to read JSON files. > > I uploaded 5 files to HDFS. Each file only contain a single JSON document. > The documents are NOT in a single line, but rather contain pretty-printed > JSON expanding over multiple lines. > > I'm trying something simple, extracting two (primitive) attributes at the > top of the document: > { > a : "some value", > ... > b : 133, > ... > } > > So, lets start with a LOAD of a single file (single JSON document): > > REGISTER 'bunch of JAR files from elephant-bird and its dependencies'; > doc = LOAD '/json-pcr/pcr-000001.json' using > com.twitter.elephantbird.pig.load.JsonLoader(); > flat = FOREACH doc GENERATE (chararray)$0#'a' AS first, (long)$0#'b' AS > second ; > DUMP flat; > > Apparently the job runs without problem, but I get no output. The output I > get includes this message: > > Input(s): > Successfully read 0 records (35863 bytes) from: > "/json-pcr/pcr-000001.json" > > I was expecting to get > > ( "some value", 133 ) > > Any idea on what I am doing wrong? > > > > > On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel wrote: > >> I think you have a misconception of HBase. >> >> You don't need to actually have mutable data for it to be effective. >> The key is that you need to have access to specific records and work a >> very small subset of the data and not the complete data set. >> >> >> On Jun 13, 2013, at 11:59 AM, Tecno Brain >> wrote: >> >> Hi Mike, >> >> Yes, I also have thought about HBase or Cassandra but my data is pretty >> much a snapshot, it does not require updates. Most of my aggregations will >> also need to be computed once and won't change over time with the exception >> of some aggregation that is based on the last N days of data. Should I >> still consider HBase ? I think that probably it will be good for the >> aggregated data. >> >> I have no idea what are sequence files, but I will take a look. My raw >> data is stored in the cloud, not in my Hadoop cluster. >> >> I'll keep looking at Pig with ElephantBird. >> Thanks, >> >> -Jorge >> >> >> >> >> >> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel > > wrote: >> >>> Hi.. >>> >>> Have you thought about HBase? >>> >>> I would suggest that if you're using Hive or Pig, to look at taking >>> these files and putting the JSON records in to a sequence file. >>> Or set of sequence files.... (Then look at HBase to help index them...) >>> 200KB is small. >>> >>> That would be the same for either pig/hive. >>> >>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. >>> And yes you get each record as a row, however you can always flatten them >>> as needed. >>> >>> Hive? >>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward >>> Capriolo could give you a better answer. >>> Going from memory, I don't know that there is a good SerDe that would >>> write JSON, just read it. (Hive) >>> >>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated >>> and biased. >>> >>> I think you're on the right track or at least train of thought. >>> >>> HTH >>> >>> -Mike >>> >>> >>> On Jun 12, 2013, at 7:57 PM, Tecno Brain >>> wrote: >>> >>> Hello, >>> I'm new to Hadoop. >>> I have a large quantity of JSON documents with a structure similar to >>> what is shown below. >>> >>> { >>> g : "some-group-identifier", >>> sg: "some-subgroup-identifier", >>> j : "some-job-identifier", >>> page : 23, >>> ... // other fields omitted >>> important-data : [ >>> { >>> f1 : "abc", >>> f2 : "a", >>> f3 : "/" >>> ... >>> }, >>> ... >>> { >>> f1 : "xyz", >>> f2 : "q", >>> f3 : "/", >>> ... >>> }, >>> ], >>> ... // other fields omitted >>> other-important-data : [ >>> { >>> x1 : "ford", >>> x2 : "green", >>> x3 : 35 >>> map : { >>> "free-field" : "value", >>> "other-free-field" : value2" >>> } >>> }, >>> ... >>> { >>> x1 : "vw", >>> x2 : "red", >>> x3 : 54, >>> ... >>> }, >>> ] >>> }, >>> } >>> >>> >>> Each file contains a single JSON document (gzip compressed, and roughly >>> about 200KB uncompressed of pretty-printed json text per document) >>> >>> I am interested in analyzing only the "important-data" array and the >>> "other-important-data" array. >>> My source data would ideally be easier to analyze if it looked like a >>> couple of tables with a fixed set of columns. Only the column "map" would >>> be a complex column, all others would be primitives. >>> >>> ( g, sg, j, page, f1, f2, f3 ) >>> >>> ( g, sg, j, page, x1, x2, x3, map ) >>> >>> So, for each JSON document, I would like to "create" several rows, but I >>> would like to avoid the intermediate step of persisting -and duplicating- >>> the "flattened" data. >>> >>> In order to avoid persisting the data flattened, I thought I had to >>> write my own map-reduce in Java code, but discovered that others have had >>> the same problem of using JSON as the source and there are somewhat >>> "standard" solutions. >>> >>> By reading about the SerDe approach for Hive I get the impression that >>> each JSON document is transformed into a single "row" of the table with >>> some columns being an array, a map of other nested structures. >>> a) Is there a way to break each JSON document into several "rows" for a >>> Hive external table? >>> b) It seems there are too many JSON SerDe libraries! Is there any of >>> them considered the de-facto standard? >>> >>> The Pig approach seems also promising using Elephant Bird Do anybody has >>> pointers to more user documentation on this project? Or is browsing through >>> the examples in GitHub my only source? >>> >>> Thanks >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> > --001a11c3da0060025d04df88c699 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I also tried:

doc =3D LOAD '/json-pcr/pcr-000001.json' USING =A0com.= twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
flat =3D FOREACH doc =A0GENERATE =A0(= chararray)json#'a' AS first, (long)json#'b' AS second ; =A0=
DUMP flat= ;

but I got n= o output either.=A0

=A0 =A0 =A0Input(s):
=A0 =A0 = =A0Successfully read 0 records (35863 bytes) from: "/json-pcr/pcr-0000= 01.json"

=A0 =A0 =A0Output(s):
=A0 =A0 =A0Successfully stored 0 records in: &q= uot;hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"


On Wed, Jun 19, 2013 at 2:36 P= M, Tecno Brain <cerebrotecnologico@gmail.com> wro= te:
I got Pig and Hive working ona single-nod= e and I am able to run some script/queries over regular text files (access = log files); with a record per line.=A0

Now, I want to process some JSON files.

As mentioned before, it seems =A0that ElephantBird woul= d be a would be a good solution to read JSON files.=A0

=
I uploaded 5 files to HDFS. Each file only contain a single JSON docum= ent. The documents are NOT in a single line, but rather contain pretty-prin= ted JSON expanding over multiple lines.=A0

I'm trying something simple, extracting two (primit= ive) attributes at the top of the document:
{
= =A0 =A0a : "some value",
=A0 =A0...
=A0 =A0b : 133,
=A0 =A0...=A0
}

So, lets start with a LOAD of a single file (single JSO= N document):

REGISTER 'bunch of JAR files from elephant-bird and its dependencies&#= 39;;
doc =3D LOAD '/json-pcr/pcr-= 000001.json' using =A0com.twitter.elephantbird.pig.load.JsonLoader();= =A0
flat =A0=3D = FOREACH doc GENERATE (chararray)$0#'a' AS =A0first, (long)$0#'b= ' AS second ;
DUMP flat;

Apparently the job runs without problem, but I get = no output. The output I get includes this message:

=A0 =A0Input(s):
=A0 =A0Successfully rea= d 0 records (35863 bytes) from: "/json-pcr/pcr-000001.json"
=

I was expecting to get=A0

( "some value", 133 )

Any idea on what I am doing wrong?




On Thu, Jun 13, 2013 at 3:05 PM, Michael Seg= el <michael_segel@hotmail.com> wrote:
I think you have a misconception of HBa= se.=A0

You don't need to actually have mutable data = for it to be effective.=A0
The key is that you need to have acces= s to specific records and work a very small subset of the data and not the = complete data set.=A0


On Jun 13, 2013, at 11:59 AM, T= ecno Brain <cerebrotecnologico@gmail.com> wrote:

Hi Mike,

Yes, I also have thought= about HBase or Cassandra but my data is pretty much a snapshot, it does no= t require updates. Most of my aggregations will also need to be computed on= ce and won't change over time with the exception of some aggregation th= at is based on the last N days of data. =A0Should I still consider HBase ? = I think that probably it will be good for the aggregated data.=A0

I have no idea what are sequence files, but I will take a lo= ok. =A0My raw data is stored in the cloud, not in my Hadoop cluster.=A0

I'll keep looking at Pig with ElephantBird.=A0
Thanks,

-Jorge=A0




On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com> wrote:
Hi..

Have you thought about HBase?=A0

I would suggest t= hat if you're using Hive or Pig, to look at taking these files and putt= ing the JSON records in to a sequence file.=A0
Or set of sequence files.... (Then look at HBase to help index them...= ) 200KB is small.=A0

That would be the same for ei= ther pig/hive.

In terms of SerDes, I've worked= w Pig and ElephantBird, its pretty nice. And yes you get each record as a = row, however you can always flatten them as needed.=A0

Hive?=A0
I haven't worked with the latest= SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better a= nswer.=A0
Going from memory, I don't know that there is a goo= d SerDe that would write JSON, just read it. (Hive)

IMHO Pig/ElephantBird is the best so far, but then agai= n I may be dated and biased.=A0

I think you're= on the right track or at least train of thought.=A0

HTH

-Mike


<= /div>
On Jun 12, 2013, at 7:57 PM, Tecno Brain <cerebrotecnologico@= gmail.com> wrote:

Hello,=A0
=A0 = =A0I'm new to Hadoop.=A0
=A0 =A0I have a large quantity of JSON documents with a struc= ture similar to what is shown below. =A0

=A0 =A0{
=A0 =A0 =A0g : "some-group-identifier",
=A0 =A0 =A0sg: &quo= t;some-subgroup-identifier",
=A0 =A0 =A0j =A0 =A0 =A0: "some-job-identifier",
=A0 =A0 =A0page =A0 =A0 : 23,
=A0 =A0 =A0... // othe= r fields=A0omitted
= =A0 =A0 =A0important-data : [
=A0 =A0 =A0 =A0 =A0{
=A0 =A0 =A0 =A0 =A0 =A0f1 =A0: = "abc",
=A0= =A0 =A0 =A0 =A0 =A0f2 =A0: "a",
=A0 =A0 =A0 =A0 =A0 =A0f3 =A0: &= quot;/"
=A0 =A0= =A0 =A0 =A0 =A0...
= =A0 =A0 =A0 =A0 =A0},
=A0 =A0 =A0 =A0 =A0...
=A0 =A0 =A0 =A0 =A0{
=A0 =A0 =A0 =A0 =A0 =A0f1 : = "xyz",
=A0 =A0 =A0 =A0 =A0 =A0f2 =A0: &= quot;q",
=A0 = =A0 =A0 =A0 =A0 =A0f3 =A0: "/",
=A0 =A0 =A0 =A0 =A0 =A0...=A0
=A0 =A0 =A0 =A0 =A0},
=A0 =A0 =A0],
=A0 =A0 ... // other fields=A0omitte= d=A0
=A0 =A0 =A0other-important-data = : [
=A0 =A0 =A0 =A0 = {
=A0 =A0 =A0 =A0 = =A0 =A0x1 =A0: "ford",
=A0 =A0 =A0 =A0 =A0 =A0x2 =A0: &= quot;green",
= =A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 35
=A0 =A0 =A0 =A0 =A0 =A0map : {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&= quot;free-field" : "value",
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0"other-free-fiel= d" : value2"
=A0 =A0 =A0 =A0 =A0 =A0 =A0 }
=A0 =A0 =A0 =A0 =A0},<= /font>
=A0 =A0 =A0 =A0 =A0.= ..
=A0 =A0 =A0 =A0 = =A0{
=A0 =A0 =A0 =A0 =A0 =A0x1 : &quo= t;vw",
=A0 =A0 = =A0 =A0 =A0 =A0x2 =A0: "red",
=A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 54,
=A0 =A0 =A0 =A0 =A0 =A0...=A0
= =A0 =A0 =A0 =A0 =A0},=A0 =A0= =A0
=A0 =A0 =A0]
=A0 =A0},
}
=A0

Each file contains a single JSON document (gzip compressed, a= nd roughly about 200KB uncompressed of pretty-printed json text per documen= t)

I am interested in analyzing only the =A0"= important-data" array and the "other-important-data" array.<= /font>
My source data would ideally be = easier to analyze if it looked like a couple of tables with a fixed set of = columns. Only the column "map" would be a complex column, all oth= ers would be primitives.

( g, sg, j, page, f1, f2, f3 )
=A0
( g, sg, j, pag= e, x1, x2, x3, map )

So, for each JSON document, I would l= ike to "create" several rows, but=A0I would like to avoid the intermediat= e step of persisting -and duplicating- the "flattened" data.

In order to avoid persisting the data flattened= , I thought I had to write my own map-reduce in Java code, but discovered t= hat others have had the same problem of using JSON as the source and there = are somewhat "standard" solutions.=A0

By reading about = the SerDe approach for Hive=A0I get the impression that each JSON document is trans= formed into a single "row" of the table with some columns being a= n array, a map of other nested structures.=A0
a) Is ther= e a way to break each JSON document into several "rows" for a Hiv= e external table?
b) It seems there are too many JSON SerDe libraries! Is= there any of them considered the de-facto standard?=A0

The Pig approach seems al= so promising using Elephant Bird Do anybody has pointers to more user docum= entation on this project? Or is browsing through the examples in GitHub my = only source?

Thanks




<= br>










--001a11c3da0060025d04df88c699--