Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0D99A10327 for ; Thu, 13 Jun 2013 22:06:34 +0000 (UTC) Received: (qmail 6022 invoked by uid 500); 13 Jun 2013 22:06:29 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 5861 invoked by uid 500); 13 Jun 2013 22:06:29 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 5854 invoked by uid 99); 13 Jun 2013 22:06:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Jun 2013 22:06:29 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com designates 65.55.111.82 as permitted sender) Received: from [65.55.111.82] (HELO blu0-omc2-s7.blu0.hotmail.com) (65.55.111.82) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Jun 2013 22:06:20 +0000 Received: from BLU0-SMTP405 ([65.55.111.71]) by blu0-omc2-s7.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Thu, 13 Jun 2013 15:06:00 -0700 X-EIP: [1C4NjaJfMeySNRs2+qo16Xbl51aHPV2A] X-Originating-Email: [michael_segel@hotmail.com] Message-ID: Received: from [10.1.10.10] ([173.15.87.38]) by BLU0-SMTP405.phx.gbl over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Thu, 13 Jun 2013 15:05:58 -0700 From: Michael Segel Content-Type: multipart/alternative; boundary="Apple-Mail=_A07341EA-2F6E-4174-BEF0-DBE6975A5BC2" MIME-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: Aggregating data nested into JSON documents Date: Thu, 13 Jun 2013 17:05:56 -0500 References: To: user@hadoop.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1508) X-OriginalArrivalTime: 13 Jun 2013 22:05:58.0109 (UTC) FILETIME=[2ED548D0:01CE6882] X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_A07341EA-2F6E-4174-BEF0-DBE6975A5BC2 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" I think you have a misconception of HBase.=20 You don't need to actually have mutable data for it to be effective.=20 The key is that you need to have access to specific records and work a = very small subset of the data and not the complete data set.=20 On Jun 13, 2013, at 11:59 AM, Tecno Brain = wrote: > Hi Mike, >=20 > Yes, I also have thought about HBase or Cassandra but my data is = pretty much a snapshot, it does not require updates. Most of my = aggregations will also need to be computed once and won't change over = time with the exception of some aggregation that is based on the last N = days of data. Should I still consider HBase ? I think that probably it = will be good for the aggregated data.=20 >=20 > I have no idea what are sequence files, but I will take a look. My = raw data is stored in the cloud, not in my Hadoop cluster.=20 >=20 > I'll keep looking at Pig with ElephantBird.=20 > Thanks, >=20 > -Jorge=20 >=20 >=20 >=20 >=20 >=20 > On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel = wrote: > Hi.. >=20 > Have you thought about HBase?=20 >=20 > I would suggest that if you're using Hive or Pig, to look at taking = these files and putting the JSON records in to a sequence file.=20 > Or set of sequence files.... (Then look at HBase to help index = them...) 200KB is small.=20 >=20 > That would be the same for either pig/hive. >=20 > In terms of SerDes, I've worked w Pig and ElephantBird, its pretty = nice. And yes you get each record as a row, however you can always = flatten them as needed.=20 >=20 > Hive?=20 > I haven't worked with the latest SerDe, but maybe Dean Wampler or = Edward Capriolo could give you a better answer.=20 > Going from memory, I don't know that there is a good SerDe that would = write JSON, just read it. (Hive) >=20 > IMHO Pig/ElephantBird is the best so far, but then again I may be = dated and biased.=20 >=20 > I think you're on the right track or at least train of thought.=20 >=20 > HTH >=20 > -Mike >=20 >=20 > On Jun 12, 2013, at 7:57 PM, Tecno Brain = wrote: >=20 >> Hello,=20 >> I'm new to Hadoop.=20 >> I have a large quantity of JSON documents with a structure similar = to what is shown below. =20 >>=20 >> { >> g : "some-group-identifier", >> sg: "some-subgroup-identifier", >> j : "some-job-identifier", >> page : 23, >> ... // other fields omitted >> important-data : [ >> { >> f1 : "abc", >> f2 : "a", >> f3 : "/" >> ... >> }, >> ... >> { >> f1 : "xyz", >> f2 : "q", >> f3 : "/", >> ...=20 >> }, >> ], >> ... // other fields omitted=20 >> other-important-data : [ >> { >> x1 : "ford", >> x2 : "green", >> x3 : 35 >> map : { >> "free-field" : "value", >> "other-free-field" : value2" >> } >> }, >> ... >> { >> x1 : "vw", >> x2 : "red", >> x3 : 54, >> ...=20 >> }, =20 >> ] >> }, >> } >> =20 >>=20 >> Each file contains a single JSON document (gzip compressed, and = roughly about 200KB uncompressed of pretty-printed json text per = document) >>=20 >> I am interested in analyzing only the "important-data" array and the = "other-important-data" array. >> My source data would ideally be easier to analyze if it looked like a = couple of tables with a fixed set of columns. Only the column "map" = would be a complex column, all others would be primitives. >>=20 >> ( g, sg, j, page, f1, f2, f3 ) >> =20 >> ( g, sg, j, page, x1, x2, x3, map ) >>=20 >> So, for each JSON document, I would like to "create" several rows, = but I would like to avoid the intermediate step of persisting -and = duplicating- the "flattened" data. >>=20 >> In order to avoid persisting the data flattened, I thought I had to = write my own map-reduce in Java code, but discovered that others have = had the same problem of using JSON as the source and there are somewhat = "standard" solutions.=20 >>=20 >> By reading about the SerDe approach for Hive I get the impression = that each JSON document is transformed into a single "row" of the table = with some columns being an array, a map of other nested structures.=20 >> a) Is there a way to break each JSON document into several "rows" for = a Hive external table? >> b) It seems there are too many JSON SerDe libraries! Is there any of = them considered the de-facto standard?=20 >>=20 >> The Pig approach seems also promising using Elephant Bird Do anybody = has pointers to more user documentation on this project? Or is browsing = through the examples in GitHub my only source? >>=20 >> Thanks >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >=20 >=20 --Apple-Mail=_A07341EA-2F6E-4174-BEF0-DBE6975A5BC2 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="iso-8859-1" I = think you have a misconception of HBase. 

You = don't need to actually have mutable data for it to be = effective. 
The key is that you need to have access to = specific records and work a very small subset of the data and not the = complete data set. 


On Jun = 13, 2013, at 11:59 AM, Tecno Brain <cerebrotecnologico@gmail.com<= /a>> wrote:

Hi Mike,

Yes, I also have thought about HBase or Cassandra = but my data is pretty much a snapshot, it does not require updates. Most = of my aggregations will also need to be computed once and won't change = over time with the exception of some aggregation that is based on the = last N days of data.  Should I still consider HBase ? I think that = probably it will be good for the aggregated data. 

I have no idea what are sequence files, but I = will take a look.  My raw data is stored in the cloud, not in my = Hadoop cluster. 

I'll = keep looking at Pig with ElephantBird. 
Thanks,

-Jorge 





On Wed, Jun 12, = 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com> wrote:
Hi..

Have you thought = about HBase? 

I would suggest that if = you're using Hive or Pig, to look at taking these files and putting the = JSON records in to a sequence file. 
Or set of sequence files.... (Then look at HBase to help index = them...) 200KB is small. 

That would be = the same for either pig/hive.

In terms of = SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you = get each record as a row, however you can always flatten them as = needed. 

Hive? 
I haven't worked with the = latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a = better answer. 
Going from memory, I don't know that = there is a good SerDe that would write JSON, just read it. (Hive)

IMHO Pig/ElephantBird is the best so far, but then = again I may be dated and biased. 

I think = you're on the right track or at least train of = thought. 

HTH

-Mike


On Jun 12, = 2013, at 7:57 PM, Tecno Brain <cerebrotecnologico@gmail.com> wrote:

Hello, 
   I'm new to = Hadoop. 
   I have a large quantity of JSON documents with a = structure similar to what is shown below.  

   {
     g : = "some-group-identifier",
    =  sg: "some-subgroup-identifier",
     j     =  : "some-job-identifier",
     page =     : 23,
     ... // other = fields omitted
     important-data : [
        =  {
  =          f1  : = "abc",
  =          f2  : "a",
        =    f3  : "/"
          =  ...
  =        },
        =  ...
  =        {
           f1 : = "xyz",
        =    f2  : "q",
           f3  : = "/",
  =          ... 
        =  },
  =    ],
    ... // other = fields omitted 
    =  other-important-data : [
        {
          =  x1  : "ford",
        =    x2  : "green",
           x3  : = 35
    =        map : {
        =        "free-field" : = "value",
  =              "other-free-field" : = value2"
        =       }
         },
        =  ...
  =        {
        =    x1 : "vw",
           x2  : = "red",
  =          x3  : 54,
        =    ... 
      =    },  =   
     ]
  =  },
}
 

Each file contains a single JSON document (gzip = compressed, and roughly about 200KB uncompressed of pretty-printed json = text per document)

I am interested in analyzing only the =  "important-data" array and the "other-important-data" = array.
My source data would ideally = be easier to analyze if it looked like a couple of tables with a fixed = set of columns. Only the column "map" would be a complex column, all = others would be primitives.

( g, sg, j, page, f1, f2, f3 = )
 
( g, sg, j, page, = x1, x2, x3, map )

So, for each JSON document, I = would like to "create" several rows, but I would like to avoid the = intermediate step of persisting -and duplicating- the "flattened" = data.

In order to avoid persisting the data = flattened, I thought I had to write my own map-reduce in Java code, but = discovered that others have had the same problem of using JSON as the = source and there are somewhat "standard" solutions. 

By reading = about the SerDe approach for Hive I get the impression = that each JSON document is transformed into a single "row" of the table = with some columns being an array, a map of other nested = structures. 
a) Is there a = way to break each JSON document into several "rows" for a Hive external = table?
b) It seems there are too many JSON SerDe libraries! Is = there any of them considered the de-facto standard? 

The Pig approach seems also promising using Elephant Bird Do = anybody has pointers to more user documentation on this project? Or is = browsing through the examples in GitHub my only source?

Thanks










=



= --Apple-Mail=_A07341EA-2F6E-4174-BEF0-DBE6975A5BC2--