hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Aggregating data nested into JSON documents
Date Thu, 13 Jun 2013 22:05:56 GMT
I think you have a misconception of HBase. 

You don't need to actually have mutable data for it to be effective. 
The key is that you need to have access to specific records and work a very small subset of
the data and not the complete data set. 


On Jun 13, 2013, at 11:59 AM, Tecno Brain <cerebrotecnologico@gmail.com> wrote:

> Hi Mike,
> 
> Yes, I also have thought about HBase or Cassandra but my data is pretty much a snapshot,
it does not require updates. Most of my aggregations will also need to be computed once and
won't change over time with the exception of some aggregation that is based on the last N
days of data.  Should I still consider HBase ? I think that probably it will be good for the
aggregated data. 
> 
> I have no idea what are sequence files, but I will take a look.  My raw data is stored
in the cloud, not in my Hadoop cluster. 
> 
> I'll keep looking at Pig with ElephantBird. 
> Thanks,
> 
> -Jorge 
> 
> 
> 
> 
> 
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com> wrote:
> Hi..
> 
> Have you thought about HBase? 
> 
> I would suggest that if you're using Hive or Pig, to look at taking these files and putting
the JSON records in to a sequence file. 
> Or set of sequence files.... (Then look at HBase to help index them...) 200KB is small.

> 
> That would be the same for either pig/hive.
> 
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you
get each record as a row, however you can always flatten them as needed. 
> 
> Hive? 
> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward Capriolo could
give you a better answer. 
> Going from memory, I don't know that there is a good SerDe that would write JSON, just
read it. (Hive)
> 
> IMHO Pig/ElephantBird is the best so far, but then again I may be dated and biased. 
> 
> I think you're on the right track or at least train of thought. 
> 
> HTH
> 
> -Mike
> 
> 
> On Jun 12, 2013, at 7:57 PM, Tecno Brain <cerebrotecnologico@gmail.com> wrote:
> 
>> Hello, 
>>    I'm new to Hadoop. 
>>    I have a large quantity of JSON documents with a structure similar to what is
shown below.  
>> 
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ... 
>>          },
>>      ],
>>     ... // other fields omitted 
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ... 
>>          },    
>>      ]
>>    },
>> }
>>  
>> 
>> Each file contains a single JSON document (gzip compressed, and roughly about 200KB
uncompressed of pretty-printed json text per document)
>> 
>> I am interested in analyzing only the  "important-data" array and the "other-important-data"
array.
>> My source data would ideally be easier to analyze if it looked like a couple of tables
with a fixed set of columns. Only the column "map" would be a complex column, all others would
be primitives.
>> 
>> ( g, sg, j, page, f1, f2, f3 )
>>  
>> ( g, sg, j, page, x1, x2, x3, map )
>> 
>> So, for each JSON document, I would like to "create" several rows, but I would like
to avoid the intermediate step of persisting -and duplicating- the "flattened" data.
>> 
>> In order to avoid persisting the data flattened, I thought I had to write my own
map-reduce in Java code, but discovered that others have had the same problem of using JSON
as the source and there are somewhat "standard" solutions. 
>> 
>> By reading about the SerDe approach for Hive I get the impression that each JSON
document is transformed into a single "row" of the table with some columns being an array,
a map of other nested structures. 
>> a) Is there a way to break each JSON document into several "rows" for a Hive external
table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them considered
the de-facto standard? 
>> 
>> The Pig approach seems also promising using Elephant Bird Do anybody has pointers
to more user documentation on this project? Or is browsing through the examples in GitHub
my only source?
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 


Mime
View raw message