hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tecno Brain <cerebrotecnolog...@gmail.com>
Subject Aggregating data nested into JSON documents
Date Thu, 13 Jun 2013 00:57:21 GMT
Hello,
   I'm new to Hadoop.
   I have a large quantity of JSON documents with a structure similar to
what is shown below.

   {
     g : "some-group-identifier",
     sg: "some-subgroup-identifier",
     j      : "some-job-identifier",
     page     : 23,
     ... // other fields omitted
     important-data : [
         {
           f1  : "abc",
           f2  : "a",
           f3  : "/"
           ...
         },
         ...
         {
           f1 : "xyz",
           f2  : "q",
           f3  : "/",
           ...
         },
     ],
    ... // other fields omitted
     other-important-data : [
        {
           x1  : "ford",
           x2  : "green",
           x3  : 35
           map : {
               "free-field" : "value",
               "other-free-field" : value2"
              }
         },
         ...
         {
           x1 : "vw",
           x2  : "red",
           x3  : 54,
           ...
         },
     ]
   },
}


Each file contains a single JSON document (gzip compressed, and roughly
about 200KB uncompressed of pretty-printed json text per document)

I am interested in analyzing only the  "important-data" array and the
"other-important-data" array.
My source data would ideally be easier to analyze if it looked like a
couple of tables with a fixed set of columns. Only the column "map" would
be a complex column, all others would be primitives.

( g, sg, j, page, f1, f2, f3 )

( g, sg, j, page, x1, x2, x3, map )

So, for each JSON document, I would like to "create" several rows, but I
would like to avoid the intermediate step of persisting -and duplicating-
the "flattened" data.

In order to avoid persisting the data flattened, I thought I had to write
my own map-reduce in Java code, but discovered that others have had the
same problem of using JSON as the source and there are somewhat "standard"
solutions.

By reading about the SerDe approach for Hive I get the impression that each
JSON document is transformed into a single "row" of the table with some
columns being an array, a map of other nested structures.
a) Is there a way to break each JSON document into several "rows" for a
Hive external table?
b) It seems there are too many JSON SerDe libraries! Is there any of them
considered the de-facto standard?

The Pig approach seems also promising using Elephant Bird Do anybody has
pointers to more user documentation on this project? Or is browsing through
the examples in GitHub my only source?

Thanks

Mime
View raw message