hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kommareddi, Mahesh" <Mahesh.Kommare...@e-hps.com>
Subject Process exported files with sections defined in file
Date Tue, 22 Mar 2016 20:09:42 GMT

Hi All. I'm still kind of new and I have a question about how to efficiently process a file
that has a bulk export of a database with certain row records being defined by successive
information regarding it. In other words, a row from table A is exported from
a table to the file, then the records associated with the defining id is exported from table
B into the file. The process is repeated for each row in table A.

For example...
15 ident1(1) ident1(2) <--- defines identifying information to successive “20 records"
20 info1(ident1)(1) info1(ident1)(2) info1(ident1)(3)<---record for this "15 type record”
20 infoN(ident1)(1) infoN(ident1)(2) infoN(ident1)(3)<---record for this "15 type record”
15 ident2(1) ident2(2) <--- defines new id to for next group of 20 type records
20 infoX(ident2)(1) info1X(2) info1X(3)<---record for this "15 type record”
20 infoX+1(ident2)(1) infoX+1(ident2)(2) infoX+1(ident2)(3)<---record for this "15 type

Until next 15 type record appears. All followed by arbitrary 20 type records. Then another
15 record type followed by more 20s regarding the new
15 type record ad infinitum.

I was hoping to do map-reduce on this data of various sorts. For example, I want to find the
max info value in each column per each 15 “section”. Is
there any sort of way to handle that?

I was hoping I wouldn't have to split the file myself… These files get to be 22GB each.

I thought a strategy close to processing XML files would be useful, but I don’t think that
would apply here.

I would appreciate any help and insight.

Best Regards,

View raw message