hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kommareddi, Mahesh" <Mahesh.Kommare...@e-hps.com>
Subject Process exported files with sections defined in file
Date Tue, 22 Mar 2016 20:09:42 GMT

Hi All. I'm still kind of new and I have a question about how to efficiently process a file
that has a bulk export of a database with certain row records being defined by successive
information regarding it. In other words, a row from table A is exported from
a table to the file, then the records associated with the defining id is exported from table
B into the file. The process is repeated for each row in table A.

For example...
15 ident1(1) ident1(2) <--- defines identifying information to successive “20 records"
20 info1(ident1)(1) info1(ident1)(2) info1(ident1)(3)<---record for this "15 type record”
.
.
.
20 infoN(ident1)(1) infoN(ident1)(2) infoN(ident1)(3)<---record for this "15 type record”
.
.
.
15 ident2(1) ident2(2) <--- defines new id to for next group of 20 type records
20 infoX(ident2)(1) info1X(2) info1X(3)<---record for this "15 type record”
20 infoX+1(ident2)(1) infoX+1(ident2)(2) infoX+1(ident2)(3)<---record for this "15 type
record”
.
.
.

Until next 15 type record appears. All followed by arbitrary 20 type records. Then another
15 record type followed by more 20s regarding the new
15 type record ad infinitum.

I was hoping to do map-reduce on this data of various sorts. For example, I want to find the
max info value in each column per each 15 “section”. Is
there any sort of way to handle that?

I was hoping I wouldn't have to split the file myself… These files get to be 22GB each.


I thought a strategy close to processing XML files would be useful, but I don’t think that
would apply here.

I would appreciate any help and insight.

Best Regards,
Mahesh

Mime
View raw message