avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pasquale Salza <pasquale.sa...@gmail.com>
Subject MapReduce and Avro split by number of records
Date Thu, 06 Feb 2014 21:53:05 GMT
Hi everybody,
I'm looking for a solution to my problem: split a group of Avro files by
number of records and not by block size, as default.

For the moment, my strategy is:
- Iterate among the records input files;
- Create a new InputSplit when a limit has been reached and store: the file
paths, the last sync point met in the first file and an offset, which is
the number of records from the sync point from which start with;
- The record reader opens the first path and launches a seek with the
stored sync point. Then it shift, by iterating, the number of records
offset and starts to read the split records.

I'm am obliged to use a split by records because the MapReduce work, in my
case, is computational centric and not data centric.

Do you have any better solution?

Pasquale Salza

View raw message