avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pasquale Salza <pasquale.sa...@gmail.com>
Subject Hadoop custom AvroInputFormat and split by number of records
Date Mon, 20 Jan 2014 00:14:26 GMT
I am writing about a problem that I have with the developing of a custom
AvroInputFormat class. I do not have a clear idea in my mind but I will try
to explain my target in order to receive a better help from you.

Firstly, I need to join multiple AVRO files together. In order to make
this, I wrote a custom implementation of FileInputFormat which works with
multiple paths.

Secondly, I need to control the number of records for each split. In order
to make this, this time I did a dirty work.

In each split I store:
1. The paths of the files in which the correspondent records are stored;
2. The first useful sync point of the first file;
3. The offset, express in terms of objects, from the sync point in the
first file.

The InputFormat does:
1. Use SeekableInput, ReflectData, DatumReader and DataFileReader in order
to iterate among all the records and all files;
2. Make the splits storing the need information.

Therefore, the RecordReader:
1. Open the first file;
2. Sync to the sync point;
3. Iterate until the offset is reached, again with SeekableInput,
ReflectData, DatumReader and DataFileReader;
4. Start to read the records, one by one to make the output.

The biggest bottleneck is in the fact that I can only use the sync point to
move straight to a file point and it is not possible to use any "seek" to
make it faster. Do you have any advice for this?

View raw message