apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yogi Devendra <devendra.vyavah...@gmail.com>
Subject Reading large HDFS files record by record
Date Thu, 28 Apr 2016 10:59:43 GMT

My usecase involves reading from HDFS and emit each record as a separate
tuple. Record can be either fixed length record or separator based record
(such as newline).  Expected output is byte[] for each record.

I am planning to solve this as follows:
- New operator which extends BlockReader.
- It will have configuration option to select mode for FIXED_LENGTH,
- Use appropriate ReaderContext based on mode.

Reason for having different operator than BlockReader is because output
port signature is different than BlockReader. This new operator can be used
in conjunction with FileSplitter.

Any feedback?

~ Yogi

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message