apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priyanka Gugale <priya...@datatorrent.com>
Subject HDFS File Reader Module
Date Tue, 16 Feb 2016 09:01:19 GMT

It is a common usecase to read big files on HDFS in parallel fashion i.e.
many reader thread are used to read the file in parallel. We can achieve
this on top of Apex using following Malhar operators:

1. AbstractFileSplitter
2. AbstractBlockReader

where FileSplitter, as per file metadata, creates small reader tasks(to
read file in parts). Those reader tasks are run by BlockReaders in parallel
to read the file.

As these operators are generally used together to achieve file read
operation, I propose we create a module, called HDFSFileReader for this.

Please provide your suggestions on same.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message