apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (APEXMALHAR-2008) Create hdfs file input module
Date Wed, 09 Mar 2016 08:28:40 GMT

    [ https://issues.apache.org/jira/browse/APEXMALHAR-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186748#comment-15186748

ASF GitHub Bot commented on APEXMALHAR-2008:

GitHub user DT-Priyanka opened a pull request:


    APEXMALHAR-2008: Create HDFS File Reader module

    Code to add HDFS file reader module. 
    1. The module reads file/list of files (directory is also accepted) and emit the file
    2. The module can be configured to emit blocks in order or out of order.
    3. Module reads file blocks in parallel. The number of parallel readers is configurable,
if not configured it will increase or decrease readers dynamically as per input data rate.
    Also updated code of FileSplitterInput to add some improvements:
    1. Tracking last file reference times of each folder differently, to avoid duplicates
(duplicates could be due to same relative paths of multiple files/sub dir)
    2. Small improvements in code.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DT-Priyanka/incubator-apex-malhar APEXMALHAR-2008-hdfs-input-module

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #207
commit 8ffb34abe48f525d401c3932d79ada6c71214e88
Author: Priyanka Gugale <priyanka@datatorrent.com>
Date:   2016-03-08T08:42:13Z

    APEXMALHAR-2008: Create HDFS File Reader module


> Create hdfs file input module 
> ------------------------------
>                 Key: APEXMALHAR-2008
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2008
>             Project: Apache Apex Malhar
>          Issue Type: Task
>            Reporter: Priyanka Gugale
>            Assignee: Priyanka Gugale
>            Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
> To read HDFS files in parallel using Apex we normally use FileSplitter and FileReader
module. It would be a good idea to combine those operators as a unit in module. Having a module
will give us readily usable set of operators to read HDFS files. 

This message was sent by Atlassian JIRA

View raw message