apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (APEXMALHAR-2274) AbstractFileInputOperator gets killed when there are a large number of files.
Date Wed, 02 Nov 2016 18:16:58 GMT

    [ https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15629893#comment-15629893
] 

Matt Zhang commented on APEXMALHAR-2274:
----------------------------------------

The scanner in FileSplitterInput is more complicate. It retrieves the file status and supports
regex. In our case we only need a light weight process to get the paths for all the files
in directory. So from performance view it's better to use a dedicated lightweight process.

> AbstractFileInputOperator gets killed when there are a large number of files.
> -----------------------------------------------------------------------------
>
>                 Key: APEXMALHAR-2274
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
>             Project: Apache Apex Malhar
>          Issue Type: Bug
>            Reporter: Munagala V. Ramanath
>            Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call to DirectoryScanner.scan()
can take a long time since it calls FileSystem.listStatus() which returns the entire list.
Meanwhile, the AppMaster deems this operator hung and restarts it which again results in the
same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or FileSystem.listFiles()
[in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message