apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (APEXMALHAR-2184) Add documentation for FileSystem Input Operator
Date Fri, 12 Aug 2016 15:36:22 GMT

    [ https://issues.apache.org/jira/browse/APEXMALHAR-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419013#comment-15419013

ASF GitHub Bot commented on APEXMALHAR-2184:

Github user amberarrow commented on a diff in the pull request:

    --- Diff: docs/operators/fsInputOperator.md ---
    @@ -0,0 +1,101 @@
    +File Input Operator
    +## Operator Objective
    +This operator scans a directory for files. Files are then read and split into tuples,
which are emitted. The default implementation scans a single directory. The operator is fault
tolerant. It tracks previously read files and current offset as part of checkpoint state.
In case of failure the operator will skip files that were already processed and fast forward
to the offset of the current file. Supports partitioning and changes to number of partitions.
The directory scanner is responsible to only accept the files that belong to a partition.
    +File Input Operator is **idempotent**, **fault-tolerant** and **partitionable**.
    +## Operator Usecase
    +1. Read all files of a directory and then keep scanning it for newly added files.
    +## Operator Information
    +1. Operator location: ***malhar-library***
    +2. Available since: ***1.0.2***
    +3. Operator state: ***Stable***
    +3. Java Packages:
    +    * Operator: ***[com.datatorrent.lib.io.fs.AbstractFileInputOperator](https://www.datatorrent.com/docs/apidocs/com/datatorrent/lib/io/fs/AbstractFileInputOperator.html)***
    +### AbstractFileInputOperator
    +This is the abstract implementation that serves as base class for scanning a directory
for files and read the files one by one. This class doesn’t have any ports.
    --- End diff --
    Add an overview here of what the operator does here, what parts need to be implemented
by concrete subclasses and show code fragments for example from the LineByLine operator. Describe
what the DirectoryScanner does. Some obvious questions that will come up in the reader's mind
should be answered, for example:
    1. What happens if a file that has already been processed has new data appended to it
    2. What happens if a processed file is deleted ?
    3. What happens if a separate actor is in the process of writing a new file in the monitored
directory ? How will this operator know when the file is ready to be read ?
    4. What if the number of files in the scanned directory grows over time to be very large
-- what impact, if any, does it have on this operator ?
    5. Can this operator be used to monitor multiple directories ? (Point people to the fileIO-multidir
    6. This class already implements the Partitioner interface; can a custom partitioner be
set on this operator ? More generally, since we are saying this operator supports dynamic
partitioning, it is useful to describe (with a code fragment if possible) how to trigger it.

> Add documentation for FileSystem Input Operator
> -----------------------------------------------
>                 Key: APEXMALHAR-2184
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2184
>             Project: Apache Apex Malhar
>          Issue Type: Documentation
>            Reporter: Priyanka Gugale
>            Assignee: Priyanka Gugale
>            Priority: Minor

This message was sent by Atlassian JIRA

View raw message