nifi-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alessandro D'Armiento (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NIFI-6462) ListHDFS should be triggerable
Date Mon, 22 Jul 2019 12:44:00 GMT

     [ https://issues.apache.org/jira/browse/NIFI-6462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alessandro D'Armiento updated NIFI-6462:
----------------------------------------
    Description: 
h2. Current Situation

ListHDFS is designed to be (only) the entry point of a data integration pipeline, and therefore
can only be triggered on a cron or time base.
h2. Improvement Proposal

ListHDFS should be able to be used as part of your pipeline even if you do not expect to have
it as the entry point. To obtain it:
 * It has to be triggerable
 * Trigger flowfile should be able to bring the listing directory as an attribute
 * Some logic, such as the "skip the last file in the listing directory" should be made optional
 ** Because if you are triggering the execution of the ListHDFS and you are sure that the
job which writes on the listing folder is over, is pointless to skip a file for the next execution

  was:
h2. Current Situation

ListHDFS is designed to be (only) the entry point of a data integration pipeline, and therefore
can only be triggered on a cron or time base.
h2. Improvement Proposal

ListHDFS should be able to be used as part of your pipeline even if you do not expect to have
it as the entry point. To obtain it:
 * It has to be triggerable
 * Trigger flowfile should be able to bring the listing directory as an attribute
 * Some logic, such as the "skip the last file in the listing directory" should be made optional
 * Since the processor will work on a 1:N semantic (1 input trigger flowfile, N output flowfiles)
it would be nice to support fragmentation attributes (for example for subsequent merge operations)
 ** It would be also useful to support different fragmentation strategies, in order to support
multiple user cases. For example, it should be possible to select:
 *** A "one for all" fragmentation strategy which will create a single fragmentation group.
Therefore, all files will have the same fragment.identifier, the same fragment.count, equal
to the total number N of listed files, and fragment.index ∈ [0, N).
 *** A "per subdir" fragmentation strategy which will create different fragmentation groups,
one for each scanned subdirectory of the given path. Therefore, for each subfolder, flowfiles
will have a specific fragment.identifier, fragment.count will be, for each flowfile, equal
to the number Ni of files in the i-th directory, and fragment.index ∈ [0, Ni).


> ListHDFS should be triggerable
> ------------------------------
>
>                 Key: NIFI-6462
>                 URL: https://issues.apache.org/jira/browse/NIFI-6462
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>    Affects Versions: 1.9.2
>            Reporter: Alessandro D'Armiento
>            Priority: Minor
>
> h2. Current Situation
> ListHDFS is designed to be (only) the entry point of a data integration pipeline, and
therefore can only be triggered on a cron or time base.
> h2. Improvement Proposal
> ListHDFS should be able to be used as part of your pipeline even if you do not expect
to have it as the entry point. To obtain it:
>  * It has to be triggerable
>  * Trigger flowfile should be able to bring the listing directory as an attribute
>  * Some logic, such as the "skip the last file in the listing directory" should be made
optional
>  ** Because if you are triggering the execution of the ListHDFS and you are sure that
the job which writes on the listing folder is over, is pointless to skip a file for the next
execution



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Mime
View raw message