beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-3030) watchForNewFiles() can emit a file multiple times if it's growing
Date Wed, 29 Nov 2017 01:13:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269863#comment-16269863
] 

Eugene Kirpichov commented on BEAM-3030:
----------------------------------------

Fix in https://github.com/apache/beam/pull/4190

> watchForNewFiles() can emit a file multiple times if it's growing
> -----------------------------------------------------------------
>
>                 Key: BEAM-3030
>                 URL: https://issues.apache.org/jira/browse/BEAM-3030
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>            Assignee: Eugene Kirpichov
>             Fix For: 2.3.0
>
>
> TextIO and AvroIO watchForNewFiles(), as well as FileIO.match().continuously(), use Watch
transform under the hood, and watch the set of Metadata matching a filepattern.
> Two Metadata's with the same filename but different size are not considered equal, so
if these transforms observe the same file multiple times with different sizes, they'll read
the file multiple times.
> This is likely not yet a problem for production users, because these features require
SDF, it's supported only in Dataflow runner, and users of the Dataflow runner are likely to
use only files on GCS which doesn't support appends. However, this needs to be fixed still.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message