beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Robertson (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
Date Wed, 29 Aug 2018 12:21:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596233#comment-16596233
] 

Tim Robertson edited comment on BEAM-5036 at 8/29/18 12:20 PM:
---------------------------------------------------------------

The changes (yet to be merged) to rename() in BEAM-4861 now creates directories if missing,
but also surfaces an exception if the underlying operation reports the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
to hdfs://ha-nn/tmp/es-2012.txt-00000-of-00045. No further information provided by underlying
filesystem.
	at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
	at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
	at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
	at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()? I am used to Hadoop operations
failing if the output already exists so for me it is correct to fail if the output exists
- I'd rather be forced to delete manually than accidentally be able to overwrite TBs of data.


was (Author: timrobertson100):
The changes (yet to be merged) to rename() in BEAM-4861 now creates directories if missing,
but also surfaces an exception if the underlying operation reports the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
to hdfs://ha-nn/tmp/es-2012.txt-00000-of-00045. No further information provided by underlying
filesystem.
	at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
	at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
	at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
	at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()? I am used to Hadoop operations
failing if the output already exists so for me it sounds wrong - I'd rather be forced to delete
manually than accidentally be able to overwrite TBs of data.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> ------------------------------------------------------
>
>                 Key: BEAM-5036
>                 URL: https://issues.apache.org/jira/browse/BEAM-5036
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-files
>    Affects Versions: 2.5.0
>            Reporter: Jozef Vilcek
>            Assignee: Tim Robertson
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by copy+delete.
It would be better to use a rename() which can be much more effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this for the
case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaKnuPZD_qdh_QDm9VXLLsZw@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message