hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3107) HDFS truncate
Date Wed, 01 Oct 2014 20:54:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155499#comment-14155499
] 

Colin Patrick McCabe commented on HDFS-3107:
--------------------------------------------

bq. \[The boolean\] is an optimization for the case when truncate happens on the block boundary.
Clients will save one RPC call in this particular case. From NameNode perspective returning
the boolean does not require any extra processing.

Fair enough.  Can we change it to return an enum of { {{TRUNCATE_IN_PROGRESS}}, {{TRUNCATE_COMPLETED}}
}?  A lot of developers just don't read documentation and will assume "it returned false,
that means it failed."

bq. \[concurrent readers discussion\]

It's true that concurrent unlink has the same problem with making readers get mysterious IOExceptions.
 I don't consider this a good behavior to copy, though (it's more like a bug).  Perhaps in
a follow-up change, we could add some faculty for clients to ask the NN for information about
whether the file has been unlinked or truncated after an unrecoverable IOException happened
during a read?  That's probably better as a follow-up, though.

bq. as HDFS-7056 indicated it will take more time to come up with design and implementation
of the complimentary functionality that would extend truncate to snapshotted files

Clearly, we're still at the stage where the design isn't complete.  This is exactly what feature
branches are for.

bq. in its current form, this is an extremely useful self-contained feature that allows various
vendors of solutions running on Hadoop to build products having having much easier time running
on HDFS.... we all know that features sitting in a branch don't get exposed to commercial
distributions and workloads as much as the ones hitting trunk do. This is, of course, a totally
right approach to features that are half-baked or not self-contained, but it feels that in
this particular case committing the patch would benefit us all by giving customers access
to the self-contained feature AND start receiving feedback for the more extended functionality
much earlier.

Most users of commercial distros are using snapshots, so even if we pulled this into 2.6 and
then into the commercial releases based on it, the feature still wouldn't get tested.  And
anyway it's not going to make 2.6, so let's not get an artificial sense of urgency here. 
We have time to get the design right and test it.

> HDFS truncate
> -------------
>
>                 Key: HDFS-3107
>                 URL: https://issues.apache.org/jira/browse/HDFS-3107
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Lei Chang
>            Assignee: Plamen Jeliazkov
>         Attachments: HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch,
HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate.pdf, HDFS_truncate_semantics_Mar15.pdf,
HDFS_truncate_semantics_Mar21.pdf, editsStored
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> Systems with transaction support often need to undo changes made to the underlying storage
when a transaction is aborted. Currently HDFS does not support truncate (a standard Posix
operation) which is a reverse operation of append, which makes upper layer applications use
ugly workarounds (such as keeping track of the discarded byte range per file in a separate
metadata store, and periodically running a vacuum process to rewrite compacted files) to overcome
this limitation of HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message