hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-13117) Proposal to support writing replications to HDFS asynchronously
Date Thu, 08 Feb 2018 09:34:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356722#comment-16356722

Wei-Chiu Chuang commented on HDFS-13117:

{quote}at least the time between the last block of the first replication and the last block
of the last replication can be saved.
Maybe. But the latency is less than 1 ms. Note that data are written in 512-byte chunks, not
in blocks. So you basically save almost nothing.


If you really want to test the performance of your approach, try creating a file with replication=1,
and then use FileSystem.setReplication() to make it 3-replica. NameNode will then schedule
for replication asynchronously. I don't think you'll notice much difference than writing a
file with replication=3.

> Proposal to support writing replications to HDFS asynchronously
> ---------------------------------------------------------------
>                 Key: HDFS-13117
>                 URL: https://issues.apache.org/jira/browse/HDFS-13117
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: xuchuanyin
>            Priority: Major
> My initial question was as below:
> ```
> I've learned that When We write data to HDFS using the interface provided by HDFS such
as 'FileSystem.create', our client will block until all the blocks and their replications
are done. This will cause efficiency problem if we use HDFS as our final data storage. And
many of my colleagues write the data to local disk in the main thread and copy it to HDFS
in another thread. Obviously, it increases the disk I/O.
>    So, is there a way to optimize this usage? I don't want to increase the disk I/O,
neither do I want to be blocked during the writing of extra replications.
>   How about writing to HDFS by specifying only one replication in the main thread and
set the actual number of replication in another thread? Or is there any better way to do this?
> ```
> So my proposal here is to support writing extra replications to HDFS asynchronously.
User can set a minimum replicator as acceptable number of replications ( < default or expected
replicator). When writing to HDFS, user will only be blocked until the minimum replicator
has been finished and HDFS will continue to complete the extra replications in background.Since
HDFS will periodically check the integrity of all the replications, we can also leave this
work to HDFS itself.
> There are ways to provide the interfaces:
> 1. Creating a series of interfaces by adding `acceptableReplication` parameter to the
current interfaces as below:
> ```
> Before:
> FSDataOutputStream create(Path f,
>   boolean overwrite,
>   int bufferSize,
>   short replication,
>   long blockSize
> ) throws IOException
> After:
> FSDataOutputStream create(Path f,
>   boolean overwrite,
>   int bufferSize,
>   short replication,
>   short acceptableReplication, // minimum number of replication to finish before return
>   long blockSize
> ) throws IOException
> ```
> 2. Adding the `acceptableReplication` and `asynchronous` to the runtime (or default)
configuration, so user will not have to change any interface and will benefit from this feature.
> How do you think about this?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message