hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "xuchuanyin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-13117) Proposal to support writing replications to HDFS asynchronously
Date Wed, 07 Feb 2018 01:51:00 GMT
xuchuanyin created HDFS-13117:
---------------------------------

             Summary: Proposal to support writing replications to HDFS asynchronously
                 Key: HDFS-13117
                 URL: https://issues.apache.org/jira/browse/HDFS-13117
             Project: Hadoop HDFS
          Issue Type: New Feature
            Reporter: xuchuanyin


My initial question was as below:

```

I've learned that When We write data to HDFS using the interface provided by HDFS such as
'FileSystem.create', our client will block until all the blocks and their replications are
done. This will cause efficiency problem if we use HDFS as our final data storage. And many
of my colleagues write the data to local disk in the main thread and copy it to HDFS in another
thread. Obviously, it increases the disk I/O.
 
   So, is there a way to optimize this usage? I don't want to increase the disk I/O, neither
do I want to be blocked during the writing of extra replications.

  How about writing to HDFS by specifying only one replication in the main thread and set
the actual number of replication in another thread? Or is there any better way to do this?

```

 

So my proposal here is to support writing extra replications to HDFS asynchronously. User
can set a minimum replicator as acceptable number of replications ( < default or expected
replicator). When writing to HDFS, user will only be blocked until the minimum replicator
has been finished and HDFS will continue to complete the extra replications in background.Since
HDFS will periodically check the integrity of all the replications, we can also leave this
work to HDFS itself.

 

There are ways to provide the interfaces:

1. Creating a series of interfaces by adding `acceptableReplication` parameter to the current
interfaces as below:

```

Before:

FSDataOutputStream create(Path f,

  boolean overwrite,

  int bufferSize,

  short replication,

  long blockSize

) throws IOException

 

After:

FSDataOutputStream create(Path f,

  boolean overwrite,

  int bufferSize,

  short replication,

  short acceptableReplication, // minimum number of replication to finish before return

  long blockSize

) throws IOException

```

 

2. Adding the `acceptableReplication` and `asynchronous` to the runtime (or default) configuration,
so user will not have to change any interface and will benefit from this feature.

 

How do you think about this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message