hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Duo Xu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-12403) Enable multiple writes in flight for HBase WAL writing
Date Thu, 10 Sep 2015 20:06:46 GMT
Duo Xu created HADOOP-12403:

             Summary: Enable multiple writes in flight for HBase WAL writing
                 Key: HADOOP-12403
                 URL: https://issues.apache.org/jira/browse/HADOOP-12403
             Project: Hadoop Common
          Issue Type: Improvement
          Components: tools
            Reporter: Duo Xu
            Assignee: Duo Xu

Azure HDI HBase clusters use Azure blob storage as file system. We found that the bottle neck
was during writing to write ahead log (WAL). The latest HBase WAL write model (HBASE-8755)
uses multiple AsyncSyncer threads to sync data to HDFS. However, our WASB driver is still
based on a single thread model. Thus when the sync threads call into WASB layer, every time
only one thread will be allowed to send data to Azure storage.This jira is to introduce a
new write model in WASB layer to allow multiple writes in parallel.
1. Since We use page blob for WAL, this will cause "holes" in the page blob as every write
starts on a new page. We use the first two bytes of every page to record the actual data size
of the current page.
2. When reading WAL, we need to know the actual size of the WAL. This should be the sum of
the number represented by the first two bytes of every page. However looping over every page
to get the size will be very slow. So during writing, the writer threads will keep updating
a metadata of the blob called "total_data_uploaded".
3. Although we allow multiple writes in flight, we need to make sure the sync threads which
call into WASB layers return in order. Reading HBase source code FSHLog.java, we find that
every sync requests associated with a transaction id. If the sync succeeds, all the transactions
before this transaction id are assumed to be in Azure Storage. We use a queue to store the
sync requests and make sure they return to HBase layer in order.

This message was sent by Atlassian JIRA

View raw message