hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Beppu <abe...@siftscience.com>
Subject Efficient use of buffered writes in a post-HTablePool world?
Date Thu, 18 Dec 2014 02:44:45 GMT
Hi All,

TLDR; in the absence of HTablePool, if HTable instances are short-lived,
how should clients use buffered writes?

I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to 0.98.6
(CDH5.2). One issue I’m confused by is how to effectively use buffered
writes now that HTablePool has been deprecated[1].

In our 0.94 code, a pathway could get a table from the pool, configure it
with table.setAutoFlush(false); and write Puts to it. Those writes would
then go to the table instance’s writeBuffer, and those writes would only be
flushed when the buffer was full, or when we were ready to close out the
pool. We were intentionally choosing to have fewer, larger writes from the
client to the cluster, and we knew we were giving up a degree of safety in
exchange (i.e. if the client dies after it’s accepted a write but before
the flush for that write occurs, the data is lost). This seems to be a
generally considered a reasonable choice (cf the HBase Book [2] SS 14.8.4)

However in the 0.98 world, without HTablePool, the endorsed pattern [3]
seems to be to create a new HTable via table =
stashedHConnection.getTable(tableName, myExecutorService). However, even if
we do table.setAutoFlush(false), because that table instance is
short-lived, its buffer never gets full. We’ll create a table instance,
write a put to it, try to close the table, and the close call will trigger
a (synchronous) flush. Thus, not having HTablePool seems like it would
cause us to have many more small writes from the client to the cluster, and
basically wipe out the advantage of turning off autoflush.

More concretely :

// Given these two helpers ...

private HTableInterface getAutoFlushTable(String tableName) throws IOException {
  // (autoflush is true by default)
  return storedConnection.getTable(tableName, executorService);
}

private HTableInterface getBufferedTable(String tableName) throws IOException {
  HTableInterface table = getAutoFlushTable(tableName);
  table.setAutoFlush(false);
  return table;
}

// it's my contention that these two methods would behave almost identically,
// except the first will hit a synchronous flush during the put call,
and the second will
// flush during the (hidden) close call on table.

private void writeAutoFlushed(Put somePut) throws IOException {
  try (HTableInterface table = getAutoFlushTable(tableName)) {
    table.put(somePut); // will do synchronous flush
  }
}

private void writeBuffered(Put somePut) throws IOException {
  try (HTableInterface table = getBufferedTable(tableName)) {
    table.put(somePut);
  } // auto-close will trigger synchronous flush
}

It seems like the only way to avoid this is to have long-lived HTable
instances, which get reused for multiple writes. However, since the actual
writes are driven from highly concurrent code, and since HTable is not
threadsafe, this would involve having a number of HTable instances, and a
control mechanism for leasing them out to individual threads safely. Except
at this point it seems like we will have recreated HTablePool, which
suggests that we’re doing something deeply wrong.

What am I missing here? Since the HTableInterface.setAutoFlush method still
exists, it must be anticipated that users will still want to buffer writes.
What’s the recommended way to actually buffer a meaningful number of
writes, from a multithreaded context, that doesn’t just amount to creating
a table pool?

Thanks in advance,
Aaron

[1] https://issues.apache.org/jira/browse/HBASE-6580
[2] http://hbase.apache.org/book/perf.writing.html
[3]
https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
​

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message