hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Dimiduk <ndimi...@apache.org>
Subject Re: Efficient use of buffered writes in a post-HTablePool world?
Date Fri, 19 Dec 2014 17:00:30 GMT
Hi Aaron,

Your analysis is spot on and I do not believe this is by design. I see the
write buffer is owned by the table, while I would have expected there to be
a buffer per table all managed by the connection. I suggest you raise a
blocker ticket vs the 1.0.0 release that's just around the corner to give
this the attention it needs. Let me know if you're not into JIRA, I can
raise one on your behalf.

cc Lars, Enis.

Nice work Aaron.

On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <abeppu@siftscience.com> wrote:
> Hi All,
> TLDR; in the absence of HTablePool, if HTable instances are short-lived,
> how should clients use buffered writes?
> I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to 0.98.6
> (CDH5.2). One issue I’m confused by is how to effectively use buffered
> writes now that HTablePool has been deprecated[1].
> In our 0.94 code, a pathway could get a table from the pool, configure it
> with table.setAutoFlush(false); and write Puts to it. Those writes would
> then go to the table instance’s writeBuffer, and those writes would only be
> flushed when the buffer was full, or when we were ready to close out the
> pool. We were intentionally choosing to have fewer, larger writes from the
> client to the cluster, and we knew we were giving up a degree of safety in
> exchange (i.e. if the client dies after it’s accepted a write but before
> the flush for that write occurs, the data is lost). This seems to be a
> generally considered a reasonable choice (cf the HBase Book [2] SS 14.8.4)
> However in the 0.98 world, without HTablePool, the endorsed pattern [3]
> seems to be to create a new HTable via table =
> stashedHConnection.getTable(tableName, myExecutorService). However, even if
> we do table.setAutoFlush(false), because that table instance is
> short-lived, its buffer never gets full. We’ll create a table instance,
> write a put to it, try to close the table, and the close call will trigger
> a (synchronous) flush. Thus, not having HTablePool seems like it would
> cause us to have many more small writes from the client to the cluster, and
> basically wipe out the advantage of turning off autoflush.
> More concretely :
> // Given these two helpers ...
> private HTableInterface getAutoFlushTable(String tableName) throws
> IOException {
>   // (autoflush is true by default)
>   return storedConnection.getTable(tableName, executorService);
> }
> private HTableInterface getBufferedTable(String tableName) throws
> IOException {
>   HTableInterface table = getAutoFlushTable(tableName);
>   table.setAutoFlush(false);
>   return table;
> }
> // it's my contention that these two methods would behave almost
> identically,
> // except the first will hit a synchronous flush during the put call,
> and the second will
> // flush during the (hidden) close call on table.
> private void writeAutoFlushed(Put somePut) throws IOException {
>   try (HTableInterface table = getAutoFlushTable(tableName)) {
>     table.put(somePut); // will do synchronous flush
>   }
> }
> private void writeBuffered(Put somePut) throws IOException {
>   try (HTableInterface table = getBufferedTable(tableName)) {
>     table.put(somePut);
>   } // auto-close will trigger synchronous flush
> }
> It seems like the only way to avoid this is to have long-lived HTable
> instances, which get reused for multiple writes. However, since the actual
> writes are driven from highly concurrent code, and since HTable is not
> threadsafe, this would involve having a number of HTable instances, and a
> control mechanism for leasing them out to individual threads safely. Except
> at this point it seems like we will have recreated HTablePool, which
> suggests that we’re doing something deeply wrong.
> What am I missing here? Since the HTableInterface.setAutoFlush method still
> exists, it must be anticipated that users will still want to buffer writes.
> What’s the recommended way to actually buffer a meaningful number of
> writes, from a multithreaded context, that doesn’t just amount to creating
> a table pool?
> Thanks in advance,
> Aaron
> [1] https://issues.apache.org/jira/browse/HBASE-6580
> [2] http://hbase.apache.org/book/perf.writing.html
> [3]
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> ​

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message