Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 81393107D6 for ; Fri, 19 Dec 2014 18:32:08 +0000 (UTC) Received: (qmail 21671 invoked by uid 500); 19 Dec 2014 18:32:06 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 21603 invoked by uid 500); 19 Dec 2014 18:32:06 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 21592 invoked by uid 99); 19 Dec 2014 18:32:06 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Dec 2014 18:32:06 +0000 Received: from mail-la0-f41.google.com (mail-la0-f41.google.com [209.85.215.41]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id A7AF41A01DB for ; Fri, 19 Dec 2014 18:32:05 +0000 (UTC) Received: by mail-la0-f41.google.com with SMTP id hv19so1328378lab.0 for ; Fri, 19 Dec 2014 10:32:01 -0800 (PST) X-Received: by 10.112.234.201 with SMTP id ug9mr9127753lbc.79.1419013921199; Fri, 19 Dec 2014 10:32:01 -0800 (PST) MIME-Version: 1.0 Received: by 10.25.209.4 with HTTP; Fri, 19 Dec 2014 10:31:20 -0800 (PST) In-Reply-To: References: From: Andrew Purtell Date: Fri, 19 Dec 2014 10:31:20 -0800 Message-ID: Subject: Re: Efficient use of buffered writes in a post-HTablePool world? To: "user@hbase.apache.org" Cc: lars hofhansl , =?UTF-8?Q?Enis_S=C3=B6ztutar?= Content-Type: multipart/alternative; boundary=001a11c3c832b468b7050a95ec9d --001a11c3c832b468b7050a95ec9d Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I believe HTableMultiplexer[1] is meant to stand in for HTablePool for buffered writing. FWIW, I've not used it. 1: https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMulti= plexer.html On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk wrote: > > Hi Aaron, > > Your analysis is spot on and I do not believe this is by design. I see th= e > write buffer is owned by the table, while I would have expected there to = be > a buffer per table all managed by the connection. I suggest you raise a > blocker ticket vs the 1.0.0 release that's just around the corner to give > this the attention it needs. Let me know if you're not into JIRA, I can > raise one on your behalf. > > cc Lars, Enis. > > Nice work Aaron. > -n > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu > wrote: > > > > Hi All, > > > > TLDR; in the absence of HTablePool, if HTable instances are short-lived= , > > how should clients use buffered writes? > > > > I=E2=80=99m working on migrating a codebase from using 0.94.6 (CDH4.4) = to 0.98.6 > > (CDH5.2). One issue I=E2=80=99m confused by is how to effectively use b= uffered > > writes now that HTablePool has been deprecated[1]. > > > > In our 0.94 code, a pathway could get a table from the pool, configure = it > > with table.setAutoFlush(false); and write Puts to it. Those writes woul= d > > then go to the table instance=E2=80=99s writeBuffer, and those writes w= ould only > be > > flushed when the buffer was full, or when we were ready to close out th= e > > pool. We were intentionally choosing to have fewer, larger writes from > the > > client to the cluster, and we knew we were giving up a degree of safety > in > > exchange (i.e. if the client dies after it=E2=80=99s accepted a write b= ut before > > the flush for that write occurs, the data is lost). This seems to be a > > generally considered a reasonable choice (cf the HBase Book [2] SS > 14.8.4) > > > > However in the 0.98 world, without HTablePool, the endorsed pattern [3] > > seems to be to create a new HTable via table =3D > > stashedHConnection.getTable(tableName, myExecutorService). However, eve= n > if > > we do table.setAutoFlush(false), because that table instance is > > short-lived, its buffer never gets full. We=E2=80=99ll create a table i= nstance, > > write a put to it, try to close the table, and the close call will > trigger > > a (synchronous) flush. Thus, not having HTablePool seems like it would > > cause us to have many more small writes from the client to the cluster, > and > > basically wipe out the advantage of turning off autoflush. > > > > More concretely : > > > > // Given these two helpers ... > > > > private HTableInterface getAutoFlushTable(String tableName) throws > > IOException { > > // (autoflush is true by default) > > return storedConnection.getTable(tableName, executorService); > > } > > > > private HTableInterface getBufferedTable(String tableName) throws > > IOException { > > HTableInterface table =3D getAutoFlushTable(tableName); > > table.setAutoFlush(false); > > return table; > > } > > > > // it's my contention that these two methods would behave almost > > identically, > > // except the first will hit a synchronous flush during the put call, > > and the second will > > // flush during the (hidden) close call on table. > > > > private void writeAutoFlushed(Put somePut) throws IOException { > > try (HTableInterface table =3D getAutoFlushTable(tableName)) { > > table.put(somePut); // will do synchronous flush > > } > > } > > > > private void writeBuffered(Put somePut) throws IOException { > > try (HTableInterface table =3D getBufferedTable(tableName)) { > > table.put(somePut); > > } // auto-close will trigger synchronous flush > > } > > > > It seems like the only way to avoid this is to have long-lived HTable > > instances, which get reused for multiple writes. However, since the > actual > > writes are driven from highly concurrent code, and since HTable is not > > threadsafe, this would involve having a number of HTable instances, and= a > > control mechanism for leasing them out to individual threads safely. > Except > > at this point it seems like we will have recreated HTablePool, which > > suggests that we=E2=80=99re doing something deeply wrong. > > > > What am I missing here? Since the HTableInterface.setAutoFlush method > still > > exists, it must be anticipated that users will still want to buffer > writes. > > What=E2=80=99s the recommended way to actually buffer a meaningful numb= er of > > writes, from a multithreaded context, that doesn=E2=80=99t just amount = to > creating > > a table pool? > > > > Thanks in advance, > > Aaron > > > > [1] https://issues.apache.org/jira/browse/HBASE-6580 > > [2] http://hbase.apache.org/book/perf.writing.html > > [3] > > > > > https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=3D13501= 302&page=3Dcom.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel= #comment-13501302 > > =E2=80=8B > > > --=20 Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) --001a11c3c832b468b7050a95ec9d--