Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CANZa=GvbC8rxJwWqN0h8gNFw45yKB_N_zCunk+g1z2-Pqfbakw@mail.gmail.com>
References: 
 <CAPdJLkEzmUQZ_kvD=8mrxi4V=hCmUp3g9MUZsddD+mon+AvNtg@mail.gmail.com>
 <CANZa=GvbC8rxJwWqN0h8gNFw45yKB_N_zCunk+g1z2-Pqfbakw@mail.gmail.com>
From: Andrew Purtell <apurtell@apache.org>
Date: Fri, 19 Dec 2014 10:31:20 -0800
Message-ID: 
 <CA+RK=_Caf3_DAA4w1cqe4n5v5UQip9hKVsk6CDcEe38WzcTpCw@mail.gmail.com>
Subject: Re: Efficient use of buffered writes in a post-HTablePool world?
To: "user@hbase.apache.org" <user@hbase.apache.org>
Cc: lars hofhansl <larsh@apache.org>,
 =?UTF-8?Q?Enis_S=C3=B6ztutar?= <enis@apache.org>
Content-Type: multipart/alternative; boundary=001a11c3c832b468b7050a95ec9d

--001a11c3c832b468b7050a95ec9d
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I believe HTableMultiplexer[1] is meant to stand in for HTablePool for
buffered writing. FWIW, I've not used it.

1:
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMulti=
plexer.html


On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <ndimiduk@apache.org> wrote:
>
> Hi Aaron,
>
> Your analysis is spot on and I do not believe this is by design. I see th=
e
> write buffer is owned by the table, while I would have expected there to =
be
> a buffer per table all managed by the connection. I suggest you raise a
> blocker ticket vs the 1.0.0 release that's just around the corner to give
> this the attention it needs. Let me know if you're not into JIRA, I can
> raise one on your behalf.
>
> cc Lars, Enis.
>
> Nice work Aaron.
> -n
>
> On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <abeppu@siftscience.com>
> wrote:
> >
> > Hi All,
> >
> > TLDR; in the absence of HTablePool, if HTable instances are short-lived=
,
> > how should clients use buffered writes?
> >
> > I=E2=80=99m working on migrating a codebase from using 0.94.6 (CDH4.4) =
to 0.98.6
> > (CDH5.2). One issue I=E2=80=99m confused by is how to effectively use b=
uffered
> > writes now that HTablePool has been deprecated[1].
> >
> > In our 0.94 code, a pathway could get a table from the pool, configure =
it
> > with table.setAutoFlush(false); and write Puts to it. Those writes woul=
d
> > then go to the table instance=E2=80=99s writeBuffer, and those writes w=
ould only
> be
> > flushed when the buffer was full, or when we were ready to close out th=
e
> > pool. We were intentionally choosing to have fewer, larger writes from
> the
> > client to the cluster, and we knew we were giving up a degree of safety
> in
> > exchange (i.e. if the client dies after it=E2=80=99s accepted a write b=
ut before
> > the flush for that write occurs, the data is lost). This seems to be a
> > generally considered a reasonable choice (cf the HBase Book [2] SS
> 14.8.4)
> >
> > However in the 0.98 world, without HTablePool, the endorsed pattern [3]
> > seems to be to create a new HTable via table =3D
> > stashedHConnection.getTable(tableName, myExecutorService). However, eve=
n
> if
> > we do table.setAutoFlush(false), because that table instance is
> > short-lived, its buffer never gets full. We=E2=80=99ll create a table i=
nstance,
> > write a put to it, try to close the table, and the close call will
> trigger
> > a (synchronous) flush. Thus, not having HTablePool seems like it would
> > cause us to have many more small writes from the client to the cluster,
> and
> > basically wipe out the advantage of turning off autoflush.
> >
> > More concretely :
> >
> > // Given these two helpers ...
> >
> > private HTableInterface getAutoFlushTable(String tableName) throws
> > IOException {
> >   // (autoflush is true by default)
> >   return storedConnection.getTable(tableName, executorService);
> > }
> >
> > private HTableInterface getBufferedTable(String tableName) throws
> > IOException {
> >   HTableInterface table =3D getAutoFlushTable(tableName);
> >   table.setAutoFlush(false);
> >   return table;
> > }
> >
> > // it's my contention that these two methods would behave almost
> > identically,
> > // except the first will hit a synchronous flush during the put call,
> > and the second will
> > // flush during the (hidden) close call on table.
> >
> > private void writeAutoFlushed(Put somePut) throws IOException {
> >   try (HTableInterface table =3D getAutoFlushTable(tableName)) {
> >     table.put(somePut); // will do synchronous flush
> >   }
> > }
> >
> > private void writeBuffered(Put somePut) throws IOException {
> >   try (HTableInterface table =3D getBufferedTable(tableName)) {
> >     table.put(somePut);
> >   } // auto-close will trigger synchronous flush
> > }
> >
> > It seems like the only way to avoid this is to have long-lived HTable
> > instances, which get reused for multiple writes. However, since the
> actual
> > writes are driven from highly concurrent code, and since HTable is not
> > threadsafe, this would involve having a number of HTable instances, and=
 a
> > control mechanism for leasing them out to individual threads safely.
> Except
> > at this point it seems like we will have recreated HTablePool, which
> > suggests that we=E2=80=99re doing something deeply wrong.
> >
> > What am I missing here? Since the HTableInterface.setAutoFlush method
> still
> > exists, it must be anticipated that users will still want to buffer
> writes.
> > What=E2=80=99s the recommended way to actually buffer a meaningful numb=
er of
> > writes, from a multithreaded context, that doesn=E2=80=99t just amount =
to
> creating
> > a table pool?
> >
> > Thanks in advance,
> > Aaron
> >
> > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > [2] http://hbase.apache.org/book/perf.writing.html
> > [3]
> >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=3D13501=
302&page=3Dcom.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=
#comment-13501302
> > =E2=80=8B
> >
>


--=20
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

--001a11c3c832b468b7050a95ec9d--