hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carter (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-12728) buffered writes substantially less useful after removal of HTablePool
Date Fri, 02 Jan 2015 18:26:34 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263110#comment-14263110
] 

Carter commented on HBASE-12728:
--------------------------------

Okay, here's another pass, scratching out the HTableMultiplexer idea.  Instead we'll create
a new class called {{AsyncPutter}}. (Not a huge fan of the name, so if you have a better one,
please share.)

First off, here are our basic requirements in this refactor:
# Handle the M/R case where a user wants to batch and flush in a single thread
# Handle the case Aaron described where we batch across multiple threads
# Provide a way to do this through the new Table interface for convenience
# Buffering/batching limits based on size in bytes, not queue length
# Move towards [~lhofhansl]'s suggestion of "HTable as cheap proxies to tables only"
# While durability can't be guaranteed in case of a crash, avoid losing data otherwise.

So here are our classes:

{code:java}
// BufferedTable is lightweight and single-threaded.  Many of them can share a single AsyncPutter.
public class BufferedTable implements Table {
    public BufferedTable(Table t, AsyncPutter ap);
    public void flush();
}

// Thread-safe handler of puts for one or more BufferedTable instances.
public class AsyncPutter implements Closeable {
    public AsyncPutter(Connection c, ExecutorService pool, ExceptionListener e, PutBuffer
pb);
    public synchronized add(Put put);  // Synchronization adds nanoseconds in the single-threaded
case.  No biggie.
    public synchronized flush();
    public synchronized close();
}

// Simple single-threaded data holder.
public class PutBuffer {
    public PutBuffer(long maxBufferSize);  // In bytes.  This makes more sense than queue
length for memory management.
    // maxBufferSize = totalBufferMem / numberOfExecutorPoolThreads
    public void add(Put p);
    public boolean isBatchAvailable();
    public List<Put> removeBatch();
}

// To make sure exceptions don't get swallowed.
public interface ExceptionListener {
    void onException(RetriesExhaustedWithDetailsException e);
}
{code}

We also proposed a {{BufferedConnection}} factory, simply to make it easier to switch between
Table and BufferTable implementations without much refactoring.  When used, it would own the
AsyncPutter.  Pros/cons for this idea?  It's not essential.

Asynchronous exception handling takes place through an {{ExceptionListener}} observer provided
by the user.  This means that exceptions are not thrown for simple put failures; they are
passed to the listener.  The thought here is I find the current behavior non-deterministic:

{code:java}
table.put(put1);  // This put causes an exception
table.put(put2);  // But we don't see the exception until we get here ...
table.put(put3);  // ... or maybe(?) here.  put3 succeeded, but I got an exception thrown.
 That's counter-intuitive.
{code}

An ExceptionListener is a pretty standard pattern for asynchronous error handling.  M/R or
other cases might rely on an exception being thrown synchronously to rollback appropriately,
but it's easy enough to mimic that behavior with the listener approach.

{{BufferedTable#close}} does not flush since we need to support batching across multiple threads.
 {{AsyncPutter#close}} does flush.  (Will JavaDoc this.)  If we decide to provide a BufferedConnection,
then closing that would also flush, since it owns the AsyncPutter.

Do we need a timeout-based flush?  I don't see one in the current HTable implementation, but
if it's important we could add it to the AsyncPutter.  Seems a good way to limit lost mutations
during slow periods of writes into a big buffer.


> buffered writes substantially less useful after removal of HTablePool
> ---------------------------------------------------------------------
>
>                 Key: HBASE-12728
>                 URL: https://issues.apache.org/jira/browse/HBASE-12728
>             Project: HBase
>          Issue Type: Bug
>          Components: hbase
>    Affects Versions: 0.98.0
>            Reporter: Aaron Beppu
>
> In previous versions of HBase, when use of HTablePool was encouraged, HTable instances
were long-lived in that pool, and for that reason, if autoFlush was set to false, the table
instance could accumulate a full buffer of writes before a flush was triggered. Writes from
the client to the cluster could then be substantially larger and less frequent than without
buffering.
> However, when HTablePool was deprecated, the primary justification seems to have been
that creating HTable instances is cheap, so long as the connection and executor service being
passed to it are pre-provided. A use pattern was encouraged where users should create a new
HTable instance for every operation, using an existing connection and executor service, and
then close the table. In this pattern, buffered writes are substantially less useful; writes
are as small and as frequent as they would have been with autoflush=true, except the synchronous
write is moved from the operation itself to the table close call which immediately follows.
> More concretely :
> ```
> // Given these two helpers ...
> private HTableInterface getAutoFlushTable(String tableName) throws IOException {
>   // (autoflush is true by default)
>   return storedConnection.getTable(tableName, executorService);
> }
> private HTableInterface getBufferedTable(String tableName) throws IOException {
>   HTableInterface table = getAutoFlushTable(tableName);
>   table.setAutoFlush(false);
>   return table;
> }
> // it's my contention that these two methods would behave almost identically,
> // except the first will hit a synchronous flush during the put call,
> and the second will
> // flush during the (hidden) close call on table.
> private void writeAutoFlushed(Put somePut) throws IOException {
>   try (HTableInterface table = getAutoFlushTable(tableName)) {
>     table.put(somePut); // will do synchronous flush
>   }
> }
> private void writeBuffered(Put somePut) throws IOException {
>   try (HTableInterface table = getBufferedTable(tableName)) {
>     table.put(somePut);
>   } // auto-close will trigger synchronous flush
> }
> ```
> For buffered writes to actually provide a performance benefit to users, one of two things
must happen:
> - The writeBuffer itself shouldn't live, flush and die with the lifecycle of it's HTableInstance.
If the writeBuffer were managed elsewhere and had a long lifespan, this could cease to be
an issue. However, if the same writeBuffer is appended to by multiple tables, then some additional
concurrency control will be needed around it.
> - Alternatively, there should be some pattern for having long-lived HTable instances.
However, since HTable is not thread-safe, we'd need multiple instances, and a mechanism for
leasing them out safely -- which sure sounds a lot like the old HTablePool to me.
> See discussion on mailing list here : http://mail-archives.apache.org/mod_mbox/hbase-user/201412.mbox/%3CCAPdJLkEzmUQZ_kvD%3D8mrxi4V%3DhCmUp3g9MUZsddD%2Bmon%2BAvNtg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message