accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slater, David M." <David.Sla...@jhuapl.edu>
Subject RE: BatchWriter performance on 1.4
Date Fri, 20 Sep 2013 16:47:47 GMT
I was using flush() after sending a bunch of mutations to the batchwriters to limit their latency.
I thought it would normally flush the buffer to ensure that the maxLatency is not violated.
If the maxLatency is quite large, how do I ensure that it doesn't wait a long time before
writing?

If the returned batchscanners are all thread safe, then I'm still going to have the bottleneck
of their synchronized addMutations method, correct?

I'm looking for "org.apache.accumulo.client.impl" in the log4j.properties, generic_logger.xml
the and other config files, but can't locate it. Do I need to create a new entry for it there?

Thanks,
David

From: Keith Turner [mailto:keith@deenlo.com]
Sent: Thursday, September 19, 2013 7:01 PM
To: user@accumulo.apache.org
Subject: Re: BatchWriter performance on 1.4

On Thu, Sep 19, 2013 at 5:08 PM, Slater, David M. <David.Slater@jhuapl.edu<mailto:David.Slater@jhuapl.edu>>
wrote:
Thanks Keith, I'm looking at it now. It appears like what I would want. As for the proper
usage...

Would I create one using the Connector,
then .getBatchWriter() for each of the tables I'm interested in,
add data to each of BatchWriters returned,

yes.

and then hit flush() when I want to write all of that to get written?

Why are you calling flush() ?   Doing this frequently will increase rpc overhead and lower
throughput.


Would the individual batch writers spawned by the multiTableBatchWriter still have synchronized
addMutations() methods so I would have to worry about blocking still, or would that all happen
at the flush() method?

The returned batch writers are thread safe. They all add to the same queue/buffer in a synchronized
manner.   Calling flush() on any of the batch writers returned from getBatchWriter() will
block the others.

If you enable set the log4j log level to TRACE for org.apache.accumulo.client.impl you can
see output like the following.  Binning is the process of taking each mutation and deciding
which tablet and tablet server it goes to.

  2013-09-19 18:43:37,261 [impl.ThriftTransportPool] TRACE: Using existing connection to 127.0.0.1:9997<http://127.0.0.1:9997>
  2013-09-19 18:43:37,393 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13  Binning 80909 mutations
for table 3
  2013-09-19 18:43:37,402 [impl.TabletLocatorImpl] TRACE: tid=12 oid=13  Binned 80909 mutations
for table 3 to 1 tservers in 0.009 secs
  2013-09-19 18:43:37,402 [impl.TabletServerBatchWriter] TRACE: Started sending 80,909 mutations
to 1 tablet servers
  2013-09-19 18:43:37,656 [impl.ThriftTransportPool] TRACE: Returned connection 127.0.0.1:9997<http://127.0.0.1:9997>
(120000) ioCount : 1459116
  2013-09-19 18:43:37,657 [impl.TabletServerBatchWriter] TRACE: sent 80,909 mutations to 127.0.0.1:9997<http://127.0.0.1:9997>
in 0.40 secs (204,832.91 mutations/sec) with 0 failures

When you close the batch writer, it will log some summary stats like the following.


  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: TABLET SERVER BATCH WRITER
STATISTICS
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Added                :  1,000,000
mutations
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Sent                 :  1,000,000
mutations
  2013-09-19 18:43:39,149 [impl.TabletServerBatchWriter] TRACE: Resent percentage   :    
  0.00%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall time         :   
   5.94 secs
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Overall send rate    : 168,406.87
mutations/sec
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Send efficiency      :   
  86.91%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: BACKGROUND WRITER PROCESS
STATISTICS
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Total send time      :   
   5.16 secs  86.91%
  2013-09-19 18:43:39,150 [impl.TabletServerBatchWriter] TRACE: Average send rate    : 193,760.90
mutations/sec
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Total bin time       :   
   0.46 secs   7.81%
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: Average bin rate     : 2,155,172.41
mutations/sec
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tservers per batch   :   
 1.00 avg       1 min      1 max
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: tablets per batch    :   
 1.00 avg       1 min      1 max
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE:
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: SYSTEM STATISTICS
  2013-09-19 18:43:39,151 [impl.TabletServerBatchWriter] TRACE: JVM GC Time          :   
   0.53 secs
  2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: JVM Compile Time     :   
   1.60 secs
  2013-09-19 18:43:39,152 [impl.TabletServerBatchWriter] TRACE: System load average : initial=
 0.22 final=  0.20

What do these numbers look like for you?

Keith


From: Keith Turner [mailto:keith@deenlo.com<mailto:keith@deenlo.com>]
Sent: Thursday, September 19, 2013 12:39 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>

Subject: Re: BatchWriter performance on 1.4

Are you aware of the multi table batch writer?  I am not sure if it would be useful, but wanted
to make sure you knew about it.   It will use the same thread pool to process mutations for
multiple tables.  Also it will batch mutations for multiple tablets into the same rpc calls.

On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M. <David.Slater@jhuapl.edu<mailto:David.Slater@jhuapl.edu>>
wrote:
Hi, I'm running a single-threaded ingestion program that takes data from an input source,
parses it into mutations, and then writes those mutations (sequentially) to four different
BatchWriters (all on different tables). Most of the time (95%) taken is on adding mutations,
e.g. batchWriter.addMutations(mutations); I am wondering how to reduce the time taken by these
methods.

1) For the method batchWriter.addMutations(Iterable<Mutation>), does it matter for performance
whether the mutations returned by the iterator are sorted in lexicographic order?

2) If the Iterable<Mutation> that I pass to the BatchWriter is very large, will I need
to wait for a number of Batches to be written and flushed before it will finish iterating,
or does it transfer the elements of the Iterable to a different intermediate list?

3) If that is the case, would it then make sense to spawn off short threads for each time
I make use of addMutations?

At a high level, my code looks like this:

BatchWriter bw1 = connector.createBatchWriter(...)
BatchWriter bw2 = ...
...
while(true) {
String[] data = input.getData();
List<Mutation> mutations1 = parseData1(data);
                List<Mutation> mutations2 = parseData2(data);
                ...
                bw1.addMutations(mutations1);
                bw2.addMutations(mutations2);
                ...
}
Thanks,
David



Mime
View raw message