accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slater, David M." <David.Sla...@jhuapl.edu>
Subject RE: BatchWriter performance on 1.4
Date Thu, 19 Sep 2013 14:53:36 GMT
Hi David,

I've looked at generating rfiles directly, but I know that adds latency to the process, so
I wanted to make sure I had found the upper bound for direct mutations before exploring that.

The tables are pre-split, and all tservers are engaged in ingest (though the application itself
that does the parsing and batchwriters is on the namenode, which is not a tserver). There
are some compactions happening on ingest, but not a lot.

The reason I'm running them on the same ingest process is that they use the same data and
their mutations reuse a lot of that data. However, it would be nice to have a different thread
handle the ingest for each BatchWriter, so I might try that out.

From: David Medinets [mailto:david.medinets@gmail.com]
Sent: Wednesday, September 18, 2013 10:41 PM
To: accumulo-user
Subject: Re: BatchWriter performance on 1.4

Have you looked at generating rfiles instead of writing mutations directly to Accumulo?
Are the four target tables pre-split?
Are all tservers engaged in the ingest process?
Do you see a lot of compactions while the ingest is happening?
Any reason not to run four ingest processes with one batchwriter each instead of one ingest
with four batchwriters?

On Wed, Sep 18, 2013 at 5:07 PM, Slater, David M. <David.Slater@jhuapl.edu<mailto:David.Slater@jhuapl.edu>>
wrote:
Hi, I'm running a single-threaded ingestion program that takes data from an input source,
parses it into mutations, and then writes those mutations (sequentially) to four different
BatchWriters (all on different tables). Most of the time (95%) taken is on adding mutations,
e.g. batchWriter.addMutations(mutations); I am wondering how to reduce the time taken by these
methods.

1) For the method batchWriter.addMutations(Iterable<Mutation>), does it matter for performance
whether the mutations returned by the iterator are sorted in lexicographic order?

2) If the Iterable<Mutation> that I pass to the BatchWriter is very large, will I need
to wait for a number of Batches to be written and flushed before it will finish iterating,
or does it transfer the elements of the Iterable to a different intermediate list?

3) If that is the case, would it then make sense to spawn off short threads for each time
I make use of addMutations?

At a high level, my code looks like this:

BatchWriter bw1 = connector.createBatchWriter(...)
BatchWriter bw2 = ...
...
while(true) {
String[] data = input.getData();
List<Mutation> mutations1 = parseData1(data);
                List<Mutation> mutations2 = parseData2(data);
                ...
                bw1.addMutations(mutations1);
                bw2.addMutations(mutations2);
                ...
}
Thanks,
David


Mime
View raw message