accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Vines <>
Subject Re: Mapreduce, Indexing and Logging
Date Sat, 02 Mar 2013 20:49:38 GMT
1. This is quite variable. It depends on your hardware specs, primarily CPU
and disk throughput. It also depends on how your system is configured for
these resources and your typical mutation size. How your mutations are
distributed is another factor.
2. Under the hood, the output format uses a BatchWriter. There is a
guarantee that once a flush comes back from the batchwriter, the data is
available. Unless explicitly called, the batchwriter will flush whenever
half of it's capacity is full, or when idle for a short period (I want to
say 3 seconds, but I could be mistaken).
3. If the 2 mutations don't intersect at all, then there's no issue. If
they have identical columns, then whichever one has the newest timestamp
will come up first. If you are explicitly setting timestamps or they arrive
at the same time, the outcome is non-deterministic.
4. I'm going to defer this question to someone else
5. Ideally each datanode should be a tserver. And they will also be a
tasktracvker. This will help ensure data locality so you can get around any
network boundaries/overhead.
5. I don't see why not. There's a little bit of log4j statements in the
Accumulo client, so it would actually make it easier for you to deal with
them there too.


On Sat, Mar 2, 2013 at 3:11 PM, Aji Janis <> wrote:

> Hello,
>  I am investigating how well accumulo will handle mapreduce jobs. I am
> interested in hearing about any known issues from anyone running mapreduce
> with accumulo as their source and sink. Specifically, I want to hear your
> thoughts about the following:
> Assume cluster has 50 nodes.
> Accumulo running is on three nodes
> Solr is on three nodes
> 1. how many concurrent mutations can accumulo handle - more details on how
> this works would be extremely helpful
> 2. is there a delay between when map reduce writes data to table vs. when
> the data is available for read.
> 3. how are concurrent mutations to the same row handled  (say from
> different mappers/reducers) since accumulo isn't transactional
> 4. I am trying to solr index some accumulo data --- are there are any know
> issues on accumulo end? solr end? how does one vs. multiple shard affect
> the MR job?
> 5. should I have more accumulo/ solr nodes (ie an instance on each node in
> cluster? is that necessary? workarounds?)
> 5. Normally I have log4j statements all over the java job. Can I still use
> them with map reduce?
> I apologize if any of these questions do not belong on this mailing list
> (and please point me to where I can ask them, if possible). I am trying to
> gather a lot of information to decide if this is a good approach for me and
> the level of effort needed so I realize these are a lot of questions. I
> very much appreciate any and all feedback. Thank you for your time in
> advance!

View raw message