accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aji Janis <>
Subject Re: Map Reduce on accumulo
Date Tue, 04 Dec 2012 23:55:42 GMT
Thank you John for your response. I do have a few followup questions.
Let me use a better example. Lets say my table and tabletserver
distributions are as follows:


rowA | f1 | q1 | v1
rowA | f2 | q2 | v2
rowA | f3 | q3 | v3

rowB | f1 | q1 | v1
rowB | f1 | q2 | v2

rowC | f1 | q1 | v1

rowD | f1 | q1 | v1
rowD | f1 | q2 | v2

rowE | f1 | q1 | v1


TabletServer1: Tablet A: rowA, rowC
TabletServer2: Tablet B: rowB
TabletServer2: Tablet C: rowD


In this example, if I have a map reduce job that reads from the table above
and writes to MyTable2 table using
org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.

Lets not focus on what the map reduce job itself is. From
your explanation below sounds like if autosplitting is not overriden then
we get *three mappers* total. Is that right?

Further, I will be right in assuming that a mapper will NOT get data from
multiple tablets. Correct?

I am also very confused on what the *order of input to the mapper* will be.
Would mapper_at_tabletA get
-all data from rowA before it gets all data from rowC or
-all data from rowC before it gets all data from rowA or
-something like:
   rowA | f1 | q1 | v1
   rowA | f2 | q2 | v2
   rowC | f1 | q1 | v1
   rowA | f3 | q3 | v3

I know these are a lot of question but I really like to get a good
understanding of the architecture. Thank you!

On Tue, Dec 4, 2012 at 5:45 PM, John Vines <> wrote:

> A tablet consists of both an in memory portion and 0 to many files in
> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
> boost to the natural locality you get when you write data to HDFS, but if a
> tablet migrates that locality could be lost until data is compacted
> (rewritten). Locality could be retained due to data replication, but
> Accumulo does not make extraordinary effort to attempt to get a little bit
> of locality, as data will eventually be rewritten and locality restored.
> As for your example, if all data for a given row is inserted at the same
> time, then it is guaranteed to be in the same file. There is no atomicity
> guarantee regarding HDFS blocks though, so depending on the block size and
> the amount of data in the file (and it's distribution), it is possible for
> a few entries to span files even though they are adjacent.
> Using the input format, unless you override the autosplitting in it, you
> will get 1 mapper per tablet. If you disable auto-splitting, then you get
> one mapper per range you specify.
> Hope this helps, let me know if you have other questions or need
> clarification.
> John
> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <> wrote:
>> NOTE: I am fairly sure this hasn't been asked on here yet - my apologies
>> if it was already asked in which case please forward me a link to the
>> answers.Thank you.
>> If my environment set up is as follows:
>> -64MB HDFS block
>> -5 tablet servers
>> -10 tablets of size 1GB each per tablet server
>> If I have a table like below:
>> rowA | f1 | q1 | v1
>> rowA | f1 | q2 | v2
>> rowB | f1 | q1 | v3
>> rowC | f1 | q1 | v4
>> rowC | f2 | q1 | v5
>> rowC | f3 | q3 | v6
>> From the little documentation, I know all data about rowA will go one
>> tablet which may or may not contain data about other rows ie its all or
>> none. So my questions are:
>> How are the tablets mapped to a Datanode or HDFS block? Obviously, One
>> tablet is split into multiple HDFS blocks (8 in this case) so would they be
>> stored on the same or different datanode(s) or does it not matter?
>> In the example above, would all data about RowC (or A or B) go onto the
>> same HDFS block or different HDFS blocks?
>> When executing a map reduce job how many mappers would I get? (one per
>> hdfs block? or per tablet? or per server?)
>> Thank you in advance for any and all suggestions.

View raw message