accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Vines <>
Subject Re: Map Reduce on accumulo
Date Tue, 04 Dec 2012 22:45:18 GMT
A tablet consists of both an in memory portion and 0 to many files in HDFS.
Each file may be one or many HDFS blocks. Accumulo gets a performance boost
to the natural locality you get when you write data to HDFS, but if a
tablet migrates that locality could be lost until data is compacted
(rewritten). Locality could be retained due to data replication, but
Accumulo does not make extraordinary effort to attempt to get a little bit
of locality, as data will eventually be rewritten and locality restored.

As for your example, if all data for a given row is inserted at the same
time, then it is guaranteed to be in the same file. There is no atomicity
guarantee regarding HDFS blocks though, so depending on the block size and
the amount of data in the file (and it's distribution), it is possible for
a few entries to span files even though they are adjacent.

Using the input format, unless you override the autosplitting in it, you
will get 1 mapper per tablet. If you disable auto-splitting, then you get
one mapper per range you specify.

Hope this helps, let me know if you have other questions or need


On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <> wrote:

> NOTE: I am fairly sure this hasn't been asked on here yet - my apologies
> if it was already asked in which case please forward me a link to the
> answers.Thank you.
> If my environment set up is as follows:
> -64MB HDFS block
> -5 tablet servers
> -10 tablets of size 1GB each per tablet server
> If I have a table like below:
> rowA | f1 | q1 | v1
> rowA | f1 | q2 | v2
> rowB | f1 | q1 | v3
> rowC | f1 | q1 | v4
> rowC | f2 | q1 | v5
> rowC | f3 | q3 | v6
> From the little documentation, I know all data about rowA will go one
> tablet which may or may not contain data about other rows ie its all or
> none. So my questions are:
> How are the tablets mapped to a Datanode or HDFS block? Obviously, One
> tablet is split into multiple HDFS blocks (8 in this case) so would they be
> stored on the same or different datanode(s) or does it not matter?
> In the example above, would all data about RowC (or A or B) go onto the
> same HDFS block or different HDFS blocks?
> When executing a map reduce job how many mappers would I get? (one per
> hdfs block? or per tablet? or per server?)
> Thank you in advance for any and all suggestions.

View raw message