accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Vines <vi...@apache.org>
Subject Re: Map Reduce on accumulo
Date Wed, 05 Dec 2012 01:36:21 GMT
Your first two presumptions are correct. You will get 3 mappers and each
mapper will have data for only one tablet.

Each mapper will function exactly as a scanner for the range of the tablet,
so you will get things in lexicographical order. So the mapper for tablet A
will get all items for rowA in order before getting items for rowB.

John


On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <aji1705@gmail.com> wrote:

> Thank you John for your response. I do have a few followup questions.
> Let me use a better example. Lets say my table and tabletserver
> distributions are as follows:
>
> ---------------------------------------------
> MyTable:
>
> rowA | f1 | q1 | v1
> rowA | f2 | q2 | v2
> rowA | f3 | q3 | v3
>
> rowB | f1 | q1 | v1
> rowB | f1 | q2 | v2
>
> rowC | f1 | q1 | v1
>
> rowD | f1 | q1 | v1
> rowD | f1 | q2 | v2
>
> rowE | f1 | q1 | v1
>
> ---------------------------------------------
>
> TabletServer1: Tablet A: rowA, rowC
> TabletServer2: Tablet B: rowB
> TabletServer2: Tablet C: rowD
>
> --------------------------------------------
>
> In this example, if I have a map reduce job that reads from the table
> above and writes to MyTable2 table using
> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.
>
> Lets not focus on what the map reduce job itself is. From
> your explanation below sounds like if autosplitting is not overriden then
> we get *three mappers* total. Is that right?
>
> Further, I will be right in assuming that a mapper will NOT get data from
> multiple tablets. Correct?
>
> I am also very confused on what the *order of input to the mapper* will
> be. Would mapper_at_tabletA get
> -all data from rowA before it gets all data from rowC or
> -all data from rowC before it gets all data from rowA or
> -something like:
>    rowA | f1 | q1 | v1
>    rowA | f2 | q2 | v2
>    rowC | f1 | q1 | v1
>    rowA | f3 | q3 | v3
>
> I know these are a lot of question but I really like to get a good
> understanding of the architecture. Thank you!
> Aji
>
>
>
> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <vines@apache.org> wrote:
>
>> A tablet consists of both an in memory portion and 0 to many files in
>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
>> boost to the natural locality you get when you write data to HDFS, but if a
>> tablet migrates that locality could be lost until data is compacted
>> (rewritten). Locality could be retained due to data replication, but
>> Accumulo does not make extraordinary effort to attempt to get a little bit
>> of locality, as data will eventually be rewritten and locality restored.
>>
>> As for your example, if all data for a given row is inserted at the same
>> time, then it is guaranteed to be in the same file. There is no atomicity
>> guarantee regarding HDFS blocks though, so depending on the block size and
>> the amount of data in the file (and it's distribution), it is possible for
>> a few entries to span files even though they are adjacent.
>>
>> Using the input format, unless you override the autosplitting in it, you
>> will get 1 mapper per tablet. If you disable auto-splitting, then you get
>> one mapper per range you specify.
>>
>> Hope this helps, let me know if you have other questions or need
>> clarification.
>>
>> John
>>
>>
>>
>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <aji1705@gmail.com> wrote:
>>
>>> NOTE: I am fairly sure this hasn't been asked on here yet - my apologies
>>> if it was already asked in which case please forward me a link to the
>>> answers.Thank you.
>>>
>>> If my environment set up is as follows:
>>> -64MB HDFS block
>>> -5 tablet servers
>>> -10 tablets of size 1GB each per tablet server
>>>
>>> If I have a table like below:
>>> rowA | f1 | q1 | v1
>>> rowA | f1 | q2 | v2
>>>
>>> rowB | f1 | q1 | v3
>>>
>>> rowC | f1 | q1 | v4
>>> rowC | f2 | q1 | v5
>>> rowC | f3 | q3 | v6
>>>
>>> From the little documentation, I know all data about rowA will go one
>>> tablet which may or may not contain data about other rows ie its all or
>>> none. So my questions are:
>>>
>>> How are the tablets mapped to a Datanode or HDFS block? Obviously, One
>>> tablet is split into multiple HDFS blocks (8 in this case) so would they be
>>> stored on the same or different datanode(s) or does it not matter?
>>>
>>> In the example above, would all data about RowC (or A or B) go onto the
>>> same HDFS block or different HDFS blocks?
>>>
>>> When executing a map reduce job how many mappers would I get? (one per
>>> hdfs block? or per tablet? or per server?)
>>>
>>> Thank you in advance for any and all suggestions.
>>>
>>
>>
>

Mime
View raw message