accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aji Janis <aji1...@gmail.com>
Subject Re: Map Reduce on accumulo
Date Fri, 07 Dec 2012 14:56:58 GMT
Thank you!

On Fri, Dec 7, 2012 at 9:51 AM, Billie Rinaldi <billie@apache.org> wrote:

> On Thu, Dec 6, 2012 at 2:32 PM, Aji Janis <aji1705@gmail.com> wrote:
>
>> Thank you for the clarification. You mentioned that "Using the input
>> format, unless you override the autosplitting in it, you will get 1 mapper
>> per tablet." in your initial response. Again, pardon me for the newbie
>> question, but how do I find out if autosplitting is overriden or not?
>>
>
> You override autosplitting with the command
> AccumuloInputFormat.disableAutoAdjustRanges(Configuration).  So if you
> haven't done that, it will fit mappers to tablets.
>
> Billie
>
>
>
>>
>> Aji
>>
>>
>>
>> On Tue, Dec 4, 2012 at 8:36 PM, John Vines <vines@apache.org> wrote:
>>
>>> Your first two presumptions are correct. You will get 3 mappers and each
>>> mapper will have data for only one tablet.
>>>
>>> Each mapper will function exactly as a scanner for the range of the
>>> tablet, so you will get things in lexicographical order. So the mapper for
>>> tablet A will get all items for rowA in order before getting items for rowB.
>>>
>>> John
>>>
>>>
>>>
>>> On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <aji1705@gmail.com> wrote:
>>>
>>>> Thank you John for your response. I do have a few followup questions.
>>>> Let me use a better example. Lets say my table and tabletserver
>>>> distributions are as follows:
>>>>
>>>> ---------------------------------------------
>>>> MyTable:
>>>>
>>>> rowA | f1 | q1 | v1
>>>> rowA | f2 | q2 | v2
>>>> rowA | f3 | q3 | v3
>>>>
>>>> rowB | f1 | q1 | v1
>>>> rowB | f1 | q2 | v2
>>>>
>>>> rowC | f1 | q1 | v1
>>>>
>>>> rowD | f1 | q1 | v1
>>>> rowD | f1 | q2 | v2
>>>>
>>>> rowE | f1 | q1 | v1
>>>>
>>>> ---------------------------------------------
>>>>
>>>> TabletServer1: Tablet A: rowA, rowC
>>>> TabletServer2: Tablet B: rowB
>>>> TabletServer2: Tablet C: rowD
>>>>
>>>> --------------------------------------------
>>>>
>>>> In this example, if I have a map reduce job that reads from the table
>>>> above and writes to MyTable2 table using
>>>> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
>>>> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.
>>>>
>>>> Lets not focus on what the map reduce job itself is. From
>>>> your explanation below sounds like if autosplitting is not overriden then
>>>> we get *three mappers* total. Is that right?
>>>>
>>>> Further, I will be right in assuming that a mapper will NOT get data
>>>> from multiple tablets. Correct?
>>>>
>>>> I am also very confused on what the *order of input to the mapper*will be.
Would mapper_at_tabletA get
>>>> -all data from rowA before it gets all data from rowC or
>>>> -all data from rowC before it gets all data from rowA or
>>>> -something like:
>>>>    rowA | f1 | q1 | v1
>>>>    rowA | f2 | q2 | v2
>>>>    rowC | f1 | q1 | v1
>>>>    rowA | f3 | q3 | v3
>>>>
>>>> I know these are a lot of question but I really like to get a good
>>>> understanding of the architecture. Thank you!
>>>> Aji
>>>>
>>>>
>>>>
>>>> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <vines@apache.org> wrote:
>>>>
>>>>> A tablet consists of both an in memory portion and 0 to many files in
>>>>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
>>>>> boost to the natural locality you get when you write data to HDFS, but
if a
>>>>> tablet migrates that locality could be lost until data is compacted
>>>>> (rewritten). Locality could be retained due to data replication, but
>>>>> Accumulo does not make extraordinary effort to attempt to get a little
bit
>>>>> of locality, as data will eventually be rewritten and locality restored.
>>>>>
>>>>> As for your example, if all data for a given row is inserted at the
>>>>> same time, then it is guaranteed to be in the same file. There is no
>>>>> atomicity guarantee regarding HDFS blocks though, so depending on the
block
>>>>> size and the amount of data in the file (and it's distribution), it is
>>>>> possible for a few entries to span files even though they are adjacent.
>>>>>
>>>>> Using the input format, unless you override the autosplitting in it,
>>>>> you will get 1 mapper per tablet. If you disable auto-splitting, then
you
>>>>> get one mapper per range you specify.
>>>>>
>>>>> Hope this helps, let me know if you have other questions or need
>>>>> clarification.
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <aji1705@gmail.com> wrote:
>>>>>
>>>>>> NOTE: I am fairly sure this hasn't been asked on here yet - my
>>>>>> apologies if it was already asked in which case please forward me
a link to
>>>>>> the answers.Thank you.
>>>>>>
>>>>>> If my environment set up is as follows:
>>>>>> -64MB HDFS block
>>>>>> -5 tablet servers
>>>>>> -10 tablets of size 1GB each per tablet server
>>>>>>
>>>>>> If I have a table like below:
>>>>>> rowA | f1 | q1 | v1
>>>>>> rowA | f1 | q2 | v2
>>>>>>
>>>>>> rowB | f1 | q1 | v3
>>>>>>
>>>>>> rowC | f1 | q1 | v4
>>>>>> rowC | f2 | q1 | v5
>>>>>> rowC | f3 | q3 | v6
>>>>>>
>>>>>> From the little documentation, I know all data about rowA will go
one
>>>>>> tablet which may or may not contain data about other rows ie its
all or
>>>>>> none. So my questions are:
>>>>>>
>>>>>> How are the tablets mapped to a Datanode or HDFS block? Obviously,
>>>>>> One tablet is split into multiple HDFS blocks (8 in this case) so
would
>>>>>> they be stored on the same or different datanode(s) or does it not
matter?
>>>>>>
>>>>>> In the example above, would all data about RowC (or A or B) go onto
>>>>>> the same HDFS block or different HDFS blocks?
>>>>>>
>>>>>> When executing a map reduce job how many mappers would I get? (one
>>>>>> per hdfs block? or per tablet? or per server?)
>>>>>>
>>>>>> Thank you in advance for any and all suggestions.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message