hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathias De Maré <mathias.dem...@gmail.com>
Subject Re: Why the map input records are equal to the map output records
Date Wed, 12 Aug 2009 14:06:11 GMT
Please send your reply to the hbase mailing list as well.

On Wed, Aug 12, 2009 at 3:37 PM, Xine Jar <xinejar22@googlemail.com> wrote:

> Your proposed architecture is clear. You try to split the calculation load
> between the
> mapper and the reducer, and even creating another job for the final
> calculation.
>
> but if I shall keep reading in the mapper the way I am doing it know, this
> means that in order to sort all the temperature values, I still need to read
> the table N*N times where N is the number of records in a table? Right? So
> my initial problem persists!!! Isn't it possible to read the Table only
> once?


Yes, that's possible, and that's what I proposed.
The map function gets called for each record.
Your map function could be changed to something simple, like (pseudocode):

map(key, value) {
output.collect(key.temperature, value);
}

Then a reduce function (the key is the temperature you passed in the map
function):
reduce(key, List values) {
calculate_average();
}

And then a second job, like I mentioned.

Mathias


>
>
> For example if I would be using a textfile as an input instead of an hbase
> table, the standard way to do, is copying the file to a string, tokenize it,
> GO THROUGH IT ONCE  and filter out whatever I want!!
>
> A similar approach,going through the table only once, is not possible with
> and Hbase table ?!
>
>
>
> Regards,
> CJ
>
>
> 2009/8/12 Mathias De Maré <mathias.demare@gmail.com>
>
>> Well, I think you could probably take care of the issue by using a
>> somewhat different architecture.
>>
>> If I understand correctly, you take all of the values with the same
>> temperature together. This is in fact a Reduce operation.
>>
>> You could structure as follows:
>> -Read in like you do now, but make your Map simpler. For each map (so for
>> each record), write away the temperature as the key, and the record as a
>> value.
>> -Each reducer will then have a list of records, each with the same
>> temperature. You can sum the entries in the list and write everything away.
>> Then you will have 1 combined result per temperature.
>>
>> You could then start a second job that has a pass-through Mapper, and then
>> do your final calculation in the Reducer.
>>
>> Does it sound like I'm making sense to some degree? :-)
>>
>> Mathias
>>
>>
>> On Wed, Aug 12, 2009 at 2:38 PM, Xine Jar <xinejar22@googlemail.com>wrote:
>>
>>> Aha!! I understand!!
>>> So basically this is the reason why I am getting 100 written Map output
>>> records. Because the mapper is calling the collect() of the OutputCollector
>>> 100 times= number of records in the table.
>>>
>>> In this case I assume I have to pass the HBASE table instead of the
>>> records as an input to the mapper right? Is there such a Java example you
>>> could point it out for me?
>>>
>>> Regards,
>>> CJ
>>>
>>>
>>> 2009/8/12 Mathias De Maré <mathias.demare@gmail.com>
>>>
>>> Hi,
>>>>
>>>> On Tue, Aug 11, 2009 at 6:27 PM, Xine Jar <xinejar22@googlemail.com>wrote:
>>>>>
>>>>> *A snapshot of the Mapper :*
>>>>>
>>>>> *public void map(ImmutableBytesWritable key,RowResult value,
>>>>> OutputCollector<Text, Text> output, Reporter reporter) throws
>>>>> IOException {
>>>>>       double numberreadings=0;
>>>>>       double sumreadings=0;
>>>>>
>>>>>        if(table==null)
>>>>>          throw new IOException("table is null");
>>>>>
>>>>>       //set a scanner
>>>>>         Scanner scanner=table.getScanner(new String[] {"cf:Value",
>>>>> "cf:Type", "cf:TimeStamp", "cf:Latitude", "cf:Longitude",
>>>>> "cf:SensorNode"});
>>>>>         RowResult rowresult=scanner.next();
>>>>>
>>>>>      //scanning the table, filtering out the values, and count them
>>>>>        while(rowresult!=null){
>>>>>
>>>>>         String stringtype= new
>>>>> String((rowresult.get(Bytes.toBytes("cf:Type"))).getValue());
>>>>>
>>>>>         if((stringtype).equals("temperature")==true)
>>>>>            ///summ the correct reading value
>>>>>            {String stringval=new
>>>>> String((rowresult.get(Bytes.toBytes("cf:Value"))).getValue());
>>>>>             double doubleval=Double.parseDouble(stringval.trim());
>>>>>             sumreadings=sumreadings+doubleval;
>>>>>
>>>>>             ///summ the number of readings
>>>>>             numberreadings=numberreadings+1;
>>>>>            }
>>>>>          rowresult=scanner.next();
>>>>>
>>>>>         }
>>>>>
>>>>>        scanner.close();
>>>>>
>>>>>       //send the summ of the values as well as the number
>>>>>        String strsumreadings=Double.toString(sumreadings);
>>>>>        String strnumberreadings=Double.toString(numberreadings);
>>>>>        String strmapoutvalue= strsumreadings+" "+strnumberreadings;
>>>>>
>>>>>        mapoutputvalue.set(strmapoutvalue);
>>>>>        output.collect(mapoutputkey,mapoutputvalue);
>>>>>
>>>>>  }*
>>>>>
>>>>>
>>>>> *Questions:*
>>>>> 1-For 100 records, I noticed that I have 1 map task and 1 reduce task,
>>>>> and
>>>>> the job finishes after 12 Sec. Whenever I extend the number of records
>>>>> in
>>>>> the htable to 10,000 I still have 1 map and 1 reduce task and the job
>>>>> finishes after 1 hour!!!!!!
>>>>> The mapper is incredibly slow, what is so heavy in my code?
>>>>>
>>>>
>>>> From your code, it looks like you are using the HBase records as input
>>>> for the mapper. Then, for each record, you go through the entire table
>>>> again, so you do N scans of the HBase table, and read in total N*N records.
>>>> That's what's heavy in your code.
>>>>
>>>> Mathias
>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message