hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xine Jar <xineja...@googlemail.com>
Subject Re: Why the map input records are equal to the map output records
Date Wed, 12 Aug 2009 16:57:47 GMT
For my own information, is there a way I can verify that it did not read the
table several times?
should the Map output record become equal to the number of records in the
table or not necessarily?

Thank you,


2009/8/12 Mathias De Maré <mathias.demare@gmail.com>

> Please send your reply to the hbase mailing list as well.
>
> On Wed, Aug 12, 2009 at 3:37 PM, Xine Jar <xinejar22@googlemail.com>wrote:
>
>> Your proposed architecture is clear. You try to split the calculation load
>> between the
>> mapper and the reducer, and even creating another job for the final
>> calculation.
>>
>> but if I shall keep reading in the mapper the way I am doing it know, this
>> means that in order to sort all the temperature values, I still need to read
>> the table N*N times where N is the number of records in a table? Right? So
>> my initial problem persists!!! Isn't it possible to read the Table only
>> once?
>
>
> Yes, that's possible, and that's what I proposed.
> The map function gets called for each record.
> Your map function could be changed to something simple, like (pseudocode):
>
> map(key, value) {
> output.collect(key.temperature, value);
> }
>
> Then a reduce function (the key is the temperature you passed in the map
> function):
> reduce(key, List values) {
> calculate_average();
> }
>
> And then a second job, like I mentioned.
>
> Mathias
>
>
>>
>>
>> For example if I would be using a textfile as an input instead of an hbase
>> table, the standard way to do, is copying the file to a string, tokenize it,
>> GO THROUGH IT ONCE  and filter out whatever I want!!
>>
>> A similar approach,going through the table only once, is not possible with
>> and Hbase table ?!
>>
>>
>>
>> Regards,
>> CJ
>>
>>
>> 2009/8/12 Mathias De Maré <mathias.demare@gmail.com>
>>
>>> Well, I think you could probably take care of the issue by using a
>>> somewhat different architecture.
>>>
>>> If I understand correctly, you take all of the values with the same
>>> temperature together. This is in fact a Reduce operation.
>>>
>>> You could structure as follows:
>>> -Read in like you do now, but make your Map simpler. For each map (so for
>>> each record), write away the temperature as the key, and the record as a
>>> value.
>>> -Each reducer will then have a list of records, each with the same
>>> temperature. You can sum the entries in the list and write everything away.
>>> Then you will have 1 combined result per temperature.
>>>
>>> You could then start a second job that has a pass-through Mapper, and
>>> then do your final calculation in the Reducer.
>>>
>>> Does it sound like I'm making sense to some degree? :-)
>>>
>>> Mathias
>>>
>>>
>>> On Wed, Aug 12, 2009 at 2:38 PM, Xine Jar <xinejar22@googlemail.com>wrote:
>>>
>>>> Aha!! I understand!!
>>>> So basically this is the reason why I am getting 100 written Map output
>>>> records. Because the mapper is calling the collect() of the OutputCollector
>>>> 100 times= number of records in the table.
>>>>
>>>> In this case I assume I have to pass the HBASE table instead of the
>>>> records as an input to the mapper right? Is there such a Java example you
>>>> could point it out for me?
>>>>
>>>> Regards,
>>>> CJ
>>>>
>>>>
>>>> 2009/8/12 Mathias De Maré <mathias.demare@gmail.com>
>>>>
>>>> Hi,
>>>>>
>>>>> On Tue, Aug 11, 2009 at 6:27 PM, Xine Jar <xinejar22@googlemail.com>wrote:
>>>>>>
>>>>>> *A snapshot of the Mapper :*
>>>>>>
>>>>>> *public void map(ImmutableBytesWritable key,RowResult value,
>>>>>> OutputCollector<Text, Text> output, Reporter reporter) throws
>>>>>> IOException {
>>>>>>       double numberreadings=0;
>>>>>>       double sumreadings=0;
>>>>>>
>>>>>>        if(table==null)
>>>>>>          throw new IOException("table is null");
>>>>>>
>>>>>>       //set a scanner
>>>>>>         Scanner scanner=table.getScanner(new String[] {"cf:Value",
>>>>>> "cf:Type", "cf:TimeStamp", "cf:Latitude", "cf:Longitude",
>>>>>> "cf:SensorNode"});
>>>>>>         RowResult rowresult=scanner.next();
>>>>>>
>>>>>>      //scanning the table, filtering out the values, and count them
>>>>>>        while(rowresult!=null){
>>>>>>
>>>>>>         String stringtype= new
>>>>>> String((rowresult.get(Bytes.toBytes("cf:Type"))).getValue());
>>>>>>
>>>>>>         if((stringtype).equals("temperature")==true)
>>>>>>            ///summ the correct reading value
>>>>>>            {String stringval=new
>>>>>> String((rowresult.get(Bytes.toBytes("cf:Value"))).getValue());
>>>>>>             double doubleval=Double.parseDouble(stringval.trim());
>>>>>>             sumreadings=sumreadings+doubleval;
>>>>>>
>>>>>>             ///summ the number of readings
>>>>>>             numberreadings=numberreadings+1;
>>>>>>            }
>>>>>>          rowresult=scanner.next();
>>>>>>
>>>>>>         }
>>>>>>
>>>>>>        scanner.close();
>>>>>>
>>>>>>       //send the summ of the values as well as the number
>>>>>>        String strsumreadings=Double.toString(sumreadings);
>>>>>>        String strnumberreadings=Double.toString(numberreadings);
>>>>>>        String strmapoutvalue= strsumreadings+" "+strnumberreadings;
>>>>>>
>>>>>>        mapoutputvalue.set(strmapoutvalue);
>>>>>>        output.collect(mapoutputkey,mapoutputvalue);
>>>>>>
>>>>>>  }*
>>>>>>
>>>>>>
>>>>>> *Questions:*
>>>>>> 1-For 100 records, I noticed that I have 1 map task and 1 reduce
task,
>>>>>> and
>>>>>> the job finishes after 12 Sec. Whenever I extend the number of records
>>>>>> in
>>>>>> the htable to 10,000 I still have 1 map and 1 reduce task and the
job
>>>>>> finishes after 1 hour!!!!!!
>>>>>> The mapper is incredibly slow, what is so heavy in my code?
>>>>>>
>>>>>
>>>>> From your code, it looks like you are using the HBase records as input
>>>>> for the mapper. Then, for each record, you go through the entire table
>>>>> again, so you do N scans of the HBase table, and read in total N*N records.
>>>>> That's what's heavy in your code.
>>>>>
>>>>> Mathias
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message