hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcos Ortiz <mlor...@uci.cu>
Subject Re: Efficient way to read a large number of files in S3 and upload their content to HBase
Date Thu, 24 May 2012 20:53:35 GMT


On 05/24/2012 04:47 PM, Amandeep Khurana wrote:
> Thanks for that description. I'm not entirely sure why you want to use 
> HBase here. You've got logs coming that you want to process in batch 
> to do calculations on. This can be done by running MR jobs on the flat 
> files itself. You could use Java MR, Hive or Pig to accomplish this. 
> Why do you want HBase here?
Tha main reason to use HBase is for the quantity of rows involved in the 
process. It could provide a efficient and "quick" way to store all this.
Hive can be an option too.

I will discuss all this again with the dev team.
Thanks a lot for your answers.
>
> -ak
>
> On Thursday, May 24, 2012 at 12:52 PM, Marcos Ortiz wrote:
>
>>
>>
>> On 05/24/2012 03:21 PM, Amandeep Khurana wrote:
>>> Marcos
>>>
>>> Can you elaborate on your use case a little bit? What is the nature of
>>> data in S3 and why you want to use HBase? Why do you want to combine
>>> HFiles and upload back to S3? It'll help us answer your questions
>>> better.
>>>
>>> Amandeep
>> Ok, let me explain more.
>> We are working on a ads optimization platform on top of Hadoop and HBase.
>> Another team of my organization create a type of log file per click 
>> by user
>> and store this file in S3. I discussed with them that a better approach
>> is to storage this
>> "workflow" log in HBase, instead S3, because in this way, we can quit
>> the another step
>> to read from S3 the content of the file, build the HFile and upload it
>> to HBase.
>>
>> The content of the file in S3 is the basic information for the operation:
>> - Source URL
>> - User Id
>> - User agent of the user
>> - Campaign id
>> and more fields.
>>
>> So, we want this to then create MapReduce jobs on top of HBase to some
>> calculations and reports
>> for this data.
>>
>> We are valuating HBase because our current solution is on top of
>> PostgreSQL, but the main issue is when you
>> launch a campaign on the platform, the INSERTs and UPDATEs to PostgreSQL
>> in a short time, could rise from 1 to
>> 100 clicks per second. We did some preliminary tests and in two days,
>> the table where we store the "workflow"
>> log grow exponentially to 350, 000 tuples, so, it could be a problem.
>> For that reason, we want to migrate this to HBase.
>>
>> But I think that the approach to generate a file in S3 and then upload
>> to HBase is not the best way to do this; because, you can always
>> create the workflow log for every user, build a Put for it and upload it
>> to HBase, and to avoid the locks, I´m valuating to use the asynchronous
>> API released
>> by StumbleUpon. [1]
>>
>> What do you think about this?
>>
>> [1] https://github.com/stumbleupon/asynchbase
>>
>>>
>>>
>>> On May 24, 2012, at 12:19 PM, Marcos Ortiz<mlortiz@uci.cu 
>>> <mailto:mlortiz@uci.cu>> wrote:
>>>
>>>> Thanks a lot for your answer, Amandeep.
>>>>
>>>> On 05/24/2012 02:55 PM, Amandeep Khurana wrote:
>>>>> Marcos,
>>>>>
>>>>> You could to a distcp from S3 to HDFS and then do a bulk import 
>>>>> into HBase.
>>>> The quantity of files are very large, so, we want to combine some 
>>>> files,
>>>> and then construct
>>>> the HFile to upload to HBase.
>>>> Any example of a custom FileMerger for it?
>>>>> Are you running HBase on EC2 or on your own hardware?
>>>> We have created a small HBase in our own hardware, but we want to build
>>>> another cluster on top of Amazon EC2. This
>>>> could be very good for the integration between S3 and the HBase 
>>>> cluster.
>>>>
>>>> Regards
>>>>> -Amandeep
>>>>>
>>>>>
>>>>> On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote:
>>>>>
>>>>>> Regards to all the list.
>>>>>> We are using Amazon S3 to store millions of files with certain 
>>>>>> format,
>>>>>> and we want to read the content of these files and then upload its
>>>>>> content to
>>>>>> a HBase cluster.
>>>>>> Anyone has done this?
>>>>>> Can you recommend me a efficient way to do this?
>>>>>>
>>>>>> Best wishes.
>>>>>>
>>>>>> --
>>>>>> Marcos Luis Ortíz Valmaseda
>>>>>> Data Engineer&& Sr. System Administrator at UCI
>>>>>> http://marcosluis2186.posterous.com
>>>>>> http://www.linkedin.com/in/marcosluis2186
>>>>>> Twitter: @marcosluis2186
>>>>>>
>>>>>>
>>>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS 
>>>>>> CIENCIAS INFORMATICAS...
>>>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>>>>>
>>>>>> http://www.uci.cu
>>>>>> http://www.facebook.com/universidad.uci
>>>>>> http://www.flickr.com/photos/universidad_uci
>>>>>
>>>>>
>>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
>>>>> INFORMATICAS...
>>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>>>>
>>>>> http://www.uci.cu
>>>>> http://www.facebook.com/universidad.uci
>>>>> http://www.flickr.com/photos/universidad_uci
>>>> --
>>>> Marcos Luis Ortíz Valmaseda
>>>> Data Engineer&& Sr. System Administrator at UCI
>>>> http://marcosluis2186.posterous.com
>>>> http://www.linkedin.com/in/marcosluis2186
>>>> Twitter: @marcosluis2186
>>>>
>>>>
>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
>>>> INFORMATICAS...
>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>>>
>>>> http://www.uci.cu
>>>> http://www.facebook.com/universidad.uci
>>>> http://www.flickr.com/photos/universidad_uci
>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
>>> INFORMATICAS...
>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>>
>>> http://www.uci.cu
>>> http://www.facebook.com/universidad.uci
>>> http://www.flickr.com/photos/universidad_uci
>>
>> -- 
>> Marcos Luis Ortíz Valmaseda
>> Data Engineer&& Sr. System Administrator at UCI
>> http://marcosluis2186.posterous.com
>> http://www.linkedin.com/in/marcosluis2186
>> Twitter: @marcosluis2186
>>
>>
>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
>> INFORMATICAS...
>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>
>> http://www.uci.cu
>> http://www.facebook.com/universidad.uci
>> http://www.flickr.com/photos/universidad_uci
>

-- 
Marcos Luis Ortíz Valmaseda
  Data Engineer&&  Sr. System Administrator at UCI
  http://marcosluis2186.posterous.com
  http://www.linkedin.com/in/marcosluis2186
  Twitter: @marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message