hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishna Kalyan <krishnakaly...@gmail.com>
Subject Re: Pig HBase integration
Date Mon, 29 Sep 2014 06:26:50 GMT
Thank you so much Serega.

Regards,
Krishna

On Sun, Sep 28, 2014 at 11:01 PM, Serega Sheypak <serega.sheypak@gmail.com>
wrote:

>
> https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html
> I'm not sure how does Pig HBaseStroage works. I suppose it would read all
> data and then join it as usual dataset. So you should get serious hbase
> perfomace degradation during read, you would get key-by-key read from the
> whole table.
> 1. so join in pig
> 2. At first you load data from hbase table then operate on it. I don't see
> a cse where you can use hbase table directly in join.
>
>
> 2014-09-28 17:02 GMT+04:00 Krishna Kalyan <krishnakalyan3@gmail.com>:
>
>>
>> We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns
>> in each record) and weblog (2-3 TB, approx 50 columns in each record). We
>> need to join the data sets using the locationId, which is in both the
>> data-sets.
>>
>> We have 2 options:
>> 1. Have both the data-sets in HDFS only and JOIN then on locationId, may
>> be using Pig.
>> 2. Since JOIN will be on locaitonId, which is primary key for location
>> data set, if we store the location data set with locationId as rowkey in
>> HBase and then use Pig query to do the join of weblog data set and location
>> table (using HBaseStorage).
>>
>> The reason to think about this idea is reading data based on the key is
>> faster in HBase, however we are not sure that in case of JOIN of 2 data
>> sets, will Pig internally go for picking the individual location record for
>> based on key or it reads through entire or few records from location table
>> and then do the join. Based on this we can make the choice.
>>
>> We are free to use HDFS or HBase for any input or output data set, please
>> advise which option can provide us better performance. Also if required,
>> please point us to some good article on this.
>>
>>
>> On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak <serega.sheypak@gmail.com
>> > wrote:
>>
>>> store location to hdfs
>>> store weblog to hdfs
>>> join them
>>> use HBase bulk load tool to load join result to hbase.
>>>
>>> What's the reason to keep location dataset in hbase and weblogs in hdfs?
>>>
>>> You can expect data load perfomance improvement. For me it takes few
>>> minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted
>>> table.
>>>
>>> 2014-09-28 16:04 GMT+04:00 Krishna Kalyan <krishnakalyan3@gmail.com>:
>>>
>>>> Thanks Serega,
>>>>
>>>> Our usecase details:
>>>> We have a location table which will be stored in HBase with locationID
>>>> as the rowkey / Joinkey.
>>>> We intend to join this table with a transactional WebLog file in HDFS
>>>> (Expected size can be around 2TB).
>>>> Joining query will be passed from Pig.
>>>> Can we expect a performance improvement when compared with mapreduce
>>>> appoach?.
>>>>
>>>> Regards,
>>>> Krishna
>>>>
>>>> On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak <
>>>> serega.sheypak@gmail.com> wrote:
>>>>
>>>>> Depends on the datasets size and HBase workload. The best way is to do
>>>>> join
>>>>> in pig, store it and then use HBase bulk load tool.
>>>>> It's general recommendation. I have no idea about your task details
>>>>>
>>>>> 2014-09-27 7:32 GMT+04:00 Krishna Kalyan <krishnakalyan3@gmail.com>:
>>>>>
>>>>> > Hi,
>>>>> > We have a use case that involves ETL on data coming from several
>>>>> different
>>>>> > sources using pig.
>>>>> > We plan to store the final output table in HBase.
>>>>> > What will be the performance impact if we do a join with an external
>>>>> CSV
>>>>> > table using pig?.
>>>>> >
>>>>> > Regards,
>>>>> > Krishna
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message