hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Looking for a better design
Date Fri, 25 Dec 2009 07:47:52 GMT
If you have the date of the crawl stored in the table, you could set a
filter on the Scan object to only scan the rows for a certain day.

Also just to be sure, are you using MapReduce to process the tables?

J-D

2009/12/24 Xin Jing <xinjing@beyondfun.net>:
> The reason to save the new data into a temp table is, we provide the processed data in
a incremental manner, providing new data everyday. But we may process the whole data again
some day on demand. If we save the new data into the whole table, it is hard for us to tell
which pages is new. We can, of course, use a flag to tell the status of the data. But I am
afraid the performance may hurt to scan some data from a big data base.
>
> Thanks
> - Xin
> ________________________________________
> 发件人: jdcryans@gmail.com [jdcryans@gmail.com] 代表 Jean-Daniel Cryans [jdcryans@apache.org]
> 发送时间: 2009年12月25日 3:39 下午
> 收件人: hbase-user@hadoop.apache.org
> 主题: Re: Looking for a better design
>
> What's the reason for first importing into a temp table and not
> directly into the whole table?
>
> Also to improve performance I recommend reading
> http://wiki.apache.org/hadoop/PerformanceTuning
>
> J-D
>
> 2009/12/24 Xin Jing <xinjing@beyondfun.net>:
>> Hi All,
>>
>> We are processing a big number of web pages, crawling about 2 million pages from
internet everyday. After processed the new data, we save them all.
>>
>> Our current design is:
>> 1. create a temp table and a whole table, the table structure is exactly same.
>> 2. import the new data into temp table, and process them
>> 3. dump all the data from temp table into the whole table
>> 4. clean the temp table
>>
>> It works, but the performance is not good, the step 3 takes a loooong time. We use
map-reduce to transfer the data from temp table into the whole table, but its performance
is too slow. We think there might be something wrong in our design, so I am looking for a
better design for this task. Or some hint on the processing.
>>
>> Thanks
>> - Xin
>>
>
>

Mime
View raw message