hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aram Mkhitaryan <aram.mkhitar...@googlemail.com>
Subject Re: Looking for a better design
Date Sat, 26 Dec 2009 15:50:59 GMT
Hi All,

I'm new to hadoop stuff and I would be grateful if you could explain
how you say/define so that your data is read from hbase tables in
map-reduce tasks and moreover to tell the system so that it has more
than one task.
is there an article that guides to that kind of stuff?

thank you very much,
marry christmas and happy new year,
Aram



2009/12/25 Eason.Lee <leongfans@gmail.com>:
> I think he means
> http://jobtracker_ip:50030/jobtracker.jsp
>
> 2009/12/25 Xin Jing <xinjing@beyondfun.net>
>
>> Good point, we will check the map-reduce number for the peformance issue.
>>
>> Could you point me to a locaiton where to learn the usage of job tracker?
>>
>> Thanks
>> - Xin
>> ________________________________________
>> 发件人: Jeff Zhang [zjffdu@gmail.com]
>> 发送时间: 2009年12月25日 4:32 下午
>> 收件人: hbase-user@hadoop.apache.org
>> 主题: Re: Looking for a better design
>>
>> You can look at the job tracker web UI to get the number of your mapper
>> number ?  And how many nodes in your cluster ? I do not think it will cost
>> you serveral hours to transfer 2 millison pages, I doubt you have only one
>> mapper to process all the 2 million pages.
>>
>>
>> Jeff Zhang
>>
>>
>> 2009/12/25 Xin Jing <xinjing@beyondfun.net>
>>
>> > I am not quite sure how many mapper task during the map-reduce job. We
>> are
>> > using the default partition funtion, using the url as the row key. The
>> > mapper manner is the default manner. It takes serveral hours to finish
>> the
>> > job, we just run it once, and found the performance issue, then ask for a
>> > better solution if any. We will get more experiment number later...
>> >
>> > Thanks
>> > - Xin
>> >
>> > _______________________________________
>> > 发件人: Jeff Zhang [zjffdu@gmail.com]
>> > 发送时间: 2009年12月25日 3:59 下午
>> > 收件人: hbase-user@hadoop.apache.org
>> > 主题: Re: Looking for a better design
>> >
>> > Hi Xin,
>> >
>> > How many mapper task do you get when you transfer the 2 million web pages
>> ?
>> > And what is the job time ?
>> >
>> >
>> > Jeff Zhang
>> >
>> >
>> > 2009/12/24 Xin Jing <xinjing@beyondfun.net>
>> >
>> > > Yes, we have the date of the crawled data, and we can use a filter to
>> > just
>> > > select those on a specific day. But it is not the row key, applying the
>> > > filter mean scanning the whole table.  The performance should be worse
>> > than
>> > > saving the new data into a temp table, right?
>> > >
>> > > We are using map reduce to transfer the processed data in the temp
>> table
>> > > into the whole table. The map-reduce job is simple, it select the data
>> in
>> > > map phrase, and import the data into the whole table in reduce phrase.
>> > > Since the table defination of the temp table and the whole table is
>> > exactly
>> > > same, I am wondering if there is a trick to switch the data in temp
>> table
>> > > into the whole table directly. Just like the partition table manner in
>> DB
>> > > area.
>> > >
>> > > Thanks
>> > > - Xin
>> > > ________________________________________
>> > > 发件人: jdcryans@gmail.com [jdcryans@gmail.com] 代表 Jean-Daniel Cryans
[
>> > > jdcryans@apache.org]
>> > > 发送时间: 2009年12月25日 3:47 下午
>> > > 收件人: hbase-user@hadoop.apache.org
>> > > 主题: Re: Looking for a better design
>> > >
>> > > If you have the date of the crawl stored in the table, you could set a
>> > > filter on the Scan object to only scan the rows for a certain day.
>> > >
>> > > Also just to be sure, are you using MapReduce to process the tables?
>> > >
>> > > J-D
>> > >
>> > > 2009/12/24 Xin Jing <xinjing@beyondfun.net>:
>> > > > The reason to save the new data into a temp table is, we provide the
>> > > processed data in a incremental manner, providing new data everyday.
>> But
>> > we
>> > > may process the whole data again some day on demand. If we save the new
>> > data
>> > > into the whole table, it is hard for us to tell which pages is new. We
>> > can,
>> > > of course, use a flag to tell the status of the data. But I am afraid
>> the
>> > > performance may hurt to scan some data from a big data base.
>> > > >
>> > > > Thanks
>> > > > - Xin
>> > > > ________________________________________
>> > > > 发件人: jdcryans@gmail.com [jdcryans@gmail.com] 代表 Jean-Daniel
Cryans [
>> > > jdcryans@apache.org]
>> > > > 发送时间: 2009年12月25日 3:39 下午
>> > > > 收件人: hbase-user@hadoop.apache.org
>> > > > 主题: Re: Looking for a better design
>> > > >
>> > > > What's the reason for first importing into a temp table and not
>> > > > directly into the whole table?
>> > > >
>> > > > Also to improve performance I recommend reading
>> > > > http://wiki.apache.org/hadoop/PerformanceTuning
>> > > >
>> > > > J-D
>> > > >
>> > > > 2009/12/24 Xin Jing <xinjing@beyondfun.net>:
>> > > >> Hi All,
>> > > >>
>> > > >> We are processing a big number of web pages, crawling about 2
>> million
>> > > pages from internet everyday. After processed the new data, we save
>> them
>> > > all.
>> > > >>
>> > > >> Our current design is:
>> > > >> 1. create a temp table and a whole table, the table structure
is
>> > exactly
>> > > same.
>> > > >> 2. import the new data into temp table, and process them
>> > > >> 3. dump all the data from temp table into the whole table
>> > > >> 4. clean the temp table
>> > > >>
>> > > >> It works, but the performance is not good, the step 3 takes a
>> loooong
>> > > time. We use map-reduce to transfer the data from temp table into the
>> > whole
>> > > table, but its performance is too slow. We think there might be
>> something
>> > > wrong in our design, so I am looking for a better design for this task.
>> > Or
>> > > some hint on the processing.
>> > > >>
>> > > >> Thanks
>> > > >> - Xin
>> > > >>
>> > > >
>> > > >
>> > >
>> > >
>> >
>>
>

Mime
View raw message