hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eason.Lee" <leongf...@gmail.com>
Subject Re: Looking for a better design
Date Fri, 25 Dec 2009 09:41:16 GMT
I think he means
http://jobtracker_ip:50030/jobtracker.jsp

2009/12/25 Xin Jing <xinjing@beyondfun.net>

> Good point, we will check the map-reduce number for the peformance issue.
>
> Could you point me to a locaiton where to learn the usage of job tracker?
>
> Thanks
> - Xin
> ________________________________________
> 发件人: Jeff Zhang [zjffdu@gmail.com]
> 发送时间: 2009年12月25日 4:32 下午
> 收件人: hbase-user@hadoop.apache.org
> 主题: Re: Looking for a better design
>
> You can look at the job tracker web UI to get the number of your mapper
> number ?  And how many nodes in your cluster ? I do not think it will cost
> you serveral hours to transfer 2 millison pages, I doubt you have only one
> mapper to process all the 2 million pages.
>
>
> Jeff Zhang
>
>
> 2009/12/25 Xin Jing <xinjing@beyondfun.net>
>
> > I am not quite sure how many mapper task during the map-reduce job. We
> are
> > using the default partition funtion, using the url as the row key. The
> > mapper manner is the default manner. It takes serveral hours to finish
> the
> > job, we just run it once, and found the performance issue, then ask for a
> > better solution if any. We will get more experiment number later...
> >
> > Thanks
> > - Xin
> >
> > _______________________________________
> > 发件人: Jeff Zhang [zjffdu@gmail.com]
> > 发送时间: 2009年12月25日 3:59 下午
> > 收件人: hbase-user@hadoop.apache.org
> > 主题: Re: Looking for a better design
> >
> > Hi Xin,
> >
> > How many mapper task do you get when you transfer the 2 million web pages
> ?
> > And what is the job time ?
> >
> >
> > Jeff Zhang
> >
> >
> > 2009/12/24 Xin Jing <xinjing@beyondfun.net>
> >
> > > Yes, we have the date of the crawled data, and we can use a filter to
> > just
> > > select those on a specific day. But it is not the row key, applying the
> > > filter mean scanning the whole table.  The performance should be worse
> > than
> > > saving the new data into a temp table, right?
> > >
> > > We are using map reduce to transfer the processed data in the temp
> table
> > > into the whole table. The map-reduce job is simple, it select the data
> in
> > > map phrase, and import the data into the whole table in reduce phrase.
> > > Since the table defination of the temp table and the whole table is
> > exactly
> > > same, I am wondering if there is a trick to switch the data in temp
> table
> > > into the whole table directly. Just like the partition table manner in
> DB
> > > area.
> > >
> > > Thanks
> > > - Xin
> > > ________________________________________
> > > 发件人: jdcryans@gmail.com [jdcryans@gmail.com] 代表 Jean-Daniel Cryans
[
> > > jdcryans@apache.org]
> > > 发送时间: 2009年12月25日 3:47 下午
> > > 收件人: hbase-user@hadoop.apache.org
> > > 主题: Re: Looking for a better design
> > >
> > > If you have the date of the crawl stored in the table, you could set a
> > > filter on the Scan object to only scan the rows for a certain day.
> > >
> > > Also just to be sure, are you using MapReduce to process the tables?
> > >
> > > J-D
> > >
> > > 2009/12/24 Xin Jing <xinjing@beyondfun.net>:
> > > > The reason to save the new data into a temp table is, we provide the
> > > processed data in a incremental manner, providing new data everyday.
> But
> > we
> > > may process the whole data again some day on demand. If we save the new
> > data
> > > into the whole table, it is hard for us to tell which pages is new. We
> > can,
> > > of course, use a flag to tell the status of the data. But I am afraid
> the
> > > performance may hurt to scan some data from a big data base.
> > > >
> > > > Thanks
> > > > - Xin
> > > > ________________________________________
> > > > 发件人: jdcryans@gmail.com [jdcryans@gmail.com] 代表 Jean-Daniel
Cryans [
> > > jdcryans@apache.org]
> > > > 发送时间: 2009年12月25日 3:39 下午
> > > > 收件人: hbase-user@hadoop.apache.org
> > > > 主题: Re: Looking for a better design
> > > >
> > > > What's the reason for first importing into a temp table and not
> > > > directly into the whole table?
> > > >
> > > > Also to improve performance I recommend reading
> > > > http://wiki.apache.org/hadoop/PerformanceTuning
> > > >
> > > > J-D
> > > >
> > > > 2009/12/24 Xin Jing <xinjing@beyondfun.net>:
> > > >> Hi All,
> > > >>
> > > >> We are processing a big number of web pages, crawling about 2
> million
> > > pages from internet everyday. After processed the new data, we save
> them
> > > all.
> > > >>
> > > >> Our current design is:
> > > >> 1. create a temp table and a whole table, the table structure is
> > exactly
> > > same.
> > > >> 2. import the new data into temp table, and process them
> > > >> 3. dump all the data from temp table into the whole table
> > > >> 4. clean the temp table
> > > >>
> > > >> It works, but the performance is not good, the step 3 takes a
> loooong
> > > time. We use map-reduce to transfer the data from temp table into the
> > whole
> > > table, but its performance is too slow. We think there might be
> something
> > > wrong in our design, so I am looking for a better design for this task.
> > Or
> > > some hint on the processing.
> > > >>
> > > >> Thanks
> > > >> - Xin
> > > >>
> > > >
> > > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message