hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Looking for a better design
Date Sat, 26 Dec 2009 17:05:06 GMT
Does this help:

http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description

St.Ack

On Sat, Dec 26, 2009 at 7:50 AM, Aram Mkhitaryan <
aram.mkhitaryan@googlemail.com> wrote:

> Hi All,
>
> I'm new to hadoop stuff and I would be grateful if you could explain
> how you say/define so that your data is read from hbase tables in
> map-reduce tasks and moreover to tell the system so that it has more
> than one task.
> is there an article that guides to that kind of stuff?
>
> thank you very much,
> marry christmas and happy new year,
> Aram
>
>
>
> 2009/12/25 Eason.Lee <leongfans@gmail.com>:
> > I think he means
> > http://jobtracker_ip:50030/jobtracker.jsp
> >
> > 2009/12/25 Xin Jing <xinjing@beyondfun.net>
> >
> >> Good point, we will check the map-reduce number for the peformance
> issue.
> >>
> >> Could you point me to a locaiton where to learn the usage of job
> tracker?
> >>
> >> Thanks
> >> - Xin
> >> ________________________________________
> >> 发件人: Jeff Zhang [zjffdu@gmail.com]
> >> 发送时间: 2009年12月25日 4:32 下午
> >> 收件人: hbase-user@hadoop.apache.org
> >> 主题: Re: Looking for a better design
> >>
> >> You can look at the job tracker web UI to get the number of your mapper
> >> number ?  And how many nodes in your cluster ? I do not think it will
> cost
> >> you serveral hours to transfer 2 millison pages, I doubt you have only
> one
> >> mapper to process all the 2 million pages.
> >>
> >>
> >> Jeff Zhang
> >>
> >>
> >> 2009/12/25 Xin Jing <xinjing@beyondfun.net>
> >>
> >> > I am not quite sure how many mapper task during the map-reduce job. We
> >> are
> >> > using the default partition funtion, using the url as the row key. The
> >> > mapper manner is the default manner. It takes serveral hours to finish
> >> the
> >> > job, we just run it once, and found the performance issue, then ask
> for a
> >> > better solution if any. We will get more experiment number later...
> >> >
> >> > Thanks
> >> > - Xin
> >> >
> >> > _______________________________________
> >> > 发件人: Jeff Zhang [zjffdu@gmail.com]
> >> > 发送时间: 2009年12月25日 3:59 下午
> >> > 收件人: hbase-user@hadoop.apache.org
> >> > 主题: Re: Looking for a better design
> >> >
> >> > Hi Xin,
> >> >
> >> > How many mapper task do you get when you transfer the 2 million web
> pages
> >> ?
> >> > And what is the job time ?
> >> >
> >> >
> >> > Jeff Zhang
> >> >
> >> >
> >> > 2009/12/24 Xin Jing <xinjing@beyondfun.net>
> >> >
> >> > > Yes, we have the date of the crawled data, and we can use a filter
> to
> >> > just
> >> > > select those on a specific day. But it is not the row key, applying
> the
> >> > > filter mean scanning the whole table.  The performance should be
> worse
> >> > than
> >> > > saving the new data into a temp table, right?
> >> > >
> >> > > We are using map reduce to transfer the processed data in the temp
> >> table
> >> > > into the whole table. The map-reduce job is simple, it select the
> data
> >> in
> >> > > map phrase, and import the data into the whole table in reduce
> phrase.
> >> > > Since the table defination of the temp table and the whole table is
> >> > exactly
> >> > > same, I am wondering if there is a trick to switch the data in temp
> >> table
> >> > > into the whole table directly. Just like the partition table manner
> in
> >> DB
> >> > > area.
> >> > >
> >> > > Thanks
> >> > > - Xin
> >> > > ________________________________________
> >> > > 发件人: jdcryans@gmail.com [jdcryans@gmail.com] 代表 Jean-Daniel
Cryans
> [
> >> > > jdcryans@apache.org]
> >> > > 发送时间: 2009年12月25日 3:47 下午
> >> > > 收件人: hbase-user@hadoop.apache.org
> >> > > 主题: Re: Looking for a better design
> >> > >
> >> > > If you have the date of the crawl stored in the table, you could set
> a
> >> > > filter on the Scan object to only scan the rows for a certain day.
> >> > >
> >> > > Also just to be sure, are you using MapReduce to process the tables?
> >> > >
> >> > > J-D
> >> > >
> >> > > 2009/12/24 Xin Jing <xinjing@beyondfun.net>:
> >> > > > The reason to save the new data into a temp table is, we provide
> the
> >> > > processed data in a incremental manner, providing new data everyday.
> >> But
> >> > we
> >> > > may process the whole data again some day on demand. If we save the
> new
> >> > data
> >> > > into the whole table, it is hard for us to tell which pages is new.
> We
> >> > can,
> >> > > of course, use a flag to tell the status of the data. But I am
> afraid
> >> the
> >> > > performance may hurt to scan some data from a big data base.
> >> > > >
> >> > > > Thanks
> >> > > > - Xin
> >> > > > ________________________________________
> >> > > > 发件人: jdcryans@gmail.com [jdcryans@gmail.com] 代表 Jean-Daniel
> Cryans [
> >> > > jdcryans@apache.org]
> >> > > > 发送时间: 2009年12月25日 3:39 下午
> >> > > > 收件人: hbase-user@hadoop.apache.org
> >> > > > 主题: Re: Looking for a better design
> >> > > >
> >> > > > What's the reason for first importing into a temp table and not
> >> > > > directly into the whole table?
> >> > > >
> >> > > > Also to improve performance I recommend reading
> >> > > > http://wiki.apache.org/hadoop/PerformanceTuning
> >> > > >
> >> > > > J-D
> >> > > >
> >> > > > 2009/12/24 Xin Jing <xinjing@beyondfun.net>:
> >> > > >> Hi All,
> >> > > >>
> >> > > >> We are processing a big number of web pages, crawling about
2
> >> million
> >> > > pages from internet everyday. After processed the new data, we save
> >> them
> >> > > all.
> >> > > >>
> >> > > >> Our current design is:
> >> > > >> 1. create a temp table and a whole table, the table structure
is
> >> > exactly
> >> > > same.
> >> > > >> 2. import the new data into temp table, and process them
> >> > > >> 3. dump all the data from temp table into the whole table
> >> > > >> 4. clean the temp table
> >> > > >>
> >> > > >> It works, but the performance is not good, the step 3 takes
a
> >> loooong
> >> > > time. We use map-reduce to transfer the data from temp table into
> the
> >> > whole
> >> > > table, but its performance is too slow. We think there might be
> >> something
> >> > > wrong in our design, so I am looking for a better design for this
> task.
> >> > Or
> >> > > some hint on the processing.
> >> > > >>
> >> > > >> Thanks
> >> > > >> - Xin
> >> > > >>
> >> > > >
> >> > > >
> >> > >
> >> > >
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message