Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of zjffdu@gmail.com designates
 209.85.160.50 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=Y6RIu36Mgb2clr86FtqIvYApMFnGD/Zk9JQWnU5IjIqMxWveFg4TM9vP2+dN89Eba2
         pZYDphxMavNVt9APUK5wxm7sb+yxfIJrIpG0bxZdUARSLeQXmZB/yXbRBbN8XJiZI4x3
         8SoEYuxgo3E8vH0jFbqjDbk0A9wSuRy3lDr/M=
MIME-Version: 1.0
In-Reply-To: 
 <654334668EE9E441AD8636601ED697140FFA0F39@BL2PRD0102MB009.prod.exchangelabs.com>
References: 
 <654334668EE9E441AD8636601ED697140FFA0EEE@BL2PRD0102MB009.prod.exchangelabs.com>
	 <31a243e70912242339r78f73090wac354df0fd25af01@mail.gmail.com>
	 <654334668EE9E441AD8636601ED697140FFA0F2A@BL2PRD0102MB009.prod.exchangelabs.com>
	 <31a243e70912242347k55ffc527w344c9fe2842fe363@mail.gmail.com>
	 <654334668EE9E441AD8636601ED697140FFA0F39@BL2PRD0102MB009.prod.exchangelabs.com>
Date: Thu, 24 Dec 2009 23:59:11 -0800
Message-ID: <8211a1320912242359k25fd7159ldf2f31af4e77b89e@mail.gmail.com>
Subject: Re: Looking for a better design
From: Jeff Zhang <zjffdu@gmail.com>
To: hbase-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=00504502ad785462cb047b88ee60

--00504502ad785462cb047b88ee60
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Xin,

How many mapper task do you get when you transfer the 2 million web pages ?
And what is the job time ?


Jeff Zhang


2009/12/24 Xin Jing <xinjing@beyondfun.net>

> Yes, we have the date of the crawled data, and we can use a filter to jus=
t
> select those on a specific day. But it is not the row key, applying the
> filter mean scanning the whole table.  The performance should be worse th=
an
> saving the new data into a temp table, right?
>
> We are using map reduce to transfer the processed data in the temp table
> into the whole table. The map-reduce job is simple, it select the data in
> map phrase, and import the data into the whole table in reduce phrase.
> Since the table defination of the temp table and the whole table is exact=
ly
> same, I am wondering if there is a trick to switch the data in temp table
> into the whole table directly. Just like the partition table manner in DB
> area.
>
> Thanks
> - Xin
> ________________________________________
> =E5=8F=91=E4=BB=B6=E4=BA=BA: jdcryans@gmail.com [jdcryans@gmail.com] =E4=
=BB=A3=E8=A1=A8 Jean-Daniel Cryans [
> jdcryans@apache.org]
> =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2009=E5=B9=B412=E6=9C=8825=E6=97=A5=
 3:47 =E4=B8=8B=E5=8D=88
> =E6=94=B6=E4=BB=B6=E4=BA=BA: hbase-user@hadoop.apache.org
> =E4=B8=BB=E9=A2=98: Re: Looking for a better design
>
> If you have the date of the crawl stored in the table, you could set a
> filter on the Scan object to only scan the rows for a certain day.
>
> Also just to be sure, are you using MapReduce to process the tables?
>
> J-D
>
> 2009/12/24 Xin Jing <xinjing@beyondfun.net>:
> > The reason to save the new data into a temp table is, we provide the
> processed data in a incremental manner, providing new data everyday. But =
we
> may process the whole data again some day on demand. If we save the new d=
ata
> into the whole table, it is hard for us to tell which pages is new. We ca=
n,
> of course, use a flag to tell the status of the data. But I am afraid the
> performance may hurt to scan some data from a big data base.
> >
> > Thanks
> > - Xin
> > ________________________________________
> > =E5=8F=91=E4=BB=B6=E4=BA=BA: jdcryans@gmail.com [jdcryans@gmail.com] =
=E4=BB=A3=E8=A1=A8 Jean-Daniel Cryans [
> jdcryans@apache.org]
> > =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2009=E5=B9=B412=E6=9C=8825=E6=97=
=A5 3:39 =E4=B8=8B=E5=8D=88
> > =E6=94=B6=E4=BB=B6=E4=BA=BA: hbase-user@hadoop.apache.org
> > =E4=B8=BB=E9=A2=98: Re: Looking for a better design
> >
> > What's the reason for first importing into a temp table and not
> > directly into the whole table?
> >
> > Also to improve performance I recommend reading
> > http://wiki.apache.org/hadoop/PerformanceTuning
> >
> > J-D
> >
> > 2009/12/24 Xin Jing <xinjing@beyondfun.net>:
> >> Hi All,
> >>
> >> We are processing a big number of web pages, crawling about 2 million
> pages from internet everyday. After processed the new data, we save them
> all.
> >>
> >> Our current design is:
> >> 1. create a temp table and a whole table, the table structure is exact=
ly
> same.
> >> 2. import the new data into temp table, and process them
> >> 3. dump all the data from temp table into the whole table
> >> 4. clean the temp table
> >>
> >> It works, but the performance is not good, the step 3 takes a loooong
> time. We use map-reduce to transfer the data from temp table into the who=
le
> table, but its performance is too slow. We think there might be something
> wrong in our design, so I am looking for a better design for this task. O=
r
> some hint on the processing.
> >>
> >> Thanks
> >> - Xin
> >>
> >
> >
>
>

--00504502ad785462cb047b88ee60--