Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 68346 invoked from network); 25 Dec 2009 07:59:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Dec 2009 07:59:42 -0000 Received: (qmail 3125 invoked by uid 500); 25 Dec 2009 07:59:41 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 3040 invoked by uid 500); 25 Dec 2009 07:59:41 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 3030 invoked by uid 99); 25 Dec 2009 07:59:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Dec 2009 07:59:41 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of zjffdu@gmail.com designates 209.85.160.50 as permitted sender) Received: from [209.85.160.50] (HELO mail-pw0-f50.google.com) (209.85.160.50) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Dec 2009 07:59:32 +0000 Received: by pwi20 with SMTP id 20so5778307pwi.29 for ; Thu, 24 Dec 2009 23:59:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=RF/0iGICC8jV4OKWwod9+s+vKWmNl4TLOyVgv9GiOoc=; b=YhIyX6hLt3i21aINLRvJMZfb4JB60PP5EdCCNL3FVxhBf5MVIPwZ1RppPR0obCIcc8 Pj2UMJUkFNOaIVplphf2wPBFoTm08OIeRsTwhCfLI8rxl9CGs8QN293gP5wSmcTazSg0 fEV5y9cSx5JdmQsQU8cgylEuqqxaur0KonujQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Y6RIu36Mgb2clr86FtqIvYApMFnGD/Zk9JQWnU5IjIqMxWveFg4TM9vP2+dN89Eba2 pZYDphxMavNVt9APUK5wxm7sb+yxfIJrIpG0bxZdUARSLeQXmZB/yXbRBbN8XJiZI4x3 8SoEYuxgo3E8vH0jFbqjDbk0A9wSuRy3lDr/M= MIME-Version: 1.0 Received: by 10.142.2.10 with SMTP id 10mr8326269wfb.144.1261727951118; Thu, 24 Dec 2009 23:59:11 -0800 (PST) In-Reply-To: <654334668EE9E441AD8636601ED697140FFA0F39@BL2PRD0102MB009.prod.exchangelabs.com> References: <654334668EE9E441AD8636601ED697140FFA0EEE@BL2PRD0102MB009.prod.exchangelabs.com> <31a243e70912242339r78f73090wac354df0fd25af01@mail.gmail.com> <654334668EE9E441AD8636601ED697140FFA0F2A@BL2PRD0102MB009.prod.exchangelabs.com> <31a243e70912242347k55ffc527w344c9fe2842fe363@mail.gmail.com> <654334668EE9E441AD8636601ED697140FFA0F39@BL2PRD0102MB009.prod.exchangelabs.com> Date: Thu, 24 Dec 2009 23:59:11 -0800 Message-ID: <8211a1320912242359k25fd7159ldf2f31af4e77b89e@mail.gmail.com> Subject: Re: Looking for a better design From: Jeff Zhang To: hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=00504502ad785462cb047b88ee60 X-Virus-Checked: Checked by ClamAV on apache.org --00504502ad785462cb047b88ee60 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Xin, How many mapper task do you get when you transfer the 2 million web pages ? And what is the job time ? Jeff Zhang 2009/12/24 Xin Jing > Yes, we have the date of the crawled data, and we can use a filter to jus= t > select those on a specific day. But it is not the row key, applying the > filter mean scanning the whole table. The performance should be worse th= an > saving the new data into a temp table, right? > > We are using map reduce to transfer the processed data in the temp table > into the whole table. The map-reduce job is simple, it select the data in > map phrase, and import the data into the whole table in reduce phrase. > Since the table defination of the temp table and the whole table is exact= ly > same, I am wondering if there is a trick to switch the data in temp table > into the whole table directly. Just like the partition table manner in DB > area. > > Thanks > - Xin > ________________________________________ > =E5=8F=91=E4=BB=B6=E4=BA=BA: jdcryans@gmail.com [jdcryans@gmail.com] =E4= =BB=A3=E8=A1=A8 Jean-Daniel Cryans [ > jdcryans@apache.org] > =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2009=E5=B9=B412=E6=9C=8825=E6=97=A5= 3:47 =E4=B8=8B=E5=8D=88 > =E6=94=B6=E4=BB=B6=E4=BA=BA: hbase-user@hadoop.apache.org > =E4=B8=BB=E9=A2=98: Re: Looking for a better design > > If you have the date of the crawl stored in the table, you could set a > filter on the Scan object to only scan the rows for a certain day. > > Also just to be sure, are you using MapReduce to process the tables? > > J-D > > 2009/12/24 Xin Jing : > > The reason to save the new data into a temp table is, we provide the > processed data in a incremental manner, providing new data everyday. But = we > may process the whole data again some day on demand. If we save the new d= ata > into the whole table, it is hard for us to tell which pages is new. We ca= n, > of course, use a flag to tell the status of the data. But I am afraid the > performance may hurt to scan some data from a big data base. > > > > Thanks > > - Xin > > ________________________________________ > > =E5=8F=91=E4=BB=B6=E4=BA=BA: jdcryans@gmail.com [jdcryans@gmail.com] = =E4=BB=A3=E8=A1=A8 Jean-Daniel Cryans [ > jdcryans@apache.org] > > =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2009=E5=B9=B412=E6=9C=8825=E6=97= =A5 3:39 =E4=B8=8B=E5=8D=88 > > =E6=94=B6=E4=BB=B6=E4=BA=BA: hbase-user@hadoop.apache.org > > =E4=B8=BB=E9=A2=98: Re: Looking for a better design > > > > What's the reason for first importing into a temp table and not > > directly into the whole table? > > > > Also to improve performance I recommend reading > > http://wiki.apache.org/hadoop/PerformanceTuning > > > > J-D > > > > 2009/12/24 Xin Jing : > >> Hi All, > >> > >> We are processing a big number of web pages, crawling about 2 million > pages from internet everyday. After processed the new data, we save them > all. > >> > >> Our current design is: > >> 1. create a temp table and a whole table, the table structure is exact= ly > same. > >> 2. import the new data into temp table, and process them > >> 3. dump all the data from temp table into the whole table > >> 4. clean the temp table > >> > >> It works, but the performance is not good, the step 3 takes a loooong > time. We use map-reduce to transfer the data from temp table into the who= le > table, but its performance is too slow. We think there might be something > wrong in our design, so I am looking for a better design for this task. O= r > some hint on the processing. > >> > >> Thanks > >> - Xin > >> > > > > > > --00504502ad785462cb047b88ee60--