Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 27093 invoked from network); 5 Jul 2009 21:26:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Jul 2009 21:26:45 -0000 Received: (qmail 42402 invoked by uid 500); 5 Jul 2009 21:26:54 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 42372 invoked by uid 500); 5 Jul 2009 21:26:54 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 42362 invoked by uid 99); 5 Jul 2009 21:26:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Jul 2009 21:26:54 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of saint.ack@gmail.com designates 74.125.92.25 as permitted sender) Received: from [74.125.92.25] (HELO qw-out-2122.google.com) (74.125.92.25) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Jul 2009 21:26:45 +0000 Received: by qw-out-2122.google.com with SMTP id 8so1379366qwh.35 for ; Sun, 05 Jul 2009 14:26:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type; bh=duoFzAPewaTB9dk1v2W6Or1AyMCsZYZlPoze09ao3oY=; b=TzOK76lFaPFnrx4kkyP75cEbOVLGcLOLzWcc+tmM4xZROuVkYw3xqyUwWTtaWo9lFi F/jyxZUb49tsJ7htQIHUm9RNx2a/gcMFxLa/HmsKOOfyq6BFcZX3RekEqAXxyfr2Felx wwKUDMWso9fKxh+ZoCF0BLLtSzWY577Rhf+Bg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=dcvoseo1ZTGCgdEgSKiQ3Skh9GKzAxDyo1ycVx9Rx8kWRws6lXlz8KWdwp/GAshInh V9dQQDzyp92dQy7/IYH7yFtdsa6jOBeAlV/0DGo5POlZ9kWX6rZiNObz0GTppKkiVm/l /ftk4DZAM8wszByG9xz2aGZsaLFVCz3Aofok4= MIME-Version: 1.0 Sender: saint.ack@gmail.com Received: by 10.229.80.78 with SMTP id s14mr1735151qck.101.1246829184089; Sun, 05 Jul 2009 14:26:24 -0700 (PDT) In-Reply-To: <24339168.post@talk.nabble.com> References: <24339168.post@talk.nabble.com> Date: Sun, 5 Jul 2009 14:26:23 -0700 X-Google-Sender-Auth: d8966f88c2a88c4d Message-ID: <7c962aed0907051426w7434fd99mad3a6f5f68084a48@mail.gmail.com> Subject: Re: HBase schema for crawling From: stack To: hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636426d4d9cffe6046dfc0a26 X-Virus-Checked: Checked by ClamAV on apache.org --001636426d4d9cffe6046dfc0a26 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit On Sat, Jul 4, 2009 at 5:21 PM, maxjar10 wrote: > > Hi All, > > I am developing a schema that will be used for crawling. Out of interest, what crawler are you using? > > Now, here's the dilemma I have... When I create a MapReduce job to go > through each row in the above I want to schedule the url to be recrawled > again at some date in the future. For example, > > // Simple psudeocode > Map( row, rowResult ) > { > BatchUpdate update = new BatchUpdate( row.get() ); > update.put( "contents:content", downloadPage( pageUrl ) ); > update.updateKey( nextFetchDate + ":" reverseDomain( pageUrl ) ); // > ???? No idea how to do this > } So you want to write a new row with a nextFetchDate prefix? FYI, have you seen http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/util/Keying.html#createKey(java.lang.String) ? (You might also find http://sourceforge.net/projects/publicsuffix/ might also be useful) > 1) Does HBase you to update the key for a row? Are HBase row keys > immutable? > Yes. If you 'update' a row key, changing it, you will create a new row. > > 2) If I can't update a key what's the easiest way to copy a row and assign > it a different key? > Get all of the row and then put it all with the new key (Billy Pearson's suggestion would be the way to go I'd suggest -- keeping a column with timestamp in it or using hbase versions -- in TRUNK you can ask for data within a timerange. Running a scanner asking for rows > some timestamp should be fast). > > 3) What are the implications for updating/deleting from a table that you > are > currently scanning as part of the mapReduce job? > Scanners return the state of the row at the time they trip over it. > > It seems to me that I may want to do a map and a reduce and during the map > phase I would record the rows that I fetched while in the reduce phase I > would then take those rows, re-add them with the nextFetchDate and then > remove the old row. Do you have to remove old data? You could let it age or be removed when the number of versions of pages are > configured maximum. > I would probably want to do this process in phases (e.g. scan only 5,000 > rows at a time) so that if my Mapper died for any particular reason I could > address the issue and in the worst case only have lost the work that I had > done on 5,000 rows. You could keep an already-seen in another hbase table and just rerun whole job if first job failed. Check the already-seen before crawling a page to see if you'd crawled it recently or not? St.Ack --001636426d4d9cffe6046dfc0a26--