Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of saint.ack@gmail.com designates
 74.125.92.25 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type;
        b=dcvoseo1ZTGCgdEgSKiQ3Skh9GKzAxDyo1ycVx9Rx8kWRws6lXlz8KWdwp/GAshInh
         V9dQQDzyp92dQy7/IYH7yFtdsa6jOBeAlV/0DGo5POlZ9kWX6rZiNObz0GTppKkiVm/l
         /ftk4DZAM8wszByG9xz2aGZsaLFVCz3Aofok4=
MIME-Version: 1.0
Sender: saint.ack@gmail.com
In-Reply-To: <24339168.post@talk.nabble.com>
References: <24339168.post@talk.nabble.com>
Date: Sun, 5 Jul 2009 14:26:23 -0700
Message-ID: <7c962aed0907051426w7434fd99mad3a6f5f68084a48@mail.gmail.com>
Subject: Re: HBase schema for crawling
From: stack <stack@duboce.net>
To: hbase-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001636426d4d9cffe6046dfc0a26

--001636426d4d9cffe6046dfc0a26
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

On Sat, Jul 4, 2009 at 5:21 PM, maxjar10 <jcuzens@gmail.com> wrote:

>
> Hi All,
>
> I am developing a schema that will be used for crawling.


Out of interest, what crawler are you using?


>
> Now, here's the dilemma I have... When I create a MapReduce job to go
> through each row in the above I want to schedule the url to be recrawled
> again at some date in the future. For example,
>
> // Simple psudeocode
> Map( row, rowResult )
> {
>      BatchUpdate update = new BatchUpdate( row.get() );
>      update.put( "contents:content", downloadPage( pageUrl ) );
>      update.updateKey( nextFetchDate + ":"  reverseDomain( pageUrl ) ); //
> ???? No idea how to do this
> }


So you want to write a new row with a nextFetchDate prefix?

FYI, have you seen
http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/util/Keying.html#createKey(java.lang.String)
?

(You might also find http://sourceforge.net/projects/publicsuffix/ might
also be useful)


> 1) Does HBase you to update the key for a row? Are HBase row keys
> immutable?
>


Yes.

If you 'update' a row key, changing it, you will create a new row.


>
> 2) If I can't update a key what's the easiest way to copy a row and assign
> it a different key?
>


Get all of the row and then put it all with the new key (Billy Pearson's
suggestion would be the way to go I'd suggest -- keeping a column with
timestamp in it or using hbase versions -- in TRUNK you can ask for data
within a timerange.  Running a scanner asking for rows > some timestamp
should be fast).


>
> 3) What are the implications for updating/deleting from a table that you
> are
> currently scanning as part of the mapReduce job?
>


Scanners return the state of the row at the time they trip over it.


>
> It seems to me that I may want to do a map and a reduce and during the map
> phase I would record the rows that I fetched while in the reduce phase I
> would then take those rows, re-add them with the nextFetchDate and then
> remove the old row.


Do you have to remove old data?  You could let it age or be removed when the
number of versions of pages are > configured maximum.


> I would probably want to do this process in phases (e.g. scan only 5,000
> rows at a time) so that if my Mapper died for any particular reason I could
> address the issue and in the worst case only have lost the work that I had
> done on 5,000 rows.


You could keep an already-seen in another hbase table and just rerun whole
job if first job failed.  Check the already-seen before crawling a page to
see if you'd crawled it recently or not?

St.Ack

--001636426d4d9cffe6046dfc0a26--