hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Billy Pearson" <sa...@pearsonwholesale.com>
Subject Re: HBase schema for crawling
Date Sun, 05 Jul 2009 04:05:52 GMT
I have stored the spider time in a column like stime: to keep from having to 
fetch the pages content in the map of the row just for the timestamp
then just scan over that one column to get last spider time etc..

In my setup I did not spider from the map reduce job I build a spider
list then ran the spider in a different language that I know more about then
java so no experience with that.


"maxjar10" <jcuzens@gmail.com> wrote in 
message news:24339168.post@talk.nabble.com...
>
> Hi All,
>
> I am developing a schema that will be used for crawling. All of the 
> examples
> that I have seen to date use a webcrawl table that looks like the below:
>
> Table: webcrawl
> rowkey                details                                   family
> com.yahoo.www    lastFetchDate:timestamp 
> content:somedownloadedpage
>
> I understand wanting to use the rowkey in reverse domain order so that 
> it's
> easy to recrawl all of a specific site including it's subdomains. However,
> it seems inefficient to scan through a large table looking for
> "lastFetchDate" where you want to refetch the page.
>
> In my case I'm not concerned with having to recrawl a particular domain as 
> I
> am about efficiently locating the urls that I need to recrawl because I
> haven't crawled them in a while.
>
> rowkey                              family
> 20090631;com.google.www   contents:somedownloadedgooglepage
> 20090701;com.yahoo.www    contents:somedownloadedyahoopage
>
> This would allow you to quickly get to the content needed to recrawl and 
> do
> it by date so that you ensure that you recrawl the most stale item first.
>
> Now, here's the dilemma I have... When I create a MapReduce job to go
> through each row in the above I want to schedule the url to be recrawled
> again at some date in the future. For example,
>
> // Simple psudeocode
> Map( row, rowResult )
> {
>      BatchUpdate update = new BatchUpdate( row.get() );
>      update.put( "contents:content", downloadPage( pageUrl ) );
>      update.updateKey( nextFetchDate + ":"  reverseDomain( pageUrl ) ); //
> ???? No idea how to do this
> }
>
> 1) Does HBase you to update the key for a row? Are HBase row keys 
> immutable?
>
> 2) If I can't update a key what's the easiest way to copy a row and assign
> it a different key?
>
> 3) What are the implications for updating/deleting from a table that you 
> are
> currently scanning as part of the mapReduce job?
>
> It seems to me that I may want to do a map and a reduce and during the map
> phase I would record the rows that I fetched while in the reduce phase I
> would then take those rows, re-add them with the nextFetchDate and then
> remove the old row.
>
> I would probably want to do this process in phases (e.g. scan only 5,000
> rows at a time) so that if my Mapper died for any particular reason I 
> could
> address the issue and in the worst case only have lost the work that I had
> done on 5,000 rows.
>
> Thanks!
>
> -- 
> View this message in context: 
> http://www.nabble.com/HBase-schema-for-crawling-tp24339168p24339168.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
> 



Mime
View raw message