hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: is this rowkey schema feasible?
Date Thu, 09 Jan 2014 18:02:38 GMT
On Thu, Jan 9, 2014 at 2:42 AM, Li Li <fancyerii@gmail.com> wrote:

> hi all,
>     I want to use hbase to store all urls for a distributed crawler.
> there is a central scheduler to schedule all unCrawled urls by
> priority.

Are you building from scratch?  If so, have you looked at nutch?

> Following is my design of rowkey and common data access
> pattern, is there any better rowkey design for my usecase?
>     the row key is: reverse_host--status--priority--MD5(path). some
> example:
>     com.google.www/-0-10-MD5(path1)
>     com.google.www/-0-9-MD5(path2)
>     ...
>     com.google.www/-1-10-MD5(path3)
>     status 0 means not crawled and 1 means crawled

Is this your total schema?  Where is crawl history kept and the crawled
content and rate of change for the content and when to recrawl, etc?  Is
this data elsewhere in other tables and this table is just a frontier
'queue' with crawl URLs added to the head and when the page is crawled, you
add a new row to this table w/ state set to 1?  How you going to request a
page be recrawled in such a scheme?

How do the distributed crawlers divvy up the work?

Generally you do not want to keep state in the key itself.

Using an hbase table as a queue is usually not a good idea especially when
lots of churn as will be the case in a distributed crawler.

You could keep the 'crawl status' in a separate column family with nothing
but this in it so your crawlers can scan fast and update this one attribute
only after the page is pulled, or, you might want to use something else
altogether for the list-of-urls to crawl by crawler since it a small
dataset and you need to go real fast against it.

>     my scheduler:
>     int batchSize=10000;
>     Map<String,Integer> hostCount=calcHostPriority(batchSize);
>     List<String> toBeCrawledUrls=..
>     for(Map.Entry<String,Integer> entry:hostCount.entrySet()){
>          //select top N priority uncrawled urls for this host
>         startRow=Bytes.toString(reverse(entry.getKey())+"/-0");
>         stopRow=Bytes.toString(reverse(entry.getKey())+"/-1");
>          Scan s = new Scan(startRow, stopRow);
>          s.setMaxResultSize(entry.getValue());
>          for(String url:scanResult){
>               toBeCrawledUrls.add(url);
>          }
>     }
>     //update after crawling
>     for(String url:crawledUrls){
>          delete url //com.google.www/-0-10-MD5(path)
>          put url //com.google.www/-1-10-MD5(path)

Each delete adds a new entry.

How you intend to make it so two distributed crawlers do not pull the same
url to fetch?  (Use checkAndPut and set a 'currently-assigned-to-a-crawler'

>     }
>     //check url exists
>     any better method than this?
>      assuming only 1-10 priority
>    try get:
>         com.google.www/-0-10-MD5(path)
>         com.google.www/-1-10-MD5(path)
>         com.google.www/-0-9-MD5(path)
>         ....
>         com.google.www/-1-1-MD5(path)
>     if any exists, then true
>     else false

You don't want to 'get', you want to 'scan'.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message