hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject is this rowkey schema feasible?
Date Thu, 09 Jan 2014 10:42:12 GMT
hi all,
    I want to use hbase to store all urls for a distributed crawler.
there is a central scheduler to schedule all unCrawled urls by
priority. Following is my design of rowkey and common data access
pattern, is there any better rowkey design for my usecase?

    the row key is: reverse_host--status--priority--MD5(path). some example:
    com.google.www/-0-10-MD5(path1)
    com.google.www/-0-9-MD5(path2)
    ...
    com.google.www/-1-10-MD5(path3)
    status 0 means not crawled and 1 means crawled
    my scheduler:
    int batchSize=10000;
    Map<String,Integer> hostCount=calcHostPriority(batchSize);
    List<String> toBeCrawledUrls=..
    for(Map.Entry<String,Integer> entry:hostCount.entrySet()){
         //select top N priority uncrawled urls for this host
        startRow=Bytes.toString(reverse(entry.getKey())+"/-0");
        stopRow=Bytes.toString(reverse(entry.getKey())+"/-1");
         Scan s = new Scan(startRow, stopRow);
         s.setMaxResultSize(entry.getValue());
         for(String url:scanResult){
              toBeCrawledUrls.add(url);
         }
    }

    //update after crawling
    for(String url:crawledUrls){
         delete url //com.google.www/-0-10-MD5(path)
         put url //com.google.www/-1-10-MD5(path)
    }

    //check url exists
    any better method than this?
     assuming only 1-10 priority
   try get:
        com.google.www/-0-10-MD5(path)
        com.google.www/-1-10-MD5(path)
        com.google.www/-0-9-MD5(path)
        ....
        com.google.www/-1-1-MD5(path)
    if any exists, then true
    else false

Mime
View raw message