hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject use hbase as distributed crawl's scheduler
Date Fri, 03 Jan 2014 06:12:39 GMT
hi all,
     I want to use hbase to store all urls(crawled or not crawled).
And each url will has a column named priority which represent the
priority of the url. I want to get the top N urls order by priority(if
priority is the same then url whose timestamp is ealier is prefered).
     in using something like mysql, my client application may like:
     while true:
         select  url from url_db order by priority,addedTime limit
1000 where status='not_crawled';
         do something with this urls;
         extract more urls and insert them into url_db;
     How should I design hbase schema for this application? Is hbase
suitable for me?
     I found in this article
http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/,
they use redis to store urls. I think hbase is originated from
bigtable and google use bigtable to store webpage, so for huge number
of urls, I prefer distributed system like hbase.

Mime
View raw message