hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis.gospodne...@gmail.com>
Subject Re: use hbase as distributed crawl's scheduler
Date Fri, 03 Jan 2014 06:17:22 GMT

Have a look at http://nutch.apache.org .  Version 2.x uses HBase under the

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Fri, Jan 3, 2014 at 1:12 AM, Li Li <fancyerii@gmail.com> wrote:

> hi all,
>      I want to use hbase to store all urls(crawled or not crawled).
> And each url will has a column named priority which represent the
> priority of the url. I want to get the top N urls order by priority(if
> priority is the same then url whose timestamp is ealier is prefered).
>      in using something like mysql, my client application may like:
>      while true:
>          select  url from url_db order by priority,addedTime limit
> 1000 where status='not_crawled';
>          do something with this urls;
>          extract more urls and insert them into url_db;
>      How should I design hbase schema for this application? Is hbase
> suitable for me?
>      I found in this article
> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
> ,
> they use redis to store urls. I think hbase is originated from
> bigtable and google use bigtable to store webpage, so for huge number
> of urls, I prefer distributed system like hbase.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message