hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: use hbase as distributed crawl's scheduler
Date Fri, 03 Jan 2014 07:15:53 GMT
Hi LiLi,
Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL
skin on top of HBase. You can model your schema and issue your queries just
like you would with MySQL. Something like this:

// Create table that optimizes for your most common query
// (i.e. the PRIMARY KEY constraint should be ordered as you'd want your
rows ordered)
CREATE TABLE url_db (
    status TINYINT,
    priority INTEGER NOT NULL,
    added_time DATE,
    url VARCHAR NOT NULL
    CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));

int lastStatus = 0;
int lastPriority = 0;
Date lastAddedTime = new Date(0);
String lastUrl = "";

while (true) {
    // Use row value constructor to page through results in batches of 1000
    String query = "
        SELECT * FROM url_db
        WHERE status=0 AND (status, priority, added_time, url) > (?, ?, ?,
?)
        ORDER BY status, priority, added_time, url
        LIMIT 1000"
    PreparedStatement stmt = connection.prepareStatement(query);

    // Bind parameters
    stmt.setInt(1, lastStatus);
    stmt.setInt(2, lastPriority);
    stmt.setDate(3, lastAddedTime);
    stmt.setString(4, lastUrl);
    ResultSet resultSet = stmt.executeQuery();

    while (resultSet.next()) {
        // Remember last row processed so that you can start after that for
next batch
        lastStatus = resultSet.getInt(1);
        lastPriority = resultSet.getInt(2);
        lastAddedTime = resultSet.getDate(3);
        lastUrl = resultSet.getString(4);

        doSomethingWithUrls();

        UPSERT INTO url_db(status, priority, added_time, url)
        VALUES (1, ?, CURRENT_DATE(), ?);

    }
}

If you need to efficiently query on url, add a secondary index like this:

CREATE INDEX url_index ON url_db (url);

Please let me know if you have questions.

Thanks,
James




On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fancyerii@gmail.com> wrote:

> thank you. But I can't use nutch. could you tell me how hbase is used
> in nutch? or hbase is only used to store webpage.
>
> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
> <otis.gospodnetic@gmail.com> wrote:
> > Hi,
> >
> > Have a look at http://nutch.apache.org .  Version 2.x uses HBase under
> the
> > hood.
> >
> > Otis
> > --
> > Performance Monitoring * Log Analytics * Search Analytics
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <fancyerii@gmail.com> wrote:
> >
> >> hi all,
> >>      I want to use hbase to store all urls(crawled or not crawled).
> >> And each url will has a column named priority which represent the
> >> priority of the url. I want to get the top N urls order by priority(if
> >> priority is the same then url whose timestamp is ealier is prefered).
> >>      in using something like mysql, my client application may like:
> >>      while true:
> >>          select  url from url_db order by priority,addedTime limit
> >> 1000 where status='not_crawled';
> >>          do something with this urls;
> >>          extract more urls and insert them into url_db;
> >>      How should I design hbase schema for this application? Is hbase
> >> suitable for me?
> >>      I found in this article
> >>
> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
> >> ,
> >> they use redis to store urls. I think hbase is originated from
> >> bigtable and google use bigtable to store webpage, so for huge number
> >> of urls, I prefer distributed system like hbase.
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message