Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BC8DC10795 for ; Fri, 3 Jan 2014 06:24:27 +0000 (UTC) Received: (qmail 92174 invoked by uid 500); 3 Jan 2014 06:24:15 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 92133 invoked by uid 500); 3 Jan 2014 06:24:12 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 92118 invoked by uid 99); 3 Jan 2014 06:24:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 06:24:10 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jtaylor@salesforce.com designates 209.85.128.44 as permitted sender) Received: from [209.85.128.44] (HELO mail-qe0-f44.google.com) (209.85.128.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 06:24:04 +0000 Received: by mail-qe0-f44.google.com with SMTP id nd7so15154526qeb.17 for ; Thu, 02 Jan 2014 22:23:43 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=QKpBaHV8+oknfT5gf/kc86Yc0FfBOR068de7hzH2SdI=; b=fL5DI5yA6M1JlkVZkZmhzSz2IOXuLWHYG9aNSRmqALtyY1g0+XwcRnaPWO2cGbsBjP DWn/mhy7UXrNSRcpXnOm4IMQakIw7YNa1suM+IewWhQhPrNhYzjrhGYsjiYQa+Hg9Bqy DWmbk2vDrinS+1gfrL2bRqmPKfn/Xv5hc1eh16T5v/p5H+G/Iq8mBjSKUMvsGxAZgoNm rr0kf86CwohujTDxJkA+HhsU3WWgmMf194tYl86YzZMEY1J2Hmj5ORm/8jTM9B5SrtQg kFnyTx8nNNofezh8+wIiB41LwqshkQ50Sq8RU1w9wveqtk+tz8W2JfAAex2V8+gJpdAP BNNQ== X-Gm-Message-State: ALoCoQkSv3yeu7mLn9Bko8UIkT1K+jj0NUf1MO6qYwniQVhyXZznNFirVNW1K2TOA0Z7Cs9J1xvO MIME-Version: 1.0 X-Received: by 10.49.109.97 with SMTP id hr1mr150271833qeb.59.1388730223813; Thu, 02 Jan 2014 22:23:43 -0800 (PST) Received: by 10.96.90.8 with HTTP; Thu, 2 Jan 2014 22:23:43 -0800 (PST) In-Reply-To: References: Date: Thu, 2 Jan 2014 22:23:43 -0800 Message-ID: Subject: Re: use hbase as distributed crawl's scheduler From: James Taylor To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=047d7bea31a6adec9a04ef0af3e0 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bea31a6adec9a04ef0af3e0 Content-Type: text/plain; charset=ISO-8859-1 Otis, I didn't realize Nutch uses HBase underneath. Might be interesting if you serialized data in a Phoenix-compliant manner, as you could run SQL queries directly on top of it. Thanks, James On Thu, Jan 2, 2014 at 10:17 PM, Otis Gospodnetic < otis.gospodnetic@gmail.com> wrote: > Hi, > > Have a look at http://nutch.apache.org . Version 2.x uses HBase under the > hood. > > Otis > -- > Performance Monitoring * Log Analytics * Search Analytics > Solr & Elasticsearch Support * http://sematext.com/ > > > On Fri, Jan 3, 2014 at 1:12 AM, Li Li wrote: > > > hi all, > > I want to use hbase to store all urls(crawled or not crawled). > > And each url will has a column named priority which represent the > > priority of the url. I want to get the top N urls order by priority(if > > priority is the same then url whose timestamp is ealier is prefered). > > in using something like mysql, my client application may like: > > while true: > > select url from url_db order by priority,addedTime limit > > 1000 where status='not_crawled'; > > do something with this urls; > > extract more urls and insert them into url_db; > > How should I design hbase schema for this application? Is hbase > > suitable for me? > > I found in this article > > > http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ > > , > > they use redis to store urls. I think hbase is originated from > > bigtable and google use bigtable to store webpage, so for huge number > > of urls, I prefer distributed system like hbase. > > > --047d7bea31a6adec9a04ef0af3e0--