Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 545841020D for ; Fri, 3 Jan 2014 12:19:59 +0000 (UTC) Received: (qmail 73424 invoked by uid 500); 3 Jan 2014 12:19:53 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 73022 invoked by uid 500); 3 Jan 2014 12:19:49 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 73011 invoked by uid 99); 3 Jan 2014 12:19:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 12:19:47 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.212.53] (HELO mail-vb0-f53.google.com) (209.85.212.53) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 12:19:43 +0000 Received: by mail-vb0-f53.google.com with SMTP id o19so7783190vbm.12 for ; Fri, 03 Jan 2014 04:19:22 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=uaI2kE6T6hXmMPdIY/bur/fIFdEIMPAmKsnTz8ZEayw=; b=VUHpJEZL+/GeQrukZmJk89xPsyMCTd+zhLt3t5EHPz+v6Vr8ep1LoyQV3IJIuAD5IG oiSDT3dy0C5rbaIb3BprvNIGf6e5gGiB4HBuL1C6FUSIwDOFcHH623jSZJvAzabIdDdi fsDFnHllYku5KL+J6F3gE3UlhfHtrOBSJYA8fVxyfWsHUiye9ronSoT1Owo+hbCj+ufe BTDPYRauzmw7pzd5117MGLJbsNUfjKyNntqI+S31BudsW/BLlzf6jU8+1VEIpey6S363 Z3WIa/zBYCtm76t4+BqaSpSZ3A8ekJZHm5KijXHC14gDph74nSKYGW/XLl6rSXMxpiUs PdhA== X-Gm-Message-State: ALoCoQndyqfaNQkW9pKbjjWYXvVe2RPDJKpNtPQto6DGJCz3cR7Wzgsv7eTI2acCVs8TAE8cD6k2 X-Received: by 10.58.100.244 with SMTP id fb20mr50774841veb.6.1388751562407; Fri, 03 Jan 2014 04:19:22 -0800 (PST) MIME-Version: 1.0 Received: by 10.52.24.49 with HTTP; Fri, 3 Jan 2014 04:19:02 -0800 (PST) In-Reply-To: References: From: Jean-Marc Spaggiari Date: Fri, 3 Jan 2014 07:19:02 -0500 Message-ID: Subject: Re: use hbase as distributed crawl's scheduler To: user Content-Type: multipart/alternative; boundary=089e013a27088f18b804ef0feb49 X-Virus-Checked: Checked by ClamAV on apache.org --089e013a27088f18b804ef0feb49 Content-Type: text/plain; charset=UTF-8 Interesting. This is exactly what I'm doing ;) I'm using 3 tables to achieve this. One table with the URL already crawled (80 millions), one URL with the URL to crawle (2 billions) and one URL with the URLs been processed. I'm not running any SQL requests against my dataset but I have MR jobs doing many different things. I have many other tables to help with the work on the URLs. I'm "salting" the keys using the URL hash so I can find them back very quickly. There can be some collisions so I store also the URL itself on the key. So very small scans returning 1 or something 2 rows allow me to quickly find a row knowing the URL. I also have secondary index tables to store the CRCs of the pages to identify duplicate pages based on this value. And so on ;) Working on that for 2 years now. I might have been able to use Nuthc and others, but my goal was to learn and do that with a distributed client on a single dataset... Enjoy. JM 2014/1/3 James Taylor > Sure, no problem. One addition: depending on the cardinality of your > priority column, you may want to salt your table to prevent hotspotting, > since you'll have a monotonically increasing date in the key. To do that, > just add " SALT_BUCKETS=" on to your query, where is the number of > machines in your cluster. You can read more about salting here: > http://phoenix.incubator.apache.org/salted.html > > > On Thu, Jan 2, 2014 at 11:36 PM, Li Li wrote: > > > thank you. it's great. > > > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor > > wrote: > > > Hi LiLi, > > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a > > SQL > > > skin on top of HBase. You can model your schema and issue your queries > > just > > > like you would with MySQL. Something like this: > > > > > > // Create table that optimizes for your most common query > > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want > your > > > rows ordered) > > > CREATE TABLE url_db ( > > > status TINYINT, > > > priority INTEGER NOT NULL, > > > added_time DATE, > > > url VARCHAR NOT NULL > > > CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); > > > > > > int lastStatus = 0; > > > int lastPriority = 0; > > > Date lastAddedTime = new Date(0); > > > String lastUrl = ""; > > > > > > while (true) { > > > // Use row value constructor to page through results in batches of > > 1000 > > > String query = " > > > SELECT * FROM url_db > > > WHERE status=0 AND (status, priority, added_time, url) > (?, ?, > > ?, > > > ?) > > > ORDER BY status, priority, added_time, url > > > LIMIT 1000" > > > PreparedStatement stmt = connection.prepareStatement(query); > > > > > > // Bind parameters > > > stmt.setInt(1, lastStatus); > > > stmt.setInt(2, lastPriority); > > > stmt.setDate(3, lastAddedTime); > > > stmt.setString(4, lastUrl); > > > ResultSet resultSet = stmt.executeQuery(); > > > > > > while (resultSet.next()) { > > > // Remember last row processed so that you can start after that > > for > > > next batch > > > lastStatus = resultSet.getInt(1); > > > lastPriority = resultSet.getInt(2); > > > lastAddedTime = resultSet.getDate(3); > > > lastUrl = resultSet.getString(4); > > > > > > doSomethingWithUrls(); > > > > > > UPSERT INTO url_db(status, priority, added_time, url) > > > VALUES (1, ?, CURRENT_DATE(), ?); > > > > > > } > > > } > > > > > > If you need to efficiently query on url, add a secondary index like > this: > > > > > > CREATE INDEX url_index ON url_db (url); > > > > > > Please let me know if you have questions. > > > > > > Thanks, > > > James > > > > > > > > > > > > > > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li wrote: > > > > > >> thank you. But I can't use nutch. could you tell me how hbase is used > > >> in nutch? or hbase is only used to store webpage. > > >> > > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic > > >> wrote: > > >> > Hi, > > >> > > > >> > Have a look at http://nutch.apache.org . Version 2.x uses HBase > > under > > >> the > > >> > hood. > > >> > > > >> > Otis > > >> > -- > > >> > Performance Monitoring * Log Analytics * Search Analytics > > >> > Solr & Elasticsearch Support * http://sematext.com/ > > >> > > > >> > > > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li wrote: > > >> > > > >> >> hi all, > > >> >> I want to use hbase to store all urls(crawled or not crawled). > > >> >> And each url will has a column named priority which represent the > > >> >> priority of the url. I want to get the top N urls order by > > priority(if > > >> >> priority is the same then url whose timestamp is ealier is > prefered). > > >> >> in using something like mysql, my client application may like: > > >> >> while true: > > >> >> select url from url_db order by priority,addedTime limit > > >> >> 1000 where status='not_crawled'; > > >> >> do something with this urls; > > >> >> extract more urls and insert them into url_db; > > >> >> How should I design hbase schema for this application? Is > hbase > > >> >> suitable for me? > > >> >> I found in this article > > >> >> > > >> > > > http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ > > >> >> , > > >> >> they use redis to store urls. I think hbase is originated from > > >> >> bigtable and google use bigtable to store webpage, so for huge > number > > >> >> of urls, I prefer distributed system like hbase. > > >> >> > > >> > > > --089e013a27088f18b804ef0feb49--