Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A750110497 for ; Fri, 3 Jan 2014 19:42:21 +0000 (UTC) Received: (qmail 53914 invoked by uid 500); 3 Jan 2014 19:42:19 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 53865 invoked by uid 500); 3 Jan 2014 19:42:19 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 53857 invoked by uid 99); 3 Jan 2014 19:42:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 19:42:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jtaylor@salesforce.com designates 209.85.128.54 as permitted sender) Received: from [209.85.128.54] (HELO mail-qe0-f54.google.com) (209.85.128.54) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 19:42:15 +0000 Received: by mail-qe0-f54.google.com with SMTP id cy11so16147337qeb.27 for ; Fri, 03 Jan 2014 11:41:54 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=4i2aEU4L/I/d4ApxfzELMMKS3ZBDXBokKt+Hn6qn1JU=; b=bic5PPwSxWxEUIWRNzgfCHRJXo0TZ7ypAIkMRjpqJEPpFnq5eUQmMSenXzkhcYGoO+ KE+DNJKhJiqSbeqF/QsKtMduWJIyfs7F9Nzb2VrGpxdcALyOeMkrD5pi/R4jLHvVLX8/ HJNpi7RGUqiJvPWqdJXcnBYfU7doeEJbxiG45SH3h32nbhyhk//1MaXIOigQN2fOxz7h d1zTA+1kNFBnyq1IXTaM/PsjV8qLwx0jv+FFrry3NIkC3ncpZXe+0OM1pgrkAx4jygA7 gtDJGlGiDTzSk3R4BvdSSxG4IiJqs+ppc7FEc5xWosNBP87FSXBvwkZ3oy7Be5qf7oHN fLJw== X-Gm-Message-State: ALoCoQkNzWO4mIPW3GRMoGK3xveZSYaBWhzONrbgoyFYstZMZbdflEpZdnz57R1HcXGu3Y3LF3l+ MIME-Version: 1.0 X-Received: by 10.224.79.13 with SMTP id n13mr149770187qak.70.1388778114365; Fri, 03 Jan 2014 11:41:54 -0800 (PST) Received: by 10.96.90.8 with HTTP; Fri, 3 Jan 2014 11:41:54 -0800 (PST) In-Reply-To: References: Date: Fri, 3 Jan 2014 11:41:54 -0800 Message-ID: Subject: Re: use hbase as distributed crawl's scheduler From: James Taylor To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=047d7bdc80162dd48804ef161a31 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc80162dd48804ef161a31 Content-Type: text/plain; charset=ISO-8859-1 On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika wrote: > Couple of notes: > 1. When updating to status you essentially add a new rowkey into HBase, I > would give it up all together. The essential requirement seems to point at > retrieving a list of urls in a certain order. > Not sure on this, but seemed to me that setting the status field is forcing the urls that have been processed to be at the end of the sort order. 2. Wouldn't salting ruin the sort order required? Priority, date added? > No, as Phoenix maintains returning rows in row key order even when they're salted. We do parallel scans for each bucket and do a merge sort on the client, so the cost is pretty low for this (we also provide a way of turning this off if your use case doesn't need it). Two years, JM? Now you're really going to have to start using Phoenix :-) > On Friday, January 3, 2014, James Taylor wrote: > > > Sure, no problem. One addition: depending on the cardinality of your > > priority column, you may want to salt your table to prevent hotspotting, > > since you'll have a monotonically increasing date in the key. To do that, > > just add " SALT_BUCKETS=" on to your query, where is the number of > > machines in your cluster. You can read more about salting here: > > http://phoenix.incubator.apache.org/salted.html > > > > > > On Thu, Jan 2, 2014 at 11:36 PM, Li Li wrote: > > > > > thank you. it's great. > > > > > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor > > > wrote: > > > > Hi LiLi, > > > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's > a > > > SQL > > > > skin on top of HBase. You can model your schema and issue your > queries > > > just > > > > like you would with MySQL. Something like this: > > > > > > > > // Create table that optimizes for your most common query > > > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want > > your > > > > rows ordered) > > > > CREATE TABLE url_db ( > > > > status TINYINT, > > > > priority INTEGER NOT NULL, > > > > added_time DATE, > > > > url VARCHAR NOT NULL > > > > CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); > > > > > > > > int lastStatus = 0; > > > > int lastPriority = 0; > > > > Date lastAddedTime = new Date(0); > > > > String lastUrl = ""; > > > > > > > > while (true) { > > > > // Use row value constructor to page through results in batches > of > > > 1000 > > > > String query = " > > > > SELECT * FROM url_db > > > > WHERE status=0 AND (status, priority, added_time, url) > (?, > ?, > > > ?, > > > > ?) > > > > ORDER BY status, priority, added_time, url > > > > LIMIT 1000" > > > > PreparedStatement stmt = connection.prepareStatement(query); > > > > > > > > // Bind parameters > > > > stmt.setInt(1, lastStatus); > > > > stmt.setInt(2, lastPriority); > > > > stmt.setDate(3, lastAddedTime); > > > > stmt.setString(4, lastUrl); > > > > ResultSet resultSet = stmt.executeQuery(); > > > > > > > > while (resultSet.next()) { > > > > // Remember last row processed so that you can start after > that > > > for > > > > next batch > > > > lastStatus = resultSet.getInt(1); > > > > lastPriority = resultSet.getInt(2); > > > > lastAddedTime = resultSet.getDate(3); > > > > lastUrl = resultSet.getString(4); > > > > > > > > doSomethingWithUrls(); > > > > > > > > UPSERT INTO url_db(status, priority, added_time, url) > > > > VALUES (1, ?, CURRENT_DATE(), ?); > > > > > > > > } > > > > } > > > > > > > > If you need to efficiently query on url, add a secondary index like > > this: > > > > > > > > CREATE INDEX url_index ON url_db (url); > > > > > > > > Please let me know if you have questions. > > > > > > > > Thanks, > > > > James > > > > > > > > > > > > > > > > > > > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li wrote: > > > > > > > >> thank you. But I can't use nutch. could you tell me how hbase is > used > > > >> in nutch? or hbase is only used to store webpage. > > > >> > > > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic > > > >> wrote: > > > >> > Hi, > > > >> > > > > >> > Have a look at http://nutch.apache.org . Version 2.x uses HBase > > > under > > > >> the > > > >> > hood. > > > >> > > > > >> > Otis > > > >> > -- > > > >> > Performance Monitoring * Log Analytics * Search Analytics > > > >> > Solr & Elasticsearch Support * http://sematext.com/ > > > >> > > > > >> > > > > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li < > --047d7bdc80162dd48804ef161a31--