Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E35CF1029A for ; Mon, 13 Jan 2014 07:26:02 +0000 (UTC) Received: (qmail 8506 invoked by uid 500); 13 Jan 2014 07:25:51 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 8020 invoked by uid 500); 13 Jan 2014 07:25:48 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 7998 invoked by uid 99); 13 Jan 2014 07:25:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Jan 2014 07:25:46 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of fancyerii@gmail.com designates 209.85.217.182 as permitted sender) Received: from [209.85.217.182] (HELO mail-lb0-f182.google.com) (209.85.217.182) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Jan 2014 07:25:42 +0000 Received: by mail-lb0-f182.google.com with SMTP id l4so5098178lbv.27 for ; Sun, 12 Jan 2014 23:25:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=cb2nCqPAxeLxHk9QVQZ94MH3nTg6T2SvKWNedT0rtRw=; b=EmXlvvXYhm72jjkAc/WhSkTyoHrGZeFmmczlMYCPLr+smuNwj1ByzyYOPDEMeS6VcN dOq0I8bRILNHAWOwmk1QxlcSgnFxB6+tD/DeM4uR2uG9RrqYdyZgj2E1fdI0bXUb5LzW 7ZKcEpYfDg/ev2HiMouanYIf5jaIJGwgUb+2Fx/5FXNrbyVJXSqAKHG21DXoD8m0qEpl FrAcDd1v9aQLQ/XwWd2WgJwPMAjDVi2I37h0/s+W1LwM95L7JZNaPrCsopyofsl+j0pm YvlCmHHw5jqDQ1bOIzpQEqwwhKNBzkmdJ+a6T9t1txRjyALS7MqD0FokNpFbaYAZAHvy mTjw== MIME-Version: 1.0 X-Received: by 10.152.238.34 with SMTP id vh2mr502676lac.50.1389597921068; Sun, 12 Jan 2014 23:25:21 -0800 (PST) Received: by 10.112.2.166 with HTTP; Sun, 12 Jan 2014 23:25:21 -0800 (PST) In-Reply-To: References: Date: Mon, 13 Jan 2014 15:25:21 +0800 Message-ID: Subject: Re: use hbase as distributed crawl's scheduler From: Li Li To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org I am interested about your solution. Here is the detailed architecture of my distributed vertical crawler. http://www.flickr.com/photos/114261973@N07/ hope you guys can give me some advices. 1. goal I want to implement a distributed vertical(topical) crawler. it will only store webpages of a certan topic. I will have a classifier to do this. I estimated the amount of webpages that need be store is about tens of millions(maybe hundreds of millions as time goes). for vertical crawler, it should crawl the pages most likely related to my target topics. So I need a frontier that can dispatch task by priorities. for now, the priority is simple but we hope it can deal with complicated priority algorithms. 1.1 host priority we should crawl many hosts rather than only one single host at the same time. initally, each hosts should be equally crawled. but after time, we can calculate the priority of host dynamically e.g. we can control the speed of a certain host by it's crawl history(some site will ban our crawler if we use too many concurrent thread to it). or we can adjust the priority of a host by whether it is relevant to our topic(we can calculate the relevance of crawled page). 1.2 enqueue time first enqueued webpages should get higher priority 1.3 depth webpages with small depth will get higher priority(something like BFS traverse) 1.4 other page priorities e.g. page rank, list page/detail page ... 2. archeitecture see picture: http://www.flickr.com/photos/114261973@N07/ 2.1 Seed Discover use google or other website to find some seed urls 2.2 Url DB a distributed DB to store all metadata about urls(that's the most hbase related) 2.3 Task Scheduler as described before, the task scheduler select top N priority webpages and dispatch them to fetcher clusters 2.4 Message Queues we use ActiveMQ to decouple different modules and also load balance 2.5 Fetchers Download webpages 2.6 WebPageDB store webpages crawled and extracted metadata(such as title,content, pub_time, author, etc ....) of this webpage. we consider using hbase too. 2.7 Extractors Using classifier to judge whether this page is related to our topics and extracting metadata from it and store them to WebPageDB 3. main challenges 3.1 Url DB as described before, this store(maybe hbase) should support sophisticated pirority algorithms. and also we use it to avoid crawling a webpage more than once. 3.2 task scheduler how to achieve our goal 4. current solution 4.1 use hbase(maybe together with phoenix) to store urls(we now have not done the schema design, hoping get some advice here) 4.2 scheduler algorithm int batchSize=10000; //dispatch batchSize tasks to different hosts by host priorities; Map hostCount=... //select top priority urls from each host List toBeCrawledUrls=new ArrayList(batchSize); for(Entry entry:hostCount.entrySet()){ //select top priority N urls from a given host List urls=selectTopNUrlsFromHost(entry.getKey(), entry.getValue()); toBeCrawledUrls.addAll(urls); } //dispatch this urls to message queue //monitor the message queue status //if the queue is all(or 3/4) consumed, goto top and dispatch another batch urls 5. using map-reduce or hbase? we discussed the possible usage of map-reduce or only hbase if the scheduling algorithm is very complicated and should consider many things, maybe we should use map-reduce But for now, our algorithm is simple and using hbase coprocesser(or phoenix) can be thought of a simple online map-reduce we can use coprocesser to implement simple aggregating function or using phoenix sql like select count where group by having.... On Fri, Jan 3, 2014 at 8:19 PM, Jean-Marc Spaggiari wrote: > Interesting. This is exactly what I'm doing ;) > > I'm using 3 tables to achieve this. > > One table with the URL already crawled (80 millions), one URL with the URL > to crawle (2 billions) and one URL with the URLs been processed. I'm not > running any SQL requests against my dataset but I have MR jobs doing many > different things. I have many other tables to help with the work on the > URLs. > > I'm "salting" the keys using the URL hash so I can find them back very > quickly. There can be some collisions so I store also the URL itself on the > key. So very small scans returning 1 or something 2 rows allow me to > quickly find a row knowing the URL. > > I also have secondary index tables to store the CRCs of the pages to > identify duplicate pages based on this value. > > And so on ;) Working on that for 2 years now. I might have been able to use > Nuthc and others, but my goal was to learn and do that with a distributed > client on a single dataset... > > Enjoy. > > JM > > > 2014/1/3 James Taylor > >> Sure, no problem. One addition: depending on the cardinality of your >> priority column, you may want to salt your table to prevent hotspotting, >> since you'll have a monotonically increasing date in the key. To do that, >> just add " SALT_BUCKETS=" on to your query, where is the number of >> machines in your cluster. You can read more about salting here: >> http://phoenix.incubator.apache.org/salted.html >> >> >> On Thu, Jan 2, 2014 at 11:36 PM, Li Li wrote: >> >> > thank you. it's great. >> > >> > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor >> > wrote: >> > > Hi LiLi, >> > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a >> > SQL >> > > skin on top of HBase. You can model your schema and issue your queries >> > just >> > > like you would with MySQL. Something like this: >> > > >> > > // Create table that optimizes for your most common query >> > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want >> your >> > > rows ordered) >> > > CREATE TABLE url_db ( >> > > status TINYINT, >> > > priority INTEGER NOT NULL, >> > > added_time DATE, >> > > url VARCHAR NOT NULL >> > > CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); >> > > >> > > int lastStatus = 0; >> > > int lastPriority = 0; >> > > Date lastAddedTime = new Date(0); >> > > String lastUrl = ""; >> > > >> > > while (true) { >> > > // Use row value constructor to page through results in batches of >> > 1000 >> > > String query = " >> > > SELECT * FROM url_db >> > > WHERE status=0 AND (status, priority, added_time, url) > (?, ?, >> > ?, >> > > ?) >> > > ORDER BY status, priority, added_time, url >> > > LIMIT 1000" >> > > PreparedStatement stmt = connection.prepareStatement(query); >> > > >> > > // Bind parameters >> > > stmt.setInt(1, lastStatus); >> > > stmt.setInt(2, lastPriority); >> > > stmt.setDate(3, lastAddedTime); >> > > stmt.setString(4, lastUrl); >> > > ResultSet resultSet = stmt.executeQuery(); >> > > >> > > while (resultSet.next()) { >> > > // Remember last row processed so that you can start after that >> > for >> > > next batch >> > > lastStatus = resultSet.getInt(1); >> > > lastPriority = resultSet.getInt(2); >> > > lastAddedTime = resultSet.getDate(3); >> > > lastUrl = resultSet.getString(4); >> > > >> > > doSomethingWithUrls(); >> > > >> > > UPSERT INTO url_db(status, priority, added_time, url) >> > > VALUES (1, ?, CURRENT_DATE(), ?); >> > > >> > > } >> > > } >> > > >> > > If you need to efficiently query on url, add a secondary index like >> this: >> > > >> > > CREATE INDEX url_index ON url_db (url); >> > > >> > > Please let me know if you have questions. >> > > >> > > Thanks, >> > > James >> > > >> > > >> > > >> > > >> > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li wrote: >> > > >> > >> thank you. But I can't use nutch. could you tell me how hbase is used >> > >> in nutch? or hbase is only used to store webpage. >> > >> >> > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic >> > >> wrote: >> > >> > Hi, >> > >> > >> > >> > Have a look at http://nutch.apache.org . Version 2.x uses HBase >> > under >> > >> the >> > >> > hood. >> > >> > >> > >> > Otis >> > >> > -- >> > >> > Performance Monitoring * Log Analytics * Search Analytics >> > >> > Solr & Elasticsearch Support * http://sematext.com/ >> > >> > >> > >> > >> > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li wrote: >> > >> > >> > >> >> hi all, >> > >> >> I want to use hbase to store all urls(crawled or not crawled). >> > >> >> And each url will has a column named priority which represent the >> > >> >> priority of the url. I want to get the top N urls order by >> > priority(if >> > >> >> priority is the same then url whose timestamp is ealier is >> prefered). >> > >> >> in using something like mysql, my client application may like: >> > >> >> while true: >> > >> >> select url from url_db order by priority,addedTime limit >> > >> >> 1000 where status='not_crawled'; >> > >> >> do something with this urls; >> > >> >> extract more urls and insert them into url_db; >> > >> >> How should I design hbase schema for this application? Is >> hbase >> > >> >> suitable for me? >> > >> >> I found in this article >> > >> >> >> > >> >> > >> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ >> > >> >> , >> > >> >> they use redis to store urls. I think hbase is originated from >> > >> >> bigtable and google use bigtable to store webpage, so for huge >> number >> > >> >> of urls, I prefer distributed system like hbase. >> > >> >> >> > >> >> > >>