Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 46108 invoked from network); 12 Apr 2010 17:49:38 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Apr 2010 17:49:38 -0000 Received: (qmail 10794 invoked by uid 500); 12 Apr 2010 17:49:36 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 10755 invoked by uid 500); 12 Apr 2010 17:49:36 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 10741 invoked by uid 99); 12 Apr 2010 17:49:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Apr 2010 17:49:36 +0000 X-ASF-Spam-Status: No, hits=-1.4 required=10.0 tests=AWL,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [88.198.2.104] (HELO koch.ro) (88.198.2.104) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Apr 2010 17:49:30 +0000 Received: from 84-72-85-88.dclient.hispeed.ch ([84.72.85.88] helo=jona.localnet) by koch.ro with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.69) (envelope-from ) id 1O1NgI-0002OE-TZ; Mon, 12 Apr 2010 19:44:19 +0200 From: Thomas Koch Reply-To: thomas@koch.ro To: Mahadev Konar Subject: Re: feed queue fetcher with hadoop/zookeeper/gearman? Date: Mon, 12 Apr 2010 19:49:01 +0200 User-Agent: KMail/1.12.4 (Linux/2.6.32-4-amd64; KDE/4.3.4; x86_64; ; ) Cc: zookeeper-user@hadoop.apache.org, common-user@hadoop.apache.org, gearman@googlegroups.com References: In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201004121949.01427.thomas@koch.ro> Mahadev Konar: > Hi Thomas, > There are a couple of projects inside Yahoo! that use ZooKeeper as an > event manager for feed processing. > > I am little bit unclear on your example below. As I understand it- > > 1. There are 1 million feeds that will be stored in Hbase. > 2. A map reduce job will be run on these feeds to find out which feeds need > to be fetched. > 3. This will create queues in ZooKeeper to fetch the feeds > 4. Workers will pull items from this queue and process feeds > > Did I understand it correctly? Also, if above is the case, how many queue > items would you anticipate be accumulated every hour? Yes. That's exactly what I'm thinking about. Currently one node processes like 20000 Feeds an hour and we have 5 feed-fetch-nodes. This would mean ~100000 queue items/hour. Each queue item should carry some meta informations, most important the feed items, that are already known to the system so that only new items get processed. Thomas Koch, http://www.koch.ro