Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 56212 invoked from network); 12 Apr 2010 18:23:08 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Apr 2010 18:23:08 -0000 Received: (qmail 94737 invoked by uid 500); 12 Apr 2010 18:23:05 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 94661 invoked by uid 500); 12 Apr 2010 18:23:05 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 94591 invoked by uid 99); 12 Apr 2010 18:23:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Apr 2010 18:23:05 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.145.54.171] (HELO mrout1.yahoo.com) (216.145.54.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Apr 2010 18:22:59 +0000 Received: from [10.73.135.242] ([10.73.135.242]) by mrout1.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id o3CIM6Rd064344; Mon, 12 Apr 2010 11:22:06 -0700 (PDT) Message-ID: <4BC364CA.3050309@apache.org> Date: Mon, 12 Apr 2010 11:22:02 -0700 From: Patrick Hunt User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 MIME-Version: 1.0 To: zookeeper-user@hadoop.apache.org CC: Thomas Koch , Mahadev Konar , common-user@hadoop.apache.org, gearman@googlegroups.com Subject: Re: feed queue fetcher with hadoop/zookeeper/gearman? References: <201004121949.01427.thomas@koch.ro> In-Reply-To: <201004121949.01427.thomas@koch.ro> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org See this environment http://bit.ly/4ekN8G. Subsequently I used the 3 server setup, each configured with 8gig of heap in the jvm and 4 CPUs/jvm (I think I used 10second session timeouts for this) for some additional testing that I've not written up yet. I was able to run ~500 clients (same test script) in parallel. So that means about 5million znodes and 25million watches. The thing to watch out for is: 1) most important is you need to tune the GC, in particular you need to turn on CMS and incremental GC. OTW the GC pauses will cause high latencies and you will see session timeouts 2) you need a stable network, esp for the serving ensemble 3) sufficient memory available in the JVM heap 4) no IO issues on the serving hosts (VM's, overloaded disk, swapping, etc...) In your case you've got less going on with only 30 or so writes per second. The performance page shows that your going to be well below the max ops/sec we see in our testing harness. btw, gearman would also be a good choice imo. I've looked at integrating ZK with gearman, there are two potentials. 1) as an additional backend persistent store for gearman, 2) as a way of addressing gearman failover. 1 is pretty simple to do today, 2 is harder, would require some changes to gearman itself but I think it would be useful (automatic failover of persistent tasks if a gearman server fails). Patrick On 04/12/2010 10:49 AM, Thomas Koch wrote: > Mahadev Konar: >> Hi Thomas, >> There are a couple of projects inside Yahoo! that use ZooKeeper as an >> event manager for feed processing. >> >> I am little bit unclear on your example below. As I understand it- >> >> 1. There are 1 million feeds that will be stored in Hbase. >> 2. A map reduce job will be run on these feeds to find out which feeds need >> to be fetched. >> 3. This will create queues in ZooKeeper to fetch the feeds >> 4. Workers will pull items from this queue and process feeds >> >> Did I understand it correctly? Also, if above is the case, how many queue >> items would you anticipate be accumulated every hour? > Yes. That's exactly what I'm thinking about. Currently one node processes like > 20000 Feeds an hour and we have 5 feed-fetch-nodes. This would mean ~100000 > queue items/hour. Each queue item should carry some meta informations, most > important the feed items, that are already known to the system so that only > new items get processed. > > Thomas Koch, http://www.koch.ro