Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 22019 invoked from network); 20 Dec 2007 04:21:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Dec 2007 04:21:02 -0000 Received: (qmail 88763 invoked by uid 500); 20 Dec 2007 04:20:44 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 88732 invoked by uid 500); 20 Dec 2007 04:20:44 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 88722 invoked by uid 99); 20 Dec 2007 04:20:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Dec 2007 20:20:44 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.145.54.172] (HELO mrout2.yahoo.com) (216.145.54.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Dec 2007 04:20:30 +0000 Received: from SNV-EXBH01.ds.corp.yahoo.com (snv-exbh01.ds.corp.yahoo.com [207.126.227.249]) by mrout2.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id lBK4K20U075442 for ; Wed, 19 Dec 2007 20:20:06 -0800 (PST) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=received:x-mimeole:content-class:mime-version: content-type:content-transfer-encoding:subject:date:message-id: in-reply-to:x-ms-has-attach:x-ms-tnef-correlator:thread-topic: thread-index:references:from:to:return-path:x-originalarrivaltime; b=vv5801s/tgOwY6U8lLFV8gweovvlk8yxnUUwOL86UkkeW4GJ82jvXgMs886lEfiP Received: from SNV-EXVS09.ds.corp.yahoo.com ([207.126.227.86]) by SNV-EXBH01.ds.corp.yahoo.com with Microsoft SMTPSVC(6.0.3790.3959); Wed, 19 Dec 2007 20:20:02 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable Subject: RE: HashMap which can spill to disk for Hadoop? Date: Wed, 19 Dec 2007 20:20:03 -0800 Message-ID: <60499C890DBB8042BC7834CC82FB237971C501@SNV-EXVS09.ds.corp.yahoo.com> In-Reply-To: <93549.63939.qm@web45402.mail.sp1.yahoo.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: HashMap which can spill to disk for Hadoop? thread-index: AchCiSMC8we6qHNtTI2YDcXGJlJ52gANNf8g References: <93549.63939.qm@web45402.mail.sp1.yahoo.com> From: "Runping Qi" To: X-OriginalArrivalTime: 20 Dec 2007 04:20:02.0815 (UTC) FILETIME=[978260F0:01C842BF] X-Virus-Checked: Checked by ClamAV on apache.org It would be nice if you can contribute a file backed hashmap, or a file backed implementation of the unique count aggregator. Short of that, if you just need to count the unique values for each event id, you can do so by using the aggregate classes with each event-id/event-value pair as a key and simply counting the number of occurrences of each composite key. Runping > -----Original Message----- > From: C G [mailto:parallelguy@yahoo.com] > Sent: Wednesday, December 19, 2007 11:59 AM > To: hadoop-user@lucene.apache.org > Subject: HashMap which can spill to disk for Hadoop? >=20 > Hi All: >=20 > The aggregation classes in Hadoop use a HashMap to hold unique values in > memory when computing unique counts, etc. I ran into a situation on 32- > node grid (4G memory/node) where a single node runs out of memory within > the reduce phase trying to manage a very large HashMap. This was > disappointing because the dataset is only 44M rows (4G) of data. This is > a scenario where I am counting unique values associated with various > events, where the total number of events is very small and the number of > unique values is very high. Since the event IDs serve as keys as the > number of distinct event IDs is small, there is a consequently small > number of reducers running, where each reducer is expected to manage a > very large HashMap of unique values. >=20 > It looks like I need to build my own unique aggregator, so I am looking > for an implementation of HashMap which can spill to disk as needed. I've > considered using BDB as a backing store, and I've looking into Derby's > BackingStoreHashtable as well. >=20 > For the present time I can restructure my data in an attempt to get more > reducers to run, but I can see in the very near future where even that > will run out of memory. >=20 > Any thoughts,comments, or flames? >=20 > Thanks, > C G >=20 >=20 >=20 > --------------------------------- > Looking for last minute shopping deals? Find them fast with Yahoo! > Search.