Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 98859 invoked from network); 18 Feb 2009 16:45:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Feb 2009 16:45:15 -0000 Received: (qmail 59978 invoked by uid 500); 18 Feb 2009 16:45:08 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 59947 invoked by uid 500); 18 Feb 2009 16:45:08 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 59931 invoked by uid 99); 18 Feb 2009 16:45:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Feb 2009 08:45:08 -0800 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of milesosb@gmail.com designates 209.85.219.15 as permitted sender) Received: from [209.85.219.15] (HELO mail-ew0-f15.google.com) (209.85.219.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Feb 2009 16:44:58 +0000 Received: by ewy8 with SMTP id 8so2599820ewy.5 for ; Wed, 18 Feb 2009 08:44:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type:content-transfer-encoding; bh=YlxQ0I9p/BBeW5T6VLq6DarLGaLy7XAGQDIwD8iufEY=; b=CTwZMAn0XKngJadoDqaY3eXCuDmSROhJV2LJ52Mzp9rWwzGyMZ8R3g3EfF7gz4GNC8 /alUtN9byOkwb1za7i7udjidhWImfnjMCRDwyHjEF1ck7ZQ+wzCz10uladE6OsGM4cBI uHUtlcBjrCjOaKNKVPfxN7PyH22CI0re939gw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=qkBN0TWIeD8IYCZ2EmSsu5Uzs2/zcRTyO4qbVgWUSnWRfc7jRRl+JxYUxHJZmUrnw3 3RRT/urdeOXdZk65vuq+SSA93zKI+/ZPxrjT7jA9FNs6sUIiAcithu8q18qxPr5fiHYa +j0sIRx8PrdXTVPHLHOITmqv159dU7Dj+Uf3U= MIME-Version: 1.0 Sender: milesosb@gmail.com Received: by 10.210.87.19 with SMTP id k19mr81766ebb.152.1234975478118; Wed, 18 Feb 2009 08:44:38 -0800 (PST) In-Reply-To: <22082598.post@talk.nabble.com> References: <21964853.post@talk.nabble.com> <7f339cb30902111607w38c08c77i40d6fd1832f3e074@mail.gmail.com> <21977132.post@talk.nabble.com> <73e5a5310902120641y1a0f6251vea039c2c6d1d6958@mail.gmail.com> <22081608.post@talk.nabble.com> <73e5a5310902180800h7005c338h648b9d36909f5ce0@mail.gmail.com> <22082598.post@talk.nabble.com> Date: Wed, 18 Feb 2009 16:44:38 +0000 X-Google-Sender-Auth: f9d40a5809ba5d10 Message-ID: <73e5a5310902180844v53fdd406ob52fce4298f05444@mail.gmail.com> Subject: Re: Finding small subset in very large dataset From: Miles Osborne To: core-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org if i remember correctly you have two sets of data: --set A, which is very big --set B, which is small and you want to find all elements of A which are in B. right? represent A using a variant of a Bloom Filter which supports key-value pairs. A Bloomier Filter will do this for you. each mapper then loads-up A (represented using the Bloomier Filter) and works over B . whenever A is present in the representation of B, you look for the associated value in B and emit that. if even using a Bloomier Filter you still need too much memory then you could store it once using Hypertable see here for an explanation of Bloomier Filters applied to the task of storing lots of string,probability pairs Randomized Language Models via Perfect Hash Functions http://aclweb.org/anthology-new/P/P08/P08-1058.pdf Miles 2009/2/18 Thibaut_ : > > Hi Miles, > > I'm not following you. > If I'm saving an associated hash or bit vector, how can I then quickly > access the elements afterwards (the file with the data might be >100GB big > and is on the DFS)? > > I could also directly save the offset of the data in the datafile as > reference, and then on each reducer read in that big file only once. As all > the keys are sorted, I can get all the needed values in one big read step > (skipping those entries I don't need). > > > Thibaut > > > > Miles Osborne wrote: >> >> just re-represent the associated data as a bit vector and set of hash >> functions. you then just copy this around, rather than the raw items >> themselves. >> >> Miles >> >> 2009/2/18 Thibaut_ : >>> >>> Hi, >>> >>> The bloomfilter solution works great, but I still have to copy the data >>> around sometimes. >>> >>> I'm still wondering if I can reduce the associated data to the keys to a >>> reference or something small (the >100 KB of data are very big), with >>> which >>> I can then later fetch the data in the reduce step. >>> >>> In the past I was using hbase to store the associated data in it (but >>> unfortunately hbase proved to be very unreliable in my case). I will >>> probably also start to compress the data in the value store, which will >>> probably increase sorting speed (as the data is there probably >>> uncompressed). >>> Is there something else I could do to speed this process up? >>> >>> Thanks, >>> Thibaut >>> -- >>> View this message in context: >>> http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html >>> Sent from the Hadoop core-user mailing list archive at Nabble.com. >>> >>> >> >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> > > -- > View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.