Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of milesosb@gmail.com designates
 209.85.219.15 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=qkBN0TWIeD8IYCZ2EmSsu5Uzs2/zcRTyO4qbVgWUSnWRfc7jRRl+JxYUxHJZmUrnw3
         3RRT/urdeOXdZk65vuq+SSA93zKI+/ZPxrjT7jA9FNs6sUIiAcithu8q18qxPr5fiHYa
         +j0sIRx8PrdXTVPHLHOITmqv159dU7Dj+Uf3U=
MIME-Version: 1.0
Sender: milesosb@gmail.com
In-Reply-To: <22082598.post@talk.nabble.com>
References: <21964853.post@talk.nabble.com>
	 <7f339cb30902111607w38c08c77i40d6fd1832f3e074@mail.gmail.com>
	 <d6d7c4410902111730t1f9271a9k1a4bfa2eb6789712@mail.gmail.com>
	 <21977132.post@talk.nabble.com>
	 <73e5a5310902120641y1a0f6251vea039c2c6d1d6958@mail.gmail.com>
	 <22081608.post@talk.nabble.com>
	 <73e5a5310902180800h7005c338h648b9d36909f5ce0@mail.gmail.com>
	 <22082598.post@talk.nabble.com>
Date: Wed, 18 Feb 2009 16:44:38 +0000
Message-ID: <73e5a5310902180844v53fdd406ob52fce4298f05444@mail.gmail.com>
Subject: Re: Finding small subset in very large dataset
From: Miles Osborne <miles@inf.ed.ac.uk>
To: core-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

if i remember correctly you have two sets of data:

--set A, which is very big
--set B, which is small

and you want to find all elements of A which are in B.  right?

represent A using a variant of a Bloom Filter which supports key-value
pairs.  A Bloomier Filter will do this for you.

each mapper then loads-up A (represented using the Bloomier Filter)
and works over B . whenever A  is present in the representation of B,
you look for the associated value in B and emit that.

if even using a Bloomier Filter you still need too much memory then
you could store it once using Hypertable

see here for an explanation of Bloomier Filters applied to the task of
storing lots of string,probability pairs

Randomized Language Models via Perfect Hash Functions
http://aclweb.org/anthology-new/P/P08/P08-1058.pdf
Miles

2009/2/18 Thibaut_ <tbritz@blue.lu>:
>
> Hi Miles,
>
> I'm not following you.
> If I'm saving an associated hash or bit vector, how can I then quickly
> access the elements afterwards (the file with the data might be >100GB big
> and is on the DFS)?
>
> I could also directly save the offset of the data in the datafile as
> reference, and then on each reducer read in that big file only once. As all
> the keys are sorted, I can get all the needed values in one big read step
> (skipping those entries I don't need).
>
>
> Thibaut
>
>
>
> Miles Osborne wrote:
>>
>> just re-represent the associated data as a bit vector and set of hash
>> functions.  you then just copy this around, rather than the raw items
>> themselves.
>>
>> Miles
>>
>> 2009/2/18 Thibaut_ <tbritz@blue.lu>:
>>>
>>> Hi,
>>>
>>> The bloomfilter solution works great, but I still have to copy the data
>>> around sometimes.
>>>
>>> I'm still wondering if I can reduce the associated data to the keys to a
>>> reference or something small (the >100 KB of data are very big), with
>>> which
>>> I can then later fetch the data in the reduce step.
>>>
>>> In the past I was using hbase to store the associated data in it (but
>>> unfortunately hbase proved to be very unreliable in my case). I will
>>> probably also start to compress the data in the value store, which will
>>> probably increase sorting speed (as the data is there probably
>>> uncompressed).
>>> Is there something else I could do to speed this process up?
>>>
>>> Thanks,
>>> Thibaut
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.