Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <28902f871001051726i1dc69f9fof71ac19d534a70ac@mail.gmail.com>
References: <28902f870912111919s3f1a86adp75635e8c23d15d69@mail.gmail.com>
	<45f85f70912111934v2960fb36s1922507c9bc61675@mail.gmail.com>
	<28902f870912121038g644b72b0vc832ec1343c16687@mail.gmail.com>
	<45f85f70912121301x353192du44becd5c16dce181@mail.gmail.com>
	<28902f871001051724q24d3b173r2a5977d0fa7ab43@mail.gmail.com>
	<28902f871001051726i1dc69f9fof71ac19d534a70ac@mail.gmail.com>
From: Todd Lipcon <todd@cloudera.com>
Date: Wed, 6 Jan 2010 11:32:38 -0800
Message-ID: <45f85f71001061132p65fde705te5957d679075de3f@mail.gmail.com>
Subject: Re: Which Hadoop product is more appropriate for a quick query on a
	large data set?
To: general@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001636e904c696ec5c047c840569

--001636e904c696ec5c047c840569
Content-Type: text/plain; charset=ISO-8859-1

Hi Xueling,

Here's a general outline:

My guess is that your "position of match" field is bounded (perhaps by the
number of base pairs in the human genome?) Given this, you can probably
write a very simple Partitioner implementation that divides this field into
ranges, each with an approximately equal number of records.

Next, write a simple MR job which takes in a line of data, and outputs the
same line, but with the position-of-match as the key. This will get
partitioned by the above function, so you end up with each reducer receiving
all of the records in a given range.

In the reducer, simply output every 1000th position into your "sparse"
output file (along with the non-sparse output file offset), and every
position into the non-sparse output file.

In your realtime query server (not part of Hadoop), load the "sparse" file
into RAM and perform binary search, etc - find the "bins" which the range
endpoints land in, and then open the non-sparse output on HDFS to finish the
count.

Hope that helps.

Thanks
-Todd

On Tue, Jan 5, 2010 at 5:26 PM, Xueling Shu <xshu@systemsbiology.org> wrote:

> Rephrase the sentence "Or what APIs I should start with for my testing?": I
> mean "What HDFS APIs I should start to look into for my testing?
>
> Thanks,
> Xueling
>
> On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu <xshu@systemsbiology.org>
> wrote:
>
> > Hi Todd:
> >
> > After finishing some tasks I finally get back to HDFS testing.
> >
> > One question for your last reply to this thread: Are there any code
> > examples close to your second and third recommendations? Or what APIs I
> > should start with for my testing?
> >
> > Thanks.
> > Xueling
> >
> >
> > On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <todd@cloudera.com> wrote:
> >
> >> Hi Xueling,
> >>
> >> In that case, I would recommend the following:
> >>
> >> 1) Put all of your data on HDFS
> >> 2) Write a MapReduce job that sorts the data by position of match
> >> 3) As a second output of this job, you can write a "sparse index" -
> >> basically a set of entries like this:
> >>
> >> <position of match> <offset into file> <number of entries following>
> >>
> >> where you're basically giving offsets into every 10K records or so. If
> >> you index every 10K records, then 5 billion total will mean 100,000
> >> index entries. Each index entry shouldn't be more than 20 bytes, so
> >> 100,000 entries will be 2MB. This is super easy to fit into memory.
> >> (you could probably index every 100th record instead and end up with
> >> 200MB, still easy to fit in memory)
> >>
> >> Then to satisfy your count-range query, you can simply scan your
> >> in-memory sparse index. Some of the indexed blocks will be completely
> >> included in the range, in which case you just add up the "number of
> >> entries following" column. The start and finish block will be
> >> partially covered, so you can use the file offset info to load that
> >> file off HDFS, start reading at that offset, and finish the count.
> >>
> >> Total time per query should be <100ms no problem.
> >>
> >> -Todd
> >>
> >> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xshu@systemsbiology.org>
> >> wrote:
> >> > Hi Todd:
> >> >
> >> > Thank you for your reply.
> >> >
> >> > The datasets wont be updated often. But the query against a data set
> is
> >> > frequent. The quicker the query, the better. For example we have done
> >> > testing on a Mysql database (5 billion records randomly scattered into
> >> 24
> >> > tables) and the slowest query against the biggest table (400,000,000
> >> > records) is around 12 mins. So if using any Hadoop product can speed
> up
> >> the
> >> > search then the product is what we are looking for.
> >> >
> >> > Cheers,
> >> > Xueling
> >> >
> >> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <todd@cloudera.com>
> wrote:
> >> >
> >> >> Hi Xueling,
> >> >>
> >> >> One important question that can really change the answer:
> >> >>
> >> >> How often does the dataset change? Can the changes be merged in in
> >> >> bulk every once in a while, or do you need to actually update them
> >> >> randomly very often?
> >> >>
> >> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1
> second,
> >> or
> >> >> 10ms?
> >> >>
> >> >> Thanks
> >> >> -Todd
> >> >>
> >> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <
> xshu@systemsbiology.org>
> >> >> wrote:
> >> >> >  Hi there:
> >> >> >
> >> >> > I am researching Hadoop to see which of its products suits our need
> >> for
> >> >> > quick queries against large data sets (billions of records per set)
> >> >> >
> >> >> > The queries will be performed against chip sequencing data. Each
> >> record
> >> >> is
> >> >> > one line in a file. To be clear below shows a sample record in the
> >> data
> >> >> set.
> >> >> >
> >> >> >
> >> >> > one line (record) looks like: 1-1-174-418
> TGTGTCCCTTTGTAATGAATCACTATC
> >> U2
> >> >> 0 0
> >> >> > 1 4 *103570835* F .. 23G 24C
> >> >> >
> >> >> > The highlighted field is called "position of match" and the query
> we
> >> are
> >> >> > interested in is the # of sequences in a certain range of this
> >> "position
> >> >> of
> >> >> > match". For instance the range can be "position of match" > 200 and
> >> >> > "position of match" + 36 < 200,000.
> >> >> >
> >> >> > Any suggestions on the Hadoop product I should start with to
> >> accomplish
> >> >> the
> >> >> > task? HBase,Pig,Hive, or ...?
> >> >> >
> >> >> > Thanks!
> >> >> >
> >> >> > Xueling
> >> >> >
> >> >>
> >> >
> >>
> >
> >
>

--001636e904c696ec5c047c840569--