hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gibbon, Robert, VF-Group" <Robert.Gib...@vodafone.com>
Subject RE: Which Hadoop product is more appropriate for a quick query on a large data set?
Date Wed, 06 Jan 2010 20:00:01 GMT

Isn't this what Hadoop Hbase is supposed to do? The partioning M/R implementation - "sharding"
in street - is the sideways scaling that Hbase is designed to excel at! Also the indexed hbase
flavour could allow very fast ad-hoc queries for Xueling using the new yet familiar HBQL sql-dialect?

Sorry, my 10 pence worth!

-----Original Message-----
From: Xueling Shu [mailto:xshu@systemsbiology.org]
Sent: Wed 1/6/2010 8:41 PM
To: general@hadoop.apache.org
Subject: Re: Which Hadoop product is more appropriate for a quick query on a large data set?
 
Thanks for the information! I will start to try.

Xueling

On Wed, Jan 6, 2010 at 11:32 AM, Todd Lipcon <todd@cloudera.com> wrote:

> Hi Xueling,
>
> Here's a general outline:
>
> My guess is that your "position of match" field is bounded (perhaps by the
> number of base pairs in the human genome?) Given this, you can probably
> write a very simple Partitioner implementation that divides this field into
> ranges, each with an approximately equal number of records.
>
> Next, write a simple MR job which takes in a line of data, and outputs the
> same line, but with the position-of-match as the key. This will get
> partitioned by the above function, so you end up with each reducer
> receiving
> all of the records in a given range.
>
> In the reducer, simply output every 1000th position into your "sparse"
> output file (along with the non-sparse output file offset), and every
> position into the non-sparse output file.
>
> In your realtime query server (not part of Hadoop), load the "sparse" file
> into RAM and perform binary search, etc - find the "bins" which the range
> endpoints land in, and then open the non-sparse output on HDFS to finish
> the
> count.
>
> Hope that helps.
>
> Thanks
> -Todd
>
> On Tue, Jan 5, 2010 at 5:26 PM, Xueling Shu <xshu@systemsbiology.org>
> wrote:
>
> > Rephrase the sentence "Or what APIs I should start with for my testing?":
> I
> > mean "What HDFS APIs I should start to look into for my testing?
> >
> > Thanks,
> > Xueling
> >
> > On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu <xshu@systemsbiology.org>
> > wrote:
> >
> > > Hi Todd:
> > >
> > > After finishing some tasks I finally get back to HDFS testing.
> > >
> > > One question for your last reply to this thread: Are there any code
> > > examples close to your second and third recommendations? Or what APIs I
> > > should start with for my testing?
> > >
> > > Thanks.
> > > Xueling
> > >
> > >
> > > On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <todd@cloudera.com>
> wrote:
> > >
> > >> Hi Xueling,
> > >>
> > >> In that case, I would recommend the following:
> > >>
> > >> 1) Put all of your data on HDFS
> > >> 2) Write a MapReduce job that sorts the data by position of match
> > >> 3) As a second output of this job, you can write a "sparse index" -
> > >> basically a set of entries like this:
> > >>
> > >> <position of match> <offset into file> <number of entries
following>
> > >>
> > >> where you're basically giving offsets into every 10K records or so. If
> > >> you index every 10K records, then 5 billion total will mean 100,000
> > >> index entries. Each index entry shouldn't be more than 20 bytes, so
> > >> 100,000 entries will be 2MB. This is super easy to fit into memory.
> > >> (you could probably index every 100th record instead and end up with
> > >> 200MB, still easy to fit in memory)
> > >>
> > >> Then to satisfy your count-range query, you can simply scan your
> > >> in-memory sparse index. Some of the indexed blocks will be completely
> > >> included in the range, in which case you just add up the "number of
> > >> entries following" column. The start and finish block will be
> > >> partially covered, so you can use the file offset info to load that
> > >> file off HDFS, start reading at that offset, and finish the count.
> > >>
> > >> Total time per query should be <100ms no problem.
> > >>
> > >> -Todd
> > >>
> > >> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <
> xshu@systemsbiology.org>
> > >> wrote:
> > >> > Hi Todd:
> > >> >
> > >> > Thank you for your reply.
> > >> >
> > >> > The datasets wont be updated often. But the query against a data set
> > is
> > >> > frequent. The quicker the query, the better. For example we have
> done
> > >> > testing on a Mysql database (5 billion records randomly scattered
> into
> > >> 24
> > >> > tables) and the slowest query against the biggest table (400,000,000
> > >> > records) is around 12 mins. So if using any Hadoop product can speed
> > up
> > >> the
> > >> > search then the product is what we are looking for.
> > >> >
> > >> > Cheers,
> > >> > Xueling
> > >> >
> > >> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <todd@cloudera.com>
> > wrote:
> > >> >
> > >> >> Hi Xueling,
> > >> >>
> > >> >> One important question that can really change the answer:
> > >> >>
> > >> >> How often does the dataset change? Can the changes be merged in
in
> > >> >> bulk every once in a while, or do you need to actually update
them
> > >> >> randomly very often?
> > >> >>
> > >> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1
> > second,
> > >> or
> > >> >> 10ms?
> > >> >>
> > >> >> Thanks
> > >> >> -Todd
> > >> >>
> > >> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <
> > xshu@systemsbiology.org>
> > >> >> wrote:
> > >> >> >  Hi there:
> > >> >> >
> > >> >> > I am researching Hadoop to see which of its products suits
our
> need
> > >> for
> > >> >> > quick queries against large data sets (billions of records
per
> set)
> > >> >> >
> > >> >> > The queries will be performed against chip sequencing data.
Each
> > >> record
> > >> >> is
> > >> >> > one line in a file. To be clear below shows a sample record
in
> the
> > >> data
> > >> >> set.
> > >> >> >
> > >> >> >
> > >> >> > one line (record) looks like: 1-1-174-418
> > TGTGTCCCTTTGTAATGAATCACTATC
> > >> U2
> > >> >> 0 0
> > >> >> > 1 4 *103570835* F .. 23G 24C
> > >> >> >
> > >> >> > The highlighted field is called "position of match" and the
query
> > we
> > >> are
> > >> >> > interested in is the # of sequences in a certain range of
this
> > >> "position
> > >> >> of
> > >> >> > match". For instance the range can be "position of match"
> 200
> and
> > >> >> > "position of match" + 36 < 200,000.
> > >> >> >
> > >> >> > Any suggestions on the Hadoop product I should start with
to
> > >> accomplish
> > >> >> the
> > >> >> > task? HBase,Pig,Hive, or ...?
> > >> >> >
> > >> >> > Thanks!
> > >> >> >
> > >> >> > Xueling
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >
> > >
> >
>


Mime
View raw message