hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xueling Shu <x...@systemsbiology.org>
Subject Re: Which Hadoop product is more appropriate for a quick query on a large data set?
Date Sat, 12 Dec 2009 22:56:02 GMT
Great information! Thank you for your help, Todd.

Xueling

On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <todd@cloudera.com> wrote:

> Hi Xueling,
>
> In that case, I would recommend the following:
>
> 1) Put all of your data on HDFS
> 2) Write a MapReduce job that sorts the data by position of match
> 3) As a second output of this job, you can write a "sparse index" -
> basically a set of entries like this:
>
> <position of match> <offset into file> <number of entries following>
>
> where you're basically giving offsets into every 10K records or so. If
> you index every 10K records, then 5 billion total will mean 100,000
> index entries. Each index entry shouldn't be more than 20 bytes, so
> 100,000 entries will be 2MB. This is super easy to fit into memory.
> (you could probably index every 100th record instead and end up with
> 200MB, still easy to fit in memory)
>
> Then to satisfy your count-range query, you can simply scan your
> in-memory sparse index. Some of the indexed blocks will be completely
> included in the range, in which case you just add up the "number of
> entries following" column. The start and finish block will be
> partially covered, so you can use the file offset info to load that
> file off HDFS, start reading at that offset, and finish the count.
>
> Total time per query should be <100ms no problem.
>
> -Todd
>
> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xshu@systemsbiology.org>
> wrote:
> > Hi Todd:
> >
> > Thank you for your reply.
> >
> > The datasets wont be updated often. But the query against a data set is
> > frequent. The quicker the query, the better. For example we have done
> > testing on a Mysql database (5 billion records randomly scattered into 24
> > tables) and the slowest query against the biggest table (400,000,000
> > records) is around 12 mins. So if using any Hadoop product can speed up
> the
> > search then the product is what we are looking for.
> >
> > Cheers,
> > Xueling
> >
> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <todd@cloudera.com> wrote:
> >
> >> Hi Xueling,
> >>
> >> One important question that can really change the answer:
> >>
> >> How often does the dataset change? Can the changes be merged in in
> >> bulk every once in a while, or do you need to actually update them
> >> randomly very often?
> >>
> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
> or
> >> 10ms?
> >>
> >> Thanks
> >> -Todd
> >>
> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org>
> >> wrote:
> >> >  Hi there:
> >> >
> >> > I am researching Hadoop to see which of its products suits our need
> for
> >> > quick queries against large data sets (billions of records per set)
> >> >
> >> > The queries will be performed against chip sequencing data. Each
> record
> >> is
> >> > one line in a file. To be clear below shows a sample record in the
> data
> >> set.
> >> >
> >> >
> >> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
> U2
> >> 0 0
> >> > 1 4 *103570835* F .. 23G 24C
> >> >
> >> > The highlighted field is called "position of match" and the query we
> are
> >> > interested in is the # of sequences in a certain range of this
> "position
> >> of
> >> > match". For instance the range can be "position of match" > 200 and
> >> > "position of match" + 36 < 200,000.
> >> >
> >> > Any suggestions on the Hadoop product I should start with to
> accomplish
> >> the
> >> > task? HBase,Pig,Hive, or ...?
> >> >
> >> > Thanks!
> >> >
> >> > Xueling
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message