hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xueling Shu <x...@systemsbiology.org>
Subject Re: Which Hadoop product is more appropriate for a quick query on a large data set?
Date Sat, 12 Dec 2009 18:38:01 GMT
Hi Todd:

Thank you for your reply.

The datasets wont be updated often. But the query against a data set is
frequent. The quicker the query, the better. For example we have done
testing on a Mysql database (5 billion records randomly scattered into 24
tables) and the slowest query against the biggest table (400,000,000
records) is around 12 mins. So if using any Hadoop product can speed up the
search then the product is what we are looking for.

Cheers,
Xueling

On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <todd@cloudera.com> wrote:

> Hi Xueling,
>
> One important question that can really change the answer:
>
> How often does the dataset change? Can the changes be merged in in
> bulk every once in a while, or do you need to actually update them
> randomly very often?
>
> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second, or
> 10ms?
>
> Thanks
> -Todd
>
> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org>
> wrote:
> >  Hi there:
> >
> > I am researching Hadoop to see which of its products suits our need for
> > quick queries against large data sets (billions of records per set)
> >
> > The queries will be performed against chip sequencing data. Each record
> is
> > one line in a file. To be clear below shows a sample record in the data
> set.
> >
> >
> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2
> 0 0
> > 1 4 *103570835* F .. 23G 24C
> >
> > The highlighted field is called "position of match" and the query we are
> > interested in is the # of sequences in a certain range of this "position
> of
> > match". For instance the range can be "position of match" > 200 and
> > "position of match" + 36 < 200,000.
> >
> > Any suggestions on the Hadoop product I should start with to accomplish
> the
> > task? HBase,Pig,Hive, or ...?
> >
> > Thanks!
> >
> > Xueling
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message