hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Which Hadoop product is more appropriate for a quick query on a large data set?
Date Sat, 12 Dec 2009 03:34:41 GMT
Hi Xueling,

One important question that can really change the answer:

How often does the dataset change? Can the changes be merged in in
bulk every once in a while, or do you need to actually update them
randomly very often?

Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second, or 10ms?

Thanks
-Todd

On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org> wrote:
>  Hi there:
>
> I am researching Hadoop to see which of its products suits our need for
> quick queries against large data sets (billions of records per set)
>
> The queries will be performed against chip sequencing data. Each record is
> one line in a file. To be clear below shows a sample record in the data set.
>
>
> one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0
> 1 4 *103570835* F .. 23G 24C
>
> The highlighted field is called "position of match" and the query we are
> interested in is the # of sequences in a certain range of this "position of
> match". For instance the range can be "position of match" > 200 and
> "position of match" + 36 < 200,000.
>
> Any suggestions on the Hadoop product I should start with to accomplish the
> task? HBase,Pig,Hive, or ...?
>
> Thanks!
>
> Xueling
>

Mime
View raw message