hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Which Hadoop product is more appropriate for a quick query on a large data set?
Date Sat, 12 Dec 2009 21:01:39 GMT
Hi Xueling,

In that case, I would recommend the following:

1) Put all of your data on HDFS
2) Write a MapReduce job that sorts the data by position of match
3) As a second output of this job, you can write a "sparse index" -
basically a set of entries like this:

<position of match> <offset into file> <number of entries following>

where you're basically giving offsets into every 10K records or so. If
you index every 10K records, then 5 billion total will mean 100,000
index entries. Each index entry shouldn't be more than 20 bytes, so
100,000 entries will be 2MB. This is super easy to fit into memory.
(you could probably index every 100th record instead and end up with
200MB, still easy to fit in memory)

Then to satisfy your count-range query, you can simply scan your
in-memory sparse index. Some of the indexed blocks will be completely
included in the range, in which case you just add up the "number of
entries following" column. The start and finish block will be
partially covered, so you can use the file offset info to load that
file off HDFS, start reading at that offset, and finish the count.

Total time per query should be <100ms no problem.

-Todd

On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xshu@systemsbiology.org> wrote:
> Hi Todd:
>
> Thank you for your reply.
>
> The datasets wont be updated often. But the query against a data set is
> frequent. The quicker the query, the better. For example we have done
> testing on a Mysql database (5 billion records randomly scattered into 24
> tables) and the slowest query against the biggest table (400,000,000
> records) is around 12 mins. So if using any Hadoop product can speed up the
> search then the product is what we are looking for.
>
> Cheers,
> Xueling
>
> On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> Hi Xueling,
>>
>> One important question that can really change the answer:
>>
>> How often does the dataset change? Can the changes be merged in in
>> bulk every once in a while, or do you need to actually update them
>> randomly very often?
>>
>> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second, or
>> 10ms?
>>
>> Thanks
>> -Todd
>>
>> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org>
>> wrote:
>> >  Hi there:
>> >
>> > I am researching Hadoop to see which of its products suits our need for
>> > quick queries against large data sets (billions of records per set)
>> >
>> > The queries will be performed against chip sequencing data. Each record
>> is
>> > one line in a file. To be clear below shows a sample record in the data
>> set.
>> >
>> >
>> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2
>> 0 0
>> > 1 4 *103570835* F .. 23G 24C
>> >
>> > The highlighted field is called "position of match" and the query we are
>> > interested in is the # of sequences in a certain range of this "position
>> of
>> > match". For instance the range can be "position of match" > 200 and
>> > "position of match" + 36 < 200,000.
>> >
>> > Any suggestions on the Hadoop product I should start with to accomplish
>> the
>> > task? HBase,Pig,Hive, or ...?
>> >
>> > Thanks!
>> >
>> > Xueling
>> >
>>
>

Mime
View raw message