hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fred Zappert <fzapp...@gmail.com>
Subject Re: Which Hadoop product is more appropriate for a quick query on a large data set?
Date Sat, 12 Dec 2009 23:31:19 GMT
+1 for hbase

On Sat, Dec 12, 2009 at 2:56 PM, Xueling Shu <xshu@systemsbiology.org>wrote:

> Great information! Thank you for your help, Todd.
>
> Xueling
>
> On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
> > Hi Xueling,
> >
> > In that case, I would recommend the following:
> >
> > 1) Put all of your data on HDFS
> > 2) Write a MapReduce job that sorts the data by position of match
> > 3) As a second output of this job, you can write a "sparse index" -
> > basically a set of entries like this:
> >
> > <position of match> <offset into file> <number of entries following>
> >
> > where you're basically giving offsets into every 10K records or so. If
> > you index every 10K records, then 5 billion total will mean 100,000
> > index entries. Each index entry shouldn't be more than 20 bytes, so
> > 100,000 entries will be 2MB. This is super easy to fit into memory.
> > (you could probably index every 100th record instead and end up with
> > 200MB, still easy to fit in memory)
> >
> > Then to satisfy your count-range query, you can simply scan your
> > in-memory sparse index. Some of the indexed blocks will be completely
> > included in the range, in which case you just add up the "number of
> > entries following" column. The start and finish block will be
> > partially covered, so you can use the file offset info to load that
> > file off HDFS, start reading at that offset, and finish the count.
> >
> > Total time per query should be <100ms no problem.
> >
> > -Todd
> >
> > On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xshu@systemsbiology.org>
> > wrote:
> > > Hi Todd:
> > >
> > > Thank you for your reply.
> > >
> > > The datasets wont be updated often. But the query against a data set is
> > > frequent. The quicker the query, the better. For example we have done
> > > testing on a Mysql database (5 billion records randomly scattered into
> 24
> > > tables) and the slowest query against the biggest table (400,000,000
> > > records) is around 12 mins. So if using any Hadoop product can speed up
> > the
> > > search then the product is what we are looking for.
> > >
> > > Cheers,
> > > Xueling
> > >
> > > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <todd@cloudera.com>
> wrote:
> > >
> > >> Hi Xueling,
> > >>
> > >> One important question that can really change the answer:
> > >>
> > >> How often does the dataset change? Can the changes be merged in in
> > >> bulk every once in a while, or do you need to actually update them
> > >> randomly very often?
> > >>
> > >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
> > or
> > >> 10ms?
> > >>
> > >> Thanks
> > >> -Todd
> > >>
> > >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org
> >
> > >> wrote:
> > >> >  Hi there:
> > >> >
> > >> > I am researching Hadoop to see which of its products suits our need
> > for
> > >> > quick queries against large data sets (billions of records per set)
> > >> >
> > >> > The queries will be performed against chip sequencing data. Each
> > record
> > >> is
> > >> > one line in a file. To be clear below shows a sample record in the
> > data
> > >> set.
> > >> >
> > >> >
> > >> > one line (record) looks like: 1-1-174-418
> TGTGTCCCTTTGTAATGAATCACTATC
> > U2
> > >> 0 0
> > >> > 1 4 *103570835* F .. 23G 24C
> > >> >
> > >> > The highlighted field is called "position of match" and the query
we
> > are
> > >> > interested in is the # of sequences in a certain range of this
> > "position
> > >> of
> > >> > match". For instance the range can be "position of match" > 200
and
> > >> > "position of match" + 36 < 200,000.
> > >> >
> > >> > Any suggestions on the Hadoop product I should start with to
> > accomplish
> > >> the
> > >> > task? HBase,Pig,Hive, or ...?
> > >> >
> > >> > Thanks!
> > >> >
> > >> > Xueling
> > >> >
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message