hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Robertson <timrobertson...@gmail.com>
Subject Re: Usign Hbase, for storing biology data and query it
Date Thu, 15 Apr 2010 18:38:22 GMT
I think I agree with Jesper... HBase does not seem the best fit to me
since you are concerned with batch scanning and transformation, rather
than single record access.

If you chose MapReduce you would do something like:
- provide the filter (column6, greaterThan, 0.1) and pass that around
in Job config for the Mappers to read in the init() phase
- map() would apply the filter and pass only the data meeting the criteria
- reduce would do nothing I think
- you'd write a custom output format which generates the HDF file.

However....

This screams to me for Hive.  With Hive, you load in the Tab delimited
(csv file) to Hadoop HDFS.  Then create a table (just like in your
favourite DB), and then you issue SQL against it, with the results
going to another file (think SQL on top of a CSV file).  The output
would then be turned into your HDF file as you already do.  Hive does
a query plan from the SQL, and launches MapReduce jobs to do the work
for you.

I was doing custom MapReduce jobs for the same until I discovered
Hive.  It really is very easy to use, and this 15 minutes will explain
a lot: http://vimeo.com/3598672

Hope this helps,
Tim




On Thu, Apr 15, 2010 at 12:27 PM, Jesper Utoft <jesper.utoft@gmail.com> wrote:
> Hey.
>
> First off i have only been playing around with HBase and Hadoop in school,
> so i have no in debt knowledge of it.
>
> I think you should not use HBase but just store the files in HDFS direcly.
> And then make these HDF files using a map/reduce job in some way.
>
> Just my 2 cents.
>
> Cheers.
>
> 2010/4/15 Håkon Sagehaug <hakon.sagehaug@googlemail.com>
>
>> Hi
>>
>> Does anyone have an input on my question?
>>
>> Håkon
>>
>> 2010/4/9 Håkon Sagehaug <hakon.sagehaug@googlemail.com>
>>
>> > Hi all,
>> >
>> > I work in a project where we need to deal with different types of biology
>> > data. For the first case, which I'm now investigating if HBase is
>> something
>> > we might use the scenario is like this.
>> >
>> > The raw text data, is public so we can download it and store it as
>> regular
>> > files. The content of looks like this
>> >
>> >  1                  2         3                4    
       5    6    7
>> > 8
>> >
>> > 24915 31643 CHB rs2003280 rs1500095 1.0 0.0 0.02 0
>> > 24915 36594 CHB rs2003280 rs7299571 1.0 0.025 0.21 0
>> > 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0
>> > 24916 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0
>> > 24916 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0
>> > 24916 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0
>> >
>> > One files is normally between 1-2 gb(20-30 million lines), and we have
>> > between 23-60 files. Data is something called LD_data if anyone is
>> > interested. For storing this better we've turned all these files into a
>> HDF
>> > file, that is a binary format, this can then be handed over to
>> applications
>> > using LD_data in analysis of biology problems. The reason why we're
>> thinking
>> > of HBase for storing the raw text files is that we want to offer the
>> users
>> > ability to issue the creation of these HDF files them self, based on a
>> > cutoff value from one or the two last columns in the file as input. We've
>> > now just turned the hole file into to a HDF, and then the application
>> > receiving the file deals with the cutoff. So a "query" from user that
>> needs
>> > the lines with a value of column 6 > 0.1 gets
>> >
>> > 24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0
>> > 24915 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0
>> > 24915 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0
>> > 24915 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0
>> >
>> > Is this something that sound reasonable to use Hbase for. I guess I also
>> > could use hadoop, and do map-reduce job, but sure how to define the map
>> > and/or the reduce job for this. Would the best maybe be to go through the
>> > files and map columns 3, can be looked at as a key, to a list of its
>> values
>> > over the cutoff. Map for the query above woule then in a map be
>> >
>> >
>> > < rs2003280,    {
>> >     24915 50733 CHB rs4079417 1.0 0.130 0.09 0
>> >     }
>> > >
>> >
>> >
>> > <rs2003282,    {
>> >     24915 59354 CHB rs1500098 1.0 0.157 0.91 0,
>> >     24915 61880 CHB rs11063263 1.0 0.157 0.91 0,
>> >     24915 62481 CHB rs10774263 1.0 0.157 0.91 0
>> >     }
>> > >
>> >
>> > If the Hbase would be used, I'm bit unsure how the data should be
>> > structured best, of way is to store one row per line in the file, but
>> maybe
>> > not the best. Maybe another one is something like this, for the first
>> line
>> > in the example above
>> >
>> > rs2003280{
>> >                  col1:24915 = 24915,
>> >                  col:31643 = 31643,
>> >                  col4:rs1500095 = rs1500095,
>> >                  col4:rs7299571 = rs7299571,
>> >                  col4:rs4079417 = rs4079417,
>> >                  value:1=1.0,
>> >                  value:2=0.0,
>> >                  value:3=0.02,
>> >                  value:4=0,
>> > }
>> >
>> >
>> >
>> > As you all can see I've got some questions, I'm in the process of
>> grasping
>> > Hbase,hadoop concepts.
>> >
>> > cheers, Håkon
>> >
>>
>

Mime
View raw message