hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Håkon Sagehaug <hakon.sageh...@googlemail.com>
Subject Usign Hbase, for storing biology data and query it
Date Fri, 09 Apr 2010 07:26:13 GMT
Hi all,

I work in a project where we need to deal with different types of biology
data. For the first case, which I'm now investigating if HBase is something
we might use the scenario is like this.

The raw text data, is public so we can download it and store it as regular
files. The content of looks like this

 1                  2         3                4            5    6    7    8


24915 31643 CHB rs2003280 rs1500095 1.0 0.0 0.02 0
24915 36594 CHB rs2003280 rs7299571 1.0 0.025 0.21 0
24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0
24916 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0
24916 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0
24916 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0

One files is normally between 1-2 gb(20-30 million lines), and we have
between 23-60 files. Data is something called LD_data if anyone is
interested. For storing this better we've turned all these files into a HDF
file, that is a binary format, this can then be handed over to applications
using LD_data in analysis of biology problems. The reason why we're thinking
of HBase for storing the raw text files is that we want to offer the users
ability to issue the creation of these HDF files them self, based on a
cutoff value from one or the two last columns in the file as input. We've
now just turned the hole file into to a HDF, and then the application
receiving the file deals with the cutoff. So a "query" from user that needs
the lines with a value of column 6 > 0.1 gets

24915 50733 CHB rs2003280 rs4079417 1.0 0.130 0.09 0
24915 59354 CHB rs2003282 rs1500098 1.0 0.157 0.91 0
24915 61880 CHB rs2003282 rs11063263 1.0 0.157 0.91 0
24915 62481 CHB rs2003282 rs10774263 1.0 0.157 0.91 0

Is this something that sound reasonable to use Hbase for. I guess I also
could use hadoop, and do map-reduce job, but sure how to define the map
and/or the reduce job for this. Would the best maybe be to go through the
files and map columns 3, can be looked at as a key, to a list of its values
over the cutoff. Map for the query above woule then in a map be


< rs2003280,    {
    24915 50733 CHB rs4079417 1.0 0.130 0.09 0
    }
>


<rs2003282,    {
    24915 59354 CHB rs1500098 1.0 0.157 0.91 0,
    24915 61880 CHB rs11063263 1.0 0.157 0.91 0,
    24915 62481 CHB rs10774263 1.0 0.157 0.91 0
    }
>

If the Hbase would be used, I'm bit unsure how the data should be structured
best, of way is to store one row per line in the file, but maybe not the
best. Maybe another one is something like this, for the first line in the
example above

rs2003280{
                 col1:24915 = 24915,
                 col:31643 = 31643,
                 col4:rs1500095 = rs1500095,
                 col4:rs7299571 = rs7299571,
                 col4:rs4079417 = rs4079417,
                 value:1=1.0,
                 value:2=0.0,
                 value:3=0.02,
                 value:4=0,
}



As you all can see I've got some questions, I'm in the process of grasping
Hbase,hadoop concepts.

cheers, Håkon

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message