incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: bulk load in hbase
Date Mon, 29 Sep 2008 11:36:02 GMT
> but still, if the matrix is huge(many rows, many columns), the loading will
> cause a lot of matrix-table split actions. Is it right?

Yes, but

>> finally, we can split the matrix's table in hbase first and let
>> matrix-loading parallely without splitting again.

I don't understand exactly. Do you mean that create tablets directly
by pre-splitting and assign them to region server?

Then, this is a role of the the hbase. The merge/split are be issued
after compaction. I guess it will be same with a hbase compaction
mechanism.

/Edward

On Mon, Sep 29, 2008 at 7:21 PM, Samuel Guo <guosijie@gmail.com> wrote:
> On Mon, Sep 29, 2008 at 12:43 PM, Edward J. Yoon <edwardyoon@apache.org>wrote:
>
>> The table is nothing more and nothing less than a matrix. So, we can
>> think about bulk load such as
>> http://wiki.apache.org/hadoop/Hbase/MapReduce
>
>
> Yes. MapReduce should be used to load a matrix.
> but still, if the matrix is huge(many rows, many columns), the loading will
> cause a lot of matrix-table split actions. Is it right?
>
>>
>>
>> And, I think we can provides some regular format to store the matrix
>> such as hadoop SquenceFileFormat.
>
>
> It is great!
>
>
>>
>>
>> Then, file->matrix, matrix->file, matrix operations,..., all done.
>>
>> /Edward
>>
>> On Fri, Sep 26, 2008 at 11:26 PM, Samuel Guo <guosijie@gmail.com> wrote:
>> > hi all,
>> >
>> > I am considering about how to use map/reudce to bulk-load a matrix from a
>> > file.
>> >
>> > we can split the file, and let many mappers to load part of the file. but
>> > lots of region-split will happen while loading if the matrix is huge. It
>> may
>> > affect the matrix load performance.
>> >
>> > I think that a file that stores a matrix may be regular.
>> > without compression, it may be as below.
>> > d11 d12 d13 .................... d1m
>> > d21 d22 d23 .................... d2m
>> > .............................................
>> > dn1 dn2 dn3......................dnm
>> >
>> > An Optimization method will be:
>> > (1) read a line from the matrix file, we may know it's row-size. assume
>> it
>> > is RS.
>> > (2) we can get the file size from filesystem's metadata know the
>> file-size.
>> > assume it is FS.
>> > (3) we can do a computation to got the number of rows. N(R) = FS/RS.
>> > (4) if we know the rows, we can estimate the number of regions of the
>> > matrix.
>> > finally, we can split the matrix's table in hbase first and let
>> > matrix-loading parallely without splitting again.
>> >
>> > certainly, no one will store a matrix as above in file. some compression
>> > will be used to store a dense or sparse matrix.
>> > but even in a compressed matrix-file, we still can pay little to estimate
>> > the number of regions of the matrix and gain more performance improvement
>> of
>> > matrix-bulk-loading.
>> >
>> > Am I right?
>> >
>> > regards,
>> >
>> > samuel
>> >
>>
>>
>>
>> --
>> Best regards, Edward J. Yoon
>> edwardyoon@apache.org
>> http://blog.udanax.org
>>
>



-- 
Best regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org

Mime
View raw message