> but still, if the matrix is huge(many rows, many columns), the loading will
> cause a lot of matrixtable split actions. Is it right?
Yes, but
>> finally, we can split the matrix's table in hbase first and let
>> matrixloading parallely without splitting again.
I don't understand exactly. Do you mean that create tablets directly
by presplitting and assign them to region server?
Then, this is a role of the the hbase. The merge/split are be issued
after compaction. I guess it will be same with a hbase compaction
mechanism.
/Edward
On Mon, Sep 29, 2008 at 7:21 PM, Samuel Guo <guosijie@gmail.com> wrote:
> On Mon, Sep 29, 2008 at 12:43 PM, Edward J. Yoon <edwardyoon@apache.org>wrote:
>
>> The table is nothing more and nothing less than a matrix. So, we can
>> think about bulk load such as
>> http://wiki.apache.org/hadoop/Hbase/MapReduce
>
>
> Yes. MapReduce should be used to load a matrix.
> but still, if the matrix is huge(many rows, many columns), the loading will
> cause a lot of matrixtable split actions. Is it right?
>
>>
>>
>> And, I think we can provides some regular format to store the matrix
>> such as hadoop SquenceFileFormat.
>
>
> It is great!
>
>
>>
>>
>> Then, file>matrix, matrix>file, matrix operations,..., all done.
>>
>> /Edward
>>
>> On Fri, Sep 26, 2008 at 11:26 PM, Samuel Guo <guosijie@gmail.com> wrote:
>> > hi all,
>> >
>> > I am considering about how to use map/reudce to bulkload a matrix from a
>> > file.
>> >
>> > we can split the file, and let many mappers to load part of the file. but
>> > lots of regionsplit will happen while loading if the matrix is huge. It
>> may
>> > affect the matrix load performance.
>> >
>> > I think that a file that stores a matrix may be regular.
>> > without compression, it may be as below.
>> > d11 d12 d13 .................... d1m
>> > d21 d22 d23 .................... d2m
>> > .............................................
>> > dn1 dn2 dn3......................dnm
>> >
>> > An Optimization method will be:
>> > (1) read a line from the matrix file, we may know it's rowsize. assume
>> it
>> > is RS.
>> > (2) we can get the file size from filesystem's metadata know the
>> filesize.
>> > assume it is FS.
>> > (3) we can do a computation to got the number of rows. N(R) = FS/RS.
>> > (4) if we know the rows, we can estimate the number of regions of the
>> > matrix.
>> > finally, we can split the matrix's table in hbase first and let
>> > matrixloading parallely without splitting again.
>> >
>> > certainly, no one will store a matrix as above in file. some compression
>> > will be used to store a dense or sparse matrix.
>> > but even in a compressed matrixfile, we still can pay little to estimate
>> > the number of regions of the matrix and gain more performance improvement
>> of
>> > matrixbulkloading.
>> >
>> > Am I right?
>> >
>> > regards,
>> >
>> > samuel
>> >
>>
>>
>>
>> 
>> Best regards, Edward J. Yoon
>> edwardyoon@apache.org
>> http://blog.udanax.org
>>
>

Best regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org
