incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Samuel Guo" <guosi...@gmail.com>
Subject Re: bulk load in hbase
Date Mon, 29 Sep 2008 10:21:05 GMT
On Mon, Sep 29, 2008 at 12:43 PM, Edward J. Yoon <edwardyoon@apache.org>wrote:

> The table is nothing more and nothing less than a matrix. So, we can
> think about bulk load such as
> http://wiki.apache.org/hadoop/Hbase/MapReduce


Yes. MapReduce should be used to load a matrix.
but still, if the matrix is huge(many rows, many columns), the loading will
cause a lot of matrix-table split actions. Is it right?

>
>
> And, I think we can provides some regular format to store the matrix
> such as hadoop SquenceFileFormat.


It is great!


>
>
> Then, file->matrix, matrix->file, matrix operations,..., all done.
>
> /Edward
>
> On Fri, Sep 26, 2008 at 11:26 PM, Samuel Guo <guosijie@gmail.com> wrote:
> > hi all,
> >
> > I am considering about how to use map/reudce to bulk-load a matrix from a
> > file.
> >
> > we can split the file, and let many mappers to load part of the file. but
> > lots of region-split will happen while loading if the matrix is huge. It
> may
> > affect the matrix load performance.
> >
> > I think that a file that stores a matrix may be regular.
> > without compression, it may be as below.
> > d11 d12 d13 .................... d1m
> > d21 d22 d23 .................... d2m
> > .............................................
> > dn1 dn2 dn3......................dnm
> >
> > An Optimization method will be:
> > (1) read a line from the matrix file, we may know it's row-size. assume
> it
> > is RS.
> > (2) we can get the file size from filesystem's metadata know the
> file-size.
> > assume it is FS.
> > (3) we can do a computation to got the number of rows. N(R) = FS/RS.
> > (4) if we know the rows, we can estimate the number of regions of the
> > matrix.
> > finally, we can split the matrix's table in hbase first and let
> > matrix-loading parallely without splitting again.
> >
> > certainly, no one will store a matrix as above in file. some compression
> > will be used to store a dense or sparse matrix.
> > but even in a compressed matrix-file, we still can pay little to estimate
> > the number of regions of the matrix and gain more performance improvement
> of
> > matrix-bulk-loading.
> >
> > Am I right?
> >
> > regards,
> >
> > samuel
> >
>
>
>
> --
> Best regards, Edward J. Yoon
> edwardyoon@apache.org
> http://blog.udanax.org
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message