hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Row+Col Range Read/Scan
Date Wed, 10 Aug 2011 18:33:36 GMT
On Wed, Aug 10, 2011 at 2:39 AM, Wayne <wav100@gmail.com> wrote:
> As we load more and more data into HBase we are seeing the "millions of
> columns" to be a challenge for us. We have some very wide rows and we are
> taking 12-15 seconds to read those rows.

How many columns when its taking this long Wayne?

> Since HBase does not sort columns

They are sorted.

> and thereby can not support a scan of columns

How do you mean?  You only want a subset of the columns?  Can you add
a filter or add some subset of the columns to the Scan specification?

You can also read a piece of the row only if that is all you are
interested in (though you are on other side of thrift, right, and this
facility may not be exposed -- I have not checked)

> Is/has there been any talk about building in some support for sorted columns
> and the ability to read/scan across columns? Millions of columns are
> challenging if you can only read a single column/list of columns or the
> entire thing.

When you say read/scan across columns, can you say more what you'd
like?  You'd like to read N columns at a time?

> How does bigtable support this? It seems that hbase is limited
> as a column based data store unless it can support this. Our columns are
> truly dynamic so we do not even necessarily know what they are to request
> them by name in a list. We want to be able to read/scan them just like for
> rows.

In java you'd do

> We would love the ability to support the following read method (through
> Thrift). We can of course do this on our own from the entire row but it
> requires reading the 2 million col row into memory first.

How big are the cells?  How big is the 2M row?  You don't know the
name but do they fit a pattern that you could filter on?  (Though
again, filters are not exposed in thrift though that looks like its
getting fixed)

> getRowWithColumnRange(tableName, row, startColumn, stopColumn)
> The above would be even better if it could be set up like a scanner where we
> could stop at any point. Basically instead of scanning rows we would scan
> columns for a given row. This would be the best way to support an offset,
> limit pattern.
> colScanID = colScannerOpenWithStop(tableName, row, startColumn, stopColumn)
> colScannerGetList (colSanID,1000)
> Of course once these changes occurred people would be pushing the size of
> rows even more. We have seen somewhere around 20+ million columns cause OOM
> errors. One row per region should be the theoretical limit to the row size,
> but there is more work needed I am sure to ensure that this is true.

The above look useful.  Stick them into an issue Wayne.

P.S. I'm still working (slowly) on the recover tool you asked for in
your last mail.

View raw message