hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Beaudreault <bbeaudrea...@hubspot.com>
Subject Dealing with large data sets in client
Date Tue, 27 Mar 2012 21:36:07 GMT

I have timeseries data, most rows have anywhere from 10 to a few thousand
columns, but outliers can have a million or more.  Each column has some
integer value (counters), and an integer identifier is the qualifier.  On
the client side, I want to scan from startDate to endDate, add up the total
values for each identifier, sort the aggregated values, and return the top
X (pagination).  We do this using a map since many identifiers may
intersect, but not all will.  This works fine for the majority of our
users, but for those outliers we end up running out of memory.  Since we
know the columns are sorted in each row, we could save memory by stepping
through the columns for each returned row together, and keep a list of the
top X as we add them up.  The problem with this is that the Scan api does
not give us access to the data in this way.  You must always get the next
row, then you can batch through the columns for that row, then move on to
the next row.

Has anyone dealt with this kind of use case, and is there any way we can
implement the above read pattern with current the API or otherwise step
through the data?  I imagine it isn't a great idea to create a ton of scans
(1 for each row), which is the only way I can think to do the above with
what we have.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message