hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SiMaYunRui <myl...@hotmail.com>
Subject How to limit columns returned by a single row in HBase
Date Sat, 19 Jul 2014 15:23:50 GMT
Hi experts,

I have a wide-flat table, and during scan, how can I limit columns returned by a single row,
instead of all rows (what ColumnCountGetFilter does)? Because I need to scan multiple rows
at the same time, and in client side to do aggregation. 

Put more background, I am designing an auditing tools, which records pattern of “(who) operates
against (what) at (when)”. The search pattern is like, given time range from "2014/6/14
13:45" to "2014/6/24 7:15", list all files (what part, start-with search) be operated in DESC
order of (when).

I have tens of millions of records per day, and keep them 30 - 90 days. So I am thinking about
two designs: a) rowkey as (file name)_(reverse of when), problem is that people want to use
start-wth search to match multiple files, in this way, scan has to go thru all matches files,
which could be huge and then client has to re-order them to display 500 records on top; It
could be very slow;

b) use wide-flat table, rowkey as (file_name)_(reverse of when (unit to day to partition)).
qualifier is (reverse of when). This design can leverage the fact that qualifiers are in order
to make fewer search than #a in my personal opinion. But I cannot put all operations on a
single file in one row, because total number might exceeds multiple millions. 

So I am thinking of grouping data into the following shape by using #b. Then back to my original
question, because I only need 500 records, if the row (file A)_(2014/06/14), contains more
than that number, can I stop it and then continue to scan next row? And if I already get enough
in (file A)_(2014/06/14), can I skip (file A)_(2014/06/13) and then continue to scan (file
B)_(2014/06/14) which is a different file?

Row: (file A)_(2014/06/14) 

   d:1341069600 value 

   d:1341069500 value 

   d:1341069400 value

Row: (file A)_(2014/06/13) 

   d:1341059600 value 

   d:1341059500 value 

   d:1341059400 value

Row: (file B)_(2014/06/14) 

   d:1341069700 value 

   d:1341069580 value 

   d:1341069401 value

发自 Windows 邮件
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message