hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From steven zhuang <steven.zhuang.1...@gmail.com>
Subject how to do fast scan on huge table
Date Sat, 27 Mar 2010 03:02:15 GMT
hi, guys,
               now I have a table with 30M rows, sometime we need to do scan
on a range of 20M rows, each row contains only one column, with less than 50
Bytes data,
            each row key is a date string, suffixed by a signature, we want
to know during a range how many times the most popular signatures have been
used.

               now what we do is use a scanner for everyday's records, and
just like a multiple-way merge sort, we count the number for the same
signature. I think most time is spent on scanning, since process on the
fetched data is really simple.

               the performance is a little awful though, around 35000
cells(rows) per second, we have 18 region servers, and 5 zookeepers.
               My idea is to distribute the scanners to different machines
and make each scanner scan a portion of rows from each days data and merge
them on the final result machine,  but this needs me to do a programming
based on some C/S architecture(NO Map/Reduce since that might be jammed when
hadoop is busy), I wonder if there is a way better than this?

               another option is redesign the schema, putting what's in
row_key into different column family, that way, data maybe more distributed
than present schema, but still to speed up I think we need to do a
distributed scanning.

               If any of you guys has met similar problem, any pointer or
sharing is appreciated.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message