For our "LogSearch" product, we make a UUID for every log row when we ingest it in a Mapper. Perfectly distributed, so it'll load evenly across the cluster! On Mon, Feb 15, 2010 at 4:52 PM, Ryan Rawson wrote: > Most log data tends to be time-oriented, thus the 'natural' schema is > to use the timestamp as the row key, thus concentrating all inserts on > a single region and thus node.  This is fixable by changing the key to > something other than a monotonically increasing value. > > If you just insert on 1 region, you end up being gated by the > performance of a single node. Thus limiting intake/insert scalability. > > As for that slide, I am the originator of it, and the reasons above > are why I suggested as below. > > On Mon, Feb 15, 2010 at 4:45 PM, Otis Gospodnetic > wrote: >> Hello, >> >> I've seen the following in a few HBase presentations now: >> >> * What to store in HBase? >> * Maybe not your raw log data... >> * ...but the results of processing it with Hadoop >> >> e.g. slides 26 & 27: http://www.slideshare.net/cloudera/hw09-practical-h-base-getting-the-most-from-your-h-base-install >> >> >> Is there anything wrong in storing raw log data directly into HBase and doing so in real-time, even when that means having to insert a few hundred rows/second? >> >> Is the above advice purely because of data volume associated with storing lots of raw logs or some other reason? >> >> Thanks, >> Otis >> ---- >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> Hadoop ecosystem search :: http://search-hadoop.com/ >> >> > -- http://www.drawntoscalehq.com -- Big Data for all. The Big Data Platform. http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science