From Paul Zimdars <>
Subject Hadoop/Hive observations
Date Tue, 14 Sep 2010 05:41:06 GMT
  Hi All,

We have been using Hadoop (0.20.2+320) and Hive (0.5.0+20) for about a 
month now to see if we could migrate our existing MySQL DB into a 
Hadoop/Hive architecture (hadoop/hive rock BTW! :) ). We unfortunately 
are experiencing slow response times while doing simple tasks such as a 
count DB query (e.g. hive> select count(blah_id) from blah;). We 
currently have 2.5B Data Points residing in a single table and hive will 
take approximately 5-6 minutes to do a count of these 2.5B records 
(15-17 minutes for 6.8B records). The reduce portion is fast (single 
reduce since this is a count * query) but the map stage takes the 
remainder of the time (~95%).  We currently have 6 (4 x quad core) 
systems with approximately 24GB of ram each. We have attempted to add 
more nodes, increase map tasktrackers (many different #s), change DFS 
block size (32M, 64M, 128MB, 256M), LZO compression, and many, many 
other configuration variables (io.sort.factor,io.sort.mb) without much 
success in lowering the time it takes to complete the count (I do notice 
a high IO wait on the matter how many tasktrackers I run). The 
size of the DB is approximately ~200GB and with MySQL it takes a few 
seconds to do both the 2.5B and 6.7B count (I am curious if running this 
locally without any nodes would result in a quicker response time since 
the delay appears to be in the mapping stage...). I have come to believe 
(and read) that hadoop/hive is unfortunately not well suited for this 
type of work and instead is suited for larger data sets. I am curious if 
anyone has any ideas on A) improving performance and/or B) similar 
experiences? I am also curious if maybe something like HBase would be 
better suited for this type of data (small dataset, many files). We 
appreciate any input, suggestions, or ideas!.

Thank you!!
Paul Zimdars

