hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Low throughputs while writing Hfiles using Hfile.writer
Date Fri, 11 Jun 2010 22:07:23 GMT
Why are you using 1 MB HDFS block sizes?  Stick with the default of
64MB, there is no reason to add 64 times the overhead.

As for HFile writing, you will want to enable compression (there is 0
reason not to) and also check that profiling.  Yourkit has a version
that reasonably runs in semi-prod without killing performance too

On Fri, Jun 11, 2010 at 3:01 PM, Todd Lipcon <todd@cloudera.com> wrote:
> Hi Vidhya,
> Do you have profiling output from your HFile writers?
> Since you have a standalone program that should be doing little except
> writing, I imagine the profiler output would be pretty useful in seeing
> where the bottleneck lies.
> My guess is that you're CPU bound on serialization - serialization is often
> slow slow slow.
> -Todd
> On Fri, Jun 11, 2010 at 2:54 PM, Vidhyashankar Venkataraman <
> vidhyash@yahoo-inc.com> wrote:
>> The last couple of days I have been running into some bottleneck issues
>> with writing HFiles that I am unable to figure out. I am using the
>> Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is
>> similar to a TFile) to bulk load and I have been getting suspiciously low
>> values for the throughput..
>> I am not using MR to create my files.. I prepare data on the fly and dump
>> Hfiles almost exactly like what HFileOutputFormat does..
>> This is my current setup: (almost similar to what I had been saying in my
>> prev emails)
>> Individual output file size is 2 GB.. Block size of 1 MB. I am writing
>> multiple such files to build the entire db.. Each client program writes
>> files one after another..
>> Each Key-value pair is around 15 KB..
>> 5 datanodes..
>> Each dn also runs 5 instances of my client program. (25 processes in all)
>>  And I get a throughput of around 100 rows per second per node (that comes
>> to around 1.5 MBps per node)
>> Expectedly, neither the disk  nor the network is the bottleneck..
>> Are there any config values that I need to take care of?
>> With copyFromLocal command of Hadoop, I can get really much better
>> throughputs: 50MBps with just one process.. (of course, the block size is
>> much larger in that case)..
>> Thanks in advance :)
>> Vidhya
>> On 6/11/10 12:44 PM, "Pei Lin Ong" <peilin@yahoo-inc.com> wrote:
>> Hi Milind and Koji,
>> Vidhya is one of the Search devs working on Web Crawl Cache cluster
>> (ingesting crawled content from Bing).
>> He is currently looking at different technology choices, such as HBase, for
>> the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is
>> looking for help.
>> I have suggested he pose the question via this thread as Vidhya indicates
>> it is urgent due to the WCC timetable.
>> Please accommodate this request and see if you can answer Vidhya's question
>> (after he poses it). Should the question require further discussion, then
>> Vidhya or I will file a ticket.
>> Thank you!
>> Pei
> --
> Todd Lipcon
> Software Engineer, Cloudera

View raw message