hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Booth <jaybo...@gmail.com>
Subject Re: Local sockets
Date Mon, 06 Dec 2010 20:14:29 GMT
The fallocate() idea was a decent one, although probably not outweighed by
the complexity of adding it..  allocating 64MB at a time for a block and
then perhaps truncating when finished could prevent some out-of-disk edge
cases and improve later streaming read speed.  But that's probably all very
marginal compared to having to add new native libraries and special cases
that could break.

I think process-level I/O scheduling would add a lot of complexity and
probably lead to worse performance, as would be the cache bypass (presuming
that linux's implementation with splice means it wouldn't save any data
copies anyways).

On Mon, Dec 6, 2010 at 2:56 PM, Todd Lipcon <todd@cloudera.com> wrote:

> On Mon, Dec 6, 2010 at 11:39 AM, Vladimir Rodionov
> <vrodionov@carrieriq.com>wrote:
>
> > Todd,
> >
> > There are  some curious people who had spent time (and tax payers money
> :)
> >  and have came to  the same conclusion (as me):
> >
> > http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf
> >
> >
> The paper says:
>
> "By default, the FreeBSD systems used for testing employed
> a simple elevator I/O scheduler. If the system had used a
> more sophisticated scheduler that minimizes seeks, such as the
> Anticipatory Scheduler, this problem may have been masked,"
>
> which is basically what I said:
> "After
> tuning readahead and with a decent elevator algorithm (anticipatory seems
> best in my benchmarks) it's better to have multiple threads reading from a
> drive compared to one"
>
> Testing Hadoop on FreeBSD with UFS is an awfully odd choice, considering
> everyone runs Hadoop on Linux with ext3, ext4, or XFS. More advanced
> operating systems and filesystems do a better job of dealing with the
> concurrent reader/writer - eg ext4 has delayed allocation, which helps
> writers, and properly tuned readahead on the block device should eliminate
> the small read request size in the paper. The paper mentions that they
> tried
> similar techniques on BSD, but admits that the implementation is "basic". I
> haven't used FreeBSD in many years so I can't really comment on the
> relative
> sophistication of their IO subsystem.
>
> I don't disagree with anything in the paper, but they point they seem to be
> making is that Hadoop's performance is not portable (ie it doesn't work
> well
> on FreeBSD with UFS). That's certainly valid from an academic standpoint,
> but given that no one has put any effort into that deployment scenario,
> it's
> hardly surprising :) Like Jay said, the paper's main point seems to be "our
> local filesystem does not scale well to multithreaded access unless the
> application implements their own IO scheduler", rather than anything
> specific about HDFS.
>
> Nonetheless, the paper was an interesting read, I hadn't seen it before. It
> always surprises me when academics write papers about open source projects
> but don't think to email out their results to any of the mailing lists!
>
> -Todd
>
> > ________________________________________
> > From: Todd Lipcon [todd@cloudera.com]
> > Sent: Monday, December 06, 2010 10:04 AM
> > To: dev@hbase.apache.org
> > Subject: Re: Local sockets
> >
> > On Mon, Dec 6, 2010 at 9:59 AM, Vladimir Rodionov
> > <vrodionov@carrieriq.com>wrote:
> >
> > > Todd,
> > >
> > > The major hdfs problem is inefficient processing of multiple streams in
> > > parallel -
> > > multiple readers/writers per one physical drive result in significant
> > drop
> > > in overall
> > > I/O throughput on Linux (tested with ext3, ext4). There should be only
> > one
> > > reader thread,
> > > one writer thread per physical drive (until we get AIO support in Java)
> > >
> > > Multiple data buffer copies in pipeline do not improve situation as
> well.
> > >
> >
> > In my benchmarks, the copies account for only a minor amount of the
> > overhead. Do a benchmark of ChecksumLocalFilesystem vs RawLocalFilesystem
> > and you should see the 2x difference I mentioned for data that's in
> buffer
> > cache.
> >
> > As for parallel reader streams, I disagree with your assessment. After
> > tuning readahead and with a decent elevator algorithm (anticipatory seems
> > best in my benchmarks) it's better to have multiple threads reading from
> a
> > drive compared to one, unless we had AIO. Otherwise we won't be able to
> > have
> > multiple outstanding requests to the block device, and the elevator will
> be
> > powerless to do any reordering of reads.
> >
> >
> > > CRC32 can be fast btw and some other hashing algos can be even faster
> > (like
> > > murmur2 -1.5GB per sec)
> > >
> >
> > Our CRC32 implementation goes around 750MB/sec on raw data, but for
> > whatever
> > undiscovered reason it adds a lot more overhead when you mix it into the
> > data pipeline. HDFS-347 has some interesting benchmarks there.
> >
> > -Todd
> >
> > >
> > > ________________________________________
> > > From: Todd Lipcon [todd@cloudera.com]
> > > Sent: Saturday, December 04, 2010 3:04 PM
> > > To: dev@hbase.apache.org
> > > Subject: Re: Local sockets
> > >
> > > On Sat, Dec 4, 2010 at 2:57 PM, Vladimir Rodionov
> > > <vrodionov@carrieriq.com>wrote:
> > >
> > > > From my own experiments performance difference is huge even on
> > > > sequential R/W operations (up to 300%) when you do local File I/O vs
> > HDFS
> > > > File I/O
> > > >
> > > > Overhead of HDFS I/O is substantial to say the least.
> > > >
> > > >
> > > Much of this is from checksumming, though - turn off checksums and you
> > > should see about a 2x improvement at least.
> > >
> > > -Todd
> > >
> > >
> > > > Best regards,
> > > > Vladimir Rodionov
> > > > Principal Platform Engineer
> > > > Carrier IQ, www.carrieriq.com
> > > > e-mail: vrodionov@carrieriq.com
> > > >
> > > > ________________________________________
> > > > From: Todd Lipcon [todd@cloudera.com]
> > > > Sent: Saturday, December 04, 2010 12:30 PM
> > > > To: dev@hbase.apache.org
> > > > Subject: Re: Local sockets
> > > >
> > > > Hi Leen,
> > > >
> > > > Check out HDFS-347 for more info on this. I hope to pick this back up
> > in
> > > > 2011 - in 2010 we mostly focused on stability above performance in
> > > HBase's
> > > > interactions with HDFS.
> > > >
> > > > Thanks
> > > > -Todd
> > > >
> > > > On Sat, Dec 4, 2010 at 12:28 PM, Leen Toelen <toelen@gmail.com>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > has anyone tested the performance impact (when there is a hdfs
> > > > > datanode and a hbase node on the same machine) of using unix domain
> > > > > sockets communication or shared memory ipc using nio? I guess this
> > > > > should make a difference on reads?
> > > > >
> > > > > Regards,
> > > > > Leen
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Todd Lipcon
> > > > Software Engineer, Cloudera
> > > >
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message