hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: HBase Random Read latency > 100ms
Date Tue, 08 Oct 2013 19:14:37 GMT
How many reads per second per region server are you throwing at the system
- also is 100ms the average latency ?


On Mon, Oct 7, 2013 at 2:04 PM, lars hofhansl <larsh@apache.org> wrote:

> He still should not see 100ms latency. 20ms, sure. 100ms seems large;
> there are still 8 machines serving the requests.
>
> I agree this spec is far from optimal, but there is still something odd
> here.
>
>
> Ramu, this does not look like a GC issue. You'd see much larger (worst
> case) latencies if that were the case (dozens of seconds).
> Are you using 40 client from 40 different machines? Or from 40 different
> processes on the same machine? Or 40 threads in the same process?
>
> Thanks.
>
> -- Lars
>
>
>
> ________________________________
>  From: Vladimir Rodionov <vrodionov@carrieriq.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>
> Sent: Monday, October 7, 2013 11:02 AM
> Subject: RE: HBase Random Read latency > 100ms
>
>
> Ramu, your HBase configuration (128GB of heap) is far from optimal.
> Nobody runs HBase with that amount of heap to my best knowledge.
> 32GB of RAM is the usual upper limit. We run 8-12GB in production.
>
> What else, your IO capacity is VERY low. 2 SATA drives in RAID 1 for
> mostly random reads load?
> You should have 8, better 12-16 drives per server. Forget about RAID. You
> have HDFS.
>
> Block cache in your case does not help much , as since your read
> amplification is at least x20 (16KB block and 724 B read) - its just waste
> RAM (heap). In your case you do not need LARGE heap and LARGE block cache.
>
> I advise you reconsidering your hardware spec, applying all optimizations
> mentioned already in this thread and lowering your expectations.
>
> With a right hardware you will be able to get 500-1000 truly random reads
> per server.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
>
> From: Ramu M S [ramu.malur@gmail.com]
> Sent: Monday, October 07, 2013 5:23 AM
> To: user@hbase.apache.org
> Subject: Re: HBase Random Read latency > 100ms
>
> Hi Bharath,
>
> I am little confused about the metrics displayed by Cloudera. Even when
> there are no oeprations, the gc_time metric is showing 2s constant in the
> graph. Is this the CMS gc_time (in that case no JVm pause) or the GC pause.
>
> GC timings reported earlier is the average taken for gc_time metric for all
> region servers.
>
> Regards,
> Ramu
>
>
> On Mon, Oct 7, 2013 at 9:10 PM, Ramu M S <ramu.malur@gmail.com> wrote:
>
> > Jean,
> >
> > Yes. It is 2 drives.
> >
> > - Ramu
> >
> >
> > On Mon, Oct 7, 2013 at 8:45 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> Quick questionon the disk side.
> >>
> >> When you say:
> >> 800 GB SATA (7200 RPM) Disk
> >> Is it 1x800GB? It's raid 1, so might be 2 drives? What's the
> >> configuration?
> >>
> >> JM
> >>
> >>
> >> 2013/10/7 Ramu M S <ramu.malur@gmail.com>
> >>
> >> > Lars, Bharath,
> >> >
> >> > Compression is disabled for the table. This was not intended from the
> >> > evaluation.
> >> > I forgot to mention that during table creation. I will enable snappy
> >> and do
> >> > major compaction again.
> >> >
> >> > Please suggest other options to try out and also suggestions for the
> >> > previous questions.
> >> >
> >> > Thanks,
> >> > Ramu
> >> >
> >> >
> >> > On Mon, Oct 7, 2013 at 6:35 PM, Ramu M S <ramu.malur@gmail.com>
> wrote:
> >> >
> >> > > Bharath,
> >> > >
> >> > > I was about to report this. Yes indeed there is too much of GC time.
> >> > > Just verified the GC time using Cloudera Manager statistics(Every
> >> minute
> >> > > update).
> >> > >
> >> > > For each Region Server,
> >> > >  - During Read: Graph shows 2s constant.
> >> > >  - During Compaction: Graph starts with 7s and goes as high as 20s
> >> during
> >> > > end.
> >> > >
> >> > > Few more questions,
> >> > > 1. For the current evaluation, since the reads are completely random
> >> and
> >> > I
> >> > > don't expect to read same data again can I set the Heap to the
> >> default 1
> >> > GB
> >> > > ?
> >> > >
> >> > > 2. Can I completely turn off BLOCK CACHE for this table?
> >> > >    http://hbase.apache.org/book/regionserver.arch.html recommends
> >> that
> >> > > for Randm reads.
> >> > >
> >> > > 3. But in the next phase of evaluation, We are interested to use
> >> HBase as
> >> > > In-memory KV DB by having the latest data in RAM (To the tune of
> >> around
> >> > 128
> >> > > GB in each RS, we are setting up 50-100 Node Cluster). I am very
> >> curious
> >> > to
> >> > > hear any suggestions in this regard.
> >> > >
> >> > > Regards,
> >> > > Ramu
> >> > >
> >> > >
> >> > > On Mon, Oct 7, 2013 at 5:50 PM, Bharath Vissapragada <
> >> > > bharathv@cloudera.com> wrote:
> >> > >
> >> > >> Hi Ramu,
> >> > >>
> >> > >> Thanks for reporting the results back. Just curious if you are
> >> hitting
> >> > any
> >> > >> big GC pauses due to block cache churn on such large heap. Do
you
> see
> >> > it ?
> >> > >>
> >> > >> - Bharath
> >> > >>
> >> > >>
> >> > >> On Mon, Oct 7, 2013 at 1:42 PM, Ramu M S <ramu.malur@gmail.com>
> >> wrote:
> >> > >>
> >> > >> > Lars,
> >> > >> >
> >> > >> > After changing the BLOCKSIZE to 16KB, the latency has reduced
a
> >> > little.
> >> > >> Now
> >> > >> > the average is around 75ms.
> >> > >> > Overall throughput (I am using 40 Clients to fetch records)
is
> >> around
> >> > 1K
> >> > >> > OPS.
> >> > >> >
> >> > >> > After compaction hdfsBlocksLocalityIndex is
> >> 91,88,78,90,99,82,94,97 in
> >> > >> my 8
> >> > >> > RS respectively.
> >> > >> >
> >> > >> > Thanks,
> >> > >> > Ramu
> >> > >> >
> >> > >> >
> >> > >> > On Mon, Oct 7, 2013 at 3:51 PM, Ramu M S <ramu.malur@gmail.com>
> >> > wrote:
> >> > >> >
> >> > >> > > Thanks Lars.
> >> > >> > >
> >> > >> > > I have changed the BLOCKSIZE to 16KB and triggered a
major
> >> > >> compaction. I
> >> > >> > > will report my results once it is done.
> >> > >> > >
> >> > >> > > - Ramu
> >> > >> > >
> >> > >> > >
> >> > >> > > On Mon, Oct 7, 2013 at 3:21 PM, lars hofhansl <
> larsh@apache.org>
> >> > >> wrote:
> >> > >> > >
> >> > >> > >> First of: 128gb heap per RegionServer. Wow.I'd be
interested
> to
> >> > hear
> >> > >> > your
> >> > >> > >> experience with such a large heap for your RS. It's
definitely
> >> big
> >> > >> > enough.
> >> > >> > >>
> >> > >> > >>
> >> > >> > >> It's interesting hat 100gb do fit into the aggregate
cache (of
> >> > >> 8x32gb),
> >> > >> > >> while 1.8tb do not.
> >> > >> > >> Looks like ~70% of the read request would need to
bring in a
> >> 64kb
> >> > >> block
> >> > >> > >> in order to read 724 bytes.
> >> > >> > >>
> >> > >> > >> Should that take 100ms? No. Something's still amiss.
> >> > >> > >>
> >> > >> > >> Smaller blocks might help (you'd need to bring in
4, 8, or
> maybe
> >> > 16k
> >> > >> to
> >> > >> > >> read the small row). You would need to issue a major
> compaction
> >> for
> >> > >> > that to
> >> > >> > >> take effect.
> >> > >> > >> Maybe try 16k blocks. If that speeds up your random
gets we
> know
> >> > >> where
> >> > >> > to
> >> > >> > >> look next... At the disk IO.
> >> > >> > >>
> >> > >> > >>
> >> > >> > >> -- Lars
> >> > >> > >>
> >> > >> > >>
> >> > >> > >>
> >> > >> > >> ________________________________
> >> > >> > >>  From: Ramu M S <ramu.malur@gmail.com>
> >> > >> > >> To: user@hbase.apache.org; lars hofhansl <larsh@apache.org>
> >> > >> > >> Sent: Sunday, October 6, 2013 11:05 PM
> >> > >> > >> Subject: Re: HBase Random Read latency > 100ms
> >> > >> > >>
> >> > >> > >>
> >> > >> > >> Lars,
> >> > >> > >>
> >> > >> > >> In one of your old posts, you had mentioned that
lowering the
> >> > >> BLOCKSIZE
> >> > >> > is
> >> > >> > >> good for random reads (of course with increased
size for Block
> >> > >> Indexes).
> >> > >> > >>
> >> > >> > >> Post is at
> >> > >> > http://grokbase.com/t/hbase/user/11bat80x7m/row-get-very-slow
> >> > >> > >>
> >> > >> > >> Will that help in my tests? Should I give it a try?
If I alter
> >> my
> >> > >> table,
> >> > >> > >> should I trigger a major compaction again for this
to take
> >> effect?
> >> > >> > >>
> >> > >> > >> Thanks,
> >> > >> > >> Ramu
> >> > >> > >>
> >> > >> > >>
> >> > >> > >>
> >> > >> > >> On Mon, Oct 7, 2013 at 2:44 PM, Ramu M S <
> ramu.malur@gmail.com>
> >> > >> wrote:
> >> > >> > >>
> >> > >> > >> > Sorry BLOCKSIZE was wrong in my earlier post,
it is the
> >> default
> >> > 64
> >> > >> KB.
> >> > >> > >> >
> >> > >> > >> > {NAME => 'usertable', FAMILIES => [{NAME
=> 'cf',
> >> > >> DATA_BLOCK_ENCODING
> >> > >> > =>
> >> > >> > >> > 'NONE', BLOOMFILTER => 'ROWCOL', REPLICATION_SCOPE
=> '0',
> >> > >> VERSIONS =>
> >> > >> > >> '1',
> >> > >> > >> > COMPRESSION => 'NONE', MIN_VERSIONS =>
'0', TTL =>
> >> '2147483647',
> >> > >> > >> > KEEP_DELETED_CELLS => 'false', BLOCKSIZE
=> '65536',
> >> IN_MEMORY =>
> >> > >> > >> 'false',
> >> > >> > >> > ENCODE_ON_DISK => 'true', BLOCKCACHE =>
'true'}]}
> >> > >> > >> >
> >> > >> > >> > Thanks,
> >> > >> > >> > Ramu
> >> > >> > >> >
> >> > >> > >> >
> >> > >> > >> > On Mon, Oct 7, 2013 at 2:42 PM, Ramu M S <
> >> ramu.malur@gmail.com>
> >> > >> > wrote:
> >> > >> > >> >
> >> > >> > >> >> Lars,
> >> > >> > >> >>
> >> > >> > >> >> - Yes Short Circuit reading is enabled
on both HDFS and
> >> HBase.
> >> > >> > >> >> - I had issued Major compaction after table
is loaded.
> >> > >> > >> >> - Region Servers have max heap set as 128
GB. Block Cache
> >> Size
> >> > is
> >> > >> > 0.25
> >> > >> > >> of
> >> > >> > >> >> heap (So 32 GB for each Region Server)
Do we need even
> more?
> >> > >> > >> >> - Decreasing HFile Size (Default is 1GB
)? Should I leave
> it
> >> to
> >> > >> > >> default?
> >> > >> > >> >> - Keys are Zipfian distributed (By YCSB)
> >> > >> > >> >>
> >> > >> > >> >> Bharath,
> >> > >> > >> >>
> >> > >> > >> >> Bloom Filters are enabled. Here is my table
details,
> >> > >> > >> >> {NAME => 'usertable', FAMILIES =>
[{NAME => 'cf',
> >> > >> DATA_BLOCK_ENCODING
> >> > >> > >> =>
> >> > >> > >> >> 'NONE', BLOOMFILTER => 'ROWCOL', REPLICATION_SCOPE
=> '0',
> >> > >> VERSIONS
> >> > >> > =>
> >> > >> > >> '1',
> >> > >> > >> >> COMPRESSION => 'NONE', MIN_VERSIONS
=> '0', TTL =>
> >> '2147483647
> >> > ',
> >> > >> > >> >> KEEP_DELETED_CELLS => 'false', BLOCKSIZE
=> '16384',
> >> IN_MEMORY
> >> > =>
> >> > >> > >> 'false',
> >> > >> > >> >> ENCODE_ON_DISK => 'true', BLOCKCACHE
=> 'true'}]}
> >> > >> > >> >>
> >> > >> > >> >> When the data size is around 100GB (100
Million records),
> >> then
> >> > the
> >> > >> > >> >> latency is very good. I am getting a throughput
of around
> >> 300K
> >> > >> OPS.
> >> > >> > >> >> In both cases (100 GB and 1.8 TB) Ganglia
stats show that
> >> Disk
> >> > >> reads
> >> > >> > >> are
> >> > >> > >> >> around 50-60 MB/s throughout the read cycle.
> >> > >> > >> >>
> >> > >> > >> >> Thanks,
> >> > >> > >> >> Ramu
> >> > >> > >> >>
> >> > >> > >> >>
> >> > >> > >> >> On Mon, Oct 7, 2013 at 2:21 PM, lars hofhansl
<
> >> larsh@apache.org
> >> > >
> >> > >> > >> wrote:
> >> > >> > >> >>
> >> > >> > >> >>> Have you enabled short circuit reading?
See here:
> >> > >> > >> >>> http://hbase.apache.org/book/perf.hdfs.html
> >> > >> > >> >>>
> >> > >> > >> >>> How's your data locality (shown on
the RegionServer UI
> >> page).
> >> > >> > >> >>>
> >> > >> > >> >>>
> >> > >> > >> >>> How much memory are you giving your
RegionServers?
> >> > >> > >> >>> If you reads are truly random and the
data set does not
> fit
> >> > into
> >> > >> the
> >> > >> > >> >>> aggregate cache, you'll be dominated
by the disk and
> >> network.
> >> > >> > >> >>> Each read would need to bring in a
64k (default) HFile
> >> block.
> >> > If
> >> > >> > short
> >> > >> > >> >>> circuit reading is not enabled you'll
get two or three
> >> context
> >> > >> > >> switches.
> >> > >> > >> >>>
> >> > >> > >> >>> So I would try:
> >> > >> > >> >>> 1. Enable short circuit reading
> >> > >> > >> >>> 2. Increase the block cache size per
RegionServer
> >> > >> > >> >>> 3. Decrease the HFile block size
> >> > >> > >> >>> 4. Make sure your data is local (if
it is not, issue a
> major
> >> > >> > >> compaction).
> >> > >> > >> >>>
> >> > >> > >> >>>
> >> > >> > >> >>> -- Lars
> >> > >> > >> >>>
> >> > >> > >> >>>
> >> > >> > >> >>>
> >> > >> > >> >>> ________________________________
> >> > >> > >> >>>  From: Ramu M S <ramu.malur@gmail.com>
> >> > >> > >> >>> To: user@hbase.apache.org
> >> > >> > >> >>> Sent: Sunday, October 6, 2013 10:01
PM
> >> > >> > >> >>> Subject: HBase Random Read latency
> 100ms
> >> > >> > >> >>>
> >> > >> > >> >>>
> >> > >> > >> >>> Hi All,
> >> > >> > >> >>>
> >> > >> > >> >>> My HBase cluster has 8 Region Servers
(CDH 4.4.0, HBase
> >> > 0.94.6).
> >> > >> > >> >>>
> >> > >> > >> >>> Each Region Server is with the following
configuration,
> >> > >> > >> >>> 16 Core CPU, 192 GB RAM, 800 GB SATA
(7200 RPM) Disk
> >> > >> > >> >>> (Unfortunately configured with RAID
1, can't change this
> as
> >> the
> >> > >> > >> Machines
> >> > >> > >> >>> are leased temporarily for a month).
> >> > >> > >> >>>
> >> > >> > >> >>> I am running YCSB benchmark tests on
HBase and currently
> >> > >> inserting
> >> > >> > >> around
> >> > >> > >> >>> 1.8 Billion records.
> >> > >> > >> >>> (1 Key + 7 Fields of 100 Bytes = 724
Bytes per record)
> >> > >> > >> >>>
> >> > >> > >> >>> Currently I am getting a write throughput
of around 100K
> >> OPS,
> >> > but
> >> > >> > >> random
> >> > >> > >> >>> reads are very very slow, all gets
have more than 100ms or
> >> more
> >> > >> > >> latency.
> >> > >> > >> >>>
> >> > >> > >> >>> I have changed the following default
configuration,
> >> > >> > >> >>> 1. HFile Size: 16GB
> >> > >> > >> >>> 2. HDFS Block Size: 512 MB
> >> > >> > >> >>>
> >> > >> > >> >>> Total Data size is around 1.8 TB (Excluding
the replicas).
> >> > >> > >> >>> My Table is split into 128 Regions
(No pre-splitting used,
> >> > >> started
> >> > >> > >> with 1
> >> > >> > >> >>> and grew to 128 over the insertion
time)
> >> > >> > >> >>>
> >> > >> > >> >>> Taking some inputs from earlier discussions
I have done
> the
> >> > >> > following
> >> > >> > >> >>> changes to disable Nagle (In both Client
and Server
> >> > >> hbase-site.xml,
> >> > >> > >> >>> hdfs-site.xml)
> >> > >> > >> >>>
> >> > >> > >> >>> <property>
> >> > >> > >> >>>   <name>hbase.ipc.client.tcpnodelay</name>
> >> > >> > >> >>>   <value>true</value>
> >> > >> > >> >>> </property>
> >> > >> > >> >>>
> >> > >> > >> >>> <property>
> >> > >> > >> >>>   <name>ipc.server.tcpnodelay</name>
> >> > >> > >> >>>   <value>true</value>
> >> > >> > >> >>> </property>
> >> > >> > >> >>>
> >> > >> > >> >>> Ganglia stats shows large CPU IO wait
(>30% during reads).
> >> > >> > >> >>>
> >> > >> > >> >>> I agree that disk configuration is
not ideal for Hadoop
> >> > cluster,
> >> > >> but
> >> > >> > >> as
> >> > >> > >> >>> told earlier it can't change for now.
> >> > >> > >> >>> I feel the latency is way beyond any
reported results so
> >> far.
> >> > >> > >> >>>
> >> > >> > >> >>> Any pointers on what can be wrong?
> >> > >> > >> >>>
> >> > >> > >> >>> Thanks,
> >> > >> > >> >>> Ramu
> >> > >> > >> >>>
> >> > >> > >> >>
> >> > >> > >> >>
> >> > >> > >> >
> >> > >> > >>
> >> > >> > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >>
> >> > >>
> >> > >> --
> >> > >> Bharath Vissapragada
> >> > >> <http://www.cloudera.com>
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message