hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Ploeg" <dpl...@gmail.com>
Subject Re: Question about recommended heap sizes
Date Mon, 29 Sep 2008 01:48:30 GMT
Thanks Jonathan.

To answer your questions - no there were no writes occurring at the same
time as the reads. Also, it was a little while since I looked at the
architecture and BigTable papers, but looking at them again the info in them
does make more sense now that I've started to throw some volume tests at it
and see it in action :)

I have been attempting to run further tests, trying to ramp up to 100K rows.
I got an exception when I was loading data - I got to about 43K rows. I ran
out of space on my hdfs cluster. When I looked on the hdfs pages, I noticed
that two of the four datanodes had 0 remaining data able to be allocated.
There is still 100-200GB on the other two nodes though. I did notice that
there were a great deal of files created by HBase, many of those contained
less data than the 64MB block size. For example, (my test involves a single
table with a single column) I noticed that for each region on hdfs there may
be several files for the column. There are n map files and n info files
there as well (from what I could tell). Each index file for the map file is
well below the storage space it requires (often <100KB), and the info files
are 10KB each. From what I gather each of these files is using the minimum
required for HDFS to store a file (default, which I'm using, is 64MB). That
seems to me like a lot of wasted capacity.

Are there any ways to optimise this? For example are there any existing
configurations that I can use that might better control region sizes and
thus reduce the overall number of files produced (eg will
hbase.hregion.max.filesize do the job)? I am using the default replication
on hdfs (I really wouldn't want to go below 3). Further, is there any way to
optimise the storage of these files? I'm not exactly sure what the info
files are for, but is there any way we could use only one per region instead
of one per mapfile (or is it possible to inbed the information they contain
elsewhere)? Also, I noticed that on the hadoop api for FileSystem you can
call the create method and pass a parameter for the blocksize. Is it
feasible to create different blocksizes (eg 1MB) for the index files that
are part of the mapfiles (or would this cause undue hardship on hdfs)?

Cheers,
Daniel


On Fri, Sep 26, 2008 at 1:42 AM, Jonathan Gray <jlist@streamy.com> wrote:

> Have you read the Architecture wiki?
> http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
>
> There's also a link to the Google BigTable paper there if you have not read
> it.
>
>
> Did you begin your querying _after_ all the writing had completed?  Or were
> there still writes going on?
>
> During writing, your tables may split and that can cause significantly
> longer than normal query times.  Freshly written data, however, is usually
> fastest to fetch because it's still in Memcache.
>
> To answer your questions:
>
> 1. This really depends on your access patterns and size of your dataset.
>  If
> you have somewhat large tables, so the regions are spread across the entire
> cluster, and your access pattern is random, as you say it is, then you
> should be able to ramp up the number of querying processes.  If you watch
> load you'll notice that the master plays a small role, the work is being
> done in the regionservers.  I would expect that if you changed from one
> client to N clients where N is the number of regionservers, you'd see no
> increase at all in query times.
>
> 2. See top
>
> 3. No connection pool jira that I'm aware of.  For us, this tends to be a
> very specialized and application dependent thing so have not yet had code
> to
> contribute.  The real solution here is a truly threaded client, not a
> process pooler, IMO.  It's being talked about and needs more people
> involved
> :)
>
> 4. There is some caching, but nothing in Hbase that I'm aware of that would
> warm the cache as your read from it.  The Memcache contains newly written
> rows but stays on disk once it's been flushed.  In Hadoop, I think there's
> something about accessing blocks and a cache.  I'm pretty fuzzy in general
> about Hadoop but I have experienced similar behavior, though I'm very
> unclear about why you'd have 5-20 sec queries if you're not doing any
> writing at the same time.
>
> JG
>
> -----Original Message-----
> From: Daniel Ploeg [mailto:dploeg@gmail.com]
> Sent: Wednesday, September 24, 2008 10:52 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Question about recommended heap sizes
>
> The changes to the heap size and thread count have so far been successful.
> I
> managed to get 10K rows into hbase (loading the 100K now, it may take a
> little longer).
>
> The query results for the 10K rows came back at an average of 360 ms.
> However, at the start of the run (probably about the first quarter of the
> queries) I was getting quite a lot of slower queries, 5 sec+. The longest
> was about 20sec. I'm just querying based on the row id, the row ids being
> in
> an unsorted random order and am only using one thread to do the querying.
>
> I do have a some questions though:
> 1. If I were to ramp up the number of processes querying (eg 10-20
> concurrent readers), would I see any / much of an increase in the query
> times?
> 2. If I were to do writes whilst reading, would I see any / much of an
> increase in the query times?
> 3. Is there an open jira item for a connection pool or some equivalent for
> the HBase client instead of the serialized RPC? (I couldn't find one, but
> it
> may be in there already)
> 4. Would there be any reason why the queries at the end of my test ran a
> lot
> faster than the earlier ones (eg is there any caching involved?)
>
> Thanks,
> Daniel
>
> On Thu, Sep 25, 2008 at 9:56 AM, Jonathan Gray <jlist@streamy.com> wrote:
>
> > One thing to be aware of...
> >
> > Currently the HBase client serializes RPC calls for a process, so you are
> > not getting true insert parallelism if all inserts are coming from a
> single
> > java process despite the threading.
> >
> > Since you are also experiencing this, there must be something going on
> > here.
> > In 0.1.3 I had been importing far more and never had to increase the
> heap,
> > up to hundreds of regions per server.
> >
> > We will be investigating this issue further... I have filed an issue
> here:
> > https://issues.apache.org/jira/browse/HBASE-900
> >
> > Stay tuned there for progress.
> >
> > Let us know how your further testing goes.
> >
> > Thanks.
> >
> > Jonathan
> >
> > -----Original Message-----
> > From: Daniel Ploeg [mailto:dploeg@gmail.com]
> > Sent: Wednesday, September 24, 2008 4:35 PM
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: Question about recommended heap sizes
> >
> > Hi,
> >
> > Thanks for your quick responses!
> >
> > I'm using HBase 0.18.0.
> >
> > I restarted the hbase cluster and it's telling me on the master's web
> page
> > that I have a total of 39 regions.
> >
> > I was using 100 threads to push data into Hbase, so I might try reducing
> > that to, say, 20 on the next run. I'll also try with the heap at 2GB. If
> > that fails again I'll  reduce the batch size to 1K and try again.
> >
> > I should note that I've tuned the configurations of hadoop with the
> > following based on the troubleshooting guide and the related jiras:
> > dfs.datanode.max.xcievers=2048
> > dfs.datanode.socket.write.timeout=0
> >
> > Cheers,
> > Daniel
> >
> >
> > Thu, Sep 25, 2008 at 9:19 AM, Jonathan Gray <jlist@streamy.com> wrote:
> >
> > > Daniel,
> > >
> > > I have seen similar issues during large scale imports.  For now, we
> have
> > > gotten around the issue by increasing the regionserver heap size to
> 2GB.
> > >  My
> > > slave machines also have 4GB of memory.
> > >
> > > How many total regions did you have when you received the OOME?
> > >
> > >
> > > Jonathan Gray
> > >
> > > -----Original Message-----
> > > From: Daniel Ploeg [mailto:dploeg@gmail.com]
> > > Sent: Wednesday, September 24, 2008 3:55 PM
> > > To: hbase-user@hadoop.apache.org
> > > Subject: Question about recommended heap sizes
> > >
> > > Hi all,
> > >
> > > I was running a test on our local hbase cluster (1 master node, 4
> region
> > > servers) and I ran into some OutOfMemory exceptions. Basically, one of
> > the
> > > region servers went down first, then the master node followed (ouch!)
> as
> > I
> > > was inserting the data for the test.
> > >
> > > I was still using the default heap size and I would like to get some
> > > recommendations as to what I should raise it to. My regionservers each
> > have
> > > 4GB and the master node has 8GB. It may be useful if I describe the
> tests
> > > that I was trying to do, so here goes:
> > >
> > > The tests were to ramp up the amount of rows to determine the query
> > latency
> > > of my particular usage pattern. Each level of testing has a different
> > > number
> > > of rows (1K, 10K and 100K). My exception occurred on the 10K row data
> > > population (about 3300 rows in).
> > >
> > > My data is a table with a single column family with 10K column
> instances
> > > per
> > > row. Each column contains approx 500-1000 bytes of data.
> > >
> > > I should note that the first level of testing with 1K rows were
> returning
> > > average query responses of approx 240ms.
> > >
> > > Could someone please advise on how large you think I should set my heap
> > > space (and if you think I should make any mods to hadoop heap as well).
> > >
> > > Thanks,
> > > Daniel
> > >
> > >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message