hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Levin <magn...@gmail.com>
Subject Re: Millions of photos into Hbase
Date Tue, 21 Sep 2010 05:51:22 GMT
Awesome, thanks!... I will give it a whirl on our test cluster.

-Jack

On Mon, Sep 20, 2010 at 10:15 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
> So we are running this code in production:
>
> http://github.com/stumbleupon/hbase
>
> The branch off point is 8dc5a1a353ffc9fa57ac59618f76928b5eb31f6c, and
> everything past that is our rebase and cherry-picked changes.
>
> We use git to manage this internally, and don't use svn.  Included is
> the LZO libraries we use checked directly into the code, and the
> assembly changes to publish those.
>
> So when we are ready to do a deploy, we do this:
> mvn install assembly:assembly
> (or include the -DskipTests to make it go faster)
>
> and then we have a new tarball to deploy.
>
> Note there is absolutely NO warranty here, not even that it will run
> for a microsecond... futhermore this is NOT an ASF release, just a
> courtesy.  If there ever was to be a release it would look
> differently, because ASF releases cant include GPL code (this does)
> and depend on commercial releases of haoopp.
>
> Enjoy,
> -ryan
>
> On Mon, Sep 20, 2010 at 9:57 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>> no no, 20 GB heap per node.  each node with 24-32gb ram, etc.
>>
>> we cant rely on the linux buffer cache to save us, so we have to cache
>> in hbase ram.
>>
>> :-)
>>
>> -ryan
>>
>> On Mon, Sep 20, 2010 at 9:44 PM, Jack Levin <magnito@gmail.com> wrote:
>>> 20GB+?, hmmm..... I do plan to run 50 regionserver nodes though, with
>>> 3 GB Heap likely, this should be plenty to rip through say, 350TB of
>>> data.
>>>
>>> -Jack
>>>
>>> On Mon, Sep 20, 2010 at 9:39 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>>>> yes that is the new ZK based coordination.  when i publish the SU code
>>>> we have a patch which limits that and is faster.  2GB is a little
>>>> small for a regionserver memory... in my ideal world we'll be putting
>>>> 20GB+ of ram to regionserver.
>>>>
>>>> I just figured you were using the DEB/RPMs because your files were in
>>>> /usr/local... I usually run everything out of /home/hadoop b/c it
>>>> allows me to easily rsync as user hadoop.
>>>>
>>>> but you are on the right track yes :-)
>>>>
>>>> On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin <magnito@gmail.com> wrote:
>>>>> Who said anything about deb :). I do use tarballs.... Yes, so what did
>>>>> it is the copy of that jar to under hbase/lib, and then full restart.
>>>>>  Now here is a funny thing, the master shuddered for about 10 minutes,
>>>>> spewing those messages:
>>>>>
>>>>> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>>> Event NodeCreated with state SyncConnected with path
>>>>> /hbase/UNASSIGNED/97999366
>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>>> NodeCreated with path /hbase/UNASSIGNED/97999366
>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>>> Got zkEvent NodeCreated state:SyncConnected
>>>>> path:/hbase/UNASSIGNED/97999366
>>>>> 2010-09-20 21:23:45,827 DEBUG
>>>>> org.apache.hadoop.hbase.master.RegionManager: Created/updated
>>>>> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state
>>>>> M2ZK_REGION_OFFLINE
>>>>> 2010-09-20 21:23:45,828 INFO
>>>>> org.apache.hadoop.hbase.master.RegionServerOperation:
>>>>> img13,p1000319tq.jpg,1284952655960.812544765 open on
>>>>> 10.103.2.3,60020,1285042333293
>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
>>>>> M2ZK_REGION_OFFLINE ] for region 97999366
>>>>> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>>> Event NodeChildrenChanged with state SyncConnected with path
>>>>> /hbase/UNASSIGNED
>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>>> NodeChildrenChanged with path /hbase/UNASSIGNED
>>>>> 2010-09-20 21:23:45,828 DEBUG
>>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>>> Got zkEvent NodeChildrenChanged state:SyncConnected
>>>>> path:/hbase/UNASSIGNED
>>>>> 2010-09-20 21:23:45,830 DEBUG
>>>>> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
>>>>> img150,,1284859678248.3116007 is not valid;
>>>>> serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown.
>>>>>
>>>>>
>>>>> Does anyone know what they mean?   At first it would kill one of my
>>>>> datanodes.  But what helped is when I changed to heap size to 4GB for
>>>>> master and 2GB for datanode that was dying, and after 10 minutes I got
>>>>> into a clean state.
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <ryanobjc@gmail.com>
wrote:
>>>>>> yes, on every single machine as well, and restart.
>>>>>>
>>>>>> again, not sure how how you'd do this in a scalable manner with your
>>>>>> deb packages... on the source tarball you can just replace it, rsync
>>>>>> it out and done.
>>>>>>
>>>>>> :-)
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <magnito@gmail.com>
wrote:
>>>>>>> ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/hbase/lib?
>>>>>>> Then restart, etc?  All regionservers too?
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <ryanobjc@gmail.com>
wrote:
>>>>>>>> Well I don't really run CDH, I disagree with their rpm/deb
packaging
>>>>>>>> policies and I have to highly recommend not using DEBs to
install
>>>>>>>> software...
>>>>>>>>
>>>>>>>> So normally installing from tarball, the jar is in
>>>>>>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>>>>>>>
>>>>>>>> On CDH/DEB edition, it's somewhere silly ... locate and find
will be
>>>>>>>> your friend.  It should be called hadoop-core-0.20.2+320.jar
though!
>>>>>>>>
>>>>>>>> I'm working on a github publish of SU's production system,
which uses
>>>>>>>> the cloudera maven repo to install the correct JAR in hbase
so when
>>>>>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>>>>>>>> (the * being whatever version you specified in pom.xml) the
cdh3b2 jar
>>>>>>>> comes pre-packaged.
>>>>>>>>
>>>>>>>> Stay tuned :-)
>>>>>>>>
>>>>>>>> -ryan
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <magnito@gmail.com>
wrote:
>>>>>>>>> Ryan, hadoop jar, what is the usual path to the file?
I just to to be
>>>>>>>>> sure, and where do I put it?
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <ryanobjc@gmail.com>
wrote:
>>>>>>>>>> you need 2 more things:
>>>>>>>>>>
>>>>>>>>>> - restart hdfs
>>>>>>>>>> - make sure the hadoop jar from your install replaces
the one we ship with
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <magnito@gmail.com>
wrote:
>>>>>>>>>>> So, I switched to 0.89, and we already had CDH3
>>>>>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even
though I added
>>>>>>>>>>>  <name>dfs.support.append</name>
as true to both hdfs-site.xml and
>>>>>>>>>>> hbase-site.xml, the master still reports this:
>>>>>>>>>>>
>>>>>>>>>>>  You are currently running the HMaster without
HDFS append support
>>>>>>>>>>> enabled. This may result in data loss. Please
see the HBase wiki  for
>>>>>>>>>>> details.
>>>>>>>>>>> Master Attributes
>>>>>>>>>>> Attribute Name  Value   Description
>>>>>>>>>>> HBase Version   0.89.20100726, r979826  HBase
version and svn revision
>>>>>>>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010,
stack     When HBase version
>>>>>>>>>>> was compiled and by whom
>>>>>>>>>>> Hadoop Version  0.20.2, r911707 Hadoop version
and svn revision
>>>>>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010,
chrisdo   When Hadoop
>>>>>>>>>>> version was compiled and by whom
>>>>>>>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase
    Location
>>>>>>>>>>> of HBase home directory
>>>>>>>>>>>
>>>>>>>>>>> Any ideas whats wrong?
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson
<ryanobjc@gmail.com> wrote:
>>>>>>>>>>>> Hey,
>>>>>>>>>>>>
>>>>>>>>>>>> There is actually only 1 active branch of
hbase, that being the 0.89
>>>>>>>>>>>> release, which is based on 'trunk'.  We
have snapshotted a series of
>>>>>>>>>>>> 0.89 "developer releases" in hopes that people
would try them our and
>>>>>>>>>>>> start thinking about the next major version.
 One of these is what SU
>>>>>>>>>>>> is running prod on.
>>>>>>>>>>>>
>>>>>>>>>>>> At this point tracking 0.89 and which ones
are the 'best' peach sets
>>>>>>>>>>>> to run is a bit of a contact sport, but if
you are serious about not
>>>>>>>>>>>> losing data it is worthwhile.  SU is based
on the most recent DR with
>>>>>>>>>>>> a few minor patches of our own concoction
brought in.  If current
>>>>>>>>>>>> works, but some Master ops are slow, and
there are a few patches on
>>>>>>>>>>>> top of that.  I'll poke about and see if
its possible to publish to a
>>>>>>>>>>>> github branch or something.
>>>>>>>>>>>>
>>>>>>>>>>>> -ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin
<magnito@gmail.com> wrote:
>>>>>>>>>>>>> Sounds, good, only reason I ask is because
of this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are currently two active branches
of HBase:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    * 0.20 - the current stable release
series, being maintained with
>>>>>>>>>>>>> patches for bug fixes only. This release
series does not support HDFS
>>>>>>>>>>>>> durability - edits may be lost in the
case of node failure.
>>>>>>>>>>>>>    * 0.89 - a development release series
with active feature and
>>>>>>>>>>>>> stability development, not currently
recommended for production use.
>>>>>>>>>>>>> This release does support HDFS durability
- cases in which edits are
>>>>>>>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are we talking about data loss in case
of datanode going down while
>>>>>>>>>>>>> being written to, or RegionServer going
down?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -jack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan
Rawson <ryanobjc@gmail.com> wrote:
>>>>>>>>>>>>>> We run 0.89 in production @ Stumbleupon.
 We also employ 3 committers...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As for safety, you have no choice
but to run 0.89.  If you run a 0.20
>>>>>>>>>>>>>> release you will lose data.  you
must be on 0.89 and
>>>>>>>>>>>>>> CDH3/append-branch to achieve data
durability, and there really is no
>>>>>>>>>>>>>> argument around it.  If you are
doing your tests with 0.20.6 now, I'd
>>>>>>>>>>>>>> stop and rebase those tests onto
the latest DR announced on the list.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM,
Jack Levin <magnito@gmail.com> wrote:
>>>>>>>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42
PM, Stack <stack@duboce.net> wrote:
>>>>>>>>>>>>>>>> Hey Jack:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for writing.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> See below for some comments.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00
AM, Jack Levin <magnito@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Image-Shack gets close
to two million image uploads per day, which are
>>>>>>>>>>>>>>>>> usually stored on regular
servers (we have about 700), as regular
>>>>>>>>>>>>>>>>> files, and each server
has its own host name, such as (img55).   I've
>>>>>>>>>>>>>>>>> been researching on how
to improve our backend design in terms of data
>>>>>>>>>>>>>>>>> safety and stumped onto
the Hbase project.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Any other requirements other
than data safety? (latency, etc).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Latency is the second requirement.
 We have some services that are
>>>>>>>>>>>>>>> very short tail, and can produce
95% cache hit rate, so I assume this
>>>>>>>>>>>>>>> would really put cache into good
use.  Some other services however,
>>>>>>>>>>>>>>> have about 25% cache hit ratio,
in which case the latency should be
>>>>>>>>>>>>>>> 'adequate', e.g. if its slightly
worse than getting data off raw disk,
>>>>>>>>>>>>>>> then its good enough.   Safely
is supremely important, then its
>>>>>>>>>>>>>>> availability, then speed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Now, I think hbase is
he most beautiful thing that happen to
>>>>>>>>>>>>>>>>> distributed DB world
:).   The idea is to store image files (about
>>>>>>>>>>>>>>>>> 400Kb on average into
HBASE).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd guess some images are
much bigger than this.  Do you ever limit
>>>>>>>>>>>>>>>> the size of images folks
can upload to your service?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The setup will include the
following
>>>>>>>>>>>>>>>>> configuration:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 50 servers total (2 datacenters),
with 8 GB RAM, dual core cpu, 6 x
>>>>>>>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>>>>>>>> 2 Masters (in a datacenter
each)
>>>>>>>>>>>>>>>>> 10 to 20 Stargate REST
instances (one per server, hash loadbalanced)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Whats your frontend?  Why
REST?  It might be more efficient if you
>>>>>>>>>>>>>>>> could run with thrift given
REST base64s its payload IIRC (check the
>>>>>>>>>>>>>>>> src yourself).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For insertion we use Haproxy,
and balance curl PUTs across multiple REST APIs.
>>>>>>>>>>>>>>> For reading, its a nginx proxy
that does Content-type modification
>>>>>>>>>>>>>>> from image/jpeg to octet-stream,
and vice versa,
>>>>>>>>>>>>>>> it then hits Haproxy again, which
hits balanced REST.
>>>>>>>>>>>>>>> Why REST, it was the simplest
thing to run, given that its supports
>>>>>>>>>>>>>>> HTTP, potentially we could rewrite
something for thrift, as long as we
>>>>>>>>>>>>>>> can use http still to send and
receive data (anyone wrote anything
>>>>>>>>>>>>>>> like that say in python, C or
java?)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 40 to 50 RegionServers
(will probably keep masters separate on dedicated boxes).
>>>>>>>>>>>>>>>>> 2 Namenode servers (one
backup, highly available, will do fsimage and
>>>>>>>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So far I got about 13
servers running, and doing about 20 insertions /
>>>>>>>>>>>>>>>>> second (file size ranging
from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>>>>>>>>> Stargate API.  Our frontend
servers receive files, and I just
>>>>>>>>>>>>>>>>> fork-insert them into
stargate via http (curl).
>>>>>>>>>>>>>>>>> The inserts are humming
along nicely, without any noticeable load on
>>>>>>>>>>>>>>>>> regionservers, so far
inserted about 2 TB worth of images.
>>>>>>>>>>>>>>>>> I have adjusted the region
file size to be 512MB, and table block size
>>>>>>>>>>>>>>>>> to about 400KB , trying
to match average access block to limit HDFS
>>>>>>>>>>>>>>>>> trips.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As Todd suggests, I'd go
up from 512MB... 1G at least.  You'll
>>>>>>>>>>>>>>>> probably want to up your
flush size from 64MB to 128MB or maybe 192MB.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yep, i will adjust to 1G.  I
thought flush was controlled by a
>>>>>>>>>>>>>>> function of memstore HEAP, something
like 40%?  Or are you talking
>>>>>>>>>>>>>>> about HDFS block size?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  So far the read performance
was more than adequate, and of
>>>>>>>>>>>>>>>>> course write performance
is nowhere near capacity.
>>>>>>>>>>>>>>>>> So right now, all newly
uploaded images go to HBASE.  But we do plan
>>>>>>>>>>>>>>>>> to insert about 170 Million
images (about 100 days worth), which is
>>>>>>>>>>>>>>>>> only about 64 TB, or
10% of planned cluster size of 600TB.
>>>>>>>>>>>>>>>>> The end goal is to have
a storage system that creates data safety,
>>>>>>>>>>>>>>>>> e.g. system may go down
but data can not be lost.   Our Front-End
>>>>>>>>>>>>>>>>> servers will continue
to serve images from their own file system (we
>>>>>>>>>>>>>>>>> are serving about 16
Gbits at peak), however should we need to bring
>>>>>>>>>>>>>>>>> any of those down for
maintenance, we will redirect all traffic to
>>>>>>>>>>>>>>>>> Hbase (should be no more
than few hundred Mbps), while the front end
>>>>>>>>>>>>>>>>> server is repaired (for
example having its disk replaced), after the
>>>>>>>>>>>>>>>>> repairs, we quickly repopulate
it with missing files, while serving
>>>>>>>>>>>>>>>>> the missing remaining
off Hbase.
>>>>>>>>>>>>>>>>> All in all should be
very interesting project, and I am hoping not to
>>>>>>>>>>>>>>>>> run into any snags, however,
should that happens, I am pleased to know
>>>>>>>>>>>>>>>>> that such a great and
vibrant tech group exists that supports and uses
>>>>>>>>>>>>>>>>> HBASE :).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We're definetly interested
in how your project progresses.  If you are
>>>>>>>>>>>>>>>> ever up in the city, you
should drop by for a chat.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> P.S. I'm also w/ Todd that
you should move to 0.89 and blooms.
>>>>>>>>>>>>>>>> P.P.S I updated the wiki
on stargate REST:
>>>>>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cool, I assume if we move to
that it won't kill existing meta tables,
>>>>>>>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>>>>>>>> Is 0.89 ready for production
environment?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message