Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 32909 invoked from network); 21 Sep 2010 04:44:55 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Sep 2010 04:44:55 -0000 Received: (qmail 27518 invoked by uid 500); 21 Sep 2010 04:44:54 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 27339 invoked by uid 500); 21 Sep 2010 04:44:51 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 27331 invoked by uid 99); 21 Sep 2010 04:44:51 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Sep 2010 04:44:51 +0000 X-ASF-Spam-Status: No, hits=4.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,URIBL_SBL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of magnito@gmail.com designates 209.85.214.169 as permitted sender) Received: from [209.85.214.169] (HELO mail-iw0-f169.google.com) (209.85.214.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Sep 2010 04:44:29 +0000 Received: by iwn33 with SMTP id 33so6807655iwn.14 for ; Mon, 20 Sep 2010 21:44:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=COoiQ2vJ0+u25adbKbxQm5iqfWBWX1tx2s5XV68jyPA=; b=vBTIYaZoD+qtWlIN2aA7z7AQ2iz7zv/uR7gqLvRphnLfUwGR3/y443uOEWjo5JKBGD cIEXHmhdjlOkkwmsXEM9JBrcxHQGv96LxfVAHspxEKGC4ZIwIwy9xH70IKTsquZ+2MqQ 1V9oBUu3VXHYoIUESdUqPi+Agc4pD71K92VCo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=gAltLysW6MEOvttGovkgG7xs5MklSRIYimfePydNgHndtEnk1Pf6cWp4QPwwytowZN 2t0F2GJHy+nUL71FqI43mivYHPo2WN08Lh52XIkwhSqz5fccHGSvHOxwNIPxJ01oHpUP 28/T//uCJHr90H7YTiUNcsCQLjO0AKjnOhD70= MIME-Version: 1.0 Received: by 10.231.14.69 with SMTP id f5mr11286802iba.116.1285044248232; Mon, 20 Sep 2010 21:44:08 -0700 (PDT) Received: by 10.231.194.209 with HTTP; Mon, 20 Sep 2010 21:44:08 -0700 (PDT) In-Reply-To: References: Date: Mon, 20 Sep 2010 21:44:08 -0700 Message-ID: Subject: Re: Millions of photos into Hbase From: Jack Levin To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org 20GB+?, hmmm..... I do plan to run 50 regionserver nodes though, with 3 GB Heap likely, this should be plenty to rip through say, 350TB of data. -Jack On Mon, Sep 20, 2010 at 9:39 PM, Ryan Rawson wrote: > yes that is the new ZK based coordination. =A0when i publish the SU code > we have a patch which limits that and is faster. =A02GB is a little > small for a regionserver memory... in my ideal world we'll be putting > 20GB+ of ram to regionserver. > > I just figured you were using the DEB/RPMs because your files were in > /usr/local... I usually run everything out of /home/hadoop b/c it > allows me to easily rsync as user hadoop. > > but you are on the right track yes :-) > > On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin wrote: >> Who said anything about deb :). I do use tarballs.... Yes, so what did >> it is the copy of that jar to under hbase/lib, and then full restart. >> =A0Now here is a funny thing, the master shuddered for about 10 minutes, >> spewing those messages: >> >> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster: >> Event NodeCreated with state SyncConnected with path >> /hbase/UNASSIGNED/97999366 >> 2010-09-20 21:23:45,827 DEBUG >> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event >> NodeCreated with path /hbase/UNASSIGNED/97999366 >> 2010-09-20 21:23:45,827 DEBUG >> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS: >> Got zkEvent NodeCreated state:SyncConnected >> path:/hbase/UNASSIGNED/97999366 >> 2010-09-20 21:23:45,827 DEBUG >> org.apache.hadoop.hbase.master.RegionManager: Created/updated >> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state >> M2ZK_REGION_OFFLINE >> 2010-09-20 21:23:45,828 INFO >> org.apache.hadoop.hbase.master.RegionServerOperation: >> img13,p1000319tq.jpg,1284952655960.812544765 open on >> 10.103.2.3,60020,1285042333293 >> 2010-09-20 21:23:45,828 DEBUG >> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [ >> M2ZK_REGION_OFFLINE ] for region 97999366 >> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster: >> Event NodeChildrenChanged with state SyncConnected with path >> /hbase/UNASSIGNED >> 2010-09-20 21:23:45,828 DEBUG >> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event >> NodeChildrenChanged with path /hbase/UNASSIGNED >> 2010-09-20 21:23:45,828 DEBUG >> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS: >> Got zkEvent NodeChildrenChanged state:SyncConnected >> path:/hbase/UNASSIGNED >> 2010-09-20 21:23:45,830 DEBUG >> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of >> img150,,1284859678248.3116007 is not valid; >> serverAddress=3D10.103.2.1:60020, startCode=3D1285038205920 unknown. >> >> >> Does anyone know what they mean? =A0 At first it would kill one of my >> datanodes. =A0But what helped is when I changed to heap size to 4GB for >> master and 2GB for datanode that was dying, and after 10 minutes I got >> into a clean state. >> >> -Jack >> >> >> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson wrote: >>> yes, on every single machine as well, and restart. >>> >>> again, not sure how how you'd do this in a scalable manner with your >>> deb packages... on the source tarball you can just replace it, rsync >>> it out and done. >>> >>> :-) >>> >>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin wrote: >>>> ok, I found that file, do I replace hadoop-core.*.jar under /usr/lib/h= base/lib? >>>> Then restart, etc? =A0All regionservers too? >>>> >>>> -Jack >>>> >>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson wrot= e: >>>>> Well I don't really run CDH, I disagree with their rpm/deb packaging >>>>> policies and I have to highly recommend not using DEBs to install >>>>> software... >>>>> >>>>> So normally installing from tarball, the jar is in >>>>> /hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar >>>>> >>>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be >>>>> your friend. =A0It should be called hadoop-core-0.20.2+320.jar though= ! >>>>> >>>>> I'm working on a github publish of SU's production system, which uses >>>>> the cloudera maven repo to install the correct JAR in hbase so when >>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz >>>>> (the * being whatever version you specified in pom.xml) the cdh3b2 ja= r >>>>> comes pre-packaged. >>>>> >>>>> Stay tuned :-) >>>>> >>>>> -ryan >>>>> >>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin wrote= : >>>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to b= e >>>>>> sure, and where do I put it? >>>>>> >>>>>> -Jack >>>>>> >>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson wr= ote: >>>>>>> you need 2 more things: >>>>>>> >>>>>>> - restart hdfs >>>>>>> - make sure the hadoop jar from your install replaces the one we sh= ip with >>>>>>> >>>>>>> >>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin wro= te: >>>>>>>> So, I switched to 0.89, and we already had CDH3 >>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added >>>>>>>> =A0dfs.support.append as true to both hdfs-site.xml a= nd >>>>>>>> hbase-site.xml, the master still reports this: >>>>>>>> >>>>>>>> =A0You are currently running the HMaster without HDFS append suppo= rt >>>>>>>> enabled. This may result in data loss. Please see the HBase wiki = =A0for >>>>>>>> details. >>>>>>>> Master Attributes >>>>>>>> Attribute Name =A0Value =A0 Description >>>>>>>> HBase Version =A0 0.89.20100726, r979826 =A0HBase version and svn = revision >>>>>>>> HBase Compiled =A0Sat Jul 31 02:01:58 PDT 2010, stack =A0 =A0 When= HBase version >>>>>>>> was compiled and by whom >>>>>>>> Hadoop Version =A00.20.2, r911707 Hadoop version and svn revision >>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo =A0 When Had= oop >>>>>>>> version was compiled and by whom >>>>>>>> HBase Root Directory =A0 =A0hdfs://namenode-rd.imageshack.us:9000/= hbase =A0 =A0 Location >>>>>>>> of HBase home directory >>>>>>>> >>>>>>>> Any ideas whats wrong? >>>>>>>> >>>>>>>> -Jack >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson = wrote: >>>>>>>>> Hey, >>>>>>>>> >>>>>>>>> There is actually only 1 active branch of hbase, that being the 0= .89 >>>>>>>>> release, which is based on 'trunk'. =A0We have snapshotted a seri= es of >>>>>>>>> 0.89 "developer releases" in hopes that people would try them our= and >>>>>>>>> start thinking about the next major version. =A0One of these is w= hat SU >>>>>>>>> is running prod on. >>>>>>>>> >>>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach s= ets >>>>>>>>> to run is a bit of a contact sport, but if you are serious about = not >>>>>>>>> losing data it is worthwhile. =A0SU is based on the most recent D= R with >>>>>>>>> a few minor patches of our own concoction brought in. =A0If curre= nt >>>>>>>>> works, but some Master ops are slow, and there are a few patches = on >>>>>>>>> top of that. =A0I'll poke about and see if its possible to publis= h to a >>>>>>>>> github branch or something. >>>>>>>>> >>>>>>>>> -ryan >>>>>>>>> >>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin w= rote: >>>>>>>>>> Sounds, good, only reason I ask is because of this: >>>>>>>>>> >>>>>>>>>> There are currently two active branches of HBase: >>>>>>>>>> >>>>>>>>>> =A0 =A0* 0.20 - the current stable release series, being maintai= ned with >>>>>>>>>> patches for bug fixes only. This release series does not support= HDFS >>>>>>>>>> durability - edits may be lost in the case of node failure. >>>>>>>>>> =A0 =A0* 0.89 - a development release series with active feature= and >>>>>>>>>> stability development, not currently recommended for production = use. >>>>>>>>>> This release does support HDFS durability - cases in which edits= are >>>>>>>>>> lost are considered serious bugs. >>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Are we talking about data loss in case of datanode going down wh= ile >>>>>>>>>> being written to, or RegionServer going down? >>>>>>>>>> >>>>>>>>>> -jack >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson wrote: >>>>>>>>>>> We run 0.89 in production @ Stumbleupon. =A0We also employ 3 co= mmitters... >>>>>>>>>>> >>>>>>>>>>> As for safety, you have no choice but to run 0.89. =A0If you ru= n a 0.20 >>>>>>>>>>> release you will lose data. =A0you must be on 0.89 and >>>>>>>>>>> CDH3/append-branch to achieve data durability, and there really= is no >>>>>>>>>>> argument around it. =A0If you are doing your tests with 0.20.6 = now, I'd >>>>>>>>>>> stop and rebase those tests onto the latest DR announced on the= list. >>>>>>>>>>> >>>>>>>>>>> -ryan >>>>>>>>>>> >>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin = wrote: >>>>>>>>>>>> Hi Stack, see inline: >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack wrot= e: >>>>>>>>>>>>> Hey Jack: >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for writing. >>>>>>>>>>>>> >>>>>>>>>>>>> See below for some comments. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day,= which are >>>>>>>>>>>>>> usually stored on regular servers (we have about 700), as re= gular >>>>>>>>>>>>>> files, and each server has its own host name, such as (img55= ). =A0 I've >>>>>>>>>>>>>> been researching on how to improve our backend design in ter= ms of data >>>>>>>>>>>>>> safety and stumped onto the Hbase project. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Any other requirements other than data safety? (latency, etc)= . >>>>>>>>>>>> >>>>>>>>>>>> Latency is the second requirement. =A0We have some services th= at are >>>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assu= me this >>>>>>>>>>>> would really put cache into good use. =A0Some other services h= owever, >>>>>>>>>>>> have about 25% cache hit ratio, in which case the latency shou= ld be >>>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off r= aw disk, >>>>>>>>>>>> then its good enough. =A0 Safely is supremely important, then = its >>>>>>>>>>>> availability, then speed. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to >>>>>>>>>>>>>> distributed DB world :). =A0 The idea is to store image file= s (about >>>>>>>>>>>>>> 400Kb on average into HBASE). >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I'd guess some images are much bigger than this. =A0Do you ev= er limit >>>>>>>>>>>>> the size of images folks can upload to your service? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The setup will include the following >>>>>>>>>>>>>> configuration: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core c= pu, 6 x >>>>>>>>>>>>>> 2TB disks each. >>>>>>>>>>>>>> 3 to 5 Zookeepers >>>>>>>>>>>>>> 2 Masters (in a datacenter each) >>>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash loadb= alanced) >>>>>>>>>>>>> >>>>>>>>>>>>> Whats your frontend? =A0Why REST? =A0It might be more efficie= nt if you >>>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC (ch= eck the >>>>>>>>>>>>> src yourself). >>>>>>>>>>>> >>>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across mul= tiple REST APIs. >>>>>>>>>>>> For reading, its a nginx proxy that does Content-type modifica= tion >>>>>>>>>>>> from image/jpeg to octet-stream, and vice versa, >>>>>>>>>>>> it then hits Haproxy again, which hits balanced REST. >>>>>>>>>>>> Why REST, it was the simplest thing to run, given that its sup= ports >>>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as lo= ng as we >>>>>>>>>>>> can use http still to send and receive data (anyone wrote anyt= hing >>>>>>>>>>>> like that say in python, C or java?) >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate = on dedicated boxes). >>>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fs= image and >>>>>>>>>>>>>> edits snapshots also) >>>>>>>>>>>>>> >>>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 in= sertions / >>>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB).= via >>>>>>>>>>>>>> Stargate API. =A0Our frontend servers receive files, and I j= ust >>>>>>>>>>>>>> fork-insert them into stargate via http (curl). >>>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable= load on >>>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images. >>>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table = block size >>>>>>>>>>>>>> to about 400KB , trying to match average access block to lim= it HDFS >>>>>>>>>>>>>> trips. >>>>>>>>>>>>> >>>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least. =A0You= 'll >>>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or may= be 192MB. >>>>>>>>>>>> >>>>>>>>>>>> Yep, i will adjust to 1G. =A0I thought flush was controlled by= a >>>>>>>>>>>> function of memstore HEAP, something like 40%? =A0Or are you t= alking >>>>>>>>>>>> about HDFS block size? >>>>>>>>>>>> >>>>>>>>>>>>> =A0So far the read performance was more than adequate, and of >>>>>>>>>>>>>> course write performance is nowhere near capacity. >>>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE. =A0But = we do plan >>>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), w= hich is >>>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB. >>>>>>>>>>>>>> The end goal is to have a storage system that creates data s= afety, >>>>>>>>>>>>>> e.g. system may go down but data can not be lost. =A0 Our Fr= ont-End >>>>>>>>>>>>>> servers will continue to serve images from their own file sy= stem (we >>>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need = to bring >>>>>>>>>>>>>> any of those down for maintenance, we will redirect all traf= fic to >>>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the f= ront end >>>>>>>>>>>>>> server is repaired (for example having its disk replaced), a= fter the >>>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while = serving >>>>>>>>>>>>>> the missing remaining off Hbase. >>>>>>>>>>>>>> All in all should be very interesting project, and I am hopi= ng not to >>>>>>>>>>>>>> run into any snags, however, should that happens, I am pleas= ed to know >>>>>>>>>>>>>> that such a great and vibrant tech group exists that support= s and uses >>>>>>>>>>>>>> HBASE :). >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> We're definetly interested in how your project progresses. = =A0If you are >>>>>>>>>>>>> ever up in the city, you should drop by for a chat. >>>>>>>>>>>> >>>>>>>>>>>> Cool. =A0I'd like that. >>>>>>>>>>>> >>>>>>>>>>>>> St.Ack >>>>>>>>>>>>> >>>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms= . >>>>>>>>>>>>> P.P.S I updated the wiki on stargate REST: >>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate >>>>>>>>>>>> >>>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta = tables, >>>>>>>>>>>> and data? =A0e.g. cross compatible? >>>>>>>>>>>> Is 0.89 ready for production environment? >>>>>>>>>>>> >>>>>>>>>>>> -Jack >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >