Return-Path: Delivered-To: apmail-hadoop-chukwa-user-archive@minotaur.apache.org Received: (qmail 1180 invoked from network); 19 Mar 2010 19:50:23 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 Mar 2010 19:50:23 -0000 Received: (qmail 38853 invoked by uid 500); 19 Mar 2010 19:50:23 -0000 Delivered-To: apmail-hadoop-chukwa-user-archive@hadoop.apache.org Received: (qmail 38831 invoked by uid 500); 19 Mar 2010 19:50:23 -0000 Mailing-List: contact chukwa-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: chukwa-user@hadoop.apache.org Delivered-To: mailing list chukwa-user@hadoop.apache.org Received: (qmail 38823 invoked by uid 99); 19 Mar 2010 19:50:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Mar 2010 19:50:23 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jboulon@netflix.com designates 208.75.77.145 as permitted sender) Received: from [208.75.77.145] (HELO mx2.netflix.com) (208.75.77.145) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Mar 2010 19:50:17 +0000 Received: from message.netflix.com (message [10.64.32.68]) by mx2.netflix.com (8.12.11.20060308/8.12.11) with ESMTP id o2JJntQI030618 for ; Fri, 19 Mar 2010 12:49:55 -0700 Received: from bigmamma.netflix.com ([10.64.32.75]) by message.netflix.com with Microsoft SMTPSVC(6.0.3790.3959); Fri, 19 Mar 2010 12:49:53 -0700 Received: from 10.2.177.52 ([10.2.177.52]) by bigmamma.netflix.com ([10.64.32.75]) with Microsoft Exchange Server HTTP-DAV ; Fri, 19 Mar 2010 19:49:54 +0000 User-Agent: Microsoft-Entourage/12.23.0.091001 Date: Fri, 19 Mar 2010 12:49:53 -0700 Subject: Re: How to set up HDFS -> MySQL from trunk? From: Jerome Boulon To: Message-ID: Thread-Topic: How to set up HDFS -> MySQL from trunk? Thread-Index: AcrHkjffiXZ+0QYWdkKroyIO4nccvQACMoT5AACVdmM= In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="EUC-KR" Content-transfer-encoding: quoted-printable X-OriginalArrivalTime: 19 Mar 2010 19:49:53.0563 (UTC) FILETIME=[581DEEB0:01CAC79D] X-Virus-Checked: Checked by ClamAV on apache.org Do you have a Jira for that, so we can continue the discussion there? The reason, I'm asking this, is because I guess that if you need to move ou= t of Mysql it's because you need to scale. And if you need to scale then you need partitioning and Voldemort and Hbase are already working on this (or all No-SQL implementations)?. Voldemort index/data files can be built using Hadoop and Hbase is already using Tfile. Thanks, /Jerome. On 3/19/10 12:33 PM, "Eric Yang" wrote: > Hi Jerome, >=20 > I am not planning to have SQL on top of HDFS. Chukwa MetricDataLoader > subsystem is a index builder. The replacement part of the index builder = is > either Tfile or a streaming job to build the index, and having distribute= d > processes to cache the index by keeping the Tfile open or load the index = to > memory. For aggregation, this could be replaced with second stage > mapreduce, or workflow subsystem like Oozie. It could also be replaced w= ith > Hive, if the community likes this approach. >=20 > Regards, > Eric >=20 > On 3/19/10 11:30 AM, "Jerome Boulon" wrote: >=20 >> Hi Eric, >> Correct me if I=A1=AFm wrong but to get that =A1=B0the SQL portion of Chukwa is >> deprecated, and the HDFS-based replacement is six months out=A1=B1, >> You need a SQL like engine otherwise is not a replacement. >> So does that mean you=A1=AFre planning to get a SQL like engine working on t= op of >> HDFS in less than 6 months ? >> If yes, do you already have some working code? >> What are the performance that you=A1=AFre targeting since even if Mysql is n= ot >> scalable, you can still do a bunch of things ... >>=20 >> Thanks, >> /Jerome. >>=20 >> On 3/18/10 8:59 PM, "Kirk True" wrote: >>=20 >>> Hi Eric, >>>=20 >>> Awesome - everything's working great now. >>>=20 >>> So, as you've said, the SQL portion of Chukwa is deprecated, and the >>> HDFS-based replacement is six months out. What should I do to get the d= ata >>> from the adapters->collectors->HDFS->HICC? Is the HDFS-based HICC >>> replacement >>> spec'ed out enough for others to contribute? >>>=20 >>> Thanks, >>> Kirk >>>=20 >>> Eric Yang wrote: >>>> =20 >>>> Hi Kirk, >>>>=20 >>>> 1. Host select is currently showing hostname collected from SystemMetr= ics >>>> table, hence, you need to have top, iostat, df, sar collected to popul= ate >>>> SystemMetrics table correctly. The hostname is also cached in the use= r >>>> session, hence you will need to =A1=B0switch to a different cluster, and s= witch >>>> back=A1=B1 or restart hicc to flush the cached hostnames from user session= . The >>>> hostname selector probably should pickup hostname from a different dat= a >>>> source in the future release. >>>>=20 >>>> 2. The server should run in UTC. Timezone was never implemented >>>> completely. Hence, server in other timezone will not work correctly. >>>>=20 >>>> 3. SQL aggregator (deprecated by the way) running as part of dbAdmin.s= h, >>>> this subsystem will down sample data from weekly table to monthly, yea= rly, >>>> decade tables. I wrote this submodule over a weekend for prototype sh= ow >>>> and >>>> tell. I strongly recommend to avoid SQL part of Chukwa all together. >>>>=20 >>>> Regards, >>>> Eric >>>>=20 >>>> On 3/18/10 1:15 PM, "Kirk True" >>>> wrote: >>>>=20 >>>> =20 >>>> =20 >>>>> =20 >>>>> Hi Eric, >>>>>=20 >>>>> I believe I have most of steps 1-5 working. Data from "/usr/bin/df" i= s >>>>> being >>>>> collected, parsed, stuck into HDFS, and then pulled out again and pla= ced >>>>> into >>>>> MySQL. However, HICC isn't showing me my data just yet... >>>>>=20 >>>>> The disk_2098_week table is filled out with several entries and looks >>>>> great. >>>>> If I select my cluster from the "Cluster Selector" and "Last 12 Hours= " >>>>> from >>>>> the "Time" widget, the "Disk Statistics" widget still says "No Data >>>>> available." >>>>>=20 >>>>> It appears to be because part of the SQL query includes the host name >>>>> which >>>>> is >>>>> coming across in the SQL parameters as "". However, since the >>>>> disk_2098_week >>>>> table properly includes the host name, nothing is returned by the que= ry. >>>>> Just >>>>> for grins, I updated the table manually in MySQL to blank out the hos= t >>>>> names >>>>> and I get a super cool, pretty graph (which looks great, BTW). >>>>>=20 >>>>> Additionally, if I select other time periods such as "Last 1 Hour", I= see >>>>> the >>>>> query is using UTC or something (at 1:00 PDT, I see the query is usin= g a >>>>> range >>>>> of 19:00-20:00). However, the data in MySQL is based on PDT, so no ma= tches >>>>> are >>>>> found. It appears that the "time_zone" session attribute contains the >>>>> value >>>>> "UTC". Where is this coming from and how can I change it? >>>>>=20 >>>>> Problems: >>>>>=20 >>>>> 1. How do I get the "Hosts Selector" in HICC to include my host name = so >>>>> that >>>>> the generated SQL queries are correct? >>>>> 2. How do I make the "time_zone" session parameter use PDT vs. UTC? >>>>> 3. How do I populate the other tables, such as "disk_489_month"? >>>>>=20 >>>>> Thanks, >>>>> Kirk >>>>>=20 >>>>> Eric Yang wrote: >>>>> =20 >>>>> =20 >>>>>> =20 >>>>>> =20 >>>>>> Df command is converted into disk_xxxx_week table in mysql, if I rem= ember >>>>>> correctly. In mysql are the database tables getting created? >>>>>> Make sure that you have: >>>>>>=20 >>>>>> >>>>>> chukwa.post.demux.data.loader >>>>>> =20 >>>>>>=20 org.apache.hadoop.chukwa.dataloader.MetricDataLoaderPool,org.apach>>= >>>> e >>>>>> .h >>>>>> adoop.chukwa.dataloader.FSMDataLoader >>>>>> >>>>>>=20 >>>>>> In Chukwa-demux.conf. >>>>>>=20 >>>>>> The rough picture of the data flows looks like this: >>>>>>=20 >>>>>> 1. demux -> Generate chukwa record outputs. >>>>>> 2. archive -> Generate bigger files by compacting data sink files. >>>>>> (Concurrent with step 1) >>>>>> 3. postProcess -> Look up what files are generated by demux process = and >>>>>> dispatch using different data loaders. >>>>>> 4. MetricDataLoaderPool -> Dispatch multiple threads to load chukwa >>>>>> record files to different MDL. >>>>>> 5. MetricDataLoader -> Load sequence file to database by record type >>>>>> defined in mdl.xml. >>>>>> 6. HICC widget has a descriptor language in json. You can find the >>>>>> widget >>>>>> descriptor files in hdfs://namenode:port/chukwa/hicc/widgets whic= h >>>>>> embedded the full SQL template like: >>>>>>=20 >>>>>> Query=3D=A9=F7select cpu_user_pcnt from [system_metrics] where timestam= p >>>>>> between >>>>>> [start] and [end]=A9=F7 >>>>>>=20 >>>>>> This will output everything the metrics in JSON format and the HI= CC >>>>>> graphing widget will render the graph. >>>>>>=20 >>>>>> If there is no data, look at postProcess.log and make sure the data >>>>>> loading >>>>>> is not throwing exceptions. Step 3 to 6 are deprecated, and will be >>>>>> replaced with something else. Hope this helps. >>>>>>=20 >>>>>> Regards, >>>>>> Eric >>>>>>=20 >>>>>> On 3/17/10 4:16 PM, "Kirk True" >>>>>> >>>>>> wrote: >>>>>>=20 >>>>>> =20 >>>>>> =20 >>>>>> =20 >>>>>> =20 >>>>>>> =20 >>>>>>> =20 >>>>>>> Hi Eric, >>>>>>>=20 >>>>>>> Eric Yang wrote: >>>>>>> =20 >>>>>>> =20 >>>>>>> =20 >>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> Hi Kirk, >>>>>>>>=20 >>>>>>>> I am working on a design which removes MySQL from Chukwa. I am ma= king >>>>>>>> this >>>>>>>> departure from MySQL because MDL framework was for prototype purpo= se. >>>>>>>> It >>>>>>>> will not scale in production system where Chukwa could be host on = large >>>>>>>> hadoop cluster. HICC will serve data directly from HDFS in the fu= ture. >>>>>>>>=20 >>>>>>>> Meanwhile, the dbAdmin.sh from Chukwa 0.3 is still compatible with >>>>>>>> trunk >>>>>>>> version of Chukwa. You can load ChukwaRecords using >>>>>>>> org.apache.hadoop.chukwa.dataloader.MetricDataLoader class or mdl.= sh >>>>>>>> from >>>>>>>> Chukwa 0.3. >>>>>>>>=20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>> =20 >>>>>>> =20 >>>>>>> I'm to the point where the "df" example is working and demux is sto= ring >>>>>>> ChukwaRecord data in HDFS. When I run dbAdmin.sh from 0.3.0, no dat= a is >>>>>>> getting updated in the database. >>>>>>>=20 >>>>>>> My question is: what's the process to get a custom Demux implementa= tion >>>>>>> to >>>>>>> be >>>>>>> viewable in HICC? Are the database tables magically created and >>>>>>> populated >>>>>>> for >>>>>>> me? Does HICC generate a widget for me? >>>>>>>=20 >>>>>>> HICC looks very nice, but when I try to add a widget to my dashboar= d, >>>>>>> the >>>>>>> preview always reads, "No Data Available." I'm running >>>>>>> $CHUKWA_HOME/bin/start-all.sh followed by $CHUKWA_HOME/bin/dbAdmin.= sh >>>>>>> (which >>>>>>> I've manually copied to the bin directory). >>>>>>>=20 >>>>>>> What am I missing? >>>>>>>=20 >>>>>>> Thanks, >>>>>>> Kirk >>>>>>>=20 >>>>>>> =20 >>>>>>> =20 >>>>>>> =20 >>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> MetricDataLoader class will be mark as deprecated, and it will not= be >>>>>>>> supported once we make transition to Avro + Tfile. >>>>>>>>=20 >>>>>>>> Regards, >>>>>>>> Eric >>>>>>>>=20 >>>>>>>> On 3/15/10 11:56 AM, "Kirk True" >>>>>>>> >>>>>>>> >>>>>>>> wrote: >>>>>>>>=20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>>> =20 >>>>>>>>> =20 >>>>>>>>> =20 >>>>>>>>> Hi all, >>>>>>>>>=20 >>>>>>>>> I recently switched to trunk as I was experiencing a lot of issue= s >>>>>>>>> with >>>>>>>>> 0.3.0. In 0.3.0, there was a dbAdmin.sh script that would run and= try >>>>>>>>> to >>>>>>>>> stick data in MySQL from HDFS. However, that script is gone and w= hen I >>>>>>>>> run the system as built from trunk, nothing is ever populated in = the >>>>>>>>> database. Where are the instructions for setting up the HDFS -> M= ySQL >>>>>>>>> data migration for HICC? >>>>>>>>>=20 >>>>>>>>> Thanks, >>>>>>>>> Kirk >>>>>>>>> =20 >>>>>>>>> =20 >>>>>>>>> =20 >>>>>>>>> =20 >>>>>>>>> =20 >>>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>>=20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>>> =20 >>>>>>> =20 >>>>>>> =20 >>>>>>> =20 >>>>>>> =20 >>>>>> =20 >>>>>> =20 >>>>>>=20 >>>>>> =20 >>>>>> =20 >>>>>> =20 >>>>> =20 >>>> =20 >>>>=20 >>>> =20 >>>=20 >>=20 >=20 >=20