Mailing-List: contact chukwa-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: chukwa-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of jboulon@netflix.com designates
 208.75.77.145 as permitted sender)
User-Agent: Microsoft-Entourage/12.23.0.091001
Date: Fri, 19 Mar 2010 12:49:53 -0700
Subject: Re: How to set up HDFS -> MySQL from trunk?
From: Jerome Boulon <jboulon@netflix.com>
To: <chukwa-user@hadoop.apache.org>
Message-ID: <C7C92371.82BC%jboulon@netflix.com>
Thread-Topic: How to set up HDFS -> MySQL from trunk?
Thread-Index: AcrHkjffiXZ+0QYWdkKroyIO4nccvQACMoT5AACVdmM=
In-Reply-To: <C7C91F86.7356%eyang@yahoo-inc.com>
Mime-version: 1.0
Content-type: text/plain;
	charset="EUC-KR"
Content-transfer-encoding: quoted-printable

Do you have a Jira for that, so we can continue the discussion there?

The reason, I'm asking this, is because I guess that if you need to move ou=
t
of Mysql it's because you need to scale. And if you need to scale then you
need partitioning and Voldemort and Hbase are already working on this (or
all No-SQL implementations)?.

Voldemort index/data files can be built using Hadoop and Hbase is already
using Tfile.

Thanks,
/Jerome.

On 3/19/10 12:33 PM, "Eric Yang" <eyang@yahoo-inc.com> wrote:

> Hi Jerome,
>=20
> I am not planning to have SQL on top of HDFS.  Chukwa MetricDataLoader
> subsystem is a index builder.  The replacement part of the index builder =
is
> either Tfile or a streaming job to build the index, and having distribute=
d
> processes to cache the index by keeping the Tfile open or load the index =
to
> memory.  For aggregation, this could be replaced with second stage
> mapreduce, or workflow subsystem like Oozie.  It could also be replaced w=
ith
> Hive, if the community likes this approach.
>=20
> Regards,
> Eric
>=20
> On 3/19/10 11:30 AM, "Jerome Boulon" <jboulon@netflix.com> wrote:
>=20
>> Hi Eric,
>> Correct me if I=A1=AFm wrong but to get that =A1=B0the SQL portion of Chukwa is
>> deprecated, and the HDFS-based replacement is six months out=A1=B1,
>> You need a SQL like engine otherwise is not a replacement.
>> So does that mean you=A1=AFre planning to get a SQL like engine working on t=
op of
>> HDFS in less than 6 months ?
>> If yes, do you already have some working code?
>> What are the performance that you=A1=AFre targeting since even if Mysql is n=
ot
>> scalable, you can still do a bunch of things ...
>>=20
>> Thanks,
>>   /Jerome.
>>=20
>> On 3/18/10 8:59 PM, "Kirk True" <kirk@mustardgrain.com> wrote:
>>=20
>>> Hi Eric,
>>>=20
>>> Awesome - everything's working great now.
>>>=20
>>> So, as you've said, the SQL portion of Chukwa is deprecated, and the
>>> HDFS-based replacement is six months out. What should I do to get the d=
ata
>>> from the adapters->collectors->HDFS->HICC? Is the HDFS-based HICC
>>> replacement
>>> spec'ed out enough for others to contribute?
>>>=20
>>> Thanks,
>>> Kirk
>>>=20
>>> Eric Yang wrote:
>>>> =20
>>>> Hi Kirk,
>>>>=20
>>>> 1. Host select is currently showing hostname collected from SystemMetr=
ics
>>>> table, hence, you need to have top, iostat, df, sar collected to popul=
ate
>>>> SystemMetrics table correctly.  The hostname is also cached in the use=
r
>>>> session, hence you will need to =A1=B0switch to a different cluster, and s=
witch
>>>> back=A1=B1 or restart hicc to flush the cached hostnames from user session=
.  The
>>>> hostname selector probably should pickup hostname from a different dat=
a
>>>> source in the future release.
>>>>=20
>>>> 2.  The server should run in UTC.  Timezone was never implemented
>>>> completely.  Hence, server in other timezone will not work correctly.
>>>>=20
>>>> 3. SQL aggregator (deprecated by the way) running as part of dbAdmin.s=
h,
>>>> this subsystem will down sample data from weekly table to monthly, yea=
rly,
>>>> decade tables.  I wrote this submodule over a weekend for prototype sh=
ow
>>>> and
>>>> tell.  I strongly recommend to avoid SQL part of Chukwa all together.
>>>>=20
>>>> Regards,
>>>> Eric
>>>>=20
>>>> On 3/18/10 1:15 PM, "Kirk True" <kirk@mustardgrain.com>
>>>> <mailto:kirk@mustardgrain.com>  wrote:
>>>>=20
>>>>  =20
>>>> =20
>>>>> =20
>>>>> Hi Eric,
>>>>>=20
>>>>> I believe I have most of steps 1-5 working. Data from "/usr/bin/df" i=
s
>>>>> being
>>>>> collected, parsed, stuck into HDFS, and then pulled out again and pla=
ced
>>>>> into
>>>>> MySQL. However, HICC isn't showing me my data just yet...
>>>>>=20
>>>>> The disk_2098_week table is filled out with several entries and looks
>>>>> great.
>>>>> If I select my cluster from the "Cluster Selector" and "Last 12 Hours=
"
>>>>> from
>>>>> the "Time" widget, the "Disk Statistics" widget still says "No Data
>>>>> available."
>>>>>=20
>>>>> It appears to be because part of the SQL query includes the host name
>>>>> which
>>>>> is
>>>>> coming across in the SQL parameters as "". However, since the
>>>>> disk_2098_week
>>>>> table properly includes the host name, nothing is returned by the que=
ry.
>>>>> Just
>>>>> for grins, I updated the table manually in MySQL to blank out the hos=
t
>>>>> names
>>>>> and I get a super cool, pretty graph (which looks great, BTW).
>>>>>=20
>>>>> Additionally, if I select other time periods such as "Last 1 Hour", I=
 see
>>>>> the
>>>>> query is using UTC or something (at 1:00 PDT, I see the query is usin=
g a
>>>>> range
>>>>> of 19:00-20:00). However, the data in MySQL is based on PDT, so no ma=
tches
>>>>> are
>>>>> found. It appears that the "time_zone" session attribute contains the
>>>>> value
>>>>> "UTC". Where is this coming from and how can I change it?
>>>>>=20
>>>>> Problems:
>>>>>=20
>>>>> 1. How do I get the "Hosts Selector" in HICC to include my host name =
so
>>>>> that
>>>>> the generated SQL queries are correct?
>>>>> 2. How do I make the "time_zone" session parameter use PDT vs. UTC?
>>>>> 3. How do I populate the other tables, such as "disk_489_month"?
>>>>>=20
>>>>> Thanks,
>>>>> Kirk
>>>>>=20
>>>>> Eric Yang wrote:
>>>>>    =20
>>>>> =20
>>>>>> =20
>>>>>> =20
>>>>>> Df command is converted into disk_xxxx_week table in mysql, if I rem=
ember
>>>>>> correctly.  In mysql are the database tables getting created?
>>>>>> Make sure that you have:
>>>>>>=20
>>>>>>   <property>
>>>>>>     <name>chukwa.post.demux.data.loader</name>
>>>>>>    =20
>>>>>>=20
<value>org.apache.hadoop.chukwa.dataloader.MetricDataLoaderPool,org.apach>>=
>>>>
e
>>>>>> .h
>>>>>> adoop.chukwa.dataloader.FSMDataLoader</value>
>>>>>>   </property>
>>>>>>=20
>>>>>> In Chukwa-demux.conf.
>>>>>>=20
>>>>>> The rough picture of the data flows looks like this:
>>>>>>=20
>>>>>> 1. demux -> Generate chukwa record outputs.
>>>>>> 2. archive -> Generate bigger files by compacting data sink files.
>>>>>>    (Concurrent with step 1)
>>>>>> 3. postProcess -> Look up what files are generated by demux process =
and
>>>>>>    dispatch using different data loaders.
>>>>>> 4. MetricDataLoaderPool -> Dispatch multiple threads to load chukwa
>>>>>>    record files to different MDL.
>>>>>> 5. MetricDataLoader -> Load sequence file to database by record type
>>>>>>    defined in mdl.xml.
>>>>>> 6. HICC widget has a descriptor language in json.  You can find the
>>>>>> widget
>>>>>>    descriptor files in hdfs://namenode:port/chukwa/hicc/widgets whic=
h
>>>>>>    embedded the full SQL template like:
>>>>>>=20
>>>>>>    Query=3D=A9=F7select cpu_user_pcnt from [system_metrics] where timestam=
p
>>>>>> between
>>>>>>    [start] and [end]=A9=F7
>>>>>>=20
>>>>>>    This will output everything the metrics in JSON format and the HI=
CC
>>>>>>    graphing widget will render the graph.
>>>>>>=20
>>>>>> If there is no data, look at postProcess.log and make sure the data
>>>>>> loading
>>>>>> is not throwing exceptions.  Step 3 to 6 are deprecated, and will be
>>>>>> replaced with something else.  Hope this helps.
>>>>>>=20
>>>>>> Regards,
>>>>>> Eric
>>>>>>=20
>>>>>> On 3/17/10 4:16 PM, "Kirk True" <kirk@mustardgrain.com>
>>>>>> <mailto:kirk@mustardgrain.com>
>>>>>> <mailto:kirk@mustardgrain.com>  wrote:
>>>>>>=20
>>>>>>  =20
>>>>>> =20
>>>>>>      =20
>>>>>> =20
>>>>>>> =20
>>>>>>> =20
>>>>>>> Hi Eric,
>>>>>>>=20
>>>>>>> Eric Yang wrote:
>>>>>>>    =20
>>>>>>> =20
>>>>>>>        =20
>>>>>>> =20
>>>>>>>> =20
>>>>>>>> =20
>>>>>>>> =20
>>>>>>>> Hi Kirk,
>>>>>>>>=20
>>>>>>>> I am working on a design which removes MySQL from Chukwa.  I am ma=
king
>>>>>>>> this
>>>>>>>> departure from MySQL because MDL framework was for prototype purpo=
se.
>>>>>>>> It
>>>>>>>> will not scale in production system where Chukwa could be host on =
large
>>>>>>>> hadoop cluster.  HICC will serve data directly from HDFS in the fu=
ture.
>>>>>>>>=20
>>>>>>>> Meanwhile, the dbAdmin.sh from Chukwa 0.3 is still compatible with
>>>>>>>> trunk
>>>>>>>> version of Chukwa.  You can load ChukwaRecords using
>>>>>>>> org.apache.hadoop.chukwa.dataloader.MetricDataLoader class or mdl.=
sh
>>>>>>>> from
>>>>>>>> Chukwa 0.3.
>>>>>>>>=20
>>>>>>>>  =20
>>>>>>>>      =20
>>>>>>>> =20
>>>>>>>>          =20
>>>>>>>> =20
>>>>>>> =20
>>>>>>> =20
>>>>>>> I'm to the point where the "df" example is working and demux is sto=
ring
>>>>>>> ChukwaRecord data in HDFS. When I run dbAdmin.sh from 0.3.0, no dat=
a is
>>>>>>> getting updated in the database.
>>>>>>>=20
>>>>>>> My question is: what's the process to get a custom Demux implementa=
tion
>>>>>>> to
>>>>>>> be
>>>>>>> viewable in HICC? Are the database tables magically created and
>>>>>>> populated
>>>>>>> for
>>>>>>> me? Does HICC generate a widget for me?
>>>>>>>=20
>>>>>>> HICC looks very nice, but when I try to add a widget to my dashboar=
d,
>>>>>>> the
>>>>>>> preview always reads, "No Data Available." I'm running
>>>>>>> $CHUKWA_HOME/bin/start-all.sh followed by $CHUKWA_HOME/bin/dbAdmin.=
sh
>>>>>>> (which
>>>>>>> I've manually copied to the bin directory).
>>>>>>>=20
>>>>>>> What am I missing?
>>>>>>>=20
>>>>>>> Thanks,
>>>>>>> Kirk
>>>>>>>=20
>>>>>>>    =20
>>>>>>> =20
>>>>>>>        =20
>>>>>>> =20
>>>>>>>> =20
>>>>>>>> =20
>>>>>>>> =20
>>>>>>>> MetricDataLoader class will be mark as deprecated, and it will not=
 be
>>>>>>>> supported once we make transition to Avro + Tfile.
>>>>>>>>=20
>>>>>>>> Regards,
>>>>>>>> Eric
>>>>>>>>=20
>>>>>>>> On 3/15/10 11:56 AM, "Kirk True" <kirk@mustardgrain.com>
>>>>>>>> <mailto:kirk@mustardgrain.com>
>>>>>>>> <mailto:kirk@mustardgrain.com>
>>>>>>>> <mailto:kirk@mustardgrain.com>  wrote:
>>>>>>>>=20
>>>>>>>>  =20
>>>>>>>> =20
>>>>>>>>      =20
>>>>>>>> =20
>>>>>>>>          =20
>>>>>>>> =20
>>>>>>>>> =20
>>>>>>>>> =20
>>>>>>>>> =20
>>>>>>>>> Hi all,
>>>>>>>>>=20
>>>>>>>>> I recently switched to trunk as I was experiencing a lot of issue=
s
>>>>>>>>> with
>>>>>>>>> 0.3.0. In 0.3.0, there was a dbAdmin.sh script that would run and=
 try
>>>>>>>>> to
>>>>>>>>> stick data in MySQL from HDFS. However, that script is gone and w=
hen I
>>>>>>>>> run the system as built from trunk, nothing is ever populated in =
the
>>>>>>>>> database. Where are the instructions for setting up the HDFS -> M=
ySQL
>>>>>>>>> data migration for HICC?
>>>>>>>>>=20
>>>>>>>>> Thanks,
>>>>>>>>> Kirk
>>>>>>>>>    =20
>>>>>>>>> =20
>>>>>>>>>        =20
>>>>>>>>> =20
>>>>>>>>>         =20
>>>>>>>>> =20
>>>>>>>> =20
>>>>>>>> =20
>>>>>>>> =20
>>>>>>>>=20
>>>>>>>>  =20
>>>>>>>>      =20
>>>>>>>> =20
>>>>>>>>          =20
>>>>>>>> =20
>>>>>>> =20
>>>>>>> =20
>>>>>>>        =20
>>>>>>> =20
>>>>>> =20
>>>>>> =20
>>>>>>=20
>>>>>>  =20
>>>>>>      =20
>>>>>> =20
>>>>> =20
>>>> =20
>>>>=20
>>>>  =20
>>>=20
>>=20
>=20
>=20