chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <ey...@yahoo-inc.com>
Subject Re: How to set up HDFS -> MySQL from trunk?
Date Fri, 19 Mar 2010 21:40:51 GMT
JIRA is CHUKWA-444.  Voldemort looks like a good fit on paper.  I will
investigate.  Thanks.

Regards,
Eric

On 3/19/10 12:49 PM, "Jerome Boulon" <jboulon@netflix.com> wrote:

> Do you have a Jira for that, so we can continue the discussion there?
> 
> The reason, I'm asking this, is because I guess that if you need to move out
> of Mysql it's because you need to scale. And if you need to scale then you
> need partitioning and Voldemort and Hbase are already working on this (or
> all No-SQL implementations)?.
> 
> Voldemort index/data files can be built using Hadoop and Hbase is already
> using Tfile.
> 
> Thanks,
> /Jerome.
> 
> On 3/19/10 12:33 PM, "Eric Yang" <eyang@yahoo-inc.com> wrote:
> 
>> Hi Jerome,
>> 
>> I am not planning to have SQL on top of HDFS.  Chukwa MetricDataLoader
>> subsystem is a index builder.  The replacement part of the index builder is
>> either Tfile or a streaming job to build the index, and having distributed
>> processes to cache the index by keeping the Tfile open or load the index to
>> memory.  For aggregation, this could be replaced with second stage
>> mapreduce, or workflow subsystem like Oozie.  It could also be replaced with
>> Hive, if the community likes this approach.
>> 
>> Regards,
>> Eric
>> 
>> On 3/19/10 11:30 AM, "Jerome Boulon" <jboulon@netflix.com> wrote:
>> 
>>> Hi Eric,
>>> Correct me if I’m wrong but to get that “the SQL portion of Chukwa is
>>> deprecated, and the HDFS-based replacement is six months out”,
>>> You need a SQL like engine otherwise is not a replacement.
>>> So does that mean you’re planning to get a SQL like engine working on top of
>>> HDFS in less than 6 months ?
>>> If yes, do you already have some working code?
>>> What are the performance that you’re targeting since even if Mysql is not
>>> scalable, you can still do a bunch of things ...
>>> 
>>> Thanks,
>>>   /Jerome.
>>> 
>>> On 3/18/10 8:59 PM, "Kirk True" <kirk@mustardgrain.com> wrote:
>>> 
>>>> Hi Eric,
>>>> 
>>>> Awesome - everything's working great now.
>>>> 
>>>> So, as you've said, the SQL portion of Chukwa is deprecated, and the
>>>> HDFS-based replacement is six months out. What should I do to get the data
>>>> from the adapters->collectors->HDFS->HICC? Is the HDFS-based HICC
>>>> replacement
>>>> spec'ed out enough for others to contribute?
>>>> 
>>>> Thanks,
>>>> Kirk
>>>> 
>>>> Eric Yang wrote:
>>>>>  
>>>>> Hi Kirk,
>>>>> 
>>>>> 1. Host select is currently showing hostname collected from SystemMetrics
>>>>> table, hence, you need to have top, iostat, df, sar collected to populate
>>>>> SystemMetrics table correctly.  The hostname is also cached in the user
>>>>> session, hence you will need to “switch to a different cluster, and
switch
>>>>> back” or restart hicc to flush the cached hostnames from user session.
>>>>> The
>>>>> hostname selector probably should pickup hostname from a different data
>>>>> source in the future release.
>>>>> 
>>>>> 2.  The server should run in UTC.  Timezone was never implemented
>>>>> completely.  Hence, server in other timezone will not work correctly.
>>>>> 
>>>>> 3. SQL aggregator (deprecated by the way) running as part of dbAdmin.sh,
>>>>> this subsystem will down sample data from weekly table to monthly, yearly,
>>>>> decade tables.  I wrote this submodule over a weekend for prototype show
>>>>> and
>>>>> tell.  I strongly recommend to avoid SQL part of Chukwa all together.
>>>>> 
>>>>> Regards,
>>>>> Eric
>>>>> 
>>>>> On 3/18/10 1:15 PM, "Kirk True" <kirk@mustardgrain.com>
>>>>> <mailto:kirk@mustardgrain.com>  wrote:
>>>>> 
>>>>>   
>>>>>  
>>>>>>  
>>>>>> Hi Eric,
>>>>>> 
>>>>>> I believe I have most of steps 1-5 working. Data from "/usr/bin/df"
is
>>>>>> being
>>>>>> collected, parsed, stuck into HDFS, and then pulled out again and
placed
>>>>>> into
>>>>>> MySQL. However, HICC isn't showing me my data just yet...
>>>>>> 
>>>>>> The disk_2098_week table is filled out with several entries and looks
>>>>>> great.
>>>>>> If I select my cluster from the "Cluster Selector" and "Last 12 Hours"
>>>>>> from
>>>>>> the "Time" widget, the "Disk Statistics" widget still says "No Data
>>>>>> available."
>>>>>> 
>>>>>> It appears to be because part of the SQL query includes the host
name
>>>>>> which
>>>>>> is
>>>>>> coming across in the SQL parameters as "". However, since the
>>>>>> disk_2098_week
>>>>>> table properly includes the host name, nothing is returned by the
query.
>>>>>> Just
>>>>>> for grins, I updated the table manually in MySQL to blank out the
host
>>>>>> names
>>>>>> and I get a super cool, pretty graph (which looks great, BTW).
>>>>>> 
>>>>>> Additionally, if I select other time periods such as "Last 1 Hour",
I see
>>>>>> the
>>>>>> query is using UTC or something (at 1:00 PDT, I see the query is
using a
>>>>>> range
>>>>>> of 19:00-20:00). However, the data in MySQL is based on PDT, so no
>>>>>> matches
>>>>>> are
>>>>>> found. It appears that the "time_zone" session attribute contains
the
>>>>>> value
>>>>>> "UTC". Where is this coming from and how can I change it?
>>>>>> 
>>>>>> Problems:
>>>>>> 
>>>>>> 1. How do I get the "Hosts Selector" in HICC to include my host name
so
>>>>>> that
>>>>>> the generated SQL queries are correct?
>>>>>> 2. How do I make the "time_zone" session parameter use PDT vs. UTC?
>>>>>> 3. How do I populate the other tables, such as "disk_489_month"?
>>>>>> 
>>>>>> Thanks,
>>>>>> Kirk
>>>>>> 
>>>>>> Eric Yang wrote:
>>>>>>     
>>>>>>  
>>>>>>>  
>>>>>>>  
>>>>>>> Df command is converted into disk_xxxx_week table in mysql, if
I
>>>>>>> remember
>>>>>>> correctly.  In mysql are the database tables getting created?
>>>>>>> Make sure that you have:
>>>>>>> 
>>>>>>>   <property>
>>>>>>>     <name>chukwa.post.demux.data.loader</name>
>>>>>>>     
>>>>>>> 
> 
<value>org.apache.hadoop.chukwa.dataloader.MetricDataLoaderPool,org.apach>>>>>>
>
> e
>>>>>>> .h
>>>>>>> adoop.chukwa.dataloader.FSMDataLoader</value>
>>>>>>>   </property>
>>>>>>> 
>>>>>>> In Chukwa-demux.conf.
>>>>>>> 
>>>>>>> The rough picture of the data flows looks like this:
>>>>>>> 
>>>>>>> 1. demux -> Generate chukwa record outputs.
>>>>>>> 2. archive -> Generate bigger files by compacting data sink
files.
>>>>>>>    (Concurrent with step 1)
>>>>>>> 3. postProcess -> Look up what files are generated by demux
process and
>>>>>>>    dispatch using different data loaders.
>>>>>>> 4. MetricDataLoaderPool -> Dispatch multiple threads to load
chukwa
>>>>>>>    record files to different MDL.
>>>>>>> 5. MetricDataLoader -> Load sequence file to database by record
type
>>>>>>>    defined in mdl.xml.
>>>>>>> 6. HICC widget has a descriptor language in json.  You can find
the
>>>>>>> widget
>>>>>>>    descriptor files in hdfs://namenode:port/chukwa/hicc/widgets
which
>>>>>>>    embedded the full SQL template like:
>>>>>>> 
>>>>>>>    Query=²select cpu_user_pcnt from [system_metrics] where timestamp
>>>>>>> between
>>>>>>>    [start] and [end]²
>>>>>>> 
>>>>>>>    This will output everything the metrics in JSON format and
the HICC
>>>>>>>    graphing widget will render the graph.
>>>>>>> 
>>>>>>> If there is no data, look at postProcess.log and make sure the
data
>>>>>>> loading
>>>>>>> is not throwing exceptions.  Step 3 to 6 are deprecated, and
will be
>>>>>>> replaced with something else.  Hope this helps.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Eric
>>>>>>> 
>>>>>>> On 3/17/10 4:16 PM, "Kirk True" <kirk@mustardgrain.com>
>>>>>>> <mailto:kirk@mustardgrain.com>
>>>>>>> <mailto:kirk@mustardgrain.com>  wrote:
>>>>>>> 
>>>>>>>   
>>>>>>>  
>>>>>>>       
>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> Hi Eric,
>>>>>>>> 
>>>>>>>> Eric Yang wrote:
>>>>>>>>     
>>>>>>>>  
>>>>>>>>         
>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> Hi Kirk,
>>>>>>>>> 
>>>>>>>>> I am working on a design which removes MySQL from Chukwa.
 I am making
>>>>>>>>> this
>>>>>>>>> departure from MySQL because MDL framework was for prototype
purpose.
>>>>>>>>> It
>>>>>>>>> will not scale in production system where Chukwa could
be host on
>>>>>>>>> large
>>>>>>>>> hadoop cluster.  HICC will serve data directly from HDFS
in the
>>>>>>>>> future.
>>>>>>>>> 
>>>>>>>>> Meanwhile, the dbAdmin.sh from Chukwa 0.3 is still compatible
with
>>>>>>>>> trunk
>>>>>>>>> version of Chukwa.  You can load ChukwaRecords using
>>>>>>>>> org.apache.hadoop.chukwa.dataloader.MetricDataLoader
class or mdl.sh
>>>>>>>>> from
>>>>>>>>> Chukwa 0.3.
>>>>>>>>> 
>>>>>>>>>   
>>>>>>>>>       
>>>>>>>>>  
>>>>>>>>>          
>>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> I'm to the point where the "df" example is working and demux
is storing
>>>>>>>> ChukwaRecord data in HDFS. When I run dbAdmin.sh from 0.3.0,
no data is
>>>>>>>> getting updated in the database.
>>>>>>>> 
>>>>>>>> My question is: what's the process to get a custom Demux
implementation
>>>>>>>> to
>>>>>>>> be
>>>>>>>> viewable in HICC? Are the database tables magically created
and
>>>>>>>> populated
>>>>>>>> for
>>>>>>>> me? Does HICC generate a widget for me?
>>>>>>>> 
>>>>>>>> HICC looks very nice, but when I try to add a widget to my
dashboard,
>>>>>>>> the
>>>>>>>> preview always reads, "No Data Available." I'm running
>>>>>>>> $CHUKWA_HOME/bin/start-all.sh followed by $CHUKWA_HOME/bin/dbAdmin.sh
>>>>>>>> (which
>>>>>>>> I've manually copied to the bin directory).
>>>>>>>> 
>>>>>>>> What am I missing?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Kirk
>>>>>>>> 
>>>>>>>>     
>>>>>>>>  
>>>>>>>>         
>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> MetricDataLoader class will be mark as deprecated, and
it will not be
>>>>>>>>> supported once we make transition to Avro + Tfile.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Eric
>>>>>>>>> 
>>>>>>>>> On 3/15/10 11:56 AM, "Kirk True" <kirk@mustardgrain.com>
>>>>>>>>> <mailto:kirk@mustardgrain.com>
>>>>>>>>> <mailto:kirk@mustardgrain.com>
>>>>>>>>> <mailto:kirk@mustardgrain.com>  wrote:
>>>>>>>>> 
>>>>>>>>>   
>>>>>>>>>  
>>>>>>>>>       
>>>>>>>>>  
>>>>>>>>>          
>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> I recently switched to trunk as I was experiencing
a lot of issues
>>>>>>>>>> with
>>>>>>>>>> 0.3.0. In 0.3.0, there was a dbAdmin.sh script that
would run and try
>>>>>>>>>> to
>>>>>>>>>> stick data in MySQL from HDFS. However, that script
is gone and when
>>>>>>>>>> I
>>>>>>>>>> run the system as built from trunk, nothing is ever
populated in the
>>>>>>>>>> database. Where are the instructions for setting
up the HDFS -> MySQL
>>>>>>>>>> data migration for HICC?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Kirk
>>>>>>>>>>     
>>>>>>>>>>  
>>>>>>>>>>         
>>>>>>>>>>  
>>>>>>>>>>         
>>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>>   
>>>>>>>>>       
>>>>>>>>>  
>>>>>>>>>          
>>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>         
>>>>>>>>  
>>>>>>>  
>>>>>>>  
>>>>>>> 
>>>>>>>   
>>>>>>>       
>>>>>>>  
>>>>>>  
>>>>>  
>>>>> 
>>>>>   
>>>> 
>>> 
>> 
>> 
> 


Mime
View raw message