Return-Path: Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: (qmail 31361 invoked from network); 2 Feb 2010 20:29:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Feb 2010 20:29:29 -0000 Received: (qmail 35600 invoked by uid 500); 2 Feb 2010 20:29:29 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 35542 invoked by uid 500); 2 Feb 2010 20:29:29 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 35533 invoked by uid 500); 2 Feb 2010 20:29:29 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 35530 invoked by uid 99); 2 Feb 2010 20:29:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Feb 2010 20:29:29 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Feb 2010 20:29:28 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 135FA16E28; Tue, 2 Feb 2010 20:29:08 +0000 (GMT) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Tue, 02 Feb 2010 20:29:07 -0000 Message-ID: <20100202202907.15290.66893@eos.apache.org> Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22Chukwa=5FProcesses=5Fand=5FData?= =?utf-8?q?=5FFlow=22_by_BillGraham?= Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch= ange notification. The "Chukwa_Processes_and_Data_Flow" page has been changed by BillGraham. http://wiki.apache.org/hadoop/Chukwa_Processes_and_Data_Flow -------------------------------------------------- New page: This document describes how Chukwa data is stored in HDFS and the processes= that act on it. '''HDFS File System Structure''' The general layout of the Chukwa filesystem is as follows. {{{ /chukwa/ archivesProcessing/ dataSinkArchives/ demuxProcessing/ finalArchives/ logs/ postProcess/ repos/ rolling/ temp/ }}} '''Raw Log Collection and Aggregation Workflow''' What data is stored where is best described by stepping through the Chukwa = workflow. 1. Collectors write chunks to {{{logs/*.chukwa}}} files until a 64MB chunk= size is reached or a given time interval is reached. * to: {{{logs/*.chukwa}}} = 1. Collectors close chunks and rename them to {{{*.done}}} * from: {{{logs/*.chukwa}}} * to: {{{logs/*.done}}} = 1. DemuxManager wakes up every 20 seconds, runs M/R to merges {{{*.done}}}= files and moves them. * from: {{{logs/*.done}}} * to: {{{demuxProcessing/mrInput}}} * to: {{{demuxProcessing/mrOutput}}} * to: {{{{{{dataSinkArchives/[yyyyMMdd]/*/*.done}}} = 1. PostProcessManager wakes up every few minutes and aggregates, orders an= d de-dups record files. * from: postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_= [yyyyMMdd]_[HH].R.evt}}} * to: {{{repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[= yyyyMMdd]_[HH]_[N].[N].evt}}} = 1. HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 = minute logs to hourly. * from: {{{repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]= _[yyyyMMdd]_[mm].[N].evt}}} * to: {{{temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]}}} * to: {{{repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_Hourly= Done_[yyyyMMdd]_[HH].[N].evt}}} * leaves: {{{repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/}}= } = 1. DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs t= o daily. * from: {{{repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyy= yMMdd]_[HH].[N].evt}}} * to: {{{temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]}}} * to: {{{repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[= yyyyMMdd].[N].evt}}} * leaves: {{{repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/}}} = 1. ChukwaArchiveManager every half hour or so aggregates and removes dataS= inkArchives data using M/R. * from: {{{dataSinkArchives/[yyyyMMdd]/*/*.done}}} * to: {{{archivesProcessing/mrInput}}} * to: {{{archivesProcessing/mrOutput}}} * to: {{{finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*}}} = '''Log Directories Requiring Cleanup''' The following directories will grow over time and will need to be periodica= lly pruned: * {{{finalArchives/[yyyyMMdd]/*}}} * {{{repos/[clusterName]/[dataType]/[yyyyMMdd]/*.evt}}}=20