Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 91BF996C1 for ; Tue, 7 Feb 2012 14:10:14 +0000 (UTC) Received: (qmail 3783 invoked by uid 500); 7 Feb 2012 14:10:13 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 3535 invoked by uid 500); 7 Feb 2012 14:10:12 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 3527 invoked by uid 99); 7 Feb 2012 14:10:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Feb 2012 14:10:12 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of xiaobinshe@gmail.com designates 74.125.82.176 as permitted sender) Received: from [74.125.82.176] (HELO mail-we0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Feb 2012 14:10:07 +0000 Received: by werf1 with SMTP id f1so7294585wer.35 for ; Tue, 07 Feb 2012 06:09:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=rKg4/2MwQqjxNS9N9O8cC1op5QkZ3thh5Dy+39hDmSw=; b=MnnXq37Vad2frsO46mqDLYmqEfz0QY6rfCibtEukwiVWWwr3Yiw/6fPvgzViu++EzT zLORORk5dUqqrMJKhIfXL6GNG/vX8NHHl+afhPZt5aebuf8LL8OidPMAB7TaThsHcyd5 M28RlhVsCoTm8uyj6WgmfE0YfAu2V8BYVHlJE= MIME-Version: 1.0 Received: by 10.216.131.78 with SMTP id l56mr6348373wei.56.1328623785891; Tue, 07 Feb 2012 06:09:45 -0800 (PST) Received: by 10.223.126.6 with HTTP; Tue, 7 Feb 2012 06:09:45 -0800 (PST) In-Reply-To: <1CFF9FDB-7596-4FA1-9BDE-EE0C8CB4C845@gmail.com> References: <1CFF9FDB-7596-4FA1-9BDE-EE0C8CB4C845@gmail.com> Date: Tue, 7 Feb 2012 22:09:45 +0800 Message-ID: Subject: Re: What's the best practice of loading logs into hdfs while using hive to do log analytic? From: Xiaobin She To: user@hive.apache.org, common-user@hadoop.apache.org Cc: =?GB2312?B?2dzP/rHy?= Content-Type: multipart/alternative; boundary=0016e6d460a6cc570f04b860543e --0016e6d460a6cc570f04b860543e Content-Type: text/plain; charset=ISO-8859-1 hi Bejoy and Alex, thank you for your advice. Actually I have look at Scribe first, and I have heard of Flume. I look at flume's user guide just now, and flume seems promising, as Bejoy said , the flume collector can dump data into hdfs when the collector buffer reaches a particular size of after a particular time interval, this is good and I think it can solve the problem of data delivery latency. But what about compress? from the user's guide of flume, I see that flum supports compression of log files, but if flume did not wait until the collector has collect one hour of log and then compress it and send it to hdfs, then it will send part of the one hour log to hdfs, am I right? so if I want to use thest data in hive (assume I have an external table in hive), I have to specify at least two partiton key while creating table, one for day-month-hour, and one for some other time interval like ten miniutes, then I add hive partition to the existed external table with specified partition key. Is the above process right ? If this right, then there could be some other problem, like the ten miniute logs after compress is not big enough to fit the block size of hdfs which may couse lots of small files ( for some of our log id, this may come true), or if I set the time interval to be half an hour, then at the end of hour, it may still cause the data delivery latency problem. this seems not a very good solution, am I making some mistakes or misunderstanding here? thank you very much! 2012/2/7 alo alt > Hi, > > a first start with flume: > > http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html > > Facebook's scribe could also be work for you. > > - Alex > > -- > Alexander Lorenz > http://mapredit.blogspot.com > > On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote: > > > Hi all, > > > > Sorry if it is not appropriate to send one thread into two maillist. > > ** > > I'm tring to use hadoop and hive to do some log analytic jobs. > > > > Our system generate lots of logs every day, for example, it produce about > > 370GB logs(including lots of log files) yesterday, and every day the logs > > increases. > > > > And we want to use hadoop and hive to replace our old log analysic > system. > > > > We distinguish our logs with logid, we have an log collector which will > > collect logs from clients and then generate log files. > > > > for every logid, there will be one log file every hour, for some logid, > > this hourly log file can be 1~2GB > > > > I have set up an test cluster with hadoop and hive, and I have run some > > test which seems good for us. > > > > For reference, we will create one table in hive for every logid which > will > > be partitoned by hour. > > > > Now I have a question, what's the best practice for loading logs files > into > > hdfs or hive warehouse dir ? > > > > My first thought is, at the begining of every hour, compress the log > file > > of the last hour of every logid and then use the hive cmd tool to load > > these compressed log files into hdfs. > > > > using commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE INTO > > TABLE $tablename PARTITION (dt='$h') " > > > > I think this can work, and I have run some test on our 3-nodes test > > clusters. > > > > But the problem is, there are lots of logid which means there are lots of > > log files, so every hour we will have to load lots of files into hdfs > > and there is another problem, we will run hourly analysis job on these > > hourly collected log files, > > which inroduces the problem, because there are lots of log files, if we > > load these log files at the same time at the begining of every hour, I > > think there will some network flows and there will be data delivery > > latency problem. > > > > For data delivery latency problem, I mean it will take some time for the > > log files to be copyed into hdfs, and this will cause our hourly log > > analysis job to start later. > > > > So I want to figure out if we can write or append logs into an compressed > > file which is already located in hdfs, and I have posted an thread in the > > mailist, and from what I have learned, this is not possible. > > > > > > So, what's the best practice of loading logs into hdfs while using hive > to > > do log analytic? > > > > Or what's the common methods to handle problem I have describe above? > > > > Can anyone give me some advices? > > > > Thank you very much for your help! > > --0016e6d460a6cc570f04b860543e Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable hi Bejoy and Alex,

thank you for your advice.

Actually I have= look at Scribe first, and I have heard of Flume.

I look at flume= 9;s user guide just now, and flume seems promising, as Bejoy said , the flu= me collector can dump data into hdfs when the collector buffer=20 reaches a particular size of after a particular time interval, this is good= and I think it can solve the problem of data delivery latency.

But = what about compress?

from the user's guide of flume, I see that = flum supports compression=A0 of log files, but if flume did not wait until = the collector has collect one hour of log and then compress it and send it = to hdfs, then it will=A0 send part of the one hour log to hdfs, am I right?=

so if I want to use thest data in hive (assume I have an external table= in hive), I have to specify at least two partiton key while creating table= , one for day-month-hour, and one for some other time interval like ten min= iutes, then I add hive partition to the existed external table with specifi= ed partition key.

Is the above process right ?

If this right, then there could be = some other problem, like the ten miniute logs after compress is not big eno= ugh to fit the block size of hdfs which may couse lots of small files ( for= some of our log id, this may come true), or if I set the time interval to = be half an hour, then at the end of hour, it may still cause the data deliv= ery latency problem.

this seems not a very good solution, am I making some mistakes or misun= derstanding here?

thank you very much!





2012/2/7 alo alt <wget.null@googlemail.com= >
Hi,

a first start with flume:
http://mapredit.blogspot.com/2011/10/cen= tralized-logfile-management-across.html

Facebook's scribe could also be work for you.

- Alex

--
Alexander Lorenz
http://mapredit.= blogspot.com

On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:

> Hi all,
>
> Sorry if it is not appropriate to send one thread into two maillist.
> **
> I'm tring to use hadoop and hive to do some log analytic= jobs.
>
> Our system generate lots of logs every day, for example, it produce ab= out
> 370GB logs(including lots of log files) yesterday, and every day the l= ogs
> increases.
>
> And we want to use hadoop and hive to replace our old log analysic sys= tem.
>
> We distinguish our logs with logid, we have an log collector which wil= l
> collect logs from clients and then generate log files.
>
> for every logid, there will be one log file every hour, for some logid= ,
> this hourly log file can be 1~2GB
>
> I have set up an test cluster with hadoop and hive, and I have run som= e
> test which seems good for us.
>
> For reference, we will create one table in hive for every logid which = will
> be partitoned by hour.
>
> Now I have a question, what's the best practice for loading logs f= iles into
> hdfs or hive warehouse dir ?
>
> My first thought is, =A0at the begining of every hour, =A0compress the= log file
> of the last hour of every logid and then use the hive cmd tool to load=
> these compressed log files into hdfs.
>
> using =A0commands like "LOAD DATA LOCAL inpath '$logname'= OVERWRITE =A0INTO
> TABLE $tablename PARTITION (dt=3D'$h') "
>
> I think this can work, and I have run some test on our 3-nodes test > clusters.
>
> But the problem is, there are lots of logid which means there are lots= of
> log files, =A0so every hour we will have to load lots of files into hd= fs
> and there is another problem, =A0we will run hourly analysis job on th= ese
> hourly collected log files,
> which inroduces the problem, because there are lots of log files, if w= e
> load these log files at the same time at the begining of every hour, I=
> think =A0there will some network flows and there will be data delivery=
> latency problem.
>
> For data delivery latency problem, I mean it will take some time for t= he
> log files to be copyed into hdfs, =A0and this will cause our hourly lo= g
> analysis job to start later.
>
> So I want to figure out if we can write or append logs into an compres= sed
> file which is already located in hdfs, and I have posted an thread in = the
> mailist, and from what I have learned, this is not possible.
>
>
> So, what's the best practice of loading logs into hdfs while using= hive to
> do log analytic?
>
> Or what's the common methods to handle problem I have describe abo= ve?
>
> Can anyone give me some advices?
>
> Thank you very much for your help!


--0016e6d460a6cc570f04b860543e--