hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From alo alt <>
Subject Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?
Date Tue, 07 Feb 2012 10:58:00 GMT

a first start with flume:

Facebook's scribe could also be work for you.

- Alex

Alexander Lorenz

On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:

> Hi all,
> Sorry if it is not appropriate to send one thread into two maillist.
> **
> I'm tring to use hadoop and hive to do some log analytic jobs.
> Our system generate lots of logs every day, for example, it produce about
> 370GB logs(including lots of log files) yesterday, and every day the logs
> increases.
> And we want to use hadoop and hive to replace our old log analysic system.
> We distinguish our logs with logid, we have an log collector which will
> collect logs from clients and then generate log files.
> for every logid, there will be one log file every hour, for some logid,
> this hourly log file can be 1~2GB
> I have set up an test cluster with hadoop and hive, and I have run some
> test which seems good for us.
> For reference, we will create one table in hive for every logid which will
> be partitoned by hour.
> Now I have a question, what's the best practice for loading logs files into
> hdfs or hive warehouse dir ?
> My first thought is,  at the begining of every hour,  compress the log file
> of the last hour of every logid and then use the hive cmd tool to load
> these compressed log files into hdfs.
> using  commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
> TABLE $tablename PARTITION (dt='$h') "
> I think this can work, and I have run some test on our 3-nodes test
> clusters.
> But the problem is, there are lots of logid which means there are lots of
> log files,  so every hour we will have to load lots of files into hdfs
> and there is another problem,  we will run hourly analysis job on these
> hourly collected log files,
> which inroduces the problem, because there are lots of log files, if we
> load these log files at the same time at the begining of every hour, I
> think  there will some network flows and there will be data delivery
> latency problem.
> For data delivery latency problem, I mean it will take some time for the
> log files to be copyed into hdfs,  and this will cause our hourly log
> analysis job to start later.
> So I want to figure out if we can write or append logs into an compressed
> file which is already located in hdfs, and I have posted an thread in the
> mailist, and from what I have learned, this is not possible.
> So, what's the best practice of loading logs into hdfs while using hive to
> do log analytic?
> Or what's the common methods to handle problem I have describe above?
> Can anyone give me some advices?
> Thank you very much for your help!

View raw message