hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fengyun RAO <raofeng...@gmail.com>
Subject Re: any suggestions on IIS log storage and analysis?
Date Tue, 31 Dec 2013 01:39:58 GMT
Thanks, Yong!

The dependence never cross files, but since HDFS splits files into blocks,
it may cross blocks, which makes it difficult to write MR job. I don't
quite understand what you mean by "WholeFileInputFormat ". Actually, I have
no idea how to deal with dependence across blocks.

2013/12/31 java8964 <java8964@hotmail.com>

> I don't know any example of IIS log files. But from what you described, it
> looks like analyzing one line of log data depends on some previous lines
> data. You should be more clear about what is this dependence and what you
> are trying to do.
> Just based on your questions, you still have different options, which one
> is better depends on your requirements and data.
> 1) You know the existing default TextInputFormat not suitable for your
> case, you just need to find alternatives, or write your own.
> 2) If the dependences never cross the files, just cross lines, you can use
> WholeFileInputFormat (No such class coming from Hadoop itself, but very
> easy implemented by yourself)
> 3) If the dependences cross the files, then you maybe have to enforce your
> business logics in reducer side, instead of mapper side. Without knowing
> your detail requirements of this dependence, it is hard to give you more
> detail, but you need to find out what are good KEY candidates for your
> dependence logic, send the data based on that to the reducers, and enforce
> your logic on the reducer sides. If one MR job is NOT enough to solve your
> dependence, you may need chain several MR jobs together.
> Yong
> ------------------------------
> Date: Mon, 30 Dec 2013 15:58:57 +0800
> Subject: any suggestions on IIS log storage and analysis?
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
> Hi,
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
> thanks!

View raw message