hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: any suggestions on IIS log storage and analysis?
Date Tue, 31 Dec 2013 02:17:01 GMT
Google "Hadoop WholeFileInputFormat" or search it in book "Hadoop: The Definitive Guide"

Date: Tue, 31 Dec 2013 09:39:58 +0800
Subject: Re: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Thanks, Yong!
The dependence never cross files, but since HDFS splits files into blocks, it may cross blocks,
which makes it difficult to write MR job. I don't quite understand what you mean by "WholeFileInputFormat
". Actually, I have no idea how to deal with dependence across blocks.

2013/12/31 java8964 <java8964@hotmail.com>

I don't know any example of IIS log files. But from what you described, it looks like analyzing
one line of log data depends on some previous lines data. You should be more clear about what
is this dependence and what you are trying to do.

Just based on your questions, you still have different options, which one is better depends
on your requirements and data.
1) You know the existing default TextInputFormat not suitable for your case, you just need
to find alternatives, or write your own.
2) If the dependences never cross the files, just cross lines, you can use WholeFileInputFormat
(No such class coming from Hadoop itself, but very easy implemented by yourself)3) If the
dependences cross the files, then you maybe have to enforce your business logics in reducer
side, instead of mapper side. Without knowing your detail requirements of this dependence,
it is hard to give you more detail, but you need to find out what are good KEY candidates
for your dependence logic, send the data based on that to the reducers, and enforce your logic
on the reducer sides. If one MR job is NOT enough to solve your dependence, you may need chain
several MR jobs together.


Date: Mon, 30 Dec 2013 15:58:57 +0800
Subject: any suggestions on IIS log storage and analysis?
From: raofengyun@gmail.com

To: user@hadoop.apache.org

HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields
could be changed in IIS log files, which means fields in one block may depend on another,
and thus make it not suitable for mapreduce job. It seems there should be some preprocess
before storing and analyzing the IIS log files. We plan to parse each line to the same fields
and store in Avro files with compression. Any other alternatives? Hbase?  or any suggestions
on analyzing IIS log files?



View raw message