hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fengyun RAO <raofeng...@gmail.com>
Subject Re: any suggestions on IIS log storage and analysis?
Date Fri, 03 Jan 2014 06:16:09 GMT
Thanks, Peyman. The problem is that the dependence is not simply a key,
instead it's so complicated that without "#Fields" line in one block, it's
not even able to parse any line in another block.


2014/1/1 Peyman Mohajerian <mohajeri@gmail.com>

> You can run a series of map-reduce jobs on your data, if some log line is
> related to another line, e.g. based on sessionId, you can emit the
> sessionId as the key of your mapper output with the value being on the rows
> associated with the sessionId, so on the reducer side data from different
> blocks will be coming together. Of course that is just one example, so the
> fact that you have file content being split doesn't impact your analysis if
> you have inter-dependencies.
>
>
> On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO <raofengyun@gmail.com> wrote:
>
>> Thanks, I understand now, but I don't think this is what we need. The IIS
>> log files are very big (e.g, serveral GB per file), we need to split them
>> for parallel processing. However, this could be used as some sort of
>> preprocessing, to transform the original log files to splitable files such
>> as Avro files.
>>
>>
>>
>>
>> 2013/12/31 java8964 <java8964@hotmail.com>
>>
>>> Google "Hadoop WholeFileInputFormat" or search it in book " Hadoop: The
>>> Definitive Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA>
>>> "
>>>
>>> Yong
>>>
>>>
>>> ------------------------------
>>> Date: Tue, 31 Dec 2013 09:39:58 +0800
>>> Subject: Re: any suggestions on IIS log storage and analysis?
>>>
>>> From: raofengyun@gmail.com
>>> To: user@hadoop.apache.org
>>>
>>> Thanks, Yong!
>>>
>>> The dependence never cross files, but since HDFS splits files into
>>> blocks, it may cross blocks, which makes it difficult to write MR job. I
>>> don't quite understand what you mean by "WholeFileInputFormat ".
>>> Actually, I have no idea how to deal with dependence across blocks.
>>>
>>>
>>> 2013/12/31 java8964 <java8964@hotmail.com>
>>>
>>> I don't know any example of IIS log files. But from what you described,
>>> it looks like analyzing one line of log data depends on some previous lines
>>> data. You should be more clear about what is this dependence and what you
>>> are trying to do.
>>>
>>> Just based on your questions, you still have different options, which
>>> one is better depends on your requirements and data.
>>>
>>> 1) You know the existing default TextInputFormat not suitable for your
>>> case, you just need to find alternatives, or write your own.
>>> 2) If the dependences never cross the files, just cross lines, you can
>>> use WholeFileInputFormat (No such class coming from Hadoop itself, but very
>>> easy implemented by yourself)
>>> 3) If the dependences cross the files, then you maybe have to enforce
>>> your business logics in reducer side, instead of mapper side. Without
>>> knowing your detail requirements of this dependence, it is hard to give you
>>> more detail, but you need to find out what are good KEY candidates for your
>>> dependence logic, send the data based on that to the reducers, and enforce
>>> your logic on the reducer sides. If one MR job is NOT enough to solve your
>>> dependence, you may need chain several MR jobs together.
>>>
>>> Yong
>>>
>>> ------------------------------
>>> Date: Mon, 30 Dec 2013 15:58:57 +0800
>>> Subject: any suggestions on IIS log storage and analysis?
>>> From: raofengyun@gmail.com
>>> To: user@hadoop.apache.org
>>>
>>>
>>> Hi,
>>>
>>> HDFS splits files into blocks, and mapreduce runs a map task for each
>>> block. However, Fields could be changed in IIS log files, which means
>>> fields in one block may depend on another, and thus make it not suitable
>>> for mapreduce job. It seems there should be some preprocess before storing
>>> and analyzing the IIS log files. We plan to parse each line to the same
>>> fields and store in Avro files with compression. Any other alternatives?
>>> Hbase?  or any suggestions on analyzing IIS log files?
>>>
>>> thanks!
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message