Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2BFE110DEF for ; Tue, 31 Dec 2013 01:40:30 +0000 (UTC) Received: (qmail 4423 invoked by uid 500); 31 Dec 2013 01:40:25 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 4333 invoked by uid 500); 31 Dec 2013 01:40:24 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 4326 invoked by uid 99); 31 Dec 2013 01:40:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Dec 2013 01:40:24 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of raofengyun@gmail.com designates 209.85.219.42 as permitted sender) Received: from [209.85.219.42] (HELO mail-oa0-f42.google.com) (209.85.219.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Dec 2013 01:40:19 +0000 Received: by mail-oa0-f42.google.com with SMTP id i4so12760292oah.29 for ; Mon, 30 Dec 2013 17:39:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=kut96lBf+RXCt37ndpx9rdXC0rFq4dcHplCKrRs6wWM=; b=VZyXGb6ObsgECIHhOkc9k5mVFuR4HRaT8VxtHD6/QJ/ucTvzcPcrot+X7jW2snjZBL 6I2ThjSJ9JEXP43L+uq2G3G0vg+Cjdkygsw7o+KB4N8ca8Vzbf2MooG3sRusC2hgbAB9 nJk0VJQGo9NhkUTOEaNLJ80g6Rr8NuterF0jMvnod7eRD4tS7pbsjfuJdcfWWUd+8anO nzDhKjmF9/bnnliP0T6GtNgVg6dLPNI9PWVmCYE6K57fbQLJI3O14g1S7g22NFoz/rIA x6upJgyvF41/haEI/sMpVsWCqp/MkpQnBvX1EQbuaexj/IKcXgGcFlqw2zOoqw4O2QGu 4rFQ== MIME-Version: 1.0 X-Received: by 10.182.229.34 with SMTP id sn2mr209obc.86.1388453998576; Mon, 30 Dec 2013 17:39:58 -0800 (PST) Received: by 10.60.21.138 with HTTP; Mon, 30 Dec 2013 17:39:58 -0800 (PST) In-Reply-To: References: Date: Tue, 31 Dec 2013 09:39:58 +0800 Message-ID: Subject: Re: any suggestions on IIS log storage and analysis? From: Fengyun RAO To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a113497505f4b2e04eecaa3fe X-Virus-Checked: Checked by ClamAV on apache.org --001a113497505f4b2e04eecaa3fe Content-Type: text/plain; charset=ISO-8859-1 Thanks, Yong! The dependence never cross files, but since HDFS splits files into blocks, it may cross blocks, which makes it difficult to write MR job. I don't quite understand what you mean by "WholeFileInputFormat ". Actually, I have no idea how to deal with dependence across blocks. 2013/12/31 java8964 > I don't know any example of IIS log files. But from what you described, it > looks like analyzing one line of log data depends on some previous lines > data. You should be more clear about what is this dependence and what you > are trying to do. > > Just based on your questions, you still have different options, which one > is better depends on your requirements and data. > > 1) You know the existing default TextInputFormat not suitable for your > case, you just need to find alternatives, or write your own. > 2) If the dependences never cross the files, just cross lines, you can use > WholeFileInputFormat (No such class coming from Hadoop itself, but very > easy implemented by yourself) > 3) If the dependences cross the files, then you maybe have to enforce your > business logics in reducer side, instead of mapper side. Without knowing > your detail requirements of this dependence, it is hard to give you more > detail, but you need to find out what are good KEY candidates for your > dependence logic, send the data based on that to the reducers, and enforce > your logic on the reducer sides. If one MR job is NOT enough to solve your > dependence, you may need chain several MR jobs together. > > Yong > > ------------------------------ > Date: Mon, 30 Dec 2013 15:58:57 +0800 > Subject: any suggestions on IIS log storage and analysis? > From: raofengyun@gmail.com > To: user@hadoop.apache.org > > > Hi, > > HDFS splits files into blocks, and mapreduce runs a map task for each > block. However, Fields could be changed in IIS log files, which means > fields in one block may depend on another, and thus make it not suitable > for mapreduce job. It seems there should be some preprocess before storing > and analyzing the IIS log files. We plan to parse each line to the same > fields and store in Avro files with compression. Any other alternatives? > Hbase? or any suggestions on analyzing IIS log files? > > thanks! > > > --001a113497505f4b2e04eecaa3fe Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thanks, Yong!

The dependence never cros= s files, but since HDFS splits files into blocks, it may cross blocks, whic= h makes it difficult to write MR job. I don't quite understand what you= mean by "= WholeFileInputFormat=A0". Actually, I have no idea how to deal with dependence= across blocks.


2013/12= /31 java8964 <java8964@hotmail.com>
I don't know any example of IIS log files. But fr= om what you described, it looks like analyzing one line of log data depends= on some previous lines data. You should be more clear about what is this d= ependence and what you are trying to do.

Just based on your questions, you still have different optio= ns, which one is better depends on your requirements and data.
1) You know the existing default TextInputFormat not suitable = for your case, you just need to find alternatives, or write your own.
2) If the dependences never cross the files, just cross lines, you can= use WholeFileInputFormat (No such class coming from Hadoop itself, but ver= y easy implemented by yourself)
3) If the dependences cross the f= iles, then you maybe have to enforce your business logics in reducer side, = instead of mapper side. Without knowing your detail requirements of this de= pendence, it is hard to give you more detail, but you need to find out what= are good KEY candidates for your dependence logic, send the data based on = that to the reducers, and enforce your logic on the reducer sides. If one M= R job is NOT enough to solve your dependence, you may need chain several MR= jobs together.

Yong


Date: Mon, 30 Dec 2013 15:58:57 +0= 800
Subject: any suggestions on IIS log storage and analysis?
From: <= a href=3D"mailto:raofengyun@gmail.com" target=3D"_blank">raofengyun@gmail.c= om
To: user@hadoop= .apache.org


Hi,

HDFS splits files into blocks, and mapreduce runs a map task for each= block. However, Fields could be changed in IIS log files, which means fiel= ds in one block may depend on another, and thus make it not suitable for ma= preduce job. It seems there should be some preprocess before storing and an= alyzing the IIS log files. We plan to parse each line to the same fields an= d store in Avro files with compression. Any other alternatives? Hbase? =A0o= r any suggestions on analyzing IIS log files?

thanks!


=

--001a113497505f4b2e04eecaa3fe--