Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of raofengyun@gmail.com
 designates 209.85.219.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <SNT149-W1D52F758326384B406736D0CE0@phx.gbl>
References: 
 <CAGSyEuArOf9d6xuiEN6ay6+eUdOpO248Y35Rm5npMvH3rVAOcw@mail.gmail.com>
	<SNT149-W1D52F758326384B406736D0CE0@phx.gbl>
Date: Tue, 31 Dec 2013 09:39:58 +0800
Message-ID: 
 <CAGSyEuDj7hKpOi6t2R5AQLPw5PVv4wZdZWvQ6wa43D3+3g9hsg@mail.gmail.com>
Subject: Re: any suggestions on IIS log storage and analysis?
From: Fengyun RAO <raofengyun@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a113497505f4b2e04eecaa3fe

--001a113497505f4b2e04eecaa3fe
Content-Type: text/plain; charset=ISO-8859-1

Thanks, Yong!

The dependence never cross files, but since HDFS splits files into blocks,
it may cross blocks, which makes it difficult to write MR job. I don't
quite understand what you mean by "WholeFileInputFormat ". Actually, I have
no idea how to deal with dependence across blocks.


2013/12/31 java8964 <java8964@hotmail.com>

> I don't know any example of IIS log files. But from what you described, it
> looks like analyzing one line of log data depends on some previous lines
> data. You should be more clear about what is this dependence and what you
> are trying to do.
>
> Just based on your questions, you still have different options, which one
> is better depends on your requirements and data.
>
> 1) You know the existing default TextInputFormat not suitable for your
> case, you just need to find alternatives, or write your own.
> 2) If the dependences never cross the files, just cross lines, you can use
> WholeFileInputFormat (No such class coming from Hadoop itself, but very
> easy implemented by yourself)
> 3) If the dependences cross the files, then you maybe have to enforce your
> business logics in reducer side, instead of mapper side. Without knowing
> your detail requirements of this dependence, it is hard to give you more
> detail, but you need to find out what are good KEY candidates for your
> dependence logic, send the data based on that to the reducers, and enforce
> your logic on the reducer sides. If one MR job is NOT enough to solve your
> dependence, you may need chain several MR jobs together.
>
> Yong
>
> ------------------------------
> Date: Mon, 30 Dec 2013 15:58:57 +0800
> Subject: any suggestions on IIS log storage and analysis?
> From: raofengyun@gmail.com
> To: user@hadoop.apache.org
>
>
> Hi,
>
> HDFS splits files into blocks, and mapreduce runs a map task for each
> block. However, Fields could be changed in IIS log files, which means
> fields in one block may depend on another, and thus make it not suitable
> for mapreduce job. It seems there should be some preprocess before storing
> and analyzing the IIS log files. We plan to parse each line to the same
> fields and store in Avro files with compression. Any other alternatives?
> Hbase?  or any suggestions on analyzing IIS log files?
>
> thanks!
>
>
>

--001a113497505f4b2e04eecaa3fe
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks, Yong!<div><br></div><div>The dependence never cros=
s files, but since HDFS splits files into blocks, it may cross blocks, whic=
h makes it difficult to write MR job. I don&#39;t quite understand what you=
 mean by &quot;<span style=3D"font-family:arial,sans-serif;font-size:14px">=
WholeFileInputFormat</span><span style=3D"font-family:arial,sans-serif;font=
-size:14px">=A0&quot;. Actually, I have no idea how to deal with dependence=
 across blocks.</span></div>
</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">2013/12=
/31 java8964 <span dir=3D"ltr">&lt;<a href=3D"mailto:java8964@hotmail.com" =
target=3D"_blank">java8964@hotmail.com</a>&gt;</span><br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


<div><div dir=3D"ltr">I don&#39;t know any example of IIS log files. But fr=
om what you described, it looks like analyzing one line of log data depends=
 on some previous lines data. You should be more clear about what is this d=
ependence and what you are trying to do.<div>
<br></div><div>Just based on your questions, you still have different optio=
ns, which one is better depends on your requirements and data.</div><div><b=
r></div><div>1) You know the existing default TextInputFormat not suitable =
for your case, you just need to find alternatives, or write your own.</div>
<div>2) If the dependences never cross the files, just cross lines, you can=
 use WholeFileInputFormat (No such class coming from Hadoop itself, but ver=
y easy implemented by yourself)</div><div>3) If the dependences cross the f=
iles, then you maybe have to enforce your business logics in reducer side, =
instead of mapper side. Without knowing your detail requirements of this de=
pendence, it is hard to give you more detail, but you need to find out what=
 are good KEY candidates for your dependence logic, send the data based on =
that to the reducers, and enforce your logic on the reducer sides. If one M=
R job is NOT enough to solve your dependence, you may need chain several MR=
 jobs together.</div>
<div><br></div><div>Yong<br><br><div><hr>Date: Mon, 30 Dec 2013 15:58:57 +0=
800<br>Subject: any suggestions on IIS log storage and analysis?<br>From: <=
a href=3D"mailto:raofengyun@gmail.com" target=3D"_blank">raofengyun@gmail.c=
om</a><br>
To: <a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop=
.apache.org</a><div class=3D"im"><br><br><div dir=3D"ltr">Hi,<div><br></div=
><div>HDFS splits files into blocks, and mapreduce runs a map task for each=
 block. However, Fields could be changed in IIS log files, which means fiel=
ds in one block may depend on another, and thus make it not suitable for ma=
preduce job. It seems there should be some preprocess before storing and an=
alyzing the IIS log files. We plan to parse each line to the same fields an=
d store in Avro files with compression. Any other alternatives? Hbase? =A0o=
r any suggestions on analyzing IIS log files?</div>

<div><br></div><div>thanks!</div><div><br></div><div><br></div></div></div>=
</div></div> 		 	   		  </div></div>
</blockquote></div><br></div>

--001a113497505f4b2e04eecaa3fe--