Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0019D101CB for ; Thu, 27 Feb 2014 14:17:46 +0000 (UTC) Received: (qmail 60754 invoked by uid 500); 27 Feb 2014 14:17:37 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 60635 invoked by uid 500); 27 Feb 2014 14:17:36 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 60623 invoked by uid 99); 27 Feb 2014 14:17:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Feb 2014 14:17:35 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com designates 65.54.61.101 as permitted sender) Received: from [65.54.61.101] (HELO snt0-omc2-s50.snt0.hotmail.com) (65.54.61.101) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Feb 2014 14:17:29 +0000 Received: from SNT149-W60 ([65.55.90.73]) by snt0-omc2-s50.snt0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Thu, 27 Feb 2014 06:17:09 -0800 X-TMN: [U89lThBh9RBttwfOp9LwZKdAPZkB6FvoeSAmnPg2cmQ=] X-Originating-Email: [java8964@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_5a94bbf3-5095-411f-babc-3daf59e1c406_" From: java8964 To: "user@hadoop.apache.org" Subject: RE: What if file format is dependent upon first few lines? Date: Thu, 27 Feb 2014 09:17:08 -0500 Importance: Normal In-Reply-To: References: MIME-Version: 1.0 X-OriginalArrivalTime: 27 Feb 2014 14:17:09.0212 (UTC) FILETIME=[99B179C0:01CF33C6] X-Virus-Checked: Checked by ClamAV on apache.org --_5a94bbf3-5095-411f-babc-3daf59e1c406_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable If the file is big enough and you want to split them for parallel processin= g=2C then maybe one option could be that in your mapper=2C you can always g= et the full file path from the InputSplit=2C then open it (The file path=2C= which means you can read from the the beginning)=2C read the first 4 line= s=2C and based on the content=2C processing the current split. I believe the file in the HDFS can support concurrent read without any prob= lem. Yong Date: Thu=2C 27 Feb 2014 17:59:38 +0800 Subject: What if file format is dependent upon first few lines? From: raofengyun@gmail.com To: user@hadoop.apache.org Below is a fake sample of Microsoft IIS log:#Software: Microsoft Internet I= nformation Services 7.5#Version: 1.0#Date: 2013-07-04 20:00:00#Fields: date= time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(Us= er-Agent) sc-status sc-substatus sc-win32-status time-taken=0A= 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 20= 0 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuse= rAgent 200 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.= 4 someuserAgent 200 0 0 390=0A= ... The first four lines describe the file format=2C which is a must to parse e= ach log line. It means log file could NOT be simply splitted=2C otherwise t= he second split would lost the "file format" information.=0A= How could each mapper get the first few lines in the file? = --_5a94bbf3-5095-411f-babc-3daf59e1c406_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
If the file is big enough and yo= u want to split them for parallel processing=2C then maybe one option could= be that in your mapper=2C you can always get the full file path from the I= nputSplit=2C then open it (The file path=2C which means you  =3Bcan rea= d from the the beginning)=2C read the first 4 lines=2C and based on the con= tent=2C processing the current split.

I believe the file= in the HDFS can support concurrent read without any problem.
Yong


Date: Thu=2C 27 Feb 20= 14 17:59:38 +0800
Subject: What if file format is dependent upon first f= ew lines?
From: raofengyun@gmail.com
To: user@hadoop.apache.org
Below is a fake sample of Microsoft IIS log:
#S= oftware: Microsoft Internet Information Services 7.5
#Version: 1.= 0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs= -method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-= status sc-substatus sc-win32-status time-taken
=0A=
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAge= nt 200 0 0 390
2013-07-04 20:00:00 1.1.1.1 GET /test.g= if xxx 80 - 3.3.3.3 someuserAgent 200 0 0 390
2013-07-= 04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200 0 0 39= 0
=0A=
...

The first four lines describe the f= ile format=2C which is a must to parse each log line. It means log file cou= ld NOT be simply splitted=2C otherwise the second split would lost the "fil= e format" information.
=0A=

How could each mapper get the first few lines in the fi= le?
= --_5a94bbf3-5095-411f-babc-3daf59e1c406_--