Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 96414 invoked from network); 4 Sep 2008 01:24:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Sep 2008 01:24:08 -0000 Received: (qmail 59175 invoked by uid 500); 4 Sep 2008 01:24:05 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 58955 invoked by uid 500); 4 Sep 2008 01:24:04 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 58940 invoked by uid 99); 4 Sep 2008 01:24:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Sep 2008 18:24:03 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Sep 2008 01:23:13 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 51B0A234C1CB for ; Wed, 3 Sep 2008 18:23:44 -0700 (PDT) Message-ID: <1651906312.1220491424333.JavaMail.jira@brutus> Date: Wed, 3 Sep 2008 18:23:44 -0700 (PDT) From: "Abdul Qadeer (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4010) Chaging LineRecordReader algo so that it does not need to skip backwards in the stream In-Reply-To: <1006973248.1219452044208.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628224#action_12628224 ] Abdul Qadeer commented on HADOOP-4010: -------------------------------------- bq. Due to new LineRecordReader algorithm, the first split will process one more line as compared to other mappers bq. That's probably not going to be acceptable to users of NLineInputFormat. Users employing N formatted lines to initialize and run a mapper may find their jobs no longer work if the input is offset or if a map receives N+1 lines. If this is necessary for the new algorithm, rewriting or somehow accommodating this case may be required. I have changed NLineInputFormat to work it with the new LineRecordReader algorithm. The diff of the file is in the following. After this change I don't need to make any change in the TestLineInputFormat test case. --- src/mapred/org/apache/hadoop/mapred/lib/NLineInputFormat.java (revisio n 687954) +++ src/mapred/org/apache/hadoop/mapred/lib/NLineInputFormat.java (working copy) @@ -93,10 +93,19 @@ long begin = 0; long length = 0; int num = -1; - while ((num = lr.readLine(line)) > 0) { + while ((num = lr.readLine(line)) > 0) { numLines++; length += num; if (numLines == N) { + //NLineInputFormat uses LineRecordReader, which + //always reads at least one character out of its + //upper split boundary. So to use LineRecordReader + // such that there are N lines in each split, we move + //back the upper split limits of each split by one + //character. + if(begin == 0){ + length--; + } splits.add(new FileSplit(fileName, begin, length, new String[]{})); begin += length; length = 0; > Chaging LineRecordReader algo so that it does not need to skip backwards in the stream > -------------------------------------------------------------------------------------- > > Key: HADOOP-4010 > URL: https://issues.apache.org/jira/browse/HADOOP-4010 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Affects Versions: 0.19.0 > Reporter: Abdul Qadeer > Assignee: Abdul Qadeer > Fix For: 0.19.0 > > Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch > > > The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.) > In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.