Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A465210223 for ; Tue, 6 Aug 2013 19:29:48 +0000 (UTC) Received: (qmail 682 invoked by uid 500); 6 Aug 2013 19:29:48 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 647 invoked by uid 500); 6 Aug 2013 19:29:48 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 632 invoked by uid 99); 6 Aug 2013 19:29:48 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Aug 2013 19:29:48 +0000 Date: Tue, 6 Aug 2013 19:29:48 +0000 (UTC) From: "BitsOfInfo (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13731149#comment-13731149 ] BitsOfInfo commented on MAPREDUCE-1176: --------------------------------------- Asokan: Sure go ahead make whatever changes are necessary; as I have no time to work on this anymore; yet would like to see this put into the project as I had a use for it when I created it and I'm sure others do as well. BTW: Never had my original question answered from a few years ago in regards to the "design", maybe I'm was missing something. bq. "Hmm, ok, do you have suggestion on how I detect where one record begins and one record ends when records are not identifiable by any sort of consistent "start" character or "end" character "boundary" but just flow together? I could see the RecordReader detecting that it only read < RECORD LENGTH bytes and hitting the end of the split and discarding it. But I am not sure how it would detect the start of a record, with a split that has partial data at the start of it. Especially if there is no consistent boundary/char marker that identifies the start of a record." > Contribution: FixedLengthInputFormat and FixedLengthRecordReader > ---------------------------------------------------------------- > > Key: MAPREDUCE-1176 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Affects Versions: 0.20.1, 0.20.2 > Environment: Any > Reporter: BitsOfInfo > Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch > > > Hello, > I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. > Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the "mapreduce.input.fixedlengthinputformat.record.length" property set as follows > myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]); > OR > myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); > This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) > This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira