hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "BitsOfInfo (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
Date Fri, 20 Nov 2009 01:44:39 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780367#action_12780367

BitsOfInfo commented on MAPREDUCE-1176:

>>>Why can't you just keep defaultSize and recordLength as longs?

Because the findbugs threw warnings if they were not cast, secondly the code works as expected.
Please just shoot over how you want that calculation re-written and I can certainly change

>>>- In isSplitable, you catch the exception generated by getRecordLength and turn
off splitting.
>>> If there is no record length specified doesn't that mean the input format won't
work at all?

Nope, it would still work, as I have yet to see an original raw data file containing records
of a fixed width, that for some reason does not contain complete records. But that's fine,
we can just exit out here to let the user know they need to configure that property. If there
is a better place to check for the existence of that property please let me know.

>>>- FixedLengthRecordReader: "This record reader does not support compressed files."
Is this true?

Correct, as stated in the docs. Reason being is that in my case, when I wrote this I was not
dealing with compressed files. Secondly, if a input file were compressed, I was not sure the
procedure to properly compute the splits against a file that is compressed and the byte lengths
of the records would be different in a compressed form, vs. once passed to the RecordReader.

>>>- Throughout, you've still got 4-space indentation in the method bodies. Indentation
should be by 2.

Does anyone know of a automated tool that will fix this? Driving me nut going line by line
and hitting delete 2x...... When I look at this in eclipse I am not seeing 4 spaces.

>>>- In FixedLengthRecordReader, you hard code a 64KB buffer. Why's this? You should
let the filesystem use its default.

Sure, I can get rid of that

>>>- In your read loop, you're not accounting for the case of read returning 0 or
-1, which I believe
>>> can happen at EOF, right? Consider using o.a.h.io.IOUtils.readFully() to replace
this loop.

Ditto, I can change to that.

>>>As a general note, I'm not sure I agree with the design here. Rather than forcing
the split to lie on record boundaries,

Ok, thats fine, I just wanted to contribute what I wrote that is working for my case. 

>>> open the record reader, skip forward to the next record boundary 

Hmm, ok, do you have suggestion on how I detect where one record begins and one record ends
when records are not identifiable by any sort of consistent "start" character or "end" character
"boundary" but just flow together?  I could see the RecordReader detecting that it only read
< RECORD LENGTH bytes and hitting the end of the split and discarding it. But I am not
sure how it would detect the start of a record, with a split that has partial data at the
start of it. Especially if there is no consistent boundary/char marker that identifies the
start of a record.

> Contribution: FixedLengthInputFormat and FixedLengthRecordReader
> ----------------------------------------------------------------
>                 Key: MAPREDUCE-1176
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 0.20.1, 0.20.2
>         Environment: Any
>            Reporter: BitsOfInfo
>            Priority: Minor
>         Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch
> Hello,
> I would like to contribute the following two classes for incorporation into the mapreduce.lib.input
package. These two classes can be used when you need to read data from files containing fixed
length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters
etc, but each record is a fixed length, and extra data is padded with spaces. The data is
one gigantic line within a file.
> Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader.
When creating a job that specifies this input format, the job must have the "mapreduce.input.fixedlengthinputformat.record.length"
property set as follows
> myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]);
> OR
> myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]);
> This input format overrides computeSplitSize() in order to ensure that InputSplits do
not contain any partial records since with fixed records there is no way to determine where
a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader
will start at the beginning of a record, and the last byte in the InputSplit will be the last
byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute
method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize
/ fixedRecordLength) * fixedRecordLength)
> This suite of fixed length input format classes, does not support compressed files. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message