hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amareshwari Sri Ramadasu <amar...@yahoo-inc.com>
Subject Re: File Split
Date Tue, 22 Dec 2009 04:43:02 GMT
Hi Cao,

My answers are inline.

On 12/21/09 8:42 PM, "Cao Kang" <weliam.kc@gmail.com> wrote:

Hi Amareshwari,
Thanks for your reply.
But another question is, where and how should I define the split boundaries?
Should I define it in FileSplit constructor?

I don't think you can extend FileSplit directly. I think you should write your own split say
ImageSplit, in which you can represent your image fully.
For example, FileSplit represents the split using offset and length. You may need all four
co-ordinates of your image.

Furthermore, as far as I have seen, all examples there use longwritable to
represent the offset of that split in the input file. What is the split is
not sequential?

Yes. FileSplit is used for representing text data.
For example, in the image split, the sub images bytes array
are not sequential from the input image. The bytes split look like this:

|               |               |
|         1    |       2      |
|               |               |
|               |               |
|       3      |       4      |
|               |               |

Each sub image split will be consisted by an array. Where and how this
should be defined in InputFormat? Many thanks.

In your InputFormat, you should define getSplits() method which returns your ImageSplits.


On Mon, Dec 21, 2009 at 6:37 AM, Amareshwari Sri Ramadasu <
amarsri@yahoo-inc.com> wrote:

> You should implement your split to represent the split information. Then
> you should implement getSplits in InputFormat to get the splits from your
> input, which divides the whole input into chunks. Here, each split will be
> given to a map task.
> You should also define RecordReader which reads records from the split. Map
> task processes one record at a time.
> See
> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Job+Input
> Thanks
> Amareshwari
> On 12/21/09 2:22 AM, "Cao Kang" <cakang@clarku.edu> wrote:
> Hi,
> I have spent several days on the customized file input format in hadoop.
> Basically, we need split one giant square shaped image (.tif) into four
> square shaped smaller images. Where does the really split happen?  Should I
> implement it in "getSplits" function or in the "next" function? It's quite
> confusing.
> Does anyone know or can anyone provide some examples of it? Thanks.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message