hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cao Kang <weliam...@gmail.com>
Subject Re: File Split
Date Tue, 22 Dec 2009 16:27:30 GMT
Hi Amareshwari,
Thanks for your replies. They are really good suggestions.
But I probably have one more question remain. About HDFS, it splits the
input file into 64M blocks in a sequential way by input file bytes, right?
But it is against the idea to split the image into sub images by using its
four corners. Is there a way to configure HDFS to make it compatible with
the image split?
Many thanks!


Cao

On Mon, Dec 21, 2009 at 11:43 PM, Amareshwari Sri Ramadasu <
amarsri@yahoo-inc.com> wrote:

> Hi Cao,
>
> My answers are inline.
>
> On 12/21/09 8:42 PM, "Cao Kang" <weliam.kc@gmail.com> wrote:
>
> Hi Amareshwari,
> Thanks for your reply.
> But another question is, where and how should I define the split
> boundaries?
> Should I define it in FileSplit constructor?
>
> I don't think you can extend FileSplit directly. I think you should write
> your own split say ImageSplit, in which you can represent your image fully.
> For example, FileSplit represents the split using offset and length. You
> may need all four co-ordinates of your image.
>
> Furthermore, as far as I have seen, all examples there use longwritable to
> represent the offset of that split in the input file. What is the split is
> not sequential?
>
> Yes. FileSplit is used for representing text data.
> For example, in the image split, the sub images bytes array
> are not sequential from the input image. The bytes split look like this:
>
> |---------------|---------------|
> |               |               |
> |         1    |       2      |
> |               |               |
> |---------------|---------------|
> |               |               |
> |       3      |       4      |
> |               |               |
> |---------------|---------------|
>
> Each sub image split will be consisted by an array. Where and how this
> should be defined in InputFormat? Many thanks.
>
> In your InputFormat, you should define getSplits() method which returns
> your ImageSplits.
>
> Thanks
> Amareshwari
>
>
> On Mon, Dec 21, 2009 at 6:37 AM, Amareshwari Sri Ramadasu <
> amarsri@yahoo-inc.com> wrote:
>
> > You should implement your split to represent the split information. Then
> > you should implement getSplits in InputFormat to get the splits from your
> > input, which divides the whole input into chunks. Here, each split will
> be
> > given to a map task.
> > You should also define RecordReader which reads records from the split.
> Map
> > task processes one record at a time.
> >
> > See
> >
> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Job+Input
> >
> > Thanks
> > Amareshwari
> >
> > On 12/21/09 2:22 AM, "Cao Kang" <cakang@clarku.edu> wrote:
> >
> > Hi,
> > I have spent several days on the customized file input format in hadoop.
> > Basically, we need split one giant square shaped image (.tif) into four
> > square shaped smaller images. Where does the really split happen?  Should
> I
> > implement it in "getSplits" function or in the "next" function? It's
> quite
> > confusing.
> > Does anyone know or can anyone provide some examples of it? Thanks.
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message