hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil Goyal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7404) Data Blocks Spliting should be record oriented or provided option for give the spliting locations (offsets) as input file
Date Mon, 20 Jun 2011 18:34:47 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052139#comment-13052139

Sunil Goyal commented on HADOOP-7404:

Hi Harsh
        Thanks for your response and giving the feedback for problem. Yes splitting the file
into smaller chunks will be a viable option for me. I do not know how much advantage i will
get by using it. Let me explain you the problem in more details (may you suggest some better
solution for it). Its simplified version will be like this:-

Consider I am having a directory having lot of files inside it (20,000 to 1millon). Each file
contains the data in the same format (say text). I have created a single file this whole directory
(20GB to 100GB). Now I wanted to do some operation on some of these files and again wanted
to dump the same file again.

  If i am able to splitting it by using the offset, Each of this can be processed independently.
I can gain the maximum advantage of Hadoop  processing nodes. Splitting it according to offset
will be useful option for me.  If it we can do the splitting of this file according to its
offset. It will give the great advantage and remove one extra step. 
I have no idea about FSes. Can you please point out come documentation link for it. (Unable
to get it by google)

Sunil Goyal 

> Data Blocks Spliting should be record oriented or provided option for give the spliting
locations (offsets) as input file
> -------------------------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-7404
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7404
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Sunil Goyal
> Old Bug :  https://issues.apache.org/jira/browse/HADOOP-106
> It is difficult to do the padding in the existing records. Due to the following reason:
> 1. Records are having the different Size (some may be bytes, some may be GB) but in same
> 2. It is having the compatibility issues with the other standard tools.
> 3. It will increases the file size without any need of other tools (not working on hadoop).
> I think there should be option to this splitting process like this:-   
> 1. File contains information of offsets where should be splitting done. (like 10,100,120,
offset it).
> 2. Hadoop should do the splitting according to it ( 10-0 = 10, 100-10 =90 , etc).
> 3. This file can be generated easily from the other tools.  

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message