hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3307) Archives in Hadoop.
Date Wed, 04 Jun 2008 00:10:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602124#action_12602124

Mahadev konar commented on HADOOP-3307:

Repsonse to devaraj's comments - 

> 1) The query part in the creation of the URI can be removed (in fact we probably should
flag an error if the har path contains a '?' since it is not a valid Path)

> 2) decodeURI should be done first and then the har archive path can be extracted

> 3) getHarAuth needn't be parsing the uri everytime since it is constant. The auth can
just be stored in a class variable.
will do

> 4) open() & other filesystem calls should support taking just the fragment path to
a file within the archive
makes sense

> 5) why is fileStatusInIndex storing the Store object in a list while going through the
master index? Isn't the list going to be always of size 1 (if the file is present in the archive)
no the list can be more than one since hashcode will have collisions and might end up between
different buckets

> 6) the index files are not closed in the fileStatusInIndex call. This might lead to problems
in the cases where the underlying filesystem is the localfs (where open actually returns a
filedescriptor). But I am also not sure whether we should open and close on every call to
fileStatusInIndex. Can we somehow cache the handles to the index files and reuse them.
for now I will just be closing and opening the files. Ill leave this optimization for later.

> 7) When we create a part file, can we record the things like replication factor, permissions,
etc. and emit them just like we emit the other info like partfilename, etc. during archive
creation and store them in the index file. That way we don't have to fake everything in the

this was stated in the  design that we will be ignoring permissions in this version. In  later
versions we can persist the permissions as well.  

> 8) In listStatus, the start and end braces are missing for the if/else block
will fix

> In listStatus, the check hstatus.isDir()?0:hstatus.getLength() seems redundant. hstatus.isDir
is always going to be false
will fix

> 10) I don't understand clearly why makeRelative is done in the listStatus and getFileStatus
Its just to join two paths that are assolute. 
SO a path in archive is persisted as /user/mahadev so to have this as the later componet of
the archive its just made relative to create a new Path 

> Do you enforce the .har in the archive name when it is created?

> 1) In writeTopLevelDirs, remove the comment "invert the paths"
will do

> 2) the SRC_LIST_LABEL file needs to have a high replication factor (maybe 10 or something)
makes sense

> Use NullOutputFormat instead of the HarOutputFormat
Did not know we had a nulloutputformat!! :) 

> 4) Overall the coding convention is that we have starting/terminating braces even for
single statement blocks. Please update the code w.r.t that.
will do

> Overall, the path manipulations (like makeRelative/absolute) confuses me. It'd be nice
to cleanup the code in that aspect if possible.
will try to clean up the code.

> In the testcase, it'd be nice if you read the whole file content instead of just 4 bytes,
and then validate. That'd tell you that there is no extra (spurious) bytes in the archived
files, right?
makes sense.

> Archives in Hadoop.
> -------------------
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
> This is a new feature for archiving and unarchiving files in HDFS. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message