hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: C API for Hadoop DFS
Date Wed, 26 Apr 2006 16:56:28 GMT
Devaraj Das wrote:
> Attached is a draft of the C API specification that some of us (in Yahoo)
> have been thinking about. The specification is closely tied to the API
> exported by Hadoop's FileSystem class. 
> Will really appreciate any comments, etc. on the specification.

Overall, this looks great!  Thanks for working on this!

>   /** 
>   * dfsFileLocationInfo
>   * used to get the mapping between file blocks and the hostnames where
>   * they are stored. Due to replication, a file block could be stored on
>   * multiple hosts.
>   */
>   typedef struct  {
>     char **hostname;
>     int numHosts;
>   } dfsFileLocationInfo;
>   /** 
>   * dfsStat
>   * used for getting information about a file/directory
>   */
>   typedef struct  {
>     tObjectKind mKind;	/** file or directory */
>     char *mName; /* the name of the file */
>     tTime mCreationTime;
>     dfsFileLocationInfo *fileLocationInfo; /*the last element 
>                                        in the array is NULL*/
>     long  mSize; /*the size of the file in bytes */
>     bool replicated; /*whether this file is replicated */
>   } dfsFileInfo;
>   /** return information about a path as a (dynamically allocated) array 
>   * of dfsFileInfo.
>   * numEntries is set to the number of elements in the array.
>   * If the path happens to be a file, the array will have just one element.
>   * If the path happens to be a directory, the dfsFileInfo elements in the
>   * array will contain information about the files/sub-dirs within the path.
>   * NULL is returned if the path does not exist or some other error is 
>   * encountered. freeDfsFileInfo should be called passing the array and 
>   * numEntries when it is no longer needed.
>   */
>   dfsFileInfo *dfsGetPathInfo(dfsFS fs, char *path, int *numEntries);

I'm a little confused about the dfsFileLocationInfo.  It exposes too 
much of the filesystem internals, that applications don't require.  It's 
also expensive to return full block lists with directory listings.

Instead, I think we need the following two functions:

tOffset getBlockSize(dfsFs fs);

char** geHosts(dfsFs fs, char* file, tOffset pos);

This would return an array of hosts that contain the specified position 
in a file.  Does that make sense?

>   int dfsCopyFromLocalFile(dfsFs fs, char *src, char *dst);
>   int dfsCopyToLocalFile(dfsFs fs, char *src, char *dst);
>   int dfsMoveFromLocalFile(dfsFs fs, char *src, char *dst);

These are utility methods, that could be implemented by user code, i.e., 
not core methods.  That's fine.  But perhaps we should add another:

int dfsCopy(dfsFs fs, char* src, char* dst);

Otherwise lots of applications will end up writing this themselves.

Thanks again,


View raw message