hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Groves <char...@threerings.net>
Subject Questions about the DataStorage API
Date Mon, 03 Mar 2008 01:01:43 GMT
I used the new DataStorage API that showed up as part of PIG-32 a  
fair bit this weekend while updating PIG-55, and I'm a little  
perplexed by its design.  A few questions about why things are the  
way they are follow.  I'd be happy to make some patches to address  
these issues, but I wanted to make sure I'm not missing something first.

Why are the navigation functions on DataStorage and not  
ContainerDescriptor?  It seems natural to add a couple methods to  
ContainerDescriptor to get a subelement or subcontainer given a  
String.  The current setup seems to require calling getDataStorage on  
the Container then calling asContainer or asElement on it with the  
same Container as an argument.  If the navigation moves to  
ContainerDescriptor, DataStorage could just have a single method to  
get an ElementDescriptor given a String rather than its current  
proliferation of as* methods.

Why does ContainerDescriptor extend ElementDescriptor?   
ElementDescriptor exposes several methods that make no sense for a  
directory, so this forces every ContainerDescriptor implementation to  
disallow those methods and return a dummy InputStream for create.  A  
common superinterface with the shared operations would make things  
much easier for DataStorage implementors.

Why does ContainerDescriptor only expose listing its subelements  
through being an Iterable?  Having it be Iterable is definitely nice,  
but there are always times when you need to look at all the files at  
once, so this forces any client code to build an array by hand out of  
the Iterable.  Since both of the existing implementations are already  
turning a returned array into an iterable, why not expose that and  
save some work for clients?

What's the distinction between getConfiguration and getStatistics?   
Is it that the things in Configuration are settable and Statistics  
aren't?  If that's the case, why not just have a getProperties method  
and note if a given key is settable in its javadoc.  A user is  
already going to have to lookup the key to figure out how a given  
DataStorage implementation's configuration maps into the common  
DataStorage operations.

Could keys common to all DataStorage implementations be moved to  
methods on ElementDescriptor?  The existing keys all seem like they'd  
be available from any DataStorage, so making them regular methods on  
the ElementDescriptor would make them much more pleasant to use not  
to mention that it'd remove the need to create a Map for every access  
to these rather commonly used attributes.

Is toString the correct method to produce a String representation of  
a ElementDescriptor?  I didn't see anything else on there to produce  
an absolute String representation of a path, and that's hugely useful  
for serialization.  It seems like a bad idea to expose that through  
toString since toString is generally used for debugging, and there's  
nothing to guide DataStorage implementors to use toString as such.

The last thing that's bothering me about the API is the names of the  
interfaces: ElementDescriptor and ContainerDescriptor.  Those names  
tell me almost nothing about what the interfaces do.  Container gives  
me a little hint that that interface will probably have other things  
inside of it, but the other two words are generic enough in  
programming to be meaningless.  I realize that something other than a  
filesystem may be exposed through these interfaces, but the  
operations exposed through the interfaces are inherently file-like,  
so calling them something like PigFile and PigDirectory would convey  
loads more information about how they're to be used to a programmer  
encountering them for the first time.


View raw message