hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-160) Change LoadFunc interface to work with new types
Date Thu, 05 Jun 2008 03:03:45 GMT

    [ https://issues.apache.org/jira/browse/PIG-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602524#action_12602524
] 

Shravan Matthur Narayanamurthy commented on PIG-160:
----------------------------------------------------

Have a few comments/que on the design.
1) Why do we have a seperate implementation of converter for pig storage and not for bin storage?
2) I did not see toBytes methods in BinStorage
3) Wouldn't it be better to use composition here as java doesn't support multiple inheritance?
So make an interface for Converter with all the bytesTo and toBytes methods. Write different
converters implementing Converter. Use a Converter object inside the storage functions and
delegate the calls to the Converter object. We can have a default Converter set for PigStorage
which can be the UTF8StorageConverter and users can configure a different converter it if
they want to.

> Change LoadFunc interface to work with new types
> ------------------------------------------------
>
>                 Key: PIG-160
>                 URL: https://issues.apache.org/jira/browse/PIG-160
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: loadfuncs_v1.patch
>
>
> The LoadFunc interface needs to change to support new types.  The load function will
need to support two new features:
> 1) type conversion, how to get the bytes read from the source converted to java Integer,
Float, String, etc.
> 2) schema discovery, as we want to support self-describing data such JSON, and we will
need the load function to tell us that schema.
> The proposed new interface is:
> {code:title=Bar.java|borderStyle=solid}
> /**
>  * This interface is used to implement functions to parse records
>  * from a dataset.  This also includes functions to cast raw byte data into various
>  * datatypes.  These are external functions because we want loaders, whenever
>  * possible, to delay casting of datatypes until the last possible moment (i.e.
>  * don't do it on load).  This means we need to expose the functionality so that
>  * other sections of the code can call back to the loader to do the cast.
>  */
> public interface LoadFunc {
>     /**
>      * Specifies a portion of an InputStream to read tuples. Because the
>      * starting and ending offsets may not be on record boundaries it is up to
>      * the implementor to deal with figuring out the actual starting and ending
>      * offsets in such a way that an arbitrarily sliced up file will be processed
>      * in its entirety.
>      * <p>
>      * A common way of handling slices in the middle of records is to start at
>      * the given offset and, if the offset is not zero, skip to the end of the
>      * first record (which may be a partial record) before reading tuples.
>      * Reading continues until a tuple has been read that ends at an offset past
>      * the ending offset.
>      * <p>
>      * <b>The load function should not do any buffering on the input stream</b>.
Buffering will
>      * cause the offsets returned by is.getPos() to be unreliable.
>      *  
>      * @param fileName the name of the file to be read
>      * @param is the stream representing the file to be processed, and which can also
provide its position.
>      * @param offset the offset to start reading tuples.
>      * @param end the ending offset for reading.
>      * @throws IOException
>      */
>     public void bindTo(String fileName,
>                        BufferedPositionedInputStream is,
>                        long offset,
>                        long end) throws IOException;
>     /**
>      * Retrieves the next tuple to be processed.
>      * @return the next tuple to be processed or null if there are no more tuples
>      * to be processed.
>      * @throws IOException
>      */
>     public Tuple getNext() throws IOException;
>     
>     /**
>      * Cast data from bytes to boolean value.  
>      * @param bytes byte array to be cast.
>      * @return Boolean value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Boolean bytesToBoolean(byte[] b) throws IOException;
>     
>     /**
>      * Cast data from bytes to integer value.  
>      * @param bytes byte array to be cast.
>      * @return Integer value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Integer bytesToInteger(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to long value.  
>      * @param bytes byte array to be cast.
>      * @return Long value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Long bytesToLong(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to float value.  
>      * @param bytes byte array to be cast.
>      * @return Float value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Float bytesToFloat(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to double value.  
>      * @param bytes byte array to be cast.
>      * @return Double value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Double bytesToDouble(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to chararray value.  
>      * @param bytes byte array to be cast.
>      * @return String value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public String bytesToCharArray(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to map value.  
>      * @param bytes byte array to be cast.
>      * @return Map value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Map<Object, Object> bytesToMap(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to tuple value.  
>      * @param bytes byte array to be cast.
>      * @return Tuple value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Tuple bytesToTuple(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to bag value.  
>      * @param bytes byte array to be cast.
>      * @return Bag value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public DataBag bytesToBag(byte[] b) throws IOException;
>     /**
>      * Indicate to the loader fields that will be needed.  This can be useful for
>      * loaders that access data that is stored in a columnar format where indicating
>      * columns to be accessed a head of time will save scans.  If the loader
>      * function cannot make use of this information, it is free to ignore it.
>      * @param schema Schema indicating which columns will be needed.
>      */
>     public void fieldsToRead(Schema schema);
>     /**
>      * Find the schema from the loader.  This function will be called at parse time
>      * (not run time) to see if the loader can provide a schema for the data.  The
>      * loader may be able to do this if the data is self describing (e.g. JSON).  If
>      * the loader cannot determine the schema, it can return a null.
>      * @param fileName Name of the file to be read.
>      * @param in inpu stream, so that the function can read enough of the
>      * data to determine the schema.
>      * @param end Function should not read past this position in the stream.
>      * @return a Schema describing the data if possible, or null otherwise.
>      * @throws IOException.
>      */
>     public Schema determineSchema(String fileName,
>                                   BufferedPositionedInputStream in,
>                                   long end) throws IOException;
> }
> {code} 
> This bug also covers the work to convert existing load function (eg PigStorage, BinStorage)
to the new interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message