hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-160) Change LoadFunc interface to work with new types
Date Tue, 03 Jun 2008 21:18:45 GMT

     [ https://issues.apache.org/jira/browse/PIG-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pradeep Kamath updated PIG-160:
-------------------------------

    Attachment: loadfuncs_v1.patch

Patch with implementation of LoadFunc interface for flat types (for non flat types - map,
bag and tuple the bytesToXXX() methods return null and the corresponding toBytes throw 
IOException. The implementation for these will be added later after more design review.

The patch also includes unit test cases for the simple type conversion routines


> Change LoadFunc interface to work with new types
> ------------------------------------------------
>
>                 Key: PIG-160
>                 URL: https://issues.apache.org/jira/browse/PIG-160
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: loadfuncs_v1.patch
>
>
> The LoadFunc interface needs to change to support new types.  The load function will
need to support two new features:
> 1) type conversion, how to get the bytes read from the source converted to java Integer,
Float, String, etc.
> 2) schema discovery, as we want to support self-describing data such JSON, and we will
need the load function to tell us that schema.
> The proposed new interface is:
> {code:title=Bar.java|borderStyle=solid}
> /**
>  * This interface is used to implement functions to parse records
>  * from a dataset.  This also includes functions to cast raw byte data into various
>  * datatypes.  These are external functions because we want loaders, whenever
>  * possible, to delay casting of datatypes until the last possible moment (i.e.
>  * don't do it on load).  This means we need to expose the functionality so that
>  * other sections of the code can call back to the loader to do the cast.
>  */
> public interface LoadFunc {
>     /**
>      * Specifies a portion of an InputStream to read tuples. Because the
>      * starting and ending offsets may not be on record boundaries it is up to
>      * the implementor to deal with figuring out the actual starting and ending
>      * offsets in such a way that an arbitrarily sliced up file will be processed
>      * in its entirety.
>      * <p>
>      * A common way of handling slices in the middle of records is to start at
>      * the given offset and, if the offset is not zero, skip to the end of the
>      * first record (which may be a partial record) before reading tuples.
>      * Reading continues until a tuple has been read that ends at an offset past
>      * the ending offset.
>      * <p>
>      * <b>The load function should not do any buffering on the input stream</b>.
Buffering will
>      * cause the offsets returned by is.getPos() to be unreliable.
>      *  
>      * @param fileName the name of the file to be read
>      * @param is the stream representing the file to be processed, and which can also
provide its position.
>      * @param offset the offset to start reading tuples.
>      * @param end the ending offset for reading.
>      * @throws IOException
>      */
>     public void bindTo(String fileName,
>                        BufferedPositionedInputStream is,
>                        long offset,
>                        long end) throws IOException;
>     /**
>      * Retrieves the next tuple to be processed.
>      * @return the next tuple to be processed or null if there are no more tuples
>      * to be processed.
>      * @throws IOException
>      */
>     public Tuple getNext() throws IOException;
>     
>     /**
>      * Cast data from bytes to boolean value.  
>      * @param bytes byte array to be cast.
>      * @return Boolean value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Boolean bytesToBoolean(byte[] b) throws IOException;
>     
>     /**
>      * Cast data from bytes to integer value.  
>      * @param bytes byte array to be cast.
>      * @return Integer value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Integer bytesToInteger(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to long value.  
>      * @param bytes byte array to be cast.
>      * @return Long value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Long bytesToLong(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to float value.  
>      * @param bytes byte array to be cast.
>      * @return Float value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Float bytesToFloat(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to double value.  
>      * @param bytes byte array to be cast.
>      * @return Double value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Double bytesToDouble(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to chararray value.  
>      * @param bytes byte array to be cast.
>      * @return String value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public String bytesToCharArray(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to map value.  
>      * @param bytes byte array to be cast.
>      * @return Map value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Map<Object, Object> bytesToMap(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to tuple value.  
>      * @param bytes byte array to be cast.
>      * @return Tuple value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Tuple bytesToTuple(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to bag value.  
>      * @param bytes byte array to be cast.
>      * @return Bag value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public DataBag bytesToBag(byte[] b) throws IOException;
>     /**
>      * Indicate to the loader fields that will be needed.  This can be useful for
>      * loaders that access data that is stored in a columnar format where indicating
>      * columns to be accessed a head of time will save scans.  If the loader
>      * function cannot make use of this information, it is free to ignore it.
>      * @param schema Schema indicating which columns will be needed.
>      */
>     public void fieldsToRead(Schema schema);
>     /**
>      * Find the schema from the loader.  This function will be called at parse time
>      * (not run time) to see if the loader can provide a schema for the data.  The
>      * loader may be able to do this if the data is self describing (e.g. JSON).  If
>      * the loader cannot determine the schema, it can return a null.
>      * @param fileName Name of the file to be read.
>      * @param in inpu stream, so that the function can read enough of the
>      * data to determine the schema.
>      * @param end Function should not read past this position in the stream.
>      * @return a Schema describing the data if possible, or null otherwise.
>      * @throws IOException.
>      */
>     public Schema determineSchema(String fileName,
>                                   BufferedPositionedInputStream in,
>                                   long end) throws IOException;
> }
> {code} 
> This bug also covers the work to convert existing load function (eg PigStorage, BinStorage)
to the new interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message