hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-160) Change LoadFunc interface to work with new types
Date Wed, 19 Mar 2008 20:14:24 GMT
Change LoadFunc interface to work with new types
------------------------------------------------

                 Key: PIG-160
                 URL: https://issues.apache.org/jira/browse/PIG-160
             Project: Pig
          Issue Type: Sub-task
          Components: impl
            Reporter: Alan Gates
            Assignee: Alan Gates


The LoadFunc interface needs to change to support new types.  The load function will need
to support two new features:

1) type conversion, how to get the bytes read from the source converted to java Integer, Float,
String, etc.
2) schema discovery, as we want to support self-describing data such JSON, and we will need
the load function to tell us that schema.

The proposed new interface is:

{code:title=Bar.java|borderStyle=solid}
/**
 * This interface is used to implement functions to parse records
 * from a dataset.  This also includes functions to cast raw byte data into various
 * datatypes.  These are external functions because we want loaders, whenever
 * possible, to delay casting of datatypes until the last possible moment (i.e.
 * don't do it on load).  This means we need to expose the functionality so that
 * other sections of the code can call back to the loader to do the cast.
 */
public interface LoadFunc {
    /**
     * Specifies a portion of an InputStream to read tuples. Because the
     * starting and ending offsets may not be on record boundaries it is up to
     * the implementor to deal with figuring out the actual starting and ending
     * offsets in such a way that an arbitrarily sliced up file will be processed
     * in its entirety.
     * <p>
     * A common way of handling slices in the middle of records is to start at
     * the given offset and, if the offset is not zero, skip to the end of the
     * first record (which may be a partial record) before reading tuples.
     * Reading continues until a tuple has been read that ends at an offset past
     * the ending offset.
     * <p>
     * <b>The load function should not do any buffering on the input stream</b>.
Buffering will
     * cause the offsets returned by is.getPos() to be unreliable.
     *  
     * @param fileName the name of the file to be read
     * @param is the stream representing the file to be processed, and which can also provide
its position.
     * @param offset the offset to start reading tuples.
     * @param end the ending offset for reading.
     * @throws IOException
     */
    public void bindTo(String fileName,
                       BufferedPositionedInputStream is,
                       long offset,
                       long end) throws IOException;

    /**
     * Retrieves the next tuple to be processed.
     * @return the next tuple to be processed or null if there are no more tuples
     * to be processed.
     * @throws IOException
     */
    public Tuple getNext() throws IOException;
    
    /**
     * Cast data from bytes to boolean value.  
     * @param bytes byte array to be cast.
     * @return Boolean value.
     * @throws IOException if the value cannot be cast.
     */
    public Boolean bytesToBoolean(byte[] b) throws IOException;
    
    /**
     * Cast data from bytes to integer value.  
     * @param bytes byte array to be cast.
     * @return Integer value.
     * @throws IOException if the value cannot be cast.
     */
    public Integer bytesToInteger(byte[] b) throws IOException;

    /**
     * Cast data from bytes to long value.  
     * @param bytes byte array to be cast.
     * @return Long value.
     * @throws IOException if the value cannot be cast.
     */
    public Long bytesToLong(byte[] b) throws IOException;

    /**
     * Cast data from bytes to float value.  
     * @param bytes byte array to be cast.
     * @return Float value.
     * @throws IOException if the value cannot be cast.
     */
    public Float bytesToFloat(byte[] b) throws IOException;

    /**
     * Cast data from bytes to double value.  
     * @param bytes byte array to be cast.
     * @return Double value.
     * @throws IOException if the value cannot be cast.
     */
    public Double bytesToDouble(byte[] b) throws IOException;

    /**
     * Cast data from bytes to chararray value.  
     * @param bytes byte array to be cast.
     * @return String value.
     * @throws IOException if the value cannot be cast.
     */
    public String bytesToCharArray(byte[] b) throws IOException;

    /**
     * Cast data from bytes to map value.  
     * @param bytes byte array to be cast.
     * @return Map value.
     * @throws IOException if the value cannot be cast.
     */
    public Map<Object, Object> bytesToMap(byte[] b) throws IOException;

    /**
     * Cast data from bytes to tuple value.  
     * @param bytes byte array to be cast.
     * @return Tuple value.
     * @throws IOException if the value cannot be cast.
     */
    public Tuple bytesToTuple(byte[] b) throws IOException;

    /**
     * Cast data from bytes to bag value.  
     * @param bytes byte array to be cast.
     * @return Bag value.
     * @throws IOException if the value cannot be cast.
     */
    public DataBag bytesToBag(byte[] b) throws IOException;

    /**
     * Indicate to the loader fields that will be needed.  This can be useful for
     * loaders that access data that is stored in a columnar format where indicating
     * columns to be accessed a head of time will save scans.  If the loader
     * function cannot make use of this information, it is free to ignore it.
     * @param schema Schema indicating which columns will be needed.
     */
    public void fieldsToRead(Schema schema);

    /**
     * Find the schema from the loader.  This function will be called at parse time
     * (not run time) to see if the loader can provide a schema for the data.  The
     * loader may be able to do this if the data is self describing (e.g. JSON).  If
     * the loader cannot determine the schema, it can return a null.
     * @param fileName Name of the file to be read.
     * @param in inpu stream, so that the function can read enough of the
     * data to determine the schema.
     * @param end Function should not read past this position in the stream.
     * @return a Schema describing the data if possible, or null otherwise.
     * @throws IOException.
     */
    public Schema determineSchema(String fileName,
                                  BufferedPositionedInputStream in,
                                  long end) throws IOException;
}
{code} 

This bug also covers the work to convert existing load function (eg PigStorage, BinStorage)
to the new interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message