hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-493) determineSchema() should be called on the deserializer in Streaming command to possibly determine schema of the output of the streaming command
Date Tue, 14 Oct 2008 17:41:44 GMT

     [ https://issues.apache.org/jira/browse/PIG-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pradeep Kamath updated PIG-493:
-------------------------------

    Assignee: Pradeep Kamath
      Status: Patch Available  (was: Open)

Attached patch, some notes on the patch:
- determineSchema() was never being called from LOLoad  since the schemaFile was always passed
as null from the parser. I have changed the signature of this method so that implementations
of this method can open the input file if they need to, to determine the schema. Here is the
new API:
{code}
/**
     * Find the schema from the loader.  This function will be called at parse time
     * (not run time) to see if the loader can provide a schema for the data.  The
     * loader may be able to do this if the data is self describing (e.g. JSON).  If
     * the loader cannot determine the schema, it can return a null.
     * LoadFunc implementations which need to open the input "fileName", can use 
     * FileLocalizer.open(String fileName, ExecType execType, DataStorage storage) to get
     * an InputStream which they can use to initialize their loader implementation. They
     * can then use this to read the input data to discover the schema. Note: this will
     * work only when the fileName represents a file on Local File System or Hadoop file 
     * system
     * @param fileName Name of the file to be read.(this will be the same as the filename

     * in the "load statement of the script)
     * @param execType - execution mode of the pig script - one of ExecType.LOCAL or ExecType.MAPREDUCE
     * @param storage - the DataStorage object corresponding to the execType
     * @return a Schema describing the data if possible, or null otherwise.
     * @throws IOException.
     */
    public Schema determineSchema(String fileName, ExecType execType, DataStorage storage)
throws IOException;
{code}

As noted in the comments above, I have also added a static helper method in FileLocalizer.
 LoadFunc implementations which need to open the input "fileName", can use  FileLocalizer.open(String
fileName, ExecType execType, DataStorage storage) to get an InputStream which they can use
to initialize their loader implementation. There are some related changes in TypeCastInserter
and Schema to handle schema specification in the Load statement. (which would be providing
an additional schema in addition to the one determined by determineSchema())

 - Reviewers, please also look at  https://issues.apache.org/jira/browse/PIG-492 to make sure
that, that are no blockers for that issue - it seems like for that issue we would need to
serialize each of the loader 
in the query and send it to the backend which may not be trivial

> determineSchema() should be called on the deserializer in Streaming command to possibly
determine schema of the output of the streaming command
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-493
>                 URL: https://issues.apache.org/jira/browse/PIG-493
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>
> Currently determineSchema() method of the deserializer is never called to determine schema
of the output of the streaming command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message