Return-Path: X-Original-To: apmail-pig-commits-archive@www.apache.org Delivered-To: apmail-pig-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0DB72D1CB for ; Thu, 13 Dec 2012 19:58:16 +0000 (UTC) Received: (qmail 91767 invoked by uid 500); 13 Dec 2012 19:58:15 -0000 Delivered-To: apmail-pig-commits-archive@pig.apache.org Received: (qmail 91585 invoked by uid 500); 13 Dec 2012 19:58:15 -0000 Mailing-List: contact commits-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list commits@pig.apache.org Received: (qmail 91454 invoked by uid 99); 13 Dec 2012 19:58:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Dec 2012 19:58:15 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Dec 2012 19:58:10 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id CA10C2388962; Thu, 13 Dec 2012 19:57:48 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: svn commit: r1421461 - in /pig/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/func.xml src/org/apache/pig/builtin/PigStorage.java src/org/apache/pig/impl/util/Utils.java test/org/apache/pig/test/TestPigStorage.java Date: Thu, 13 Dec 2012 19:57:47 -0000 To: commits@pig.apache.org From: cheolsoo@apache.org X-Mailer: svnmailer-1.0.8-patched Message-Id: <20121213195748.CA10C2388962@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: cheolsoo Date: Thu Dec 13 19:57:46 2012 New Revision: 1421461 URL: http://svn.apache.org/viewvc?rev=1421461&view=rev Log: PIG-2857: Add a -tagPath option to PigStorage (prkommireddi via cheolsoo) Modified: pig/trunk/CHANGES.txt pig/trunk/src/docs/src/documentation/content/xdocs/func.xml pig/trunk/src/org/apache/pig/builtin/PigStorage.java pig/trunk/src/org/apache/pig/impl/util/Utils.java pig/trunk/test/org/apache/pig/test/TestPigStorage.java Modified: pig/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/pig/trunk/CHANGES.txt?rev=1421461&r1=1421460&r2=1421461&view=diff ============================================================================== --- pig/trunk/CHANGES.txt (original) +++ pig/trunk/CHANGES.txt Thu Dec 13 19:57:46 2012 @@ -24,6 +24,8 @@ INCOMPATIBLE CHANGES IMPROVEMENTS +PIG-2857: Add a -tagPath option to PigStorage (prkommireddi via cheolsoo) + PIG-2341: Need better documentation on Pig/HBase integration (jthakrar and billgraham via billgraham) PIG-3075: Allow AvroStorage STORE Operations To Use Schema Specified By URI (nwhite via cheolsoo) Modified: pig/trunk/src/docs/src/documentation/content/xdocs/func.xml URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/func.xml?rev=1421461&r1=1421460&r2=1421461&view=diff ============================================================================== --- pig/trunk/src/docs/src/documentation/content/xdocs/func.xml (original) +++ pig/trunk/src/docs/src/documentation/content/xdocs/func.xml Thu Dec 13 19:57:46 2012 @@ -1436,10 +1436,12 @@ STORE X INTO 'output' USING PigDump();

A string that contains space-separated options (‘optionA optionB optionC’)

Currently supported options are:

    -
  • (‘schema’) - Stores the schema of the relation using a hidden JSON file.
  • -
  • (‘noschema’) - Ignores a stored schema during the load.
  • -
  • ('tagsource') - Add a first column indicates the input file of the record.
  • -
+
  • (‘schema’) - Stores the schema of the relation using a hidden JSON file.
  • +
  • (‘noschema’) - Ignores a stored schema during the load.
  • +
  • ('tagsource') - (deprecated, Use tagPath instead) Add a first column indicates the input file of the record.
  • +
  • ('tagPath') - Add a first column indicates the input path of the record.
  • +
  • ('tagFile') - Add a first column indicates the input file name of the record.
  • + @@ -1471,7 +1473,7 @@ STORE X INTO 'output' USING PigDump();

    Note that regardless of whether or not you store the schema, you always need to specify the correct delimiter to read your data. If you store reading delimiter "#" and then load using the default delimiter, your data will not be parsed correctly.

    Record Provenance

    -

    If tagsource option is specified, PigStorage will add a psudo-column INPUT_FILE_NAME to the beginning of the record. As the name suggests, it is the input file name containing this particular record.

    +

    If tagPath or tagFile option is specified, PigStorage will add a pseudo-column INPUT_FILE_PATH or INPUT_FILE_NAME respectively to the beginning of the record. As the name suggests, it is the input file path/name containing this particular record. Please note tagsource is deprecated.

    Complex Data Types

    The formats for complex data types are shown here:

    Modified: pig/trunk/src/org/apache/pig/builtin/PigStorage.java URL: http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/builtin/PigStorage.java?rev=1421461&r1=1421460&r2=1421461&view=diff ============================================================================== --- pig/trunk/src/org/apache/pig/builtin/PigStorage.java (original) +++ pig/trunk/src/org/apache/pig/builtin/PigStorage.java Thu Dec 13 19:57:46 2012 @@ -41,8 +41,8 @@ import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.OutputFormat; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.RecordWriter; -import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.pig.Expression; import org.apache.pig.FileInputLoadFunc; @@ -84,7 +84,8 @@ import org.apache.pig.parser.ParserExcep *
  • -schema Reads/Stores the schema of the relation using a * hidden JSON file. *
  • -noschema Ignores a stored schema during loading. - *
  • -tagsource Appends input source file path to beginning of each tuple. + *
  • -tagFile Appends input source file name to beginning of each tuple. + *
  • -tagPath Appends input source file path to beginning of each tuple. * *

    *

    Schemas

    @@ -101,9 +102,12 @@ import org.apache.pig.parser.ParserExcep * files with header lines easier (just cat the header to your data). *

    *

    Source tagging

    - * If-tagsource is specified, PigStorage will prepend input split path to each Tuple/row. - * Usage: A = LOAD 'input' using PigStorage(',','-tagsource'); B = foreach A generate $0; - * The first field (0th index) in each Tuple will contain input path + * If-tagFile is specified, PigStorage will prepend input split name to each Tuple/row. + * Usage: A = LOAD 'input' using PigStorage(',','-tagFile'); B = foreach A generate $0; + * The first field (0th index) in each Tuple will contain input file name. + * If-tagPath is specified, PigStorage will prepend input split path to each Tuple/row. + * Usage: A = LOAD 'input' using PigStorage(',','-tagPath'); B = foreach A generate $0; + * The first field (0th index) in each Tuple will contain input file path *

    * Note that regardless of whether or not you store the schema, you always need to specify * the correct delimiter to read your data. If you store reading delimiter "#" and then load using @@ -144,15 +148,19 @@ LoadPushDown, LoadMetadata, StoreMetadat protected boolean[] mRequiredColumns = null; private boolean mRequiredColumnsInitialized = false; - //Indicates whether the input file path should be read. - private boolean tagSource = false; - private static final String TAG_SOURCE_PATH = "tagsource"; + // Indicates whether the input file name/path should be read. + private boolean tagFile = false; + private static final String TAG_SOURCE_FILE = "tagFile"; + private boolean tagPath = false; + private static final String TAG_SOURCE_PATH = "tagPath"; private Path sourcePath = null; private void populateValidOptions() { validOptions.addOption("schema", false, "Loads / Stores the schema of the relation using a hidden JSON file."); validOptions.addOption("noschema", false, "Disable attempting to load data schema from the filesystem."); - validOptions.addOption(TAG_SOURCE_PATH, false, "Appends input source file path to beginning of each tuple. "); + validOptions.addOption(TAG_SOURCE_FILE, false, "Appends input source file name to beginning of each tuple."); + validOptions.addOption(TAG_SOURCE_PATH, false, "Appends input source file path to beginning of each tuple."); + validOptions.addOption("tagsource", false, "Appends input source file name to beginning of each tuple."); } public PigStorage() { @@ -178,7 +186,8 @@ LoadPushDown, LoadMetadata, StoreMetadat *

      *
    • -schema Loads / Stores the schema of the relation using a hidden JSON file. *
    • -noschema Ignores a stored schema during loading. - *
    • -tagsource Appends input source file path to beginning of each tuple. + *
    • -tagFile Appends input source file name to beginning of each tuple. + *
    • -tagPath Appends input source file path to beginning of each tuple. *
    * @param delimiter the single byte character that is used to separate fields. * @param options a list of options that can be used to modify PigStorage behavior @@ -192,7 +201,14 @@ LoadPushDown, LoadMetadata, StoreMetadat configuredOptions = parser.parse(validOptions, optsArr); isSchemaOn = configuredOptions.hasOption("schema"); dontLoadSchema = configuredOptions.hasOption("noschema"); - tagSource = configuredOptions.hasOption(TAG_SOURCE_PATH); + tagFile = configuredOptions.hasOption(TAG_SOURCE_FILE); + tagPath = configuredOptions.hasOption(TAG_SOURCE_PATH); + // TODO: Remove -tagsource in 0.13. For backward compatibility, we + // need tagsource to be supported until at least 0.12 + if (configuredOptions.hasOption("tagsource")) { + mLog.warn("'-tagsource' is deprecated. Use '-tagFile' instead."); + tagFile = true; + } } catch (ParseException e) { HelpFormatter formatter = new HelpFormatter(); formatter.printHelp( "PigStorage(',', '[options]')", validOptions); @@ -213,8 +229,10 @@ LoadPushDown, LoadMetadata, StoreMetadat mRequiredColumnsInitialized = true; } //Prepend input source path if source tagging is enabled - if(tagSource) { - mProtoTuple.add(new DataByteArray(sourcePath.getName())); + if(tagFile) { + mProtoTuple.add(new DataByteArray(sourcePath.getName())); + } else if (tagPath) { + mProtoTuple.add(new DataByteArray(sourcePath.toString())); } try { @@ -362,8 +380,8 @@ LoadPushDown, LoadMetadata, StoreMetadat @Override public void prepareToRead(RecordReader reader, PigSplit split) { in = reader; - if(tagSource) { - sourcePath = ((FileSplit)split.getWrappedSplit()).getPath(); + if (tagFile || tagPath) { + sourcePath = ((FileSplit)split.getWrappedSplit()).getPath(); } } @@ -471,8 +489,10 @@ LoadPushDown, LoadMetadata, StoreMetadat schema = (new JsonMetadata()).getSchema(location, job, isSchemaOn); if (signature != null && schema != null) { - if(tagSource) { - schema = Utils.getSchemaWithInputSourceTag(schema); + if(tagFile) { + schema = Utils.getSchemaWithInputSourceTag(schema, "INPUT_FILE_NAME"); + } else if(tagPath) { + schema = Utils.getSchemaWithInputSourceTag(schema, "INPUT_FILE_PATH"); } Properties p = UDFContext.getUDFContext().getUDFProperties(this.getClass(), new String[] {signature}); Modified: pig/trunk/src/org/apache/pig/impl/util/Utils.java URL: http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/impl/util/Utils.java?rev=1421461&r1=1421460&r2=1421461&view=diff ============================================================================== --- pig/trunk/src/org/apache/pig/impl/util/Utils.java (original) +++ pig/trunk/src/org/apache/pig/impl/util/Utils.java Thu Dec 13 19:57:46 2012 @@ -17,7 +17,6 @@ */ package org.apache.pig.impl.util; -import java.io.ByteArrayInputStream; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; @@ -51,7 +50,6 @@ import org.apache.pig.impl.io.ReadToEndL import org.apache.pig.impl.io.TFileStorage; import org.apache.pig.impl.logicalLayer.schema.Schema; import org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema; -import org.apache.pig.newplan.logical.Util; import org.apache.pig.newplan.logical.relational.LogicalSchema; import org.apache.pig.parser.ParserException; import org.apache.pig.parser.QueryParserDriver; @@ -62,7 +60,7 @@ import com.google.common.collect.Lists; * Class with utility static methods */ public class Utils { - private static final Log log = LogFactory.getLog(Utils.class); + private static final Log log = LogFactory.getLog(Utils.class); /** * This method is a helper for classes to implement {@link java.lang.Object#equals(java.lang.Object)} * checks if two objects are equals - two levels of checks are @@ -204,24 +202,26 @@ public class Utils { return getSchemaFromString(unwrappedSchemaString); } - public static LogicalSchema parseSchema(String schemaString) throws ParserException { - QueryParserDriver queryParser = new QueryParserDriver( new PigContext(), - "util", new HashMap() ) ; + public static LogicalSchema parseSchema(String schemaString) throws ParserException { + QueryParserDriver queryParser = new QueryParserDriver( new PigContext(), + "util", new HashMap() ) ; LogicalSchema schema = queryParser.parseSchema(schemaString); - return schema; - } - + return schema; + } + /** * This method adds FieldSchema of 'input source tag/path' as the first * field. This will be called only when PigStorage is invoked with - * '-tagsource' option and the schema file is present to be loaded. + * '-tagFile' or '-tagPath' option and the schema file is present to be + * loaded. * * @param schema + * @param fieldName * @return ResourceSchema */ - public static ResourceSchema getSchemaWithInputSourceTag(ResourceSchema schema) { + public static ResourceSchema getSchemaWithInputSourceTag(ResourceSchema schema, String fieldName) { ResourceFieldSchema[] fieldSchemas = schema.getFields(); - ResourceFieldSchema sourceTagSchema = new ResourceFieldSchema(new FieldSchema("INPUT_FILE_NAME", DataType.CHARARRAY)); + ResourceFieldSchema sourceTagSchema = new ResourceFieldSchema(new FieldSchema(fieldName, DataType.CHARARRAY)); ResourceFieldSchema[] fieldSchemasWithSourceTag = new ResourceFieldSchema[fieldSchemas.length + 1]; fieldSchemasWithSourceTag[0] = sourceTagSchema; for(int j = 0; j < fieldSchemas.length; j++) { @@ -324,15 +324,15 @@ public class Utils { } public static InputStream getCompositeStream(InputStream in, Properties properties) { - //Load default ~/.pigbootup if not specified by user - final String bootupFile = properties.getProperty("pig.load.default.statements", System.getProperty("user.home") + "/.pigbootup"); - try { - final InputStream inputSteam = new FileInputStream(new File(bootupFile)); - return new SequenceInputStream(inputSteam, in); - } catch(FileNotFoundException fe) { - log.info("Default bootup file " +bootupFile+ " not found"); - return in; - } + //Load default ~/.pigbootup if not specified by user + final String bootupFile = properties.getProperty("pig.load.default.statements", System.getProperty("user.home") + "/.pigbootup"); + try { + final InputStream inputSteam = new FileInputStream(new File(bootupFile)); + return new SequenceInputStream(inputSteam, in); + } catch(FileNotFoundException fe) { + log.info("Default bootup file " +bootupFile+ " not found"); + return in; + } } /** Modified: pig/trunk/test/org/apache/pig/test/TestPigStorage.java URL: http://svn.apache.org/viewvc/pig/trunk/test/org/apache/pig/test/TestPigStorage.java?rev=1421461&r1=1421460&r2=1421461&view=diff ============================================================================== --- pig/trunk/test/org/apache/pig/test/TestPigStorage.java (original) +++ pig/trunk/test/org/apache/pig/test/TestPigStorage.java Thu Dec 13 19:57:46 2012 @@ -28,7 +28,6 @@ import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.io.PrintWriter; -import java.util.Arrays; import java.util.HashMap; import java.util.Iterator; import java.util.List; @@ -59,9 +58,9 @@ import org.codehaus.jackson.map.JsonMapp import org.codehaus.jackson.map.ObjectMapper; import org.junit.After; import org.junit.AfterClass; +import org.junit.Assert; import org.junit.Before; import org.junit.Test; -import org.junit.Assert; public class TestPigStorage { @@ -468,7 +467,7 @@ public class TestPigStorage { /** * This is for testing source tagging option on PigStorage. When a user - * specifies '-tagsource' as an option, PigStorage must prepend the input + * specifies '-tagFile' as an option, PigStorage must prepend the input * source path to the tuple and "INPUT_FILE_NAME" to schema. * * @throws Exception @@ -482,18 +481,29 @@ public class TestPigStorage { pig.store("a", datadir + "aout", "PigStorage('\\t', '-schema')"); // aout now has a schema. - // Verify that loading a-out with '-tagsource' produces + // Verify that loading a-out with '-tagFile' produces // the original schema, and prepends 'INPUT_FILE_NAME' to // original schema. - pig.registerQuery("b = LOAD '" + datadir + "aout' using PigStorage('\\t', '-tagsource');"); + pig.registerQuery("b = LOAD '" + datadir + "aout' using PigStorage('\\t', '-tagFile');"); Schema genSchema = pig.dumpSchema("b"); - // Verify that -tagsource schema works - String[] aliases = {"INPUT_FILE_NAME", "f1", "f2"}; - byte[] types = {DataType.CHARARRAY, DataType.CHARARRAY, DataType.INTEGER}; + String[] fileAliases = {"INPUT_FILE_NAME", "f1", "f2"}; + byte[] fileTypes = {DataType.CHARARRAY, DataType.CHARARRAY, DataType.INTEGER}; Schema newSchema = TypeCheckingTestUtil.genFlatSchema( - aliases,types); - Assert.assertTrue("schema with -tagsource preprends INPUT_FILE_NAME", + fileAliases,fileTypes); + Assert.assertTrue("schema with -tagFile preprends INPUT_FILE_NAME", + Schema.equals(newSchema, genSchema, true, false)); + + // Verify that loading a-out with '-tagPath' produces + // the original schema, and prepends 'INPUT_FILE_PATH' to + // original schema. + pig.registerQuery("b = LOAD '" + datadir + "aout' using PigStorage('\\t', '-tagPath');"); + genSchema = pig.dumpSchema("b"); + String[] pathAliases = {"INPUT_FILE_PATH", "f1", "f2"}; + byte[] pathTypes = {DataType.CHARARRAY, DataType.CHARARRAY, DataType.INTEGER}; + newSchema = TypeCheckingTestUtil.genFlatSchema(pathAliases,pathTypes); + Assert.assertTrue("schema with -tagPath preprends INPUT_FILE_PATH", Schema.equals(newSchema, genSchema, true, false)); + // Verify that explicitly requesting no schema works pig.registerQuery("d = LOAD '" + datadir + "aout' using PigStorage('\\t', '-noschema');"); @@ -501,7 +511,7 @@ public class TestPigStorage { assertNull(genSchema); // Verify specifying your own schema works - pig.registerQuery("b = LOAD '" + datadir + "aout' using PigStorage('\\t', '-tagsource') " + + pig.registerQuery("b = LOAD '" + datadir + "aout' using PigStorage('\\t', '-tagFile') " + "as (input_file:chararray, foo:chararray, bar:int);"); genSchema = pig.dumpSchema("b"); String[] newAliases = {"input_file", "foo", "bar"}; @@ -522,14 +532,14 @@ public class TestPigStorage { // Storing in 'aout' directory will store contents in part-m-00000 pig.store("a", datadir + "aout", "PigStorage('\\t', '-schema')"); - // Verify input source tag is present when using -tagsource - pig.registerQuery("b = LOAD '" + datadir + "aout' using PigStorage('\\t', '-tagsource');"); + // Verify input source tag is present when using -tagFile or -tagPath + pig.registerQuery("b = LOAD '" + datadir + "aout' using PigStorage('\\t', '-tagFile');"); pig.registerQuery("c = foreach b generate INPUT_FILE_NAME;"); Iterator iter = pig.openIterator("c"); while(iter.hasNext()) { Tuple tuple = iter.next(); String inputFileName = (String)tuple.get(0); - assertEquals("tagsource value must be part-m-00000", inputFileName, storeFileName); + assertEquals("tagFile value must be part-m-00000", inputFileName, storeFileName); } } }