pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park" <cheol...@cloudera.com>
Subject Re: Review Request: PIG-2492 AvroStorage should recognize globs and commas
Date Thu, 19 Jul 2012 01:23:53 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5936/
-----------------------------------------------------------

(Updated July 19, 2012, 1:23 a.m.)


Review request for pig.


Changes
-------

1) Added more unit tests including some negative tests.

2) Removed getPathsFromString() because I realized that fs.globStatus() implicitly expands
comma-separated string into paths, so it is redundant to explicitly do it.

3) Changed the type of 1st parameter of getAllSubDirs() from URI to hadoop.fs.Path. This is
needed because '{' and '}' are not allowed in URI, so URI.create() throws a URISyntaxException
on a glob pattern. But these characters are automatically escaped when constructing a Path.
Note that this wasn't an issue in my previous patch because getPathsFromString() used to implicitly
convert a glob pattern to paths, but now I removed getPathsFromString() and have to do it
explicitly.

In fact, this reverts some changes made by PIG-2540 (https://issues.apache.org/jira/browse/PIG-2540).
However, this does not break S3 support because inside getAllSubDirs(), file system is still
constructed for the given URI, and globStatus() is called on that file system.

FileSystem fs = FileSystem.get(path.toUri(), job.getConfiguration());
FileStatus[] matchedFiles = fs.globStatus(path);

So if path is a s3 URI, S3 file system will be used.


Description
-------

Add glob support to AvroStorage:

https://issues.apache.org/jira/browse/PIG-2492


This addresses bug PIG-2492.
    https://issues.apache.org/jira/browse/PIG-2492


Diffs (updated)
-----

  contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
0f8ef27 
  contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java
c7de726 
  contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
48b093b 
  contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorageUtils.java
e5d0c38 

Diff: https://reviews.apache.org/r/5936/diff/


Testing
-------

1. Added new unit tests as follows:

- testDir verifies that AvroStorage recursively loads files in a directory and its sub-directories.
- testGlob1 to 3 verify that glob patterns are expanded properly.

To run the tests, please do the following:

wget https://issues.apache.org/jira/secure/attachment/12536534/avro_test_files.tar.gz 
tar -xf avro_test_files.tar.gz
ant clean compile-test piggybank -Dhadoopversion=20
cd contrib/piggybank/java
ant test -Dtestcase=TestAvroStorage

2. Both TestAvroStorage and TestAvroStorageUtils pass.


Thanks,

Cheolsoo Park


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message