Return-Path: X-Original-To: apmail-parquet-commits-archive@minotaur.apache.org Delivered-To: apmail-parquet-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2F05718E28 for ; Fri, 1 May 2015 00:45:13 +0000 (UTC) Received: (qmail 71416 invoked by uid 500); 1 May 2015 00:45:13 -0000 Delivered-To: apmail-parquet-commits-archive@parquet.apache.org Received: (qmail 71362 invoked by uid 500); 1 May 2015 00:45:13 -0000 Mailing-List: contact commits-help@parquet.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@parquet.apache.org Delivered-To: mailing list commits@parquet.apache.org Received: (qmail 71346 invoked by uid 99); 1 May 2015 00:45:13 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 May 2015 00:45:13 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id D1FADE3A5E; Fri, 1 May 2015 00:45:12 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: alexlevenson@apache.org To: commits@parquet.apache.org Date: Fri, 01 May 2015 00:45:13 -0000 Message-Id: <84d9b10afa8445eb829896d7d9aa27b0@git.apache.org> In-Reply-To: <25540a72200b4d04822b54c2da3726e7@git.apache.org> References: <25540a72200b4d04822b54c2da3726e7@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [2/2] parquet-mr git commit: PARQUET-229 Add a strict thrift projection API with backwards compat support PARQUET-229 Add a strict thrift projection API with backwards compat support Currently, the thrift projection API accepts strings in a very general glob format that supports not only wildcards like `*` and `?` and expansions (`{x,y,z}`) but also character classes `[abc]`, and negation. Because of this flexibility, it's hard to give users good error reporting, for example letting them know that when they requested columns `foo.bar.{a,b,c}` there is actually no such column `foo.bar.c`. This PR introduces a new syntax that supports a more restrictive form of glob syntax and enforces that all **expansions** of a glob match a column, not just that all globs match a column. The new syntax is very simple and only has four special characters: `{`,`}`,`,`, and `*` It supports glob expansion, for example: `home.{phone,address}` or `org.apache{-incubator,}.foo` And the wildcard `*` which is treated the same way as java regex treats `(.*)`, for example: `home.*` or `org.apache*.foo` In the new syntax glob paths mean "keep all the child fields of the field matched by this glob", just like variable access would work in a programming language. For example: `x.y.z` means keep field `z` and all of its children (if any). So it's not necessary to do `x.y.z.*`. However, `x.y.z` would not keep field `x.y.zoo`. If that was desired, then `x.y.z*` could be used instead. Setting `"parquet.thrift.column.filter"` will result in the same behavior that it does currently in master, though a deprecation warning will be logged. The classes that implement the current behavior have been marked as deprecated, and using this will log a warning. Setting `"parquet.thrift.column.projection.globs"` will instead use this new syntax, and entry points in the various Builder's in the codebase is added as well. This PR does a little bit of cleanup as well, moving some shared methods to a `Strings` class and simplifying some of the class hierarchy in `ThriftSchemaConverterVisitor`. There are a few `// TODO Why?` added as well that I wanted to ask about. Author: Alex Levenson Closes #150 from isnotinvain/alexlevenson/strict-projection and squashes the following commits: 6c58e1c [Alex Levenson] clean up docs 1aab666 [Alex Levenson] Merge branch 'master' into alexlevenson/strict-projection 92b6ba6 [Alex Levenson] Merge branch 'master' into alexlevenson/strict-projection ceaf6cd [Alex Levenson] update packages a28dc19 [Alex Levenson] Merge branch 'master' into alexlevenson/strict-projection ebc4761 [Alex Levenson] Remove unneeded TODO c2e12c5 [Alex Levenson] Update docs eecf5f3 [Alex Levenson] Merge branch 'master' into alexlevenson/strict-projection 671f0b5 [Alex Levenson] Merge branch 'master' into alexlevenson/strict-projection 298cad8 [Alex Levenson] Add warning 8b7e4bb [Alex Levenson] Add more comments to StrictFieldProjectionFilter 8f65ed2 [Alex Levenson] Add tests for strict projection filter c81d9e1 [Alex Levenson] Docs and cleanup for FieldProjectionFilter 71139a7 [Alex Levenson] Add tests for FieldsPath 7d17068 [Alex Levenson] Tests for WildcardPath 8a3d2af [Alex Levenson] Add some tests f3fd931 [Alex Levenson] More docs 0b190c3 [Alex Levenson] Add more comments 6e67df5 [Alex Levenson] Add a strict thrift projection API with backwards support for the current API Project: http://git-wip-us.apache.org/repos/asf/parquet-mr/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-mr/commit/7fc79983 Tree: http://git-wip-us.apache.org/repos/asf/parquet-mr/tree/7fc79983 Diff: http://git-wip-us.apache.org/repos/asf/parquet-mr/diff/7fc79983 Branch: refs/heads/master Commit: 7fc7998398373a14b4cdc0ce18abdeb221b1ccf9 Parents: 22c6d08 Author: Alex Levenson Authored: Thu Apr 30 17:45:11 2015 -0700 Committer: Alex Levenson Committed: Thu Apr 30 17:45:11 2015 -0700 ---------------------------------------------------------------------- .../parquet/cascading/ParquetValueScheme.java | 42 +++- .../main/java/org/apache/parquet/Strings.java | 110 +++++++++ .../org/apache/parquet/glob/GlobExpander.java | 114 ++++++++++ .../java/org/apache/parquet/glob/GlobNode.java | 157 +++++++++++++ .../org/apache/parquet/glob/GlobParser.java | 224 +++++++++++++++++++ .../org/apache/parquet/glob/WildcardPath.java | 122 ++++++++++ .../parquet/hadoop/metadata/ColumnPath.java | 12 +- .../java/org/apache/parquet/glob/TestGlob.java | 144 ++++++++++++ .../apache/parquet/glob/TestWildcardPath.java | 125 +++++++++++ .../hadoop/thrift/ThriftReadSupport.java | 78 +++++-- .../thrift/ThriftSchemaConvertVisitor.java | 36 ++- .../parquet/thrift/ThriftSchemaConverter.java | 69 +++--- .../projection/FieldProjectionFilter.java | 92 +++----- .../parquet/thrift/projection/FieldsPath.java | 61 ++--- .../thrift/projection/PathGlobPattern.java | 184 --------------- .../projection/StrictFieldProjectionFilter.java | 182 +++++++++++++++ .../DeprecatedFieldProjectionFilter.java | 107 +++++++++ .../projection/deprecated/PathGlobPattern.java | 187 ++++++++++++++++ ...stParquetToThriftReadWriteAndProjection.java | 2 +- .../thrift/TestThriftSchemaConverter.java | 73 +++--- .../thrift/projection/PathGlobPatternTest.java | 59 ----- .../thrift/projection/TestFieldsPath.java | 166 ++++++++++++++ .../TestStrictFieldProjectionFilter.java | 119 ++++++++++ .../deprecated/PathGlobPatternTest.java | 59 +++++ parquet_cascading.md | 71 +++--- pom.xml | 3 + 26 files changed, 2098 insertions(+), 500 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetValueScheme.java ---------------------------------------------------------------------- diff --git a/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetValueScheme.java b/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetValueScheme.java index 33d1bb7..9549ef4 100644 --- a/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetValueScheme.java +++ b/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetValueScheme.java @@ -51,17 +51,21 @@ public abstract class ParquetValueScheme extends Scheme implements Serializable { private final FilterPredicate filterPredicate; - private final String projectionString; + private final String deprecatedProjectionString; + private final String strictProjectionString; private final Class klass; - private Config(Class klass, FilterPredicate filterPredicate, String projectionString) { + + private Config(Class klass, FilterPredicate filterPredicate, String deprecatedProjectionString, String strictProjectionString) { this.filterPredicate = filterPredicate; - this.projectionString = projectionString; + this.deprecatedProjectionString = deprecatedProjectionString; + this.strictProjectionString = strictProjectionString; this.klass = klass; } public Config() { filterPredicate = null; - projectionString = null; + deprecatedProjectionString = null; + strictProjectionString = null; klass = null; } @@ -69,8 +73,13 @@ public abstract class ParquetValueScheme extends Scheme getKlass() { @@ -78,15 +87,20 @@ public abstract class ParquetValueScheme extends Scheme withFilterPredicate(FilterPredicate f) { - return new Config(this.klass, checkNotNull(f, "filterPredicate"), this.projectionString); + return new Config(this.klass, checkNotNull(f, "filterPredicate"), this.deprecatedProjectionString, this.strictProjectionString); } + @Deprecated public Config withProjectionString(String p) { - return new Config(this.klass, this.filterPredicate, checkNotNull(p, "projectionFilter")); + return new Config(this.klass, this.filterPredicate, checkNotNull(p, "projectionString"), this.strictProjectionString); + } + + public Config withStrictProjectionString(String p) { + return new Config(this.klass, this.filterPredicate, this.deprecatedProjectionString, checkNotNull(p, "projectionString")); } public Config withRecordClass(Class klass) { - return new Config(checkNotNull(klass, "recordClass"), this.filterPredicate, this.projectionString); + return new Config(checkNotNull(klass, "recordClass"), this.filterPredicate, this.deprecatedProjectionString, this.strictProjectionString); } } @@ -105,9 +119,16 @@ public abstract class ParquetValueScheme extends Scheme extends Scheme jobConfFlowProcess, Tap jobConfRecordReaderOutputCollectorTap, final JobConf jobConf) { setPredicatePushdown(jobConf); setProjectionPushdown(jobConf); + setStrictProjectionPushdown(jobConf); setRecordClass(jobConf); } http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-common/src/main/java/org/apache/parquet/Strings.java ---------------------------------------------------------------------- diff --git a/parquet-common/src/main/java/org/apache/parquet/Strings.java b/parquet-common/src/main/java/org/apache/parquet/Strings.java new file mode 100644 index 0000000..4d9fcc9 --- /dev/null +++ b/parquet-common/src/main/java/org/apache/parquet/Strings.java @@ -0,0 +1,110 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Iterator; +import java.util.List; + +import org.apache.parquet.glob.GlobExpander; +import org.apache.parquet.glob.WildcardPath; + +public final class Strings { + private Strings() { } + + /** + * Join an Iterable of Strings into a single string with a delimiter. + * For example, join(Arrays.asList("foo","","bar","x"), "|") would return + * "foo||bar|x" + * + * @param s an iterable of strings + * @param on the delimiter + * @return a single joined string + */ + public static String join(Iterable s, String on) { + Iterator iter = s.iterator(); + StringBuilder sb = new StringBuilder(); + while (iter.hasNext()) { + sb.append(iter.next()); + if (iter.hasNext()) { + sb.append(on); + } + } + return sb.toString(); + } + + /** + * Join an Array of Strings into a single string with a delimiter. + * For example, join(new String[] {"foo","","bar","x"}, "|") would return + * "foo||bar|x" + * + * @param s an iterable of strings + * @param on the delimiter + * @return a single joined string + */ + public static String join(String[] s, String on) { + return join(Arrays.asList(s), on); + } + + /** + * Returns true if s.isEmpty() or s == null + */ + public static boolean isNullOrEmpty(String s) { + return s == null || s.isEmpty(); + } + + /** + * Expands a string with braces ("{}") into all of its possible permutations. + * We call anything inside of {} braces a "one-of" group. + * + * The only special characters in this glob syntax are '}', '{' and ',' + * + * The top-level pattern must not contain any commas, but a "one-of" group separates + * its elements with commas, and a one-of group may contain sub one-of groups. + * + * For example: + * start{a,b,c}end -> startaend, startbend, startcend + * start{a,{b,c},d} -> startaend, startbend, startcend, startdend + * {a,b,c} -> a, b, c + * start{a, b{x,y}} -> starta, startbx, startby + * + * @param globPattern a string in the format described above + * @return a list of all the strings that would satisfy globPattern, including duplicates + */ + public static List expandGlob(String globPattern) { + return GlobExpander.expand(globPattern); + } + + /** + * Expands a string according to {@link #expandGlob(String)}, and then constructs a {@link WildcardPath} + * for each expanded result which can be used to match strings as described in {@link WildcardPath}. + * + * @param globPattern a String to be passed to {@link #expandGlob(String)} + * @param delim the delimeter used by {@link WildcardPath} + */ + public static List expandGlobToWildCardPaths(String globPattern, char delim) { + List ret = new ArrayList(); + for (String expandedGlob : Strings.expandGlob(globPattern)) { + ret.add(new WildcardPath(globPattern, expandedGlob, delim)); + } + return ret; + } +} http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-common/src/main/java/org/apache/parquet/glob/GlobExpander.java ---------------------------------------------------------------------- diff --git a/parquet-common/src/main/java/org/apache/parquet/glob/GlobExpander.java b/parquet-common/src/main/java/org/apache/parquet/glob/GlobExpander.java new file mode 100644 index 0000000..79b633c --- /dev/null +++ b/parquet-common/src/main/java/org/apache/parquet/glob/GlobExpander.java @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.glob; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +import org.apache.parquet.glob.GlobNode.Atom; +import org.apache.parquet.glob.GlobNode.GlobNodeSequence; +import org.apache.parquet.glob.GlobNode.OneOf; + +/** + * Implementation of {@link org.apache.parquet.Strings#expandGlob(String)} + */ +public final class GlobExpander { + private GlobExpander() { } + + /** + * See {@link org.apache.parquet.Strings#expandGlob(String)} for docs. + */ + public static List expand(String globPattern) { + return GlobExpanderImpl.expand(GlobParser.parse(globPattern)); + } + + /** + * Transforms a tree of {@link GlobNode} into a list of all the strings that satisfy + * this tree. + */ + private final static class GlobExpanderImpl implements GlobNode.Visitor> { + private static final GlobExpanderImpl INSTANCE = new GlobExpanderImpl(); + + private GlobExpanderImpl() {} + + public static List expand(GlobNode node) { + return node.accept(INSTANCE); + } + + @Override + public List visit(Atom atom) { + // atoms are the base case, just return a singleton list + return Arrays.asList(atom.get()); + } + + @Override + public List visit(OneOf oneOf) { + // in the case of OneOf, we just need to take all of + // the possible values the OneOf represents and + // union them together + List results = new ArrayList(); + for (GlobNode n : oneOf.getChildren()) { + results.addAll(n.accept(this)); + } + return results; + } + + @Override + public List visit(GlobNodeSequence seq) { + // in the case of a sequence, for each child + // we need to expand the child into all of its + // possibilities, then do a cross product of + // all the children, in order. + + List results = new ArrayList(); + for (GlobNode n : seq.getChildren()) { + results = crossOrTakeNonEmpty(results, n.accept(this)); + } + return results; + } + + /** + * Computes the cross product of two lists by adding each string in list1 to each string in list2. + * If one of the lists is empty, a copy of the other list is returned. + * If both are empty, an empty list is returned. + */ + public static List crossOrTakeNonEmpty(List list1, List list2) { + if (list1.isEmpty()) { + ArrayList result = new ArrayList(list2.size()); + result.addAll(list2); + return result; + } + + if (list2.isEmpty()) { + ArrayList result = new ArrayList(list1.size()); + result.addAll(list1); + return result; + } + + List result = new ArrayList(list1.size() * list2.size()); + for (String s1 : list1) { + for (String s2 : list2) { + result.add(s1 + s2); + } + } + return result; + } + } +} \ No newline at end of file http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-common/src/main/java/org/apache/parquet/glob/GlobNode.java ---------------------------------------------------------------------- diff --git a/parquet-common/src/main/java/org/apache/parquet/glob/GlobNode.java b/parquet-common/src/main/java/org/apache/parquet/glob/GlobNode.java new file mode 100644 index 0000000..c63c780 --- /dev/null +++ b/parquet-common/src/main/java/org/apache/parquet/glob/GlobNode.java @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.glob; + +import java.util.List; + +/** + * A GlobNode represents a tree structure for describing a parsed glob pattern. + * + * GlobNode uses the visitor pattern for tree traversal. + * + * See {@link org.apache.parquet.Strings#expandGlob(String)} + */ +interface GlobNode { + R accept(Visitor visitor); + + static interface Visitor { + T visit(Atom atom); + T visit(OneOf oneOf); + T visit(GlobNodeSequence seq); + } + + /** + * An Atom is just a String, it's a concrete String that is either part + * of the top-level pattern, or one of the choices in a OneOf clause, or an + * element in a GlobNodeSequence. In this sense it's the base case or leaf node + * of a GlobNode tree. + * + * For example, in pre{x,y{a,b}}post pre, x, y, z, b, and post are all Atoms. + */ + static class Atom implements GlobNode { + private final String s; + + public Atom(String s) { + this.s = s; + } + + public String get() { + return s; + } + + @Override + public boolean equals(Object o) { + if (this == o) return true; + return getClass() == o.getClass() && s.equals(((Atom) o).s); + } + + @Override + public int hashCode() { + return s.hashCode(); + } + + @Override + public String toString() { + return "Atom(" + s + ")"; + } + + @Override + public R accept(Visitor visitor) { + return visitor.visit(this); + } + } + + /** + * A OneOf represents a {} clause in a glob pattern, which means + * "one of the elements of this set must be satisfied", for example: + * in pre{x,y} {x,y} is a OneOf, and in or pre{x, {a,b}}post both {x, {a,b}} + * and {a,b} are OneOfs. + */ + static class OneOf implements GlobNode { + private final List children; + + public OneOf(List children) { + this.children = children; + } + + public List getChildren() { + return children; + } + + @Override + public boolean equals(Object o) { + if (this == o) return true; + return getClass() == o.getClass() && children.equals(((OneOf) o).children); + } + + @Override + public int hashCode() { + return children.hashCode(); + } + + @Override + public String toString() { + return "OneOf" + children; + } + + @Override + public R accept(Visitor visitor) { + return visitor.visit(this); + } + } + + /** + * A GlobNodeSequence is an ordered collection of GlobNodes that must be satisfied in order, + * and represents structures like pre{x,y}post or {x,y}{a,b}. In {test, pre{x,y}post}, pre{x,y}post is a + * GlobNodeSequence. Unlike a OneOf, GlobNodeSequence's children have an ordering that is meaningful and + * the requirements of its children must each be satisfied. + */ + static class GlobNodeSequence implements GlobNode { + private final List children; + + public GlobNodeSequence(List children) { + this.children = children; + } + + public List getChildren() { + return children; + } + + @Override + public boolean equals(Object o) { + if (this == o) return true; + return getClass() == o.getClass() && children.equals(((OneOf) o).children); + } + + @Override + public int hashCode() { + return children.hashCode(); + } + + @Override + public String toString() { + return "GlobNodeSequence" + children; + } + + @Override + public R accept(Visitor visitor) { + return visitor.visit(this); + } + } +} http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-common/src/main/java/org/apache/parquet/glob/GlobParser.java ---------------------------------------------------------------------- diff --git a/parquet-common/src/main/java/org/apache/parquet/glob/GlobParser.java b/parquet-common/src/main/java/org/apache/parquet/glob/GlobParser.java new file mode 100644 index 0000000..d49bd52 --- /dev/null +++ b/parquet-common/src/main/java/org/apache/parquet/glob/GlobParser.java @@ -0,0 +1,224 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.glob; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +import org.apache.parquet.glob.GlobNode.Atom; +import org.apache.parquet.glob.GlobNode.GlobNodeSequence; +import org.apache.parquet.glob.GlobNode.OneOf; + +final class GlobParser { + private GlobParser() { } + + /** + * Parse a String into a {@link GlobNodeSequence} + * + * See {@link org.apache.parquet.Strings#expandGlob(String)} + */ + public static GlobNodeSequence parse(String pattern) { + /* + * The parse algorithm works as follows, assuming we are parsing: + * "apache{one,pre{x,y}post,two}parquet{a,b}" + * + * 1) Begin scanning the string until we find the first { + * + * 2) Now that we've found the beginning of a glob group, scan forwards + * until the end of this glob group (by counting { and } we see until we find + * the closing } for the group we found in step 1). + * + * 3) Once the matching closing } is found we need to do two things. First, everything + * from the end of the last group up to start of this group is an Atom, so in the example + * above, once we've found that "{one,pre{x,y}post,two}" is the first group, we need to grab + * "apache" and treat it as an atom and add it to our sequence. + * Then, we parse "{one,pre{x,y}post,two}" using a similar but slightly different function (parseOneOf) + * and add the result from that to our sequence. + * + * 4) Repeat until the end of the string -- so next we find {a,b} and add "parquet" as an Atom and parse + * {a,b} using parseOneOf. + */ + + if (pattern.isEmpty() || pattern.equals("{}")) { + return new GlobNodeSequence(Arrays.asList(new Atom(""))); + } + + // the outer parse method needs to parse the pattern into a + // GlobNodeSequence, though it may end up being a singleton sequence + List children = new ArrayList(); + + int unmatchedBraces = 0; // count of unmatched braces + int firstBrace = 0; // open brace of current group being processsed + int anchor = 0; // first un-parsed character position + + for (int i = 0; i < pattern.length(); i++) { + char c = pattern.charAt(i); + + switch (c) { + case ',': + if (unmatchedBraces == 0) { + // commas not allowed in the top level expression + // TODO: maybe turn this check off? + throw new GlobParseException("Unexpected comma outside of a {} group:\n" + + annotateMessage(pattern, i)); + } + break; + case '{': + if (unmatchedBraces == 0) { + // this is the first brace of an outermost {} group + firstBrace = i; + } + unmatchedBraces++; + break; + case '}': + unmatchedBraces--; + if (unmatchedBraces < 0) { + throw new GlobParseException("Unexpected closing }:\n" + + annotateMessage(pattern, i)); + } + if (unmatchedBraces == 0) { + // grab everything from the end of the last group up to here, + // not including the close brace, it is an Atom in our sequence + // (assuming it's not empty) + if (anchor != firstBrace) { + // not empty! + // (substring's end param is exclusive) + children.add(new Atom(pattern.substring(anchor, firstBrace))); + } + + // grab the group, parse it, add it to our sequence, and then continue + // note that we skip the braces on both sides (substring's end param is exclusive) + children.add(parseOneOf(pattern.substring(firstBrace + 1, i))); + + // we have now parsed all the way up to here, the next un-parsed char is i + 1 + anchor = i + 1; + } + break; + } + } + + if (unmatchedBraces > 0) { + throw new GlobParseException("Not enough close braces in: " + pattern); + } + + if (anchor != pattern.length()) { + // either there were no {} groups, or there were some characters after the + // last }, either way whatever is left (could be the entire input) is an Atom + // in our sequence + children.add(new Atom(pattern.substring(anchor, pattern.length()))); + } + + return new GlobNodeSequence(children); + } + + private static OneOf parseOneOf(String pattern) { + /* + * This method is only called when parsing the inside of a {} expression. + * So in the example above, of calling parse("apache{one,pre{x,y}post,two}parquet{a,b}") + * this method will get called on first "one,pre{x,y}post,two", then on "x,y" and then on "a,b" + * + * The inside of a {} expression essentially means "one of these comma separated expressions". + * So this gets parsed slightly differently than the top level string passed to parse(). + * + * The algorithm works as follows: + * 1) Split the string on ',' -- but only commas that are not inside of {} expressions + * 2) Each of the splits can be parsed via the parse() method above + * 3) Add all parsed splits to a single parent OneOf. + */ + + // this inner parse method needs to parse the pattern into a + // OneOf, though it may end up being a singleton OneOf + List children = new ArrayList(); + + int unmatchedBraces = 0; // count of unmatched braces + int anchor = 0; // first un-parsed character position + + for (int i = 0; i < pattern.length(); i++) { + char c = pattern.charAt(i); + + switch (c) { + case ',': + // only "split" on commas not nested inside of {} + if (unmatchedBraces == 0) { + // ok, this comma is not inside of a {}, so + // grab everything from anchor to here, parse it, and add it + // as one of the options in this OneOf + children.add(parse(pattern.substring(anchor, i))); + + // we have now parsed up to this comma, the next un-parsed char is i + 1 + anchor = i + 1; + } + break; + case '{': + unmatchedBraces++; + break; + case '}': + unmatchedBraces--; + if (unmatchedBraces < 0) { + throw new GlobParseException("Unexpected closing }:\n" + + annotateMessage(pattern, i)); + } + break; + } + } + + if (unmatchedBraces > 0) { + throw new GlobParseException("Not enough close braces in: " + pattern); + } + + if (anchor != pattern.length()) { + // either there were no commas outside of {} groups, or there were some characters after the + // last comma, either way whatever is left (could be the entire input) is an Atom + // in our sequence + children.add(parse(pattern.substring(anchor, pattern.length()))); + } + + if (pattern.length() > 0 && pattern.charAt(pattern.length() - 1) == ',') { + // the above loop won't handle a trailing comma + children.add(parse("")); + } + + return new OneOf(children); + } + + // for pretty printing which character had the error + private static String annotateMessage(String message, int pos) { + StringBuilder sb = new StringBuilder(message); + sb.append('\n'); + for (int i = 0; i < pos; i++) { + sb.append('-'); + } + sb.append('^'); + return sb.toString(); + } + + public static class GlobParseException extends RuntimeException { + public GlobParseException() { + } + + public GlobParseException(String message) { + super(message); + } + + public GlobParseException(String message, Throwable cause) { + super(message, cause); + } + } +} http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-common/src/main/java/org/apache/parquet/glob/WildcardPath.java ---------------------------------------------------------------------- diff --git a/parquet-common/src/main/java/org/apache/parquet/glob/WildcardPath.java b/parquet-common/src/main/java/org/apache/parquet/glob/WildcardPath.java new file mode 100644 index 0000000..e1e4bd3 --- /dev/null +++ b/parquet-common/src/main/java/org/apache/parquet/glob/WildcardPath.java @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.glob; + +import java.util.regex.Pattern; + +import org.apache.parquet.Preconditions; + +/** + * Holds a String with wildcards ('*'), and can answer whether a given string matches this WildcardPath. + * For example: + * "foo.*.baz" or "foo*baz.bar*" + * + * The '*' in "foo*bar" is treated the same way that java regex treats "(.*)", + * and all WildcardPath's are considered to match child paths. + * For example, "foo.bar" will match "foo.bar.baz". It will not match "foo.barbaz" however. + * To match "foo.barbaz" the pattern "foo.bar*" could be used, which would also match "foo.barbaz.x" + * + * Only '*' is considered a special character. + * All other characters are not treated as special characters, including '{', '}', '.', and '/' + * with one exception -- the delimiter character is used for matching against child paths as explained above. + * + * It is assumed that {} globs have already been expanded before constructing + * this object. + */ +public class WildcardPath { + private static final String STAR_REGEX = "(.*)"; + private static final String MORE_NESTED_FIELDS_TEMPLATE = "((%s).*)?"; + private final String parentGlobPath; + private final String originalPattern; + private final Pattern pattern; + + public WildcardPath(String parentGlobPath, String wildcardPath, char delim) { + this.parentGlobPath = Preconditions.checkNotNull(parentGlobPath, "parentGlobPath"); + this.originalPattern = Preconditions.checkNotNull(wildcardPath, "wildcardPath"); + this.pattern = Pattern.compile(buildRegex(wildcardPath, delim)); + } + + public static String buildRegex(String wildcardPath, char delim) { + if (wildcardPath.isEmpty()) { + return wildcardPath; + } + + String delimStr = Pattern.quote(Character.toString(delim)); + + String[] splits = wildcardPath.split("\\*", -1); // -1 means keep trailing empty strings + StringBuilder regex = new StringBuilder(); + + for (int i = 0; i < splits.length; i++) { + if ((i == 0 || i == splits.length - 1) && splits[i].isEmpty()) { + // there was a * at the beginning or end of the string, so add a regex wildcard + regex.append(STAR_REGEX); + continue; + } + + if (splits[i].isEmpty()) { + // means there was a double asterisk, we've already + // handled this just keep going. + continue; + } + + // don't treat this part of the string as a regex, escape + // the entire thing + regex.append(Pattern.quote(splits[i])); + + if (i < splits.length - 1) { + // this isn't the last split, so add a * + regex.append(STAR_REGEX); + } + } + // x.y.z should match "x.y.z" and also "x.y.z.foo.bar" + regex.append(String.format(MORE_NESTED_FIELDS_TEMPLATE, delimStr)); + return regex.toString(); + } + + public boolean matches(String path) { + return pattern.matcher(path).matches(); + } + + public String getParentGlobPath() { + return parentGlobPath; + } + + public String getOriginalPattern() { + return originalPattern; + } + + @Override + public String toString() { + return String.format("WildcardPath(parentGlobPath: '%s', pattern: '%s')", + parentGlobPath, originalPattern); + } + + @Override + public boolean equals(Object o) { + if (this == o) return true; + if (o == null || getClass() != o.getClass()) return false; + WildcardPath wildcardPath = (WildcardPath) o; + return originalPattern.equals(wildcardPath.originalPattern); + } + + @Override + public int hashCode() { + return originalPattern.hashCode(); + } +} http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java ---------------------------------------------------------------------- diff --git a/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java b/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java index a34cb91..4383b0f 100644 --- a/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java +++ b/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java @@ -22,6 +22,8 @@ import java.io.Serializable; import java.util.Arrays; import java.util.Iterator; +import org.apache.parquet.Strings; + import static org.apache.parquet.Preconditions.checkNotNull; public final class ColumnPath implements Iterable, Serializable { @@ -66,15 +68,7 @@ public final class ColumnPath implements Iterable, Serializable { } public String toDotString() { - Iterator iter = Arrays.asList(p).iterator(); - StringBuilder sb = new StringBuilder(); - while (iter.hasNext()) { - sb.append(iter.next()); - if (iter.hasNext()) { - sb.append('.'); - } - } - return sb.toString(); + return Strings.join(p, "."); } @Override http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-common/src/test/java/org/apache/parquet/glob/TestGlob.java ---------------------------------------------------------------------- diff --git a/parquet-common/src/test/java/org/apache/parquet/glob/TestGlob.java b/parquet-common/src/test/java/org/apache/parquet/glob/TestGlob.java new file mode 100644 index 0000000..d9902a3 --- /dev/null +++ b/parquet-common/src/test/java/org/apache/parquet/glob/TestGlob.java @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.glob; + +import java.util.Arrays; + +import org.apache.parquet.Strings; +import org.apache.parquet.glob.GlobParser.GlobParseException; +import org.junit.Test; + +import junit.framework.Assert; + +import static junit.framework.Assert.fail; +import static org.junit.Assert.assertEquals; + +public class TestGlob { + + @Test + public void testNoGlobs() { + assertEquals(Arrays.asList("foo"), Strings.expandGlob("foo")); + } + + @Test + public void testEmptyGroup() { + assertEquals(Arrays.asList(""), Strings.expandGlob("")); + assertEquals(Arrays.asList(""), Strings.expandGlob("{}")); + assertEquals(Arrays.asList("a"), Strings.expandGlob("a{}")); + assertEquals(Arrays.asList("ab"), Strings.expandGlob("a{}b")); + assertEquals(Arrays.asList("a"), Strings.expandGlob("{}a")); + assertEquals(Arrays.asList("a"), Strings.expandGlob("a{}")); + assertEquals(Arrays.asList("", ""), Strings.expandGlob("{,}")); + assertEquals(Arrays.asList("ab", "a", "ac"), Strings.expandGlob("a{b,{},c}")); + } + + @Test + public void testSingleLevel() { + assertEquals(Arrays.asList("foobar", "foobaz"), Strings.expandGlob("foo{bar,baz}")); + assertEquals(Arrays.asList("startfooend", "startbarend"), Strings.expandGlob("start{foo,bar}end")); + assertEquals(Arrays.asList("fooend", "barend"), Strings.expandGlob("{foo,bar}end")); + assertEquals(Arrays.asList( + "startfooenda", "startfooendb", "startfooendc", "startfooendd", + "startbarenda", "startbarendb", "startbarendc", "startbarendd"), + Strings.expandGlob("start{foo,bar}end{a,b,c,d}")); + assertEquals(Arrays.asList("xa", "xb", "xc", "ya", "yb", "yc"), Strings.expandGlob("{x,y}{a,b,c}")); + assertEquals(Arrays.asList("x", "y", "z"), Strings.expandGlob("{x,y,z}")); + } + + @Test + public void testNested() { + assertEquals(Arrays.asList( + "startoneend", "startpretwopostend", "startprethreepostend", + "startfourend", "startfiveend", "a", "b", "foox", "fooy"), + Strings.expandGlob("{start{one,pre{two,three}post,{four,five}}end,a,b,foo{x,y}}")); + } + + @Test + public void testExtraBraces() { + assertEquals(Arrays.asList("x", "y", "z"), Strings.expandGlob("{{x,y,z}}")); + assertEquals(Arrays.asList("x", "y", "z"), Strings.expandGlob("{{{x,y,z}}}")); + assertEquals(Arrays.asList("startx", "starta", "startb", "starty"), Strings.expandGlob("start{x,{a,b},y}")); + } + + @Test + public void testCommaInTopLevel() { + try { + Strings.expandGlob("foo,bar"); + fail("This should throw"); + } catch (GlobParseException e) { + Assert.assertEquals("Unexpected comma outside of a {} group:\n" + + "foo,bar\n" + + "---^", e.getMessage()); + } + } + + @Test + public void testCommaCornerCases() { + // single empty string in each location + assertEquals(Arrays.asList("foobar", "foo", "foobaz"), Strings.expandGlob("foo{bar,,baz}")); + assertEquals(Arrays.asList("foo", "foobar", "foobaz"), Strings.expandGlob("foo{,bar,baz}")); + assertEquals(Arrays.asList("foobar", "foobaz", "foo"), Strings.expandGlob("foo{bar,baz,}")); + + // multiple empty strings + assertEquals(Arrays.asList("foobar", "foo", "foo", "foobaz"), Strings.expandGlob("foo{bar,,,baz}")); + assertEquals(Arrays.asList("foo", "foo", "foobar", "foobaz"), Strings.expandGlob("foo{,,bar,baz}")); + assertEquals(Arrays.asList("foobar", "foobaz", "foo", "foo"), Strings.expandGlob("foo{bar,baz,,}")); + + // between groups + assertEquals(Arrays.asList("x", "y", "", "a", "b"), Strings.expandGlob("{{x,y},,{a,b}}")); + } + + private void assertNotEnoughCloseBraces(String s) { + String expected = "Not enough close braces in: "; + try { + Strings.expandGlob(s); + fail("this should throw"); + } catch (GlobParseException e) { + Assert.assertEquals(expected, e.getMessage().substring(0, expected.length())); + } + } + + private void assertTooManyCloseBraces(String s) { + String expected = "Unexpected closing }:"; + try { + Strings.expandGlob(s); + fail("this should throw"); + } catch (GlobParseException e) { + Assert.assertEquals(expected, e.getMessage().substring(0, expected.length())); + } + } + + + @Test + public void testMismatchedBraces() { + assertNotEnoughCloseBraces("{"); + assertNotEnoughCloseBraces("{}{}{}{{}{}{"); + assertNotEnoughCloseBraces("foo{bar"); + assertNotEnoughCloseBraces("foo{{bar}"); + assertNotEnoughCloseBraces("foo{}{{bar}"); + + assertTooManyCloseBraces("{}}{"); + assertTooManyCloseBraces("}"); + assertTooManyCloseBraces("{}{}{}}{}{}{"); + assertTooManyCloseBraces("foo}bar"); + assertTooManyCloseBraces("foo}}bar}"); + assertTooManyCloseBraces("foo{}{{bar}}}"); + } + +} http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-common/src/test/java/org/apache/parquet/glob/TestWildcardPath.java ---------------------------------------------------------------------- diff --git a/parquet-common/src/test/java/org/apache/parquet/glob/TestWildcardPath.java b/parquet-common/src/test/java/org/apache/parquet/glob/TestWildcardPath.java new file mode 100644 index 0000000..d8af8b4 --- /dev/null +++ b/parquet-common/src/test/java/org/apache/parquet/glob/TestWildcardPath.java @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.glob; + +import org.junit.Test; + +import static org.junit.Assert.fail; + + +public class TestWildcardPath { + + private static void assertMatches(WildcardPath wp, String... strings) { + for (String s : strings) { + if (!wp.matches(s)) { + fail(String.format("String '%s' was expected to match '%s'", s, wp)); + } + } + } + + private static void assertDoesNotMatch(WildcardPath wp, String... strings) { + for (String s : strings) { + if (wp.matches(s)) { + fail(String.format("String '%s' was not expected to match '%s'", s, wp)); + } + } + } + + + @Test + public void testNoWildcards() { + WildcardPath wp = new WildcardPath("", "foo", '.'); + assertMatches(wp, "foo", "foo.x", "foo.x.y"); + assertDoesNotMatch(wp, "xfoo", "xfoox", "fooa.x.y"); + } + + @Test + public void testStarMatchesEverything() { + WildcardPath wp = new WildcardPath("", "*", '.'); + assertMatches(wp, "", ".", "hi", "foo.bar", "*", "foo."); + } + + @Test + public void testChildrenPathsMatch() { + WildcardPath wp = new WildcardPath("", "x.y.z", '.'); + assertMatches(wp, "x.y.z", "x.y.z.bar", "x.y.z.bar.baz.bop"); + assertDoesNotMatch(wp, "x.y.zzzz", "x.y.b", "x.y.a.z", "x.y.zhi.z"); + } + + @Test + public void testEmptyString() { + WildcardPath wp = new WildcardPath("", "", '.'); + assertMatches(wp, ""); + assertDoesNotMatch(wp, "x"); + } + + @Test + public void testDoubleStarsIgnored() { + WildcardPath wp = new WildcardPath("", "foo**bar", '.'); + assertMatches(wp, "foobar", "fooxyzbar", "foo.x.y.z.bar"); + assertDoesNotMatch(wp, "fobar", "hi", "foobazr"); + + wp = new WildcardPath("", "foo********bar", '.'); + assertMatches(wp, "foobar", "fooxyzbar", "foo.x.y.z.bar"); + assertDoesNotMatch(wp, "fobar", "hi", "foobazr"); + } + + @Test + public void testStarsAtBeginAndEnd() { + WildcardPath wp = new WildcardPath("", "*x.y.z", '.'); + assertMatches(wp, "a.b.c.x.y.z", "x.y.z", "zoopx.y.z", "zoopx.y.z.child"); + assertDoesNotMatch(wp, "a.b.c.x.y", "xy.z", "hi"); + + wp = new WildcardPath("", "*.x.y.z", '.'); + assertMatches(wp, "a.b.c.x.y.z", "foo.x.y.z", "foo.x.y.z.child"); + assertDoesNotMatch(wp, "x.y.z", "a.b.c.x.y", "xy.z", "hi", "zoopx.y.z", "zoopx.y.z.child"); + + + wp = new WildcardPath("", "x.y.z*", '.'); + assertMatches(wp, "x.y.z", "x.y.z.foo", "x.y.zoo", "x.y.zoo.bar"); + assertDoesNotMatch(wp, "a.b.c.x.y.z", "foo.x.y.z", "hi"); + + wp = new WildcardPath("", "x.y.z.*", '.'); + assertMatches(wp, "x.y.z.foo", "x.y.z.bar.baz"); + assertDoesNotMatch(wp, "x.y.z", "a.b.c.x.y.z", "x.y.zoo", "foo.x.y.z", "hi", "x.y.zoo.bar"); + } + + @Test + public void testComplex() { + WildcardPath wp = new WildcardPath("", "*.street", '.'); + assertMatches(wp, + "home.address.street", "home.address.street.number", "work.address.street", "work.address.street.foo", + "street.street", "street.street.street.street", "thing.street.thing" + ); + + assertDoesNotMatch(wp, + "home.address.street_2", "home.address.street_2.number", "work.addressstreet", "work.addressstreet.foo", "", + "x.y.z.street2", "x.y.z.street2.z" + ); + + wp = new WildcardPath("", "x.y.*_stat.average", '.'); + assertMatches(wp, "x.y.z_stat.average", "x.y.foo_stat.average", "x.y.z.a.b_stat.average", + "x.y.z.a.b_stat.average.child", "x.y.z._stat.average"); + assertDoesNotMatch(wp, "x.y.z_stats.average", "x.y.z_stat.averages", "x.y_stat.average", "x.yyy.foo_stat.average"); + + wp = new WildcardPath("", "x.y.pre*.bar", '.'); + assertMatches(wp, "x.y.pre.bar", "x.y.preabc.bar", "x.y.prebar.bar"); + assertDoesNotMatch(wp, "x.y.pre.baraaaa", "x.y.preabc.baraaaa"); + } +} http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java ---------------------------------------------------------------------- diff --git a/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java b/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java index 871f817..cb9bf66 100644 --- a/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java +++ b/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java @@ -28,6 +28,7 @@ import org.apache.thrift.TBase; import org.apache.thrift.protocol.TProtocol; import org.apache.parquet.Log; +import org.apache.parquet.Strings; import org.apache.parquet.hadoop.api.InitContext; import org.apache.parquet.hadoop.api.ReadSupport; import org.apache.parquet.io.ParquetDecodingException; @@ -38,20 +39,29 @@ import org.apache.parquet.thrift.ThriftMetaData; import org.apache.parquet.thrift.ThriftRecordConverter; import org.apache.parquet.thrift.ThriftSchemaConverter; import org.apache.parquet.thrift.projection.FieldProjectionFilter; +import org.apache.parquet.thrift.projection.StrictFieldProjectionFilter; import org.apache.parquet.thrift.projection.ThriftProjectionException; +import org.apache.parquet.thrift.projection.deprecated.DeprecatedFieldProjectionFilter; import org.apache.parquet.thrift.struct.ThriftType.StructType; public class ThriftReadSupport extends ReadSupport { private static final Log LOG = Log.getLog(ThriftReadSupport.class); /** - * configuration key for thrift read projection schema + * Deprecated. Use {@link #STRICT_THRIFT_COLUMN_FILTER_KEY} + * Accepts a ";" delimited list of globs in the syntax implemented by {@link DeprecatedFieldProjectionFilter} */ - public static final String THRIFT_COLUMN_FILTER_KEY = "parquet.thrift.column.filter"; + @Deprecated + public static final String DEPRECATED_THRIFT_COLUMN_FILTER_KEY = "parquet.thrift.column.filter"; + + /** + * Accepts a ";" delimited list of glob paths, in the syntax implemented by {@link StrictFieldProjectionFilter} + */ + public static final String STRICT_THRIFT_COLUMN_FILTER_KEY = "parquet.thrift.column.projection.globs"; + private static final String RECORD_CONVERTER_DEFAULT = TBaseRecordConverter.class.getName(); public static final String THRIFT_READ_CLASS_KEY = "parquet.thrift.read.class"; - /** * A {@link ThriftRecordConverter} builds an object by working with {@link TProtocol}. The default * implementation creates standard Apache Thrift {@link TBase} objects; to support alternatives, such @@ -87,6 +97,43 @@ public class ThriftReadSupport extends ReadSupport { conf.set(RECORD_CONVERTER_CLASS_KEY, klass.getName()); } + @Deprecated + public static void setProjectionPushdown(JobConf jobConf, String projectionString) { + jobConf.set(DEPRECATED_THRIFT_COLUMN_FILTER_KEY, projectionString); + } + + public static void setStrictFieldProjectionFilter(Configuration conf, String semicolonDelimitedGlobs) { + conf.set(STRICT_THRIFT_COLUMN_FILTER_KEY, semicolonDelimitedGlobs); + } + + public static FieldProjectionFilter getFieldProjectionFilter(Configuration conf) { + String deprecated = conf.get(DEPRECATED_THRIFT_COLUMN_FILTER_KEY); + String strict = conf.get(STRICT_THRIFT_COLUMN_FILTER_KEY); + + if (Strings.isNullOrEmpty(deprecated) && Strings.isNullOrEmpty(strict)) { + return null; + } + + if(!Strings.isNullOrEmpty(deprecated) && !Strings.isNullOrEmpty(strict)) { + throw new ThriftProjectionException( + "You cannot provide both " + + DEPRECATED_THRIFT_COLUMN_FILTER_KEY + + " and " + + STRICT_THRIFT_COLUMN_FILTER_KEY + +"! " + + DEPRECATED_THRIFT_COLUMN_FILTER_KEY + + " is deprecated." + ); + } + + if (!Strings.isNullOrEmpty(deprecated)) { + LOG.warn(String.format("Using %s is deprecated. Please see the docs for %s!", + DEPRECATED_THRIFT_COLUMN_FILTER_KEY, STRICT_THRIFT_COLUMN_FILTER_KEY)); + return new DeprecatedFieldProjectionFilter(deprecated); + } + + return StrictFieldProjectionFilter.fromSemicolonDelimitedString(strict); + } /** * used from hadoop @@ -102,29 +149,31 @@ public class ThriftReadSupport extends ReadSupport { this.thriftClass = thriftClass; } - - @Override public org.apache.parquet.hadoop.api.ReadSupport.ReadContext init(InitContext context) { final Configuration configuration = context.getConfiguration(); final MessageType fileMessageType = context.getFileSchema(); MessageType requestedProjection = fileMessageType; String partialSchemaString = configuration.get(ReadSupport.PARQUET_READ_SCHEMA); - String projectionFilterString = configuration.get(THRIFT_COLUMN_FILTER_KEY); - if (partialSchemaString != null && projectionFilterString != null) - throw new ThriftProjectionException("PARQUET_READ_SCHEMA and THRIFT_COLUMN_FILTER_KEY are both specified, should use only one."); + FieldProjectionFilter projectionFilter = getFieldProjectionFilter(configuration); + + if (partialSchemaString != null && projectionFilter != null) { + throw new ThriftProjectionException( + String.format("You cannot provide both a partial schema and field projection filter." + + "Only one of (%s, %s, %s) should be set.", + PARQUET_READ_SCHEMA, STRICT_THRIFT_COLUMN_FILTER_KEY, DEPRECATED_THRIFT_COLUMN_FILTER_KEY)); + } //set requestedProjections only when it's specified if (partialSchemaString != null) { requestedProjection = getSchemaForRead(fileMessageType, partialSchemaString); - } else if (projectionFilterString != null && !projectionFilterString.isEmpty()) { - FieldProjectionFilter fieldProjectionFilter = new FieldProjectionFilter(projectionFilterString); + } else if (projectionFilter != null) { try { initThriftClassFromMultipleFiles(context.getKeyValueMetadata(), configuration); - requestedProjection = getProjectedSchema(fieldProjectionFilter); + requestedProjection = getProjectedSchema(projectionFilter); } catch (ClassNotFoundException e) { - throw new ThriftProjectionException("can not find thriftClass from configuration"); + throw new ThriftProjectionException("can not find thriftClass from configuration", e); } } @@ -132,6 +181,7 @@ public class ThriftReadSupport extends ReadSupport { return new ReadContext(schemaForRead); } + @SuppressWarnings("unchecked") protected MessageType getProjectedSchema(FieldProjectionFilter fieldProjectionFilter) { return new ThriftSchemaConverter(fieldProjectionFilter).convert((Class>)thriftClass); } @@ -188,8 +238,4 @@ public class ThriftReadSupport extends ReadSupport { throw new RuntimeException("Unable to create Thrift Converter for Thrift metadata " + thriftMetaData, t); } } - - public static void setProjectionPushdown(JobConf jobConf, String projectionString) { - jobConf.set(THRIFT_COLUMN_FILTER_KEY, projectionString); - } } http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConvertVisitor.java ---------------------------------------------------------------------- diff --git a/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConvertVisitor.java b/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConvertVisitor.java index ec5a371..3081d87 100644 --- a/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConvertVisitor.java +++ b/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConvertVisitor.java @@ -35,6 +35,7 @@ import static org.apache.parquet.schema.Types.primitive; import java.util.ArrayList; import java.util.List; +import org.apache.parquet.Preconditions; import org.apache.parquet.schema.GroupType; import org.apache.parquet.schema.MessageType; import org.apache.parquet.schema.OriginalType; @@ -49,25 +50,21 @@ import org.apache.parquet.thrift.struct.ThriftField; import org.apache.parquet.thrift.struct.ThriftType; /** - * Visitor Class for converting a thrift definiton to parquet message type. + * Visitor Class for converting a thrift definition to parquet message type. * Projection can be done by providing a {@link FieldProjectionFilter} * * @author Tianshuo Deng */ public class ThriftSchemaConvertVisitor implements ThriftType.TypeVisitor { + private final FieldProjectionFilter fieldProjectionFilter; + private final FieldsPath currentFieldPath = new FieldsPath(); - public FieldProjectionFilter getFieldProjectionFilter() { - return fieldProjectionFilter; - } - - FieldProjectionFilter fieldProjectionFilter; - Type currentType; - FieldsPath currentFieldPath = new FieldsPath(); - Type.Repetition currentRepetition = Type.Repetition.REPEATED;//MessageType is repeated GroupType - String currentName = "ParquetSchema"; + private Type currentType; + private Type.Repetition currentRepetition = Type.Repetition.REPEATED; // MessageType is repeated GroupType + private String currentName = "ParquetSchema"; public ThriftSchemaConvertVisitor(FieldProjectionFilter fieldProjectionFilter) { - this.fieldProjectionFilter = fieldProjectionFilter; + this.fieldProjectionFilter = Preconditions.checkNotNull(fieldProjectionFilter, "fieldProjectionFilter"); } @Override @@ -120,9 +117,7 @@ public class ThriftSchemaConvertVisitor implements ThriftType.TypeVisitor { currentRepetition = REPEATED; setElemField.getType().accept(this); //after conversion, currentType is the nested type - if (currentType == null) { - return; - } else { + if (currentType != null) { currentType = listType(setRepetition, setName, currentType); } } @@ -136,12 +131,9 @@ public class ThriftSchemaConvertVisitor implements ThriftType.TypeVisitor { currentRepetition = REPEATED; setElemField.getType().accept(this); //after conversion, currentType is the nested type - if (currentType == null) { - return; - } else { + if (currentType != null) { currentType = listType(listRepetition, listName, currentType); } - } public MessageType getConvertedMessageType() { @@ -173,10 +165,8 @@ public class ThriftSchemaConvertVisitor implements ThriftType.TypeVisitor { private List getFieldsTypes(List fields) { List types = new ArrayList(); - for (int i = 0; i < fields.size(); i++) { - ThriftField field = fields.get(i); - Type.Repetition rep = getRepetition(field); - currentRepetition = rep; + for (ThriftField field : fields) { + currentRepetition = getRepetition(field); currentName = field.getName(); currentFieldPath.push(field); field.getType().accept(this); @@ -190,7 +180,7 @@ public class ThriftSchemaConvertVisitor implements ThriftType.TypeVisitor { } private boolean isCurrentlyMatchedFilter(){ - if(!fieldProjectionFilter.isMatched(currentFieldPath)){ + if(!fieldProjectionFilter.keep(currentFieldPath)){ currentType = null; return false; } http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConverter.java ---------------------------------------------------------------------- diff --git a/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConverter.java b/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConverter.java index c580b54..f9f23eb 100644 --- a/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConverter.java +++ b/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConverter.java @@ -26,8 +26,6 @@ import org.apache.thrift.TUnion; import org.apache.parquet.schema.MessageType; import org.apache.parquet.thrift.projection.FieldProjectionFilter; -import org.apache.parquet.thrift.projection.PathGlobPattern; -import org.apache.parquet.thrift.projection.ThriftProjectionException; import org.apache.parquet.thrift.struct.ThriftField; import org.apache.parquet.thrift.struct.ThriftField.Requirement; import org.apache.parquet.thrift.struct.ThriftType; @@ -44,15 +42,10 @@ import java.util.List; * a {@link FieldProjectionFilter} can be specified for projection pushdown. */ public class ThriftSchemaConverter { - private final FieldProjectionFilter fieldProjectionFilter; - public static > StructOrUnionType structOrUnionType(Class klass) { - return TUnion.class.isAssignableFrom(klass) ? StructOrUnionType.UNION : StructOrUnionType.STRUCT; - } - public ThriftSchemaConverter() { - this(new FieldProjectionFilter()); + this(FieldProjectionFilter.ALL_COLUMNS); } public ThriftSchemaConverter(FieldProjectionFilter fieldProjectionFilter) { @@ -60,52 +53,43 @@ public class ThriftSchemaConverter { } public MessageType convert(Class> thriftClass) { - return convert(new ThriftStructConverter().toStructType(thriftClass)); + return convert(toStructType(thriftClass)); } public MessageType convert(StructType thriftClass) { ThriftSchemaConvertVisitor visitor = new ThriftSchemaConvertVisitor(fieldProjectionFilter); thriftClass.accept(visitor); MessageType convertedMessageType = visitor.getConvertedMessageType(); - checkUnmatchedProjectionFilter(visitor.getFieldProjectionFilter()); + fieldProjectionFilter.assertNoUnmatchedPatterns(); return convertedMessageType; } - private void checkUnmatchedProjectionFilter(FieldProjectionFilter filter) { - List unmatched = filter.getUnMatchedPatterns(); - if (unmatched.size() != 0) { - throw new ThriftProjectionException("unmatched projection filters: " + unmatched.toString()); - } + public static > StructOrUnionType structOrUnionType(Class klass) { + return TUnion.class.isAssignableFrom(klass) ? StructOrUnionType.UNION : StructOrUnionType.STRUCT; } - public ThriftType.StructType toStructType(Class> thriftClass) { - return new ThriftStructConverter().toStructType(thriftClass); + public static ThriftType.StructType toStructType(Class> thriftClass) { + final TStructDescriptor struct = TStructDescriptor.getInstance(thriftClass); + return toStructType(struct); } - private static class ThriftStructConverter { - - public ThriftType.StructType toStructType(Class> thriftClass) { - final TStructDescriptor struct = TStructDescriptor.getInstance(thriftClass); - return toStructType(struct); - } - - private StructType toStructType(TStructDescriptor struct) { - List fields = struct.getFields(); - List children = new ArrayList(fields.size()); - for (int i = 0; i < fields.size(); i++) { - Field field = fields.get(i); - Requirement req = - field.getFieldMetaData() == null ? - Requirement.OPTIONAL : - Requirement.fromType(field.getFieldMetaData().requirementType); - children.add(toThriftField(field.getName(), field, req)); - } - return new StructType(children, structOrUnionType(struct.getThriftClass())); + private static StructType toStructType(TStructDescriptor struct) { + List fields = struct.getFields(); + List children = new ArrayList(fields.size()); + for (int i = 0; i < fields.size(); i++) { + Field field = fields.get(i); + Requirement req = + field.getFieldMetaData() == null ? + Requirement.OPTIONAL : + Requirement.fromType(field.getFieldMetaData().requirementType); + children.add(toThriftField(field.getName(), field, req)); } + return new StructType(children, structOrUnionType(struct.getThriftClass())); + } - private ThriftField toThriftField(String name, Field field, ThriftField.Requirement requirement) { - ThriftType type; - switch (ThriftTypeID.fromByte(field.getType())) { + private static ThriftField toThriftField(String name, Field field, ThriftField.Requirement requirement) { + ThriftType type; + switch (ThriftTypeID.fromByte(field.getType())) { case STOP: case VOID: default: @@ -138,8 +122,8 @@ public class ThriftSchemaConverter { final Field mapKeyField = field.getMapKeyField(); final Field mapValueField = field.getMapValueField(); type = new ThriftType.MapType( - toThriftField(mapKeyField.getName(), mapKeyField, requirement), - toThriftField(mapValueField.getName(), mapValueField, requirement)); + toThriftField(mapKeyField.getName(), mapKeyField, requirement), + toThriftField(mapValueField.getName(), mapValueField, requirement)); break; case SET: final Field setElemField = field.getSetElemField(); @@ -157,9 +141,8 @@ public class ThriftSchemaConverter { } type = new EnumType(values); break; - } - return new ThriftField(name, field.getId(), requirement, type); } + return new ThriftField(name, field.getId(), requirement, type); } } http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/FieldProjectionFilter.java ---------------------------------------------------------------------- diff --git a/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/FieldProjectionFilter.java b/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/FieldProjectionFilter.java index e038406..5498c8e 100644 --- a/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/FieldProjectionFilter.java +++ b/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/FieldProjectionFilter.java @@ -1,4 +1,4 @@ -/* +/* * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information @@ -6,9 +6,9 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + * * http://www.apache.org/licenses/LICENSE-2.0 - * + * * Unless required by applicable law or agreed to in writing, * software distributed under the License is distributed on an * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY @@ -18,76 +18,44 @@ */ package org.apache.parquet.thrift.projection; -import java.util.LinkedList; -import java.util.List; /** - * Filter thrift attributes using glob syntax. + * A field projection filter decides whether a thrift field (column) should + * be included when reading thrift data. It is used to implement projection push down. * - * @author Tianshuo Deng + * See {@link StrictFieldProjectionFilter} and + * {@link parquet.thrift.projection.deprecated.DeprecatedFieldProjectionFilter} */ -public class FieldProjectionFilter { - public static final String PATTERN_SEPARATOR = ";"; - List filterPatterns; +public interface FieldProjectionFilter { /** - * Class for remembering if a glob pattern has matched anything. - * If there is an invalid glob pattern that matches nothing, it should throw. + * Decide whether to keep the field (column) represented by path. + * + * @param path the path to the field (column) + * @return true to keep, false to discard (project out) */ - private static class PathGlobPatternStatus { - PathGlobPattern pattern; - boolean hasMatchingPath = false; - - PathGlobPatternStatus(String pattern) { - this.pattern = new PathGlobPattern(pattern); - } - - public boolean matches(FieldsPath path) { - if (this.pattern.matches(path.toString())) { - this.hasMatchingPath = true; - return true; - } else { - return false; - } - } - } - - public FieldProjectionFilter() { - filterPatterns = new LinkedList(); - } - - public FieldProjectionFilter(String filterDescStr) { - filterPatterns = new LinkedList(); - - if (filterDescStr == null || filterDescStr.isEmpty()) - return; + boolean keep(FieldsPath path); - String[] rawPatterns = filterDescStr.split(PATTERN_SEPARATOR); - for (String rawPattern : rawPatterns) { - filterPatterns.add(new PathGlobPatternStatus(rawPattern)); - } - } + /** + * Should throw a ThriftProjectionException if this FieldProjectionFilter has remaining patterns / columns + * that didn't match any of paths passed to {@link #keep(FieldsPath)}. + * + * Will be called once after all paths have been passed to {@link #keep(FieldsPath)}. + */ + void assertNoUnmatchedPatterns() throws ThriftProjectionException; - public boolean isMatched(FieldsPath path) { - if (filterPatterns.size() == 0) + /** + * A filter that keeps all of the columns. + */ + public static final FieldProjectionFilter ALL_COLUMNS = new FieldProjectionFilter() { + @Override + public boolean keep(FieldsPath path) { return true; - - for (int i = 0; i < filterPatterns.size(); i++) { - - if (filterPatterns.get(i).matches(path)) - return true; } - return false; - } - public List getUnMatchedPatterns() { - List unmatched = new LinkedList(); - for (PathGlobPatternStatus p : filterPatterns) { - if (!p.hasMatchingPath) { - unmatched.add(p.pattern); - } - } - return unmatched; - } + @Override + public void assertNoUnmatchedPatterns() throws ThriftProjectionException { + } + }; } http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/FieldsPath.java ---------------------------------------------------------------------- diff --git a/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/FieldsPath.java b/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/FieldsPath.java index ac20b0d..aa5ebaf 100644 --- a/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/FieldsPath.java +++ b/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/FieldsPath.java @@ -1,4 +1,4 @@ -/* +/* * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information @@ -6,9 +6,9 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + * * http://www.apache.org/licenses/LICENSE-2.0 - * + * * Unless required by applicable law or agreed to in writing, * software distributed under the License is distributed on an * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY @@ -24,61 +24,64 @@ import org.apache.parquet.thrift.struct.ThriftField; import org.apache.parquet.thrift.struct.ThriftType; /** - * represent field path for thrift field + * Represents a column path as a sequence of fields. * * @author Tianshuo Deng */ public class FieldsPath { - ArrayList fields = new ArrayList(); + private final ArrayList fields = new ArrayList(); - public void push(ThriftField field) { - this.fields.add(field); + public void push(ThriftField f) { + this.fields.add(f); } public ThriftField pop() { return this.fields.remove(fields.size() - 1); } - @Override - public String toString() { - StringBuffer pathStrBuffer = new StringBuffer(); + public ArrayList getFields() { + return fields; + } + + public String toDelimitedString(String delim) { + StringBuilder delimited = new StringBuilder(); for (int i = 0; i < fields.size(); i++) { ThriftField currentField = fields.get(i); if (i > 0) { ThriftField previousField = fields.get(i - 1); - if (isKeyFieldOfMap(currentField, previousField)) { - pathStrBuffer.append("key/"); + if (FieldsPath.isKeyFieldOfMap(currentField, previousField)) { + delimited.append("key"); + delimited.append(delim); continue; - } else if (isValueFieldOfMap(currentField, previousField)) { - pathStrBuffer.append("value/"); + } else if (FieldsPath.isValueFieldOfMap(currentField, previousField)) { + delimited.append("value"); + delimited.append(delim); continue; } } - - pathStrBuffer.append(currentField.getName()).append("/"); + delimited.append(currentField.getName()).append(delim); } - if (pathStrBuffer.length() == 0) { + if (delimited.length() == 0) { return ""; } else { - String pathStr = pathStrBuffer.substring(0, pathStrBuffer.length() - 1); - return pathStr; + return delimited.substring(0, delimited.length() - 1); } } - private boolean isValueFieldOfMap(ThriftField currentField, ThriftField previousField) { + @Override + public String toString() { + return toDelimitedString("."); + } + + private static boolean isValueFieldOfMap(ThriftField currentField, ThriftField previousField) { ThriftType previousType = previousField.getType(); - if(!(previousType instanceof ThriftType.MapType)) { - return false; - } - return ((ThriftType.MapType)previousType).getValue()==currentField; + return previousType instanceof ThriftType.MapType && ((ThriftType.MapType) previousType).getValue() == currentField; } - private boolean isKeyFieldOfMap(ThriftField currentField, ThriftField previousField) { + private static boolean isKeyFieldOfMap(ThriftField currentField, ThriftField previousField) { ThriftType previousType = previousField.getType(); - if(!(previousType instanceof ThriftType.MapType)) { - return false; - } - return ((ThriftType.MapType)previousType).getKey()==currentField; + return previousType instanceof ThriftType.MapType && ((ThriftType.MapType) previousType).getKey() == currentField; } + } http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/7fc79983/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/PathGlobPattern.java ---------------------------------------------------------------------- diff --git a/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/PathGlobPattern.java b/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/PathGlobPattern.java deleted file mode 100644 index 0d35b82..0000000 --- a/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/PathGlobPattern.java +++ /dev/null @@ -1,184 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ -package org.apache.parquet.thrift.projection; - -import org.apache.hadoop.fs.GlobPattern; - -import java.util.regex.Pattern; -import java.util.regex.PatternSyntaxException; - -/** - * Enhanced version of GlobPattern class that is defined in hadoop with extra capability of matching - * full path separated by '/', and double star matching - * @author Tianshuo Deng - */ - -public class PathGlobPattern { - private static final char BACKSLASH = '\\'; - private static final char PATH_SEPARATOR = '/'; - private Pattern compiled; - private boolean hasWildcard = false; - - /** - * Construct the glob pattern object with a glob pattern string - * - * @param globPattern the glob pattern string - */ - public PathGlobPattern(String globPattern) { - set(globPattern); - } - - /** - * Compile glob pattern string - * - * @param globPattern the glob pattern - * @return the pattern object - */ - public static Pattern compile(String globPattern) { - return new GlobPattern(globPattern).compiled(); - } - - private static void error(String message, String pattern, int pos) { - throw new PatternSyntaxException(message, pattern, pos); - } - - /** - * @return the compiled pattern - */ - public Pattern compiled() { - return compiled; - } - - /** - * Match input against the compiled glob pattern - * - * @param s input chars - * @return true for successful matches - */ - public boolean matches(CharSequence s) { - return compiled.matcher(s).matches(); - } - - /** - * Set and compile a glob pattern - * - * @param glob the glob pattern string - */ - public void set(String glob) { - StringBuilder regex = new StringBuilder(); - int setOpen = 0; - int curlyOpen = 0; - int len = glob.length(); - hasWildcard = false; - - for (int i = 0; i < len; i++) { - char c = glob.charAt(i); - - switch (c) { - case BACKSLASH: - if (++i >= len) { - error("Missing escaped character", glob, i); - } - regex.append(c).append(glob.charAt(i)); - continue; - case '.': - case '$': - case '(': - case ')': - case '|': - case '+': - // escape regex special chars that are not glob special chars - regex.append(BACKSLASH); - break; - case '*': - if (i + 1 < len && glob.charAt(i + 1) == '*') { - regex.append('.'); - i++; - break; - } - regex.append("[^" + PATH_SEPARATOR + "]"); - hasWildcard = true; - break; - case '?': - regex.append('.'); - hasWildcard = true; - continue; - case '{': // start of a group - regex.append("(?:"); // non-capturing - curlyOpen++; - hasWildcard = true; - continue; - case ',': - regex.append(curlyOpen > 0 ? '|' : c); - continue; - case '}': - if (curlyOpen > 0) { - // end of a group - curlyOpen--; - regex.append(")"); - continue; - } - break; - case '[': - if (setOpen > 0) { - error("Unclosed character class", glob, i); - } - setOpen++; - hasWildcard = true; - break; - case '^': // ^ inside [...] can be unescaped - if (setOpen == 0) { - regex.append(BACKSLASH); - } - break; - case '!': // [! needs to be translated to [^ - regex.append(setOpen > 0 && '[' == glob.charAt(i - 1) ? '^' : '!'); - continue; - case ']': - // Many set errors like [][] could not be easily detected here, - // as []], []-] and [-] are all valid POSIX glob and java regex. - // We'll just let the regex compiler do the real work. - setOpen = 0; - break; - default: - } - regex.append(c); - } - - if (setOpen > 0) { - error("Unclosed character class", glob, len); - } - if (curlyOpen > 0) { - error("Unclosed group", glob, len); - } - compiled = Pattern.compile(regex.toString()); - } - - @Override - public String toString() { - return compiled.toString(); - } - - /** - * @return true if this is a wildcard pattern (with special chars) - */ - public boolean hasWildcard() { - return hasWildcard; - } -}