incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <thelabd...@gmail.com>
Subject Re: Pig partition filter using operator other than ==
Date Tue, 20 Nov 2012 14:35:06 GMT
Thanks for your help Travis and Aniket. I ended up applying the patch from
HIVE-2084 and also making the FCOMMENT to COMMENT change to package.jdo
suggested by Travis. It seems to be working now, except I have a new
problem. If my Pig filter only contains a single clause that filters on the
partition field, the filter fails to get "pushed" into the load, i.e.

signals_for_day = filter signals by day >= '2012-10-31_2000';
This fails at the following place in the 0.4.0 code (note the line numbers
might be slightly off as I've added some debug statements here and there:

at
org.apache.pig.newplan.PColFilterExtractor.logInternalErrorAndSetFlag(PColFilterExtractor.java:482)
at
org.apache.pig.newplan.PColFilterExtractor.getExpression(PColFilterExtractor.java:434)
at
org.apache.pig.newplan.PColFilterExtractor.getExpression(PColFilterExtractor.java:473)
at
org.apache.pig.newplan.PColFilterExtractor.getExpression(PColFilterExtractor.java:461)
at
org.apache.pig.newplan.PColFilterExtractor.getExpression(PColFilterExtractor.java:473)
at
org.apache.pig.newplan.PColFilterExtractor.getExpression(PColFilterExtractor.java:449)
at
org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:115)
at
org.apache.pig.newplan.logical.rules.PartitionFilterOptimizer$PartitionFilterPushDownTransformer.transform(PartitionFilterOptimizer.java:160)
at
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:110)


However, if I add another clause to the filter, then it works fine, i.e.

signals_for_day = filter signals by (day >= '2012-10-31_2000' AND service IS
NOT NULL);

Note that service is not a partition field. This filter works fine and the
partition seems to get pushed into the load based on the number of input
paths reported by the MR job.
I'm using Pig 0.10 on CDH4. Seems like a Pig bug to me ...

Cheers,
Tim


On Mon, Nov 19, 2012 at 5:17 PM, Travis Crawford
<traviscrawford@gmail.com>wrote:

> Quick fix is forcing these jar versions in your build:
>
> datanucleus-core 2.2.5
> datanucleus-rdbms 2.2.4
>
> If you do a regular "ant package" then update these jars in the dist
> dir it works fine. Note if using the metastore thrift service you only
> need to do this on the server-side, as clients will not be using
> datanucleus at all.
>
> I agree this is a major issue; I just added fix version 0.5 to
> https://issues.apache.org/jira/browse/HCATALOG-209 so we don't forget
> about this in the next release.
>
> --travis
>
>
> On Mon, Nov 19, 2012 at 3:50 PM, Aniket Mokashi <aniket486@gmail.com>
> wrote:
> > There is an easy way to fix this. You need to re-compile the fix
> suggested
> > in HIVE-2609 and jar it up in datanucleus-rdbms jar along with other
> class
> > files.
> >
> > ~Aniket
> >
> >
> > On Mon, Nov 19, 2012 at 12:51 PM, Timothy Potter <thelabdude@gmail.com>
> > wrote:
> >>
> >> Ok, nevermind - looks like a known issue with Hive's data nucleus
> >> dependency: https://issues.apache.org/jira/browse/PIG-2339
> >>
> >> Will move to Postgres!
> >>
> >>
> >> On Mon, Nov 19, 2012 at 1:30 PM, Timothy Potter <thelabdude@gmail.com>
> >> wrote:
> >>>
> >>> More to this ... finally tracked down the hive server log and am seeing
> >>> this:
> >>>
> >>> 2012-11-19 19:42:53,700 ERROR server.TThreadPoolServer
> >>> (TThreadPoolServer.java:run(182)) - Error occurred during processing of
> >>> message.
> >>> java.lang.NullPointerException
> >>> at
> >>>
> org.datanucleus.store.mapped.mapping.MappingHelper.getMappingIndices(MappingHelper.java:35)
> >>> at
> >>>
> org.datanucleus.store.mapped.expression.StatementText.applyParametersToStatement(StatementText.java:194)
> >>> at
> >>>
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getPreparedStatementForQuery(RDBMSQueryUtils.java:233)
> >>> at
> >>>
> org.datanucleus.store.rdbms.query.legacy.SQLEvaluator.evaluate(SQLEvaluator.java:115)
> >>> at
> >>>
> org.datanucleus.store.rdbms.query.legacy.JDOQLQuery.performExecute(JDOQLQuery.java:288)
> >>> at org.datanucleus.store.query.Query.executeQuery(Query.java:1657)
> >>> at
> >>>
> org.datanucleus.store.rdbms.query.legacy.JDOQLQuery.executeQuery(JDOQLQuery.java:245)
> >>> at org.datanucleus.store.query.Query.executeWithMap(Query.java:1526)
> >>> at org.datanucleus.jdo.JDOQuery.executeWithMap(JDOQuery.java:334)
> >>> at
> >>>
> org.apache.hadoop.hive.metastore.ObjectStore.listMPartitionsByFilter(ObjectStore.java:1711)
> >>> at
> >>>
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:1581)
> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>> at
> >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>> at
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>> at
> >>>
> org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:111)
> >>> at $Proxy4.getPartitionsByFilter(Unknown Source)
> >>> at
> >>>
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:2466)
> >>> at
> >>>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions_by_filter.getResult(ThriftHiveMetastore.java:5817)
> >>> at
> >>>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_partitions_by_filter.getResult(ThriftHiveMetastore.java:5805)
> >>> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
> >>> at
> >>>
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:115)
> >>> at
> >>>
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:112)
> >>> at java.security.AccessController.doPrivileged(Native Method)
> >>> at javax.security.auth.Subject.doAs(Subject.java:396)
> >>> at
> >>>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
> >>> at
> >>>
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:520)
> >>> at
> >>>
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:123)
> >>> at
> >>>
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
> >>> at
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>> at
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>> at java.lang.Thread.run(Thread.java:662)
> >>>
> >>>
> >>> On Mon, Nov 19, 2012 at 12:53 PM, Timothy Potter <thelabdude@gmail.com
> >
> >>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I'm using HCatalog 0.4.0 with Pig 0.10 and am not having success using
> >>>> an operator other than (==) with my partition field.
> >>>>
> >>>> For example, the following works (day is my partition field):
> >>>>
> >>>> signals = load 'signals' using org.apache.hcatalog.pig.HCatLoader();
> >>>>
> >>>> signals_for_day = filter signals by (day == '2012-10-30_1200' AND
> >>>> service IS NOT NULL);
> >>>>
> >>>> samp1 = sample signals_for_day 0.01;
> >>>>
> >>>> dump samp1;
> >>>>
> >>>>
> >>>> but, if I change my filter to: signals_for_day = filter signals by
> (day
> >>>> >= '2012-10-30_1200' AND service IS NOT NULL);
> >>>>
> >>>> Then I get the following error:
> >>>>
> >>>> Caused by: java.io.IOException:
> >>>> org.apache.thrift.transport.TTransportException
> >>>> at
> >>>>
> org.apache.hcatalog.mapreduce.HCatInputFormat.setInput(HCatInputFormat.java:42)
> >>>> at org.apache.hcatalog.pig.HCatLoader.setLocation(HCatLoader.java:90)
> >>>> at
> >>>>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:380)
> >>>> ... 19 more
> >>>> Caused by: org.apache.thrift.transport.TTransportException
> >>>> at
> >>>>
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> >>>> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> >>>> at
> >>>>
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
> >>>> at
> >>>>
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
> >>>> at
> >>>>
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
> >>>> at
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> >>>> at
> >>>>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partitions_by_filter(ThriftHiveMetastore.java:1511)
> >>>> at
> >>>>
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_by_filter(ThriftHiveMetastore.java:1495)
> >>>> at
> >>>>
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:691)
> >>>> at
> >>>>
> org.apache.hcatalog.mapreduce.InitializeInput.getSerializedHcatKeyJobInfo(InitializeInput.java:98)
> >>>> at
> >>>>
> org.apache.hcatalog.mapreduce.InitializeInput.setInput(InitializeInput.java:73)
> >>>> at
> >>>>
> org.apache.hcatalog.mapreduce.HCatInputFormat.setInput(HCatInputFormat.java:40)
> >>>> ... 21 more
> >>>>
> >>>> I can start debugging but would like to know if HCatalog is supposed
> to
> >>>> support this type of filtering by partition fields?
> >>>>
> >>>> Thanks.
> >>>> Tim
> >>>>
> >>>
> >>
> >
> >
> >
> > --
> > "...:::Aniket:::... Quetzalco@tl"
>

Mime
View raw message