pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chandeep Singh ...@chandeep.com>
Subject Re: Using NOT NULL in a Pig FILTER statement.
Date Thu, 18 Feb 2016 23:06:51 GMT
So, I added one record to your sample to match all the conditions you have in your filter statement.

New input: 
[csingh]$ hadoop fs -cat test.txt
1,,2,76
1,,,76
,2,,76
1,1,2,
1,1,1,76
1,2,1,76

I modified the load statement to use PigStorage delimited by comma.

D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
AFFINITY_GROUP_ID:INT);

Output:
(1,2,1,76)

So, the NOT NULL's seem to be working.

Pig Log’s:

grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT, PROCESSING_STATUS_ID:INT,
PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not null)
AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
grunt> DUMP X;
2016-02-18 23:01:06,336 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features
used in the script: FILTER
2016-02-18 23:01:06,366 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer
- {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter,
ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer,
PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier,
PartitionFilterOptimizer]}
2016-02-18 23:01:06,480 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2016-02-18 23:01:10,798 [JobControl] INFO  org.apache.hadoop.conf.Configuration.deprecation
- fs.default.name is deprecated. Instead, use fs.defaultFS
2016-02-18 23:01:11,345 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting
tokens for job: job_1454499131434_9884
2016-02-18 23:01:11,542 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl
- Submitted application application_1454499131434_9884
2016-02-18 23:01:11,597 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2016-02-18 23:01:31,393 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 50% complete
2016-02-18 23:01:36,818 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks
is deprecated. Instead, use mapreduce.job.reduces
2016-02-18 23:01:36,875 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
2016-02-18 23:01:36,878 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script
Statistics:

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.6.0-cdh5.4.8	0.12.0-cdh5.4.8	csingh	2016-02-18 23:01:06	2016-02-18 23:01:36	FILTER

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime
AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1454499131434_9884	1	0	8	8	8	8	n/a	n/a	n/a	n/a	D,X	MAP_ONLY

Input(s):
Successfully read 6 records (418 bytes) from: 

Output(s):
Successfully stored 1 records (10 bytes) in: 

Counters:
Total records written : 1
Total bytes written : 10
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1454499131434_9884

2016-02-18 23:01:36,976 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!
2016-02-18 23:01:36,992 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat
- Total input paths to process : 1
2016-02-18 23:01:36,993 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil
- Total input paths to process : 1
(1,2,1,76)



> On Feb 18, 2016, at 10:13 PM, Parth Sawant <parth.sawant90@gmail.com> wrote:
> 
> Attaching a sample input. Basically 5 rows with only 4 Integer values in each. Some are
NULL values.
> 
> Thanks.
> 
> On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <cs@chandeep.com <mailto:cs@chandeep.com>>
wrote:
> I’m just looking for one sample record (which has NULL's) and not the entire input
so that its easier for me to debug.
> 
> > On Feb 18, 2016, at 9:40 PM, Parth Sawant <parth.sawant90@gmail.com <mailto:parth.sawant90@gmail.com>>
wrote:
> >
> > The input is simply too large to relay to others. A simplified schema is
> > below. I only have INT columns with some null values in them. This is my
> > Pig code snippet:
> >
> > D= LOAD 'src_locatn' as
> > IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
> > AFFINITY_GROUP_ID:INT;
> >
> > X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not
> > null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
> > (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
> >
> > Thanks
> >
> > On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <cs@chandeep.com <mailto:cs@chandeep.com>>
wrote:
> >
> >> Any chance you could share a sample record which has NULL’s in it? as well
> >> as your pig script?
> >>
> >>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <parth.sawant90@gmail.com <mailto:parth.sawant90@gmail.com>>
> >> wrote:
> >>>
> >>> I had anticipated it would throw a similar error with this suggestion as
> >>> the last one... and it did. My fields are declared as INT, just to
> >>> re-iterate. I don't think they can be compared to regexes. Here is the
> >>> error:
> >>>
> >>> ERROR 1037:
> >>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> >>> CharArray only :(Name: Regex Type: null Uid: null)
> >>>
> >>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR
> >> 1037:
> >>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
> >>> CharArray only :(Name: Regex Type: null Uid: null)
> >>>
> >>>
> >>>
> >>> Thanks.
> >>>
> >>>
> >>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <cs@chandeep.com <mailto:cs@chandeep.com>>
wrote:
> >>>
> >>>> Since you integers in this field can you try matching to a regular
> >>>> expression?
> >>>>
> >>>> Something like: X matches '\\d+'
> >>>>
> >>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <parth.sawant90@gmail.com
<mailto:parth.sawant90@gmail.com>>
> >>>> wrote:
> >>>>>
> >>>>> Hi Chandeep. I tried that already but it gave me the following error:
> >>>>>
> >>>>> ERROR 1039:
> >>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
> >>>>> types in NotEqual Operator left hand side:int right hand
> >>>>> side:chararray.
> >>>>>
> >>>>> The error makes sense cause the fields I have are INT type and hence
> >>>>> cannot be compared to a chararray.
> >>>>>
> >>>>>
> >>>>> Thanks for the prompt response though.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <cs@chandeep.com <mailto:cs@chandeep.com>>
wrote:
> >>>>>
> >>>>> Try adding != '' along with IS NOT NULL.
> >>>>>>
> >>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <parth.sawant90@gmail.com
<mailto:parth.sawant90@gmail.com>
> >>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT
NULL' .
> >> For
> >>>>>> some
> >>>>>>> reason the null data values persist.
> >>>>>>> For eg: the following filter on storing it's contents, contains
null
> >>>>>> values
> >>>>>>> for ABC and PQR.
> >>>>>>>
> >>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1)
AND (PQR
> >> IS
> >>>>>> NOT
> >>>>>>> NULL) ;
> >>>>>>>
> >>>>>>>
> >>>>>>> Can someone help with this?
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>> Parth S
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
> 
> 
> <Sample_in.txt>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message