pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Russell Jurney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-3015) Rewrite of AvroStorage
Date Mon, 18 Feb 2013 03:37:17 GMT

    [ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13580386#comment-13580386
] 

Russell Jurney commented on PIG-3015:
-------------------------------------

Loading data without going to Piggybank is amazing. However, Trevnistorage fails to store
my emails, schema:

You can reproduce this data with your own gmail emails (just need a few) with these instructions:
https://github.com/rjurney/Agile_Data_Code/tree/master/ch03

grunt> describe emails
emails: {message_id: chararray,thread_id: chararray,in_reply_to: chararray,subject: chararray,body:
chararray,date: chararray,from: (real_name: chararray,address: chararray),tos: {to: (real_name:
chararray,address: chararray)},ccs: {cc: (real_name: chararray,address: chararray)},bccs:
{bcc: (real_name: chararray,address: chararray)},reply_tos: {reply_to: (real_name: chararray,address:
chararray)}}

Error:

2013-02-17 18:03:31,574 [Thread-6] INFO  org.apache.hadoop.mapred.MapTask - io.sort.mb = 100
2013-02-17 18:03:31,680 [Thread-6] INFO  org.apache.hadoop.mapred.MapTask - data buffer =
79691776/99614720
2013-02-17 18:03:31,680 [Thread-6] INFO  org.apache.hadoop.mapred.MapTask - record buffer
= 262144/327680
2013-02-17 18:03:31,699 [Thread-6] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple]
was not set... will not generate code.
2013-02-17 18:03:31,713 [Thread-6] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map
- Aliases being processed per job phase (AliasName[line,offset]): M: emails[2,9],null[-1,-1],null[-1,-1],token_records[-1,-1],doc_word_totals[5,18],1-84[5,27]
C: doc_word_totals[5,18],1-84[5,27] R: doc_word_totals[5,18]
2013-02-17 18:03:31,748 [Thread-6] WARN  org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POUserFunc
(Name: POUserFunc(org.apache.pig.builtin.LuceneTokenize)[bag] - scope-19 Operator Key: scope-19)
children: null at []]: java.lang.NullPointerException
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:370)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPreCombinerLocalRearrange.getNext(POPreCombinerLocalRearrange.java:126)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:242)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:263)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:254)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:236)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:228)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.NullPointerException
	at org.apache.lucene.analysis.standard.std31.StandardTokenizerImpl31.zzRefill(StandardTokenizerImpl31.java:795)
	at org.apache.lucene.analysis.standard.std31.StandardTokenizerImpl31.getNextToken(StandardTokenizerImpl31.java:1002)
	at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:180)
	at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
	at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
	at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:50)
	at org.apache.pig.builtin.LuceneTokenize.exec(LuceneTokenize.java:70)
	at org.apache.pig.builtin.LuceneTokenize.exec(LuceneTokenize.java:51)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:380)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:341)
	... 18 more
2013-02-17 18:03:31,811 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_local_0001
2013-02-17 18:03:31,811 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Processing aliases 1-84,doc_word_totals,emails,token_records
2013-02-17 18:03:31,811 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- detailed locations: M: emails[2,9],null[-1,-1],null[-1,-1],token_records[-1,-1],doc_word_totals[5,18],1-84[5,27]
C: doc_word_totals[5,18],1-84[5,27] R: doc_word_totals[5,18]
2013-02-17 18:03:31,813 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2013-02-17 18:03:31,817 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately
on failure.
2013-02-17 18:03:31,817 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- job job_local_0001 has failed! Stop running all dependent jobs
2013-02-17 18:03:31,817 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
2013-02-17 18:03:31,818 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce
job(s) failed!
2013-02-17 18:03:31,818 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Detected
Local mode. Stats reported below may be incomplete
2013-02-17 18:03:31,819 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script
Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
1.0.3	0.12.0-SNAPSHOT	rjurney	2013-02-17 18:03:31	2013-02-17 18:03:31	HASH_JOIN,GROUP_BY

Failed!

Failed Jobs:
JobId	Alias	Feature	Message	Outputs
job_local_0001	1-84,doc_word_totals,emails,token_records	MULTI_QUERY,COMBINER	Message: Job
failed! Error - NA	

Input(s):
Failed to read data from "/me/Data/test_mbox"

Output(s):

Job DAG:
job_local_0001	->	null,
null	->	null,null,
null	->	null,
null	->	null,
null	->	null,
null


2013-02-17 18:03:31,819 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Failed!

                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>         Attachments: bad.avro, good.avro, PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch,
PIG-3015-5.patch, PIG-3015-6.patch, PIG-3015-7.patch, PIG-3015-doc.patch, TestInput.java,
Test.java
>
>
> The current AvroStorage implementation has a lot of issues: it requires old versions
of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet
peeve of mine is that old versions of Avro don't support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation
is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled
me to implement support for Trevni (as TrevniStorage).
> I'm opening this ticket to facilitate discussion while I figure out the best way to contribute
the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message