crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Help Reading Orc Files
Date Wed, 03 Feb 2016 23:38:02 GMT
Not super-sure myself, but it looks like something the underlying
OrcInputFormat expects to be set in Hive. From here, it corresponds to the
hive.exec.orc.split.strategy property in the HiveConf:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/HiveConf.java

"hive.exec.orc.split.strategy", "HYBRID", new StringSet
<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/Validator.java#Validator.StringSet>("HYBRID",
"BI", "ETL"),

1014 <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/HiveConf.java#1014>

<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/HiveConf.java#>

        "This is not a user level config. BI strategy is used when the
requirement is to spend less time in split generation" +

1015 <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/HiveConf.java#1015>

<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/HiveConf.java#>

        " as opposed to query execution (split generation does not
read or cache file footers)." +

1016 <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/HiveConf.java#1016>

<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/HiveConf.java#>

        " ETL strategy is used when spending little more time in split
generation is acceptable" +

1017 <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/HiveConf.java#1017>

<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/HiveConf.java#>

        " (split generation reads and caches file footers). HYBRID
chooses between the above strategies" +

1018 <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/HiveConf.java#1018>

<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-common/1.2.0/org/apache/hadoop/hive/conf/HiveConf.java#>

        " based on heuristics."),


On Wed, Feb 3, 2016 at 1:59 PM, Robinson, Landon - Landon <
landon.t.robinson@lowes.com> wrote:

> Crunch Gurus,
>
> Need some advice. I have experience writing Orc files in Crunch, and I can
> successfully read them in Crunch and print them out.
> But when I attempt to process them with a DoFn, I get this error. What
> should I do?
>
> Exception in thread "Thread-5" java.lang.NoSuchFieldError:
> HIVE_ORC_SPLIT_STRATEGY
>
> Here’s my code:
>
>         logger.info("Generating Hadoop Configuration...");
>         Configuration crunchConf = getConf();
>         logger.info("Establishing OrcFile Target for Final Output...");
>         OrcFileTarget target = new OrcFileTarget(new Path(outputPath));
>         //Establish Pipeline
>         logger.info("Generating Crunch Map-Reduce Pipeline...");
>         Pipeline pipeline = new MRPipeline(DataQualityDriver.class,crunchConf);
>
>         //Establish OrcFileSource (emulates a Java class) linked to HDFS Path
>         logger.info("Generating Orc File Source around given HDFS path...");
>
>         OrcFileSource<Verint1978Record> orcsource = new OrcFileSource<Verint1978Record>(new
Path(inputPath), Orcs.reflects(Verint1978Record.class));
>
> //        Ingest the Orc File into a PCollection
>         logger.info("Generating PCollection of Verint1978Record from Data...");
>         PCollection<Verint1978Record> data = pipeline.read(orcsource);
> //
>
> 	for (Verint1978Record record : data.materialize()){
>     		System.out.println(record.getAllColumns());
> 	}
>
> //this all works fine until THIS point
>
> 	*// can’t run these files through a DOFN or write them out without getting above error*
>
> *	//this dofn simply reads the prev PCollection and prints it back out as a string (just
to test the DOFN)*
>
>         PCollection<String> newData = data.parallelDo(DataQualityDoFns.DoFn_ProduceSameRecords(),
Writables.strings());
>                 for (String record : newData.materialize()){
>             System.out.println(record);
>         }
>
> PipelineResult result = pipeline.done();
>
>
> DoFN (super lazy):
>
> static DoFn<Verint1978Record, String> DoFn_ProduceSameRecords(){
>     return new DoFn<Verint1978Record, String>() {
>         @Override
>         public void process(Verint1978Record input, Emitter<String> emitter) {
>
>             emitter.emit(input.getLct_nbr() + "" + input.getVid_caa_id()+ "" + input.getHrs_nbr()+
"" + input.getMte_nbr()+ "" + input.getAcl_idc()+ "" + input.getSec_dur()+ "" + input.getSec_to_pcs()+
"" + input.getSec_pcd()+ "" + input.getUse_for_rpr_idc()+ "" + input.getGrp_cnt()+ "" + input.getSng_cnt()+
"" + input.getUpd_dt()+ "" + input.getUpd_id()+ "" + input.getCal_dt());
>
>         }
>     };
> }
>
> ---------------------------------------------------------------------------
> Landon Robinson
> Big Data & Hadoop Engineer
> IT Business Intelligence, Lowe’s Companies Inc.
> ---------------------------------------------------------------------------
> NOTICE: All information in and attached to the e-mails below may be
> proprietary, confidential, privileged and otherwise protected from improper
> or erroneous disclosure. If you are not the sender's intended recipient,
> you are not authorized to intercept, read, print, retain, copy, forward, or
> disseminate this message. If you have erroneously received this
> communication, please notify the sender immediately by phone (704-758-1000)
> or by e-mail and destroy all copies of this message electronic, paper, or
> otherwise.
>
> *By transmitting documents via this email: Users, Customers, Suppliers and
> Vendors collectively acknowledge and agree the transmittal of information
> via email is voluntary, is offered as a convenience, and is not a secured
> method of communication; Not to transmit any payment information E.G.
> credit card, debit card, checking account, wire transfer information,
> passwords, or sensitive and personal information E.G. Driver's license,
> DOB, social security, or any other information the user wishes to remain
> confidential; To transmit only non-confidential information such as plans,
> pictures and drawings and to assume all risk and liability for and
> indemnify Lowe's from any claims, losses or damages that may arise from the
> transmittal of documents or including non-confidential information in the
> body of an email transmittal. Thank you. *
>

Mime
View raw message