hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5324) Extend record writer and ORC reader/writer interfaces to provide statistics
Date Sat, 28 Sep 2013 09:33:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780763#comment-13780763
] 

Hudson commented on HIVE-5324:
------------------------------

FAILURE: Integrated in Hive-trunk-hadoop2-ptest #119 (See [https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/119/])
HIVE-5324 : Extend record writer and ORC reader/writer interfaces to provide statistics (Prasanth
J via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1527149)
* /hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/fileformat/base64/Base64TextOutputFormat.java
* /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHFileOutputFormat.java
* /hive/trunk/hcatalog/core/src/test/java/org/apache/hcatalog/cli/DummyStorageHandler.java
* /hive/trunk/hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/HBaseBaseOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/FSRecordWriter.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinaryOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveIgnoreKeyTextOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveNullValueSequenceFileOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HivePassThroughOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HivePassThroughRecordWriter.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSequenceFileOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroContainerOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordWriter.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/Reader.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/ReaderImpl.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/Writer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/udf/Rot13OutputFormat.java
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java

                
> Extend record writer and ORC reader/writer interfaces to provide statistics
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-5324
>                 URL: https://issues.apache.org/jira/browse/HIVE-5324
>             Project: Hive
>          Issue Type: New Feature
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: orcfile, statistics
>             Fix For: 0.13.0
>
>         Attachments: HIVE-5324.1.patch.txt, HIVE-5324.2.patch.txt, HIVE-5324.3.patch.txt,
HIVE-5324.4.patch.txt
>
>
> The current implementation for computing statistics (number of rows and raw data size)
happens for every single row processed. The processOp() method in FileSinkOperator gets raw
data size for each row from the serde and accumulates the size in hashmap while counting the
number of rows. This accumulated statistics is then published to metastore. 
> In case of ORC, ORC already stores enough statistics internally which can be made use
of when publishing the stats to metastore. This will avoid the duplication of work that is
happening in the processOp(). Also getting the statistics directly from ORC is very cheap
(can directly read from the file footer).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message