hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hive QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-13985) ORC improvements for reducing the file system calls in task side
Date Tue, 21 Jun 2016 01:18:57 GMT

    [ https://issues.apache.org/jira/browse/HIVE-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340842#comment-15340842
] 

Hive QA commented on HIVE-13985:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12811988/HIVE-13985.6.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 22 failed/errored test(s), 10234 tests executed
*Failed tests:*
{noformat}
TestMiniTezCliDriver-vector_interval_2.q-dynamic_partition_pruning.q-vectorization_10.q-and-12-more
- did not produce a TEST-*.xml file
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_constantPropagateForSubQuery
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_13
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.org.apache.hadoop.hive.cli.TestMiniTezCliDriver
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_subq_in
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cte_4
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_all_non_partitioned
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_disable_merge_for_bucketing
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_empty_join
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_groupby1
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_groupby3
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_insert_into2
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_orc_merge7
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_partition_column_names_with_leading_and_trailing_spaces
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_schema_evol_orc_vec_mapwork_part_all_primitive
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_groupby_reduce
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_inner_join
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_struct_in
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vectorized_case
org.apache.hive.jdbc.TestJdbcWithMiniLlap.testLlapInputFormatEndToEnd
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/196/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/196/console
Test logs: http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-196/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 22 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12811988 - PreCommit-HIVE-MASTER-Build

> ORC improvements for reducing the file system calls in task side
> ----------------------------------------------------------------
>
>                 Key: HIVE-13985
>                 URL: https://issues.apache.org/jira/browse/HIVE-13985
>             Project: Hive
>          Issue Type: Bug
>          Components: ORC
>    Affects Versions: 1.3.0, 2.2.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>             Fix For: 1.3.0, 2.1.0, 2.2.0
>
>         Attachments: HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch,
HIVE-13985-branch-1.patch, HIVE-13985-branch-2.1.patch, HIVE-13985.1.patch, HIVE-13985.2.patch,
HIVE-13985.3.patch, HIVE-13985.4.patch, HIVE-13985.5.patch, HIVE-13985.6.patch
>
>
> HIVE-13840 fixed some issues with addition file system invocations during split generation.
Similarly, this jira will fix issues with additional file system invocations on the task side.
To avoid reading footers on the task side, users can set hive.orc.splits.include.file.footer
to true which will serialize the orc footers on the splits. But this has issues with serializing
unwanted information like column statistics and other metadata which are not really required
for reading orc split on the task side. We can reduce the payload on the orc splits by serializing
only the minimum required information (stripe information, types, compression details). This
will decrease the payload on the orc splits and can potentially avoid OOMs in application
master (AM) during split generation. This jira also address other issues concerning the AM
cache. The local cache used by AM is soft reference cache. This can introduce unpredictability
across multiple runs of the same query. We can cache the serialized footer in the local cache
and also use strong reference cache which should avoid memory pressure and will have better
predictability.
> One other improvement that we can do is when hive.orc.splits.include.file.footer is set
to false, on the task side we make one additional file system call to know the size of the
file. If we can serialize the file length in the orc split this can be avoided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message