hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hive QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
Date Wed, 10 Sep 2014 20:17:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129029#comment-14129029
] 

Hive QA commented on HIVE-8038:
-------------------------------



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12667782/HIVE-8038.patch

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 6195 tests executed
*Failed tests:*
{noformat}
org.apache.hive.hcatalog.pig.TestOrcHCatLoader.testReadDataPrimitiveTypes
{noformat}

Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/728/testReport
Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/728/console
Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-728/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12667782

> Decouple ORC files split calculation logic from Filesystem's get file location implementation
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8038
>                 URL: https://issues.apache.org/jira/browse/HIVE-8038
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>    Affects Versions: 0.13.1
>            Reporter: Pankit Thapar
>             Fix For: 0.14.0
>
>         Attachments: HIVE-8038.patch
>
>
> What is the Current Logic
> ======================
> 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array
of BlockLocation
> 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks.
> 3.If split spans just one block, then using the array index (index = offset/blockSize),
get the corresponding host having the blockLocation
> 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the
max of total data in split hosted by any host.
> 5.add the split to a list of splits
> Issue with Current Logic
> =====================
> Dependency on FileSystem API’s logic for block location calculations. It returns an
array and we need to rely on FileSystem to  
> make all blocks of same size if we want to directly access a block from the array.
>  
> What is the Fix
> =============
> 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array
of BlockLocation
> 1b.convert the array into a tree map <offset, BlockLocation> and return it through
getLocationsWithOffSet()
> 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks.
> 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry
smaller than offset for the split and get the corresponding host.
> 4a.If the split spans multiple blocks, get a submap, which contains all entries containing
blockLocations from the offset to offset + length
> 4b.get all hosts that have at least 80% of the max of total data in split hosted by any
host.
> 5.add the split to a list of splits
> What are the major changes in logic
> ==============================
> 1. store BlockLocations in a Map instead of an array
> 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
> 3. one block case is checked by "if(offset + length <= start.getOffset() + start.getLength())"
 instead of "if((offset % blockSize) + length <= blockSize)"
> What is the affect on Complexity (Big O)
> =================================
> 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would
not be called for each split
> 2. In case of one block case, we can get the block in O(logn) worst case which was O(1)
before
> 3. Getting the submap is O(logn)
> 4. In case of multiple block case, building the list of hosts is O(m) which was O(n)
& m < n as previously we were iterating 
>    over all the block locations but now we are only iterating only blocks that belong
to that range go offsets that we need. 
> What are the benefits of the change
> ==============================
> 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure
out the block corresponding to the offset and blockSize
> 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems
> 3. Previously we were using blockSize for one block case and block.length for multiple
block case, which is not the case now. We figure out the block
>    depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message