hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Hanson" <eh...@microsoft.com>
Subject Re: Review Request: Change ORC tree readers to return batches of rows instead of a row
Date Tue, 23 Apr 2013 19:33:27 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10712/#review19568
-----------------------------------------------------------


I didn't get a chance to go through the whole thing -- here are just some initial comments.


ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicByteArray.java
<https://reviews.apache.org/r/10712/#comment40418>

    should be totalLength not totalLenght



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReader.java
<https://reviews.apache.org/r/10712/#comment40420>

    the previous argument and the return value are always of class VectorizedRowBatch so why
not declare them to be VectorizedRowBatch instead of Object?



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40461>

    put parens around expression on right side of =, for clarity



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40462>

    if noNulls is false, that means there are nulls, so you can't set noNulls back to true,
can you?



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40423>

    You should never have to create a new column vector here because there should always be
one passed down into this code as part of the batch.
    
    Consider assuming there will always be a vector. The code will just fail if there isn't
one.



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40463>

    It appears you are replacing one isNull array (the one passed in) with another. We should
be writing null information into the same Boolean array over and over. Am I missing something?



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40424>

    this code for IntTreeReader appears identical to the ShortTreeReader case. Can the code
be shared, e.g. by inheriting from super-class?
    
    Since this is a vector, consider calling it nextVector



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40425>

    same comment -- can the code be shared for all int types?



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40474>

    for float/double, vector entries are supposed to be set to NaN for null values.



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40478>

    I think this could incorrectly decide something isRepeating if there are nulls, the way
the code is now. But using NaN for nulls should prevent this. 
    
    I think this isRepeating check is probably worth it for performance but it might not really
pay off. It might be worth a performance test sometime in the future to determine this.



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40486>

    I'd said that NULL string entries should be set to the empty string in the vector. I'm
not sure if any code depends on that though. So this is probably okay to not set null entries.



ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40427>

    Please add comment to say what is the purpose of the file created



ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40489>

    Should create the batch up here, not use a null batch. 
    
    Or at least have a second test that creates the batch outside and passes it down.



ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40490>

    style guidelines say put a blank line in front of all comments



ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40445>

    I think this should take and return VectorizedRowBatch since it would not make sense for
it to be another type -- the method name even says "Batch"
    



ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40488>

    should add another test with some null values


- Eric Hanson


On April 22, 2013, 10:26 p.m., Sarvesh Sakalanaga wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10712/
> -----------------------------------------------------------
> 
> (Updated April 22, 2013, 10:26 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Description
> -------
> 
> The patch contains changes to ORC reader to return a batch of rows instead of a row.
A new method called nextBatch() is added to ORC reader and tree readers of ORC. Currently
only int,long,short,double,float,string and struct support batch processing.
> 
> 
> This addresses bug HIVE-4370.
>     https://issues.apache.org/jira/browse/HIVE-4370
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/vector/BytesColumnVector.java 246170d 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicByteArray.java fc4e53b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReader.java 05240ce 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java d044cd8 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReader.java 2825c64 
>   ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java PRE-CREATION

> 
> Diff: https://reviews.apache.org/r/10712/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sarvesh Sakalanaga
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message