hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From j.prasant...@gmail.com
Subject Re: Review Request 43626: HIVE-12988
Date Tue, 23 Feb 2016 07:35:38 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43626/#review120265
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 1499)
<https://reviews.apache.org/r/43626/#comment181697>

    nit: I don't see newFiles modified outside. Can this be made as an return argument from
Hive.copyFiles()? Instead of InOut argument.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 1746)
<https://reviews.apache.org/r/43626/#comment181698>

    nit: same here. newFiles not modified anywhere other than copyFiles.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2592)
<https://reviews.apache.org/r/43626/#comment181695>

    I think the thread pool size should be bounded. Cached thread pool creates threads on-demand.
On sudden burst of events this could end up creating too many threads that increases context
switch times.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2593)
<https://reviews.apache.org/r/43626/#comment181699>

    Can you also name the thread pool with setNameFormat()?



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2643)
<https://reviews.apache.org/r/43626/#comment181701>

    I think this should happen inside the callable and not in notification listener. The cost
of creating path is cheaper when compared to moving or copying. 
    
    The default  method of addCallback() handles notification in the same thread that sent
the notification. As per documentation, "Note: If the callback is slow or heavyweight, consider
supplying an executor. If you do not supply an executor, addCallback will use a direct executor,
which carries some caveats for heavier operations. For example, the callback may run on an
unpredictable or undesirable thread".



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2645)
<https://reviews.apache.org/r/43626/#comment181706>

    I can see 2 expensive operations here
    1) listStatus
    2) mvFile
    
    I think both should be handled by the threadpool.
    Typical sequence of operations would be 
     - For each directory, submit to threadpool to compute listStatus. This returns List<Future>
     - Iterate the futures, invoke future.get() to get List<Path> for each directories.
If get() throws exception then shutdown the executor and log exception.
     - Iterator the List<Path>, submit to threadpool to perform mvFile. This returns
List<Future>
     - Iterate the futures, invoke futures.get() to get void/boolean. Same way of exception
handling here as well. 
     
     I think addCallback is good approach but it adds complications when multi-level futures
are involved like above and waiting can be tricking like the 1 hour waiting below.


- Prasanth_J


On Feb. 16, 2016, 9:20 p.m., Ashutosh Chauhan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43626/
> -----------------------------------------------------------
> 
> (Updated Feb. 16, 2016, 9:20 p.m.)
> 
> 
> Review request for hive and Prasanth_J.
> 
> 
> Bugs: HIVE-12988
>     https://issues.apache.org/jira/browse/HIVE-12988
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Improve dynamic partition loading IV Parallelize copyFiles()
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java a92c002 
> 
> Diff: https://reviews.apache.org/r/43626/diff/
> 
> 
> Testing
> -------
> 
> Regression suite
> 
> 
> Thanks,
> 
> Ashutosh Chauhan
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message