impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Volker>
Subject Re: Adding impala daemons on servers without local HDFS storage
Date Thu, 19 Apr 2018 17:09:33 GMT
You can find documentation on the -default_query_options flag here:

Keep in mind that setting replica_preference to REMOTE will make Impala
ignore any locality when deciding where to schedule a read. Even within the
group of impalads that have local storage attached, Impala will pick a
randomized assignment, optimizing for the number of bytes read by each
node. There is currently no logic to schedule a fraction of the reads
locally and assign the rest to remote impalads (such a scenario wasn't part
of the considerations when working on the scheduler).

On Thu, Apr 19, 2018 at 9:47 AM, Fawze Abujaber <> wrote:

> Thanks Tim for you quick response as usual,
> Can you send me a documentation how to do that or send me detail example
> how to do that globally and per pool ...
> Again much appreciate your readiness to help
> On Thu, 19 Apr 2018 at 19:43 Tim Armstrong <>
> wrote:
>> We have a way to set global and per-pool defaults for query options. You
>> can set default query options via the --default_query_options startup flag
>> or if you have resource pools set up, you can set default query option
>> values for queries submitted to each resource pool (including the default
>> pool)
>> On Tue, Apr 17, 2018 at 3:27 AM, Fawze Abujaber <>
>> wrote:
>>> Thanks Tim,
>>> That's means that i cannot disable this cross the impala cluster and i
>>> need to manage this at the query level, right?
>>> Is it any configuration at the cluster level to disable this?
>>> On Wed, Apr 4, 2018 at 3:44 AM, Tim Armstrong <>
>>> wrote:
>>>> I agree with Jim's answers.
>>>> You may run into challenges if you have some Impala daemons that have
>>>> local DataNodes and some that do not have local DataNodes. By default
>>>> Impala always chooses a daemon with a local copy of the data, which would
>>>> mean that daemons without a co-located DataNode might never get fragments
>>>> scheduled on them. We do have a knob that let's you disable locality-based
>>>> scheduling
>>>> replica_preference.html but that may be too blunt an instrument.
>>>> On Tue, Apr 3, 2018 at 11:34 AM, Jim Apple <>
>>>> wrote:
>>>>> I think the answers are:
>>>>> 1. It depends on your workload and your network. I know some users run
>>>>> with ONLY remote reads and still get performance they are happy with.
>>>>> existing nodes will continue to be able to short-circuit read.
>>>>> 2. This is highly workload-dependent. You want to try and avoid
>>>>> spilling, obviously, but if your spinning disk can write 200MB/s it would
>>>>> take 3000 seconds, which is 50 minutes, to fill up.
>>>>> 3. I think the impalads are smart enough to not try and do a
>>>>> short-circuit read on data that isn't local.
>>>>> On Tue, Apr 3, 2018 at 10:22 AM, Fawze Abujaber <>
>>>>> wrote:
>>>>>> Hi All,
>>>>>> I have reached a point in my cluster that i don't need more storage
>>>>>> for the HDFS and i need to add processing power, i'm using Yarn,Spark
>>>>>> Impala on the normal nodes for processing.
>>>>>> My questions:
>>>>>> 1- How much the data locality will impact impala performance as i
>>>>>> know impala rely on data locality on it's processing?
>>>>>> 2- I have OS disk with 600GB, will this be enough to be used to spill
>>>>>> to disk when needed? is it dependent on other factors, the impala
>>>>>> memory limit is 35GB.
>>>>>> 3- Should i disable the  *HDFS Short Circuit Read*  on these nodes?
>>>>>> Will happy to get more recommendation on this ....
>>>>>> --
>>>>>> Take Care
>>>>>> Fawze Abujaber
>>> --
>>> Take Care
>>> Fawze Abujaber
>> --
> Take Care
> Fawze Abujaber

View raw message