impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Behm <alex.b...@cloudera.com>
Subject Re: performance issue on big table join
Date Fri, 03 Nov 2017 03:59:55 GMT
You are welcome.

You can track IMPALA-3902 to see our progress on supporting fully
multi-threaded execution.

Support for multi-threaded aggregations is already available. Certain other
queries will also work in multi-threaded mode. The big limitation is that
distributed joins and unions do not yet work (local joins for nested types
are ok).
We even enable multi-threading by default for some operations like COMPUTE
STATS on Parquet.

You can play around with multi-threaded execution using the MT_DOP query
option.

On Thu, Nov 2, 2017 at 6:17 PM, 俊杰陈 <cjjnjust@gmail.com> wrote:

> Thanks Alex to reply again.
>
> Do we have plan to support multi-thread join/aggregation?  Or it is
> intented to be single thread to maximum query throughput?
>
>
>
> 2017-11-03 0:32 GMT+08:00 Alexander Behm <alex.behm@cloudera.com>:
>
>> See my response on the other thread you started. The probe side of joins
>> are are executed in a single thread per host. Impala can run multiple
>> builds in parallel - but each build uses only a single thread.
>> A single query might not be able to max out your CPU, but most realistic
>> workloads run several queries concurrently.
>>
>> On Thu, Nov 2, 2017 at 12:22 AM, Hongxu Ma <interma@outlook.com> wrote:
>>
>> > Thanks LL. Your query options look good.
>> >
>> > As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in
>> > some big data situations.
>> > Very curious to the root cause.
>> >
>> >
>> > 在 02/11/2017 10:00, 俊杰陈 写道:
>> >
>> > +user list
>> >
>> > 2017-11-02 9:57 GMT+08:00 俊杰陈 <cjjnjust@gmail.com> <cjjnjust@gmail.com
>> >:
>> >
>> >
>> > Hi Mostafa
>> >
>> > Cheng already put the profile in thread.
>> >
>> > Here is another profile for impala release version. you can also see the
>> > attachment.
>> >
>> >
>> > 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar <mmokhtar@cloudera.com> <
>> mmokhtar@cloudera.com>:
>> >
>> >
>> > Attaching the query profile will be most helpful to investigate this
>> > issue.
>> >
>> > If you can capture the profile from the WebUI on the coordinator node it
>> > would be great.
>> >
>> > On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 <cjjnjust@gmail.com> <
>> cjjnjust@gmail.com> wrote:
>> >
>> >
>> > Thanks Hongxu,
>> >
>> > Here are configurations on my cluster,  most of them are default values.
>> > Which item do you think it may impact?
>> >
>> >         ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
>> >         ABORT_ON_ERROR: [0]
>> >         ALLOW_UNSUPPORTED_FORMATS: [0]
>> >         APPX_COUNT_DISTINCT: [0]
>> >         BATCH_SIZE: [0]
>> >         COMPRESSION_CODEC: [NONE]
>> >         DEBUG_ACTION: []
>> >         DEFAULT_ORDER_BY_LIMIT: [-1]
>> >         DISABLE_CACHED_READS: [0]
>> >         DISABLE_CODEGEN: [0]
>> >         DISABLE_OUTERMOST_TOPN: [0]
>> >         DISABLE_ROW_RUNTIME_FILTERING: [0]
>> >         DISABLE_STREAMING_PREAGGREGATIONS: [0]
>> >         DISABLE_UNSAFE_SPILLS: [0]
>> >         ENABLE_EXPR_REWRITES: [1]
>> >         EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
>> >         EXPLAIN_LEVEL: [1]
>> >         HBASE_CACHE_BLOCKS: [0]
>> >         HBASE_CACHING: [0]
>> >         MAX_BLOCK_MGR_MEMORY: [0]
>> >         MAX_ERRORS: [100]
>> >         MAX_IO_BUFFERS: [0]
>> >         MAX_NUM_RUNTIME_FILTERS: [10]
>> >         MAX_SCAN_RANGE_LENGTH: [0]
>> >         MEM_LIMIT: [0]
>> >         MT_DOP: [0]
>> >         NUM_NODES: [0]
>> >         NUM_SCANNER_THREADS: [0]
>> >         OPTIMIZE_PARTITION_KEY_SCANS: [0]
>> >         PARQUET_ANNOTATE_STRINGS_UTF8: [0]
>> >         PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
>> >         PARQUET_FILE_SIZE: [0]
>> >         PREFETCH_MODE: [1]
>> >         QUERY_TIMEOUT_S: [0]
>> >         REPLICA_PREFERENCE: [0]
>> >         REQUEST_POOL: []
>> >         RESERVATION_REQUEST_TIMEOUT: [0]
>> >         RM_INITIAL_MEM: [0]
>> >         RUNTIME_BLOOM_FILTER_SIZE: [1048576]
>> >         RUNTIME_FILTER_MAX_SIZE: [16777216]
>> >         RUNTIME_FILTER_MIN_SIZE: [1048576]
>> >         RUNTIME_FILTER_MODE: [2]
>> >         RUNTIME_FILTER_WAIT_TIME_MS: [0]
>> >         S3_SKIP_INSERT_STAGING: [1]
>> >         SCAN_NODE_CODEGEN_THRESHOLD: [1800000]
>> >         SCHEDULE_RANDOM_REPLICA: [0]
>> >         SCRATCH_LIMIT: [-1]
>> >         SEQ_COMPRESSION_MODE: [0]
>> >         STRICT_MODE: [0]
>> >         SUPPORT_START_OVER: [false]
>> >         SYNC_DDL: [0]
>> >         V_CPU_CORES: [0]
>> >
>> > 2017-10-31 15:30 GMT+08:00 Hongxu Ma <interma@outlook.com> <
>> interma@outlook.com>:
>> >
>> >
>> > Hi JJ
>> > Consider it only takes 3mins on SparkSQL, maybe there are some
>> >
>> > mistakes
>> >
>> > in
>> >
>> > query options.
>> > Try run "set;" in impala-shell and check all query options, e.g:
>> >     BATCH_SIZE: [0]
>> >     DISABLE_CODEGEN: [0]
>> >     RUNTIME_FILTER_MODE: GLOBAL
>> >
>> > Just a guess, thanks.
>> >
>> > 在 27/10/2017 10:25, 俊杰陈 写道:
>> > The profile file is damaged. Here is a screenshot for exec summary
>> > [cid:ii_j999ymep1_15f5ba563aeabb91]
>> > ​
>> >
>> > 2017-10-27 10:04 GMT+08:00 俊杰陈 <cjjnjust@gmail.com<mailto:cjj
>> > njust@gmail.com> <cjjnjust@gmail.com>>:
>>
>> > Hi Devs
>> >
>> > I met a performance issue on big table join. The query takes more
>> >
>> > than 3
>> >
>> > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
>> > cluster. when running query,  the left scanner and exchange node are
>> >
>> > very
>> >
>> > slow.  Did I miss some key arguments?
>> >
>> > you can see profile file in attachment.
>> >
>> > [cid:ii_j9998pph2_15f5b92f2cf47020]
>> > ​
>> > --
>> > Thanks & Best Regards
>> >
>> >
>> >
>> > --
>> > Thanks & Best Regards
>> >
>> >
>> > --
>> > Regards,
>> > Hongxu.
>> >
>> >
>> >
>> >
>> > --
>> > Thanks & Best Regards
>> >
>> >
>> >
>> >
>> > --
>> > Thanks & Best Regards
>> >
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Hongxu.
>> >
>> >
>>
>
>
>
> --
> Thanks & Best Regards
>

Mime
View raw message