impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yu feng <olaptes...@gmail.com>
Subject Re: big issue on retrieving 400MB data
Date Sat, 29 Apr 2017 05:50:48 GMT
We has faced this problem before with JDBC, this query speed too much time
on fetching result, and the query is running. which will occupy slot of
request pool(when we configured admission control), we have to cancel the
query after a Certain time time such as 5 minutes.

In our solution, JDBC fetch result with default fetchSize = 50, we try to
set fetchSize = 10000(by call setFetchSize()) and the acceleration is
obvious(in our test, result size is 2000+W and this can speed up from more
than 1 hour to 500 seconds.).

2017-04-29 0:39 GMT+08:00 Mostafa Mokhtar <mmokhtar@cloudera.com>:

> Hi Jeszy
>
>
> One option for large data extracts is to create a text table in HDFS then
> use "hdfs dfs -copyToLocal
> <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/
> FileSystemShell.html#copyToLocal>"
> to extract the text file(s), for large data dumps storing the data in HDFS
> then extracting it is likely to be faster than impala-shell or JDCB.
>
> When using impala-shell it is best practice to use -B to speedup fetching
> the results.
>
> Below are the results for two experiments, storing the results in an HDFS
> text table and writing the out of HDFS file took 14 seconds compared to 40
> seconds when using just impala-shell.
>
> Lineitem table was used, output data size is 390MB which is comparable to
> your use case.
>
> *Write to HDFS then extract*
>
> time (impala-shell -q "drop table if exists tpch_300_parquet.lineitem_3m;
> create table tpch_300_parquet.lineitem_3m stored as textfile as select *
> from tpch_300_parquet.lineitem limit 3000000;" ; hdfs dfs -copyToLocal
> hdfs://
> vd1315.domain.com:8020/user/hive/warehouse/tpch_300_parquet.db/lineitem_3m
> .; du -sh lineitem_3m )
> Starting Impala Shell without Kerberos authentication
> Connected to server1.domain.com:21000
> Server version: impalad version 2.9.0-SNAPSHOT RELEASE (build
> dc25acd558e9388290503e0e1100baac2b3c800b)
> Query: drop table if exists tpch_300_parquet.lineitem_3m
> Query: create table tpch_300_parquet.lineitem_3m stored as textfile as
> select * from tpch_300_parquet.lineitem limit 3000000
> Query submitted at: 2017-04-28 09:26:23 (Coordinator:
> http://server1.domain.com:25000)
> Query progress can be monitored at:
> http://server1.domain.com:25000/query_plan?query_id=a24f86518c3de1ce:
> 2f35ef9b00000000
> +-------------------------+
> | summary                 |
> +-------------------------+
> | Inserted 3000000 row(s) |
> +-------------------------+
> Fetched 1 row(s) in 11.23s
> 390M lineitem_3m
>
> *real 0m14.392s*
> user 0m3.503s
> sys 0m0.655s
>
>
> *Extract results using impala-shell*
>
> [mmokhtar@server1 ~]$ time (impala-shell -B -q "select * From
> tpch_300_parquet.lineitem_3m " > lineitem_3m_impala_shell.txt; du -sh
> lineitem_3m_impala_shell.txt; )
> Starting Impala Shell without Kerberos authentication
> Connected to server1.domain.com:21000
> Server version: impalad version 2.9.0-SNAPSHOT RELEASE (build
> dc25acd558e9388290503e0e1100baac2b3c800b)
> Query: select * From tpch_300_parquet.lineitem_3m
> Query submitted at: 2017-04-28 09:27:39 (Coordinator:
> http://server1.domain.com:25000)
> Query progress can be monitored at:
> http://server1.domain.com:25000/query_plan?query_id=ee421b6d4a2226d3:
> 8acbb75f00000000
> Fetched 3000000 row(s) in 38.88s
> 390M lineitem_3m_impala_shell.txt
>
> *real 0m39.152s*
> user 0m26.012s
> sys 0m0.668s
>
>
> On Fri, Apr 28, 2017 at 7:26 AM, Jeszy <jeszyb@gmail.com> wrote:
>
> > Hey,
> >
> > It looks like all the time is spent waiting for the client to fetch the
> > results:
> >      - ClientFetchWaitTimer: 17m31s
> >
> > Try doing:
> > impala-shell -B -q '<query>'
> >
> > HTH
> >
> > 2017-04-28 14:51 GMT+02:00 吴朱华 <ikewu83@gmail.com>:
> > > Maybe I just paste some main thing on mail , and congratulation on IPO
> > > thing.
> > >
> > > Unregister query: 17m42s (17m42s)
> > >
> > > Fetched 317246 row(s) in 1062.84s
> > > Query Runtime Profile:
> > > Query (id=8149e2439f43b15a:f08e570d7fbf1085):
> > >   Summary:
> > >     Session ID: 35436d1112b79287:9045c79c795858a5
> > >     Session Type: BEESWAX
> > >     Start Time: 2017-04-28 11:50:00.292615000
> > >     End Time: 2017-04-28 12:07:43.133484000
> > >     Query Type: QUERY
> > >     Query State: FINISHED
> > >     Query Status: OK
> > >     Impala Version: impalad version 2.5.0-cdh5-INTERNAL RELEASE (build
> > > 43880282edc04c03c162bbea6fc85b5388e7fdde)
> > >     User: impala
> > >     Connected User: impala
> > >     Delegated User:
> > >     Network Address: ::ffff:10.44.10.186:36325
> > >     Default Db: sjzy
> > >     Sql Statement: select
> > > MRECID,UNITID,PCQDM,PCQMC,PCXQDM,PCXQMC,DM,H001,H002,
> > H003,H021,H022,H023,H024,H025,H026A,H026B,H026C,H026D,H026E,
> > H026F,H026G,H027,H028,H029,H030,H031,H032,H033,H034,H035,
> > H036,H037A,H037B,H037C,H038,H039,H040,H041,H042,H043A,
> > H043B,H043C,H043D,H043E,H043F,H043G,H043H,H043I,H043J,H043K,
> > H043L,H044A,H044B,H044C,H044D,H044E,H044F,H044G,H044H,H044I,
> > H050,H051,H052,H053,H054,H055,H056,H061,H062,H063,H064,H065,
> > H066,H070,H071,H072,H073,H074,H075,H080,H100,H111,H112,H113,
> > H120,H200,H201,H202,H203,H204,H205,H206,H207,H208,H209,H210,
> > H211,H300,H320,H321,H322,H323,H324,H400,H401,H402,H403,H404,
> > H405,H406,H500,H600,H601,H602,H603,H604,H605,H606,H607,H608,
> > H609,H610,H611,H612,H613,H614,H615,H616,H621A,H621B,H621C,
> > H621D,H621E,H621F,H622A,H622B,H622C,H801,H802,H803,H804,H901,H902,H903
> > > FROM NP_2017_NP601 WHERE DS_AREACODE LIKE '445281%'
> > >     Coordinator: node1.sky.org:22000
> > >     Query Options (non default):
> > >     Plan:
> > > ----------------
> > > Estimated Per-Host Requirements: Memory=4.50GB VCores=1
> > >
> > > 01:EXCHANGE [UNPARTITIONED]
> > > |  hosts=4 per-host-mem=unavailable
> > > |  tuple-ids=0 row-size=1.67KB cardinality=1155911
> > > |
> > > 00:SCAN HDFS [sjzy.np_2017_np601, RANDOM]
> > >    partitions=1/1 files=20 size=1.06GB
> > >    predicates: DS_AREACODE LIKE '445281%'
> > >    table stats: 11559109 rows total
> > >    column stats: all
> > >    hosts=4 per-host-mem=4.50GB
> > >    tuple-ids=0 row-size=1.67KB cardinality=1155911
> > > ----------------
> > >     Estimated Per-Host Mem: 4831838208
> > >     Estimated Per-Host VCores: 1
> > >     Request Pool: default-pool
> > >     ExecSummary:
> > > Operator       #Hosts  Avg Time  Max Time    #Rows  Est. #Rows   Peak
> Mem
> > >  Est. Peak Mem  Detail
> > > ------------------------------------------------------------
> > -------------------------------------------------
> > > 01:EXCHANGE         1  32.314ms  32.314ms  317.25K       1.16M
>   0
> > >      -1.00 B  UNPARTITIONED
> > > 00:SCAN HDFS       20   1s137ms   1s348ms  317.25K       1.16M  163.85
> MB
> > >      4.50 GB  sjzy.np_2017_np601
> > >     Planner Timeline: 53.683ms
> > >        - Analysis finished: 24.565ms (24.565ms)
> > >        - Equivalence classes computed: 26.389ms (1.823ms)
> > >        - Single node plan created: 33.607ms (7.218ms)
> > >        - Runtime filters computed: 33.684ms (76.568us)
> > >        - Distributed plan created: 39.125ms (5.441ms)
> > >        - Planning finished: 53.683ms (14.558ms)
> > >     Query Timeline: 17m42s
> > >        - Start execution: 43.792us (43.792us)
> > >        - Planning finished: 60.640ms (60.596ms)
> > >        - Ready to start 20 remote fragments: 65.111ms (4.471ms)
> > >        - All 20 remote fragments started: 74.572ms (9.461ms)
> > >        - Rows available: 744.300ms (669.728ms)
> > >        - First row fetched: 790.128ms (45.828ms)
> > >        - Unregister query: 17m42s (17m42s)
> > >   ImpalaServer:
> > >      - ClientFetchWaitTimer: 17m31s
> > >      - RowMaterializationTimer: 10s024ms
> > >
> > > 2017-04-28 19:44 GMT+08:00 Jim Apple <jbapple@cloudera.com>:
> > >
> > >> dev@ does not appear to accept attachments. You can upload it
> somewhere
> > >> and
> > >> post a link, though.
> > >>
> > >> On Thu, Apr 27, 2017 at 11:35 PM, 吴朱华 <ikewu83@gmail.com> wrote:
> > >>
> > >> > Oops, I just resend it, you know the chinese network^_^
> > >> >
> > >> > 2017-04-28 14:20 GMT+08:00 Mostafa Mokhtar <mmokhtar@cloudera.com>:
> > >> >
> > >> >> Btw the profile wasn't attached.
> > >> >> Please resend.
> > >> >>
> > >> >> On Thu, Apr 27, 2017 at 11:11 PM, 吴朱华 <ikewu83@gmail.com>
wrote:
> > >> >>
> > >> >>> Profile is in the attachment, thanks
> > >> >>>
> > >> >>>
> > >> >>> 2017-04-28 13:10 GMT+08:00 Dimitris Tsirogiannis <
> > >> >>> dtsirogiannis@cloudera.com>:
> > >> >>>
> > >> >>>> Maybe you also want to post some information about the
schema
> (how
> > >> wide
> > >> >>>> your table is, does it use nested types, etc) as well
as the
> > profile
> > >> of
> > >> >>>> the
> > >> >>>> slow query.
> > >> >>>>
> > >> >>>> Dimitris
> > >> >>>>
> > >> >>>> On Thu, Apr 27, 2017 at 9:30 PM, 吴朱华 <ikewu83@gmail.com>
wrote:
> > >> >>>>
> > >> >>>> > Hi guys:
> > >> >>>> > we can facing a big issue when select * from a big
table.
> > >> >>>> > The performance is 17minutes for retrieving 400MB
data. Even
> slow
> > >> >>>> under
> > >> >>>> > JDBC situation.
> > >> >>>> > Is there anyway to improve it?^_^
> > >> >>>> >
> > >> >>>>
> > >> >>>
> > >> >>>
> > >> >>
> > >> >
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message