impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimitris Tsirogiannis (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (IMPALA-1823) Adding New Data To Tables is EXTREMELY SLOW
Date Thu, 06 Apr 2017 16:57:41 GMT

     [ https://issues.apache.org/jira/browse/IMPALA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dimitris Tsirogiannis resolved IMPALA-1823.
-------------------------------------------
    Resolution: Information Provided

IMPALA-1480 is resolved and we've improved the performance of metadata loading. Please reopen
if the issue persists. 

> Adding New Data To Tables is EXTREMELY SLOW
> -------------------------------------------
>
>                 Key: IMPALA-1823
>                 URL: https://issues.apache.org/jira/browse/IMPALA-1823
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Perf Investigation
>    Affects Versions: Impala 2.1.1
>            Reporter: Brian Helm
>            Assignee: Dimitris Tsirogiannis
>            Priority: Minor
>              Labels: performance, usability
>
> First a little background to help you guys see if the slowness is caused by something
we are doing.
> We have incoming data streams in the thousands that are sending us data.  Due to the
need to enforce quotas on the size/days stored for each individual stream, each one gets their
own DB with individual tables.
> We have created a Map Reduce job that pulls data in for all streams from Apache Kafka
on the map side, for 60 seconds, that then assigns key value pairs for reducers that use the
final HDFS storage location of the data as the key and data itself as the value.  On the reduce
side, each job takes a key (the intended file location) and aggregated values and stores them
in HDFS, updating Impala in the process.
> Now here's where our struggles lie.  We have attempted two methods to get impala to use
the new data.  First we've stored the data in a "staging" location then called LOAD DATA on
each file through the reducers to place it in its final storage location.  Going this route,
the reduce phase of our job took around 2 hours to complete.
> The second method we used was to have the reducer go ahead and store the data in its
final storage location, then call REFRESH to have impala update the metadata to include. 
Again, this method is also taking ~2 hours to complete for all ingested data.
> If we take out any logic to interact with impala from the reducers, they take around
1.5 minutes to complete.
> Each run of the M/R job ingests approximatly 20 million pieces of data per run.  Because
of the constant flow of data, we need to be able to run the M/R jobs one right after the other,
and we need the amount of time to complete the jobs to be in the range we are seeing for the
latter case of not updating impala in the reduce phase.
> We are using daily partitions for each table, and only the most recent one is updated
via this process.  With each M/R job run, I would estimate that a total of around 60,000 individual
partitions are receiving new data.
> When looking at the query times for LOAD DATA and REFRESH, we're seeing times of 40 -
60 seconds for each query.
> Is there anything in the methodology above that you can see that would be causing metadata
updates in impala to be so extremely slow?
> Do you have any suggestions on what we can do to work around this?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message