impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Behm <alex.b...@cloudera.com>
Subject Re: REFRESH partitions
Date Mon, 19 Mar 2018 23:21:49 GMT
Did you have a different option in mind that might suit your needs better?

These are your options for discovering metadata changes external to Impala:
refresh <table>
refresh <table> PARTITION (partition_spec)
invalidate metadata <table>
recover partitions <table>
invalidate metadata (don't do this)

Those commands all do different things, so it really depends on your goals.

If you want new files/partitions to be incrementally discovered by Impala,
then use refresh.



On Mon, Mar 19, 2018 at 12:49 PM, Fawze Abujaber <fawzeaj@gmail.com> wrote:

> Thanks Tim and Juan,
>
> So no options other than running the refresh statement each hour or to let
> the spark job run it after writing the parquet files.
>
> On Mon, Mar 19, 2018 at 9:34 PM, Tim Armstrong <tarmstrong@cloudera.com>
> wrote:
>
>> Don't use the -r option to impala-shell! That option was a mistake and
>> it's removed in impala 3.0. The problem is that it does a global invalidate
>> which is expensive because it requires reloading all metadata.
>>
>> On 19 Mar. 2018 10:35, "Juan" <anyion@gmail.com> wrote:
>>
>>> If the table is partitioned by year, month, day, but not hour, running
>>> recover partitions is not a good idea.
>>> Recover partitions only load metadata when it discovers a new partition,
>>> for existing partitions, even if there is new data, recover partitions will
>>> ignore them. so the table metadata could be out-of-date and queries will
>>> return wrong result.
>>>
>>> If the spark job is not running very frequently, you can run refresh
>>> table to refresh a specific partition after job completion. or running it
>>> once per hour.
>>>
>>> REFRESH [db_name.]table_name [PARTITION (key_col1=val1 [, key_col2=val2...])]
>>>
>>>
>>> On Sat, Mar 17, 2018 at 1:10 AM, Fawze Abujaber <fawzeaj@gmail.com>
>>> wrote:
>>>
>>>> Hello Guys,
>>>>
>>>> I have a parquet files that a Spark job generates, i'm defining an
>>>> external table on these parquet files which portioned by year.month and
>>>> day, The Spark job feeds these tables each hour.
>>>>
>>>> I have a cron job that running  each one hour and run the command:
>>>>
>>>>  alter table $(table_name) recover partitions
>>>>
>>>> I'm looking for other solutions if there is by impala, like
>>>> configuration, for example i'm thinking if i need to educate the end users
>>>> to use -r option to refresh the table.
>>>>
>>>>
>>>> Is there any other solutions for recover partitions?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>

Mime
View raw message