hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furcy Pin <pin.fu...@gmail.com>
Subject Re: partitioned hive table
Date Tue, 31 Oct 2017 08:37:21 GMT
Hi,

If you want to load pre-existing records, instead of inserting data in this
partition, you should use the ADD PARTITION statement
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions

or simply the MSCK REPAIR TABLE statement
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)


Something that new Hive users often miss out is that Hive does not detect
automatically external data, and that sometimes a table's data and metadata
can be off sync.

What happened in your case is that Hive keeps a record count for the
partition in its metadata,
when you insert data in the partition, Hive updates its count on the fly
with the number of rows you have inserted (here: 1).

So when you do a SELECT *, all your json files are read, but when you do a
SELECT COUNT(*), Hive will just fetch that number to respond faster.

By running COMPUTE STATISTICS this number is updated, but the correct way
is to use MSCK REPAIR TABLE to tell Hive to update its partition metadata.

Regards,

Furcy









2017-10-31 1:25 GMT+01:00 Jiewen Shao <fifistorm123@gmail.com>:

> Thanks Mich,
> ANALYZE TABLE PARTITION(dt='2017-08-20, bar='hello'')  COMPUTE STATISTICS
> indeed make count(*) returns correct value (for the partition only).
>
> but my hive table was not able to get data from those pre-existed json
> file unless I insert one record for the partition AND run ANALYZE TABLE
> ... COMPUTE STATISTICS for the partition. I must have missed something.
>
> How to make those preexisted json visible in hive table?
>
> On Mon, Oct 30, 2017 at 4:53 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> have you analyzed table for the partition?
>>
>> ANALYZE TABLE test_table PARTITION('2017-08-20, bar='hello'') COMPUTE
>> STATISTICS;
>>
>> and do count(*) from table
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 30 October 2017 at 22:29, Jiewen Shao <fifistorm123@gmail.com> wrote:
>>
>>> Hi, I have persisted lots of JSON files on S3 under partitioned
>>> directories such as /bucket/table1/dt=2017-10-28/bar=hello/*
>>>
>>> 1. Now I created a hive table:
>>> CREATE EXTERNAL TABLE table1 (.... )
>>> PARTITIONED BY (dt string, bar string) ROW FORMAT serde
>>> 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://bucket/table1';
>>>
>>> 2. Under hive commandline
>>> select * from table1;    // return nothing
>>>
>>> 3. INSERT INTO TABLE  table1 PARTITION (dt='2017-08-28', bar='hello')
>>> select ....;
>>>
>>> 4. now select * from table1;    // return all the data from that
>>> partition
>>>
>>> 5. select count(*) from table1;  // returns 1
>>>
>>> Can someone explain what did  I miss?
>>>
>>> Thanks a lot!
>>>
>>
>>
>
2017-10-31 1:25 GMT+01:00 Jiewen Shao <fifistorm123@gmail.com>:

> Thanks Mich,
> ANALYZE TABLE PARTITION(dt='2017-08-20, bar='hello'')  COMPUTE STATISTICS
> indeed make count(*) returns correct value (for the partition only).
>
> but my hive table was not able to get data from those pre-existed json
> file unless I insert one record for the partition AND run ANALYZE TABLE
> ... COMPUTE STATISTICS for the partition. I must have missed something.
>
> How to make those preexisted json visible in hive table?
>
> On Mon, Oct 30, 2017 at 4:53 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> have you analyzed table for the partition?
>>
>> ANALYZE TABLE test_table PARTITION('2017-08-20, bar='hello'') COMPUTE
>> STATISTICS;
>>
>> and do count(*) from table
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 30 October 2017 at 22:29, Jiewen Shao <fifistorm123@gmail.com> wrote:
>>
>>> Hi, I have persisted lots of JSON files on S3 under partitioned
>>> directories such as /bucket/table1/dt=2017-10-28/bar=hello/*
>>>
>>> 1. Now I created a hive table:
>>> CREATE EXTERNAL TABLE table1 (.... )
>>> PARTITIONED BY (dt string, bar string) ROW FORMAT serde
>>> 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://bucket/table1';
>>>
>>> 2. Under hive commandline
>>> select * from table1;    // return nothing
>>>
>>> 3. INSERT INTO TABLE  table1 PARTITION (dt='2017-08-28', bar='hello')
>>> select ....;
>>>
>>> 4. now select * from table1;    // return all the data from that
>>> partition
>>>
>>> 5. select count(*) from table1;  // returns 1
>>>
>>> Can someone explain what did  I miss?
>>>
>>> Thanks a lot!
>>>
>>
>>
>

Mime
View raw message