hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "@Sanjiv Singh" <sanjiv.is...@gmail.com>
Subject Re: Spark SQL is not returning records for hive bucketed tables on HDP
Date Tue, 23 Feb 2016 09:19:43 GMT
Yes, It is very strange and also very opposite to my belief on Spark SQL on
hive tables.

I am facing this issue on HDP setup on which COMPACTION is required only
once.
On the other hand, Apache setup doesn't required compaction even once.

May be something got triggered on meta-store after compaction, Spark SQL
start recognizing delta files.

Let know me if needed other details to get root cause.



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 2:28 PM, Varadharajan Mukundan <srinathsmn@gmail.com
> wrote:

> That's interesting. I'm not sure why first compaction is needed but not on
> the subsequent inserts. May be its just to create few metadata. Thanks for
> clarifying this :)
>
> On Tue, Feb 23, 2016 at 2:15 PM, @Sanjiv Singh <sanjiv.is.on@gmail.com>
> wrote:
>
>> Try this,
>>
>>
>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>> hive> insert into default.foo values(10);
>>
>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>> because data is still in delta files
>>
>> Now run major compaction:
>>
>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>
>> scala> sqlContext.table("default.foo").count // Gives 1
>>
>> hive> insert into foo values(20);
>>
>> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
>> required.*
>>
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan <
>> srinathsmn@gmail.com> wrote:
>>
>>> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure
>>> if its different.
>>>
>>> Please walkthrough the below commands in the same order to understand
>>> the sequence.
>>>
>>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>>> hive> insert into foo values(10);
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>>> because data is still in delta files
>>>
>>> Now run major compaction:
>>>
>>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 1
>>>
>>>
>>> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sanjiv.is.on@gmail.com>
>>> wrote:
>>>
>>>> Hi Varadharajan,
>>>>
>>>>
>>>> That is the point, Spark SQL is able to recognize delta files. See
>>>> below directory structure, ONE BASE (43 records) and one DELTA (created
>>>> after last insert). And I am able see last insert through Spark SQL.
>>>>
>>>>
>>>> *See below complete scenario :*
>>>>
>>>> *Steps:*
>>>>
>>>>    - Inserted 43 records in table.
>>>>    - Run major compaction on table.
>>>>    - *alter table mytable COMPACT 'major';*
>>>>    - Disabled auto compaction on table.
>>>>    - *alter table mytable set
>>>>       TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>>>>    - Inserted 1 record in table.
>>>>
>>>>
>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000087
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088
>>>>
>>>> *SPARK JDBC :*
>>>>
>>>> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
>>>> +------+
>>>> | _c0  |
>>>> +------+
>>>> | 44   |
>>>> +------+
>>>> 1 row selected (1.196 seconds)
>>>>
>>>> *HIVE JDBC :*
>>>>
>>>> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
>>>> +------+--+
>>>> | _c0  |
>>>> +------+--+
>>>> | 44   |
>>>> +------+--+
>>>> 1 row selected (0.121 seconds)
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
>>>> srinathsmn@gmail.com> wrote:
>>>>
>>>>> Hi Sanjiv,
>>>>>
>>>>> Yes.. If we make use of Hive JDBC we should be able to retrieve all
>>>>> the rows since it is hive which processes the query. But i think the
>>>>> problem with Hive JDBC is that there are two layers of processing, hive
and
>>>>> then at spark with the result set. And another one is performance is
>>>>> limited to that single HiveServer2 node and network.
>>>>>
>>>>> But If we make use of sqlContext.table function in spark to access
>>>>> hive tables, it is supposed to read files directly from HDFS skipping
the
>>>>> hive layer. But it doesn't read delta files and just reads the contents
>>>>> from base folder. Only after Major compaction, the delta files would
be
>>>>> merged with based folder and be visible for Spark SQL
>>>>>
>>>>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <
>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>
>>>>>> Hi Varadharajan,
>>>>>>
>>>>>> Can you elaborate on (you quoted on previous mail) :
>>>>>> "I observed that hive transaction storage structure do not work with
>>>>>> spark yet"
>>>>>>
>>>>>>
>>>>>> If it is related to delta files created after each transaction and
>>>>>> spark would not be able recognize them. then I have a table
>>>>>> *mytable *(ORC , BUCKETED , NON-SORTED) , already done lots on
>>>>>> insert , update and deletes. I can see delta files created in HDFS
(see
>>>>>> below), Still able to fetch consistent records through Spark JDBC
and HIVE
>>>>>> JDBC.
>>>>>>
>>>>>> Not compaction triggered for that table.
>>>>>>
>>>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>>>>
>>>>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>
>>>>>>> Actually the auto compaction if enabled is triggered based on
the
>>>>>>> volume of changes. It doesn't automatically run after every insert.
I think
>>>>>>> its possible to reduce the thresholds but that might reduce performance
by
>>>>>>> a big margin. As of now, we do compaction after the batch insert
completes.
>>>>>>>
>>>>>>> The only other way to solve this problem as of now is to use
Hive
>>>>>>> JDBC API.
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <
>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>
>>>>>>>> Compaction would have been triggered automatically as following
>>>>>>>> properties already set in *hive-site.xml*. and also
>>>>>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>>>>>
>>>>>>>>
>>>>>>>>     <property>
>>>>>>>>
>>>>>>>>       <name>hive.compactor.initiator.on</name>
>>>>>>>>
>>>>>>>>       <value>true</value>
>>>>>>>>
>>>>>>>>     </property>
>>>>>>>>
>>>>>>>>     <property>
>>>>>>>>
>>>>>>>>       <name>hive.compactor.worker.threads</name>
>>>>>>>>
>>>>>>>>       <value>1</value>
>>>>>>>>
>>>>>>>>     </property>
>>>>>>>>
>>>>>>>>
>>>>>>>> Documentation is upset sometimes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sanjiv Singh
>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>
>>>>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Yes, I was burned down by this issue couple of weeks
back. This
>>>>>>>>> also means that after every insert job, compaction should
be run to access
>>>>>>>>> new rows from Spark. Sad that this issue is not documented
/ mentioned
>>>>>>>>> anywhere.
>>>>>>>>>
>>>>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <
>>>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Varadharajan,
>>>>>>>>>>
>>>>>>>>>> Thanks for your response.
>>>>>>>>>>
>>>>>>>>>> Yes it is transnational table; See below *show create
table. *
>>>>>>>>>>
>>>>>>>>>> Table hardly have 3 records , and after triggering
minor
>>>>>>>>>> compaction on tables , it start showing results on
spark SQL.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> > *show create table hivespark;*
>>>>>>>>>>
>>>>>>>>>>   CREATE TABLE `hivespark`(
>>>>>>>>>>
>>>>>>>>>>     `id` int,
>>>>>>>>>>
>>>>>>>>>>     `name` string)
>>>>>>>>>>
>>>>>>>>>>   CLUSTERED BY (
>>>>>>>>>>
>>>>>>>>>>     id)
>>>>>>>>>>
>>>>>>>>>>   INTO 32 BUCKETS
>>>>>>>>>>
>>>>>>>>>>   ROW FORMAT SERDE
>>>>>>>>>>
>>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>>>>>
>>>>>>>>>>   STORED AS INPUTFORMAT
>>>>>>>>>>
>>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>>>>>
>>>>>>>>>>   OUTPUTFORMAT
>>>>>>>>>>
>>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>>>>>
>>>>>>>>>>   LOCATION
>>>>>>>>>>
>>>>>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>>>>>   TBLPROPERTIES (
>>>>>>>>>>
>>>>>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>>>>>
>>>>>>>>>>     'last_modified_by'='root',
>>>>>>>>>>
>>>>>>>>>>     'last_modified_time'='1455859079',
>>>>>>>>>>
>>>>>>>>>>     'numFiles'='37',
>>>>>>>>>>
>>>>>>>>>>     'numRows'='3',
>>>>>>>>>>
>>>>>>>>>>     'rawDataSize'='0',
>>>>>>>>>>
>>>>>>>>>>     'totalSize'='11383',
>>>>>>>>>>
>>>>>>>>>>     'transactional'='true',
>>>>>>>>>>
>>>>>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Sanjiv Singh
>>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan
<
>>>>>>>>>> srinathsmn@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Is the transaction attribute set on your table?
I observed that
>>>>>>>>>>> hive transaction storage structure do not work
with spark yet. You can
>>>>>>>>>>> confirm this by looking at the transactional
attribute in the output of
>>>>>>>>>>> "desc extended <tablename>" in hive console.
>>>>>>>>>>>
>>>>>>>>>>> If you'd need to access transactional table,
consider doing a
>>>>>>>>>>> major compaction and then try accessing the tables
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh
<
>>>>>>>>>>> sanjiv.is.on@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I have observed that Spark SQL is not returning
records for
>>>>>>>>>>>> hive bucketed ORC tables on HDP.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On spark SQL , I am able to list all tables
, but queries on
>>>>>>>>>>>> hive bucketed tables are not returning records.
>>>>>>>>>>>>
>>>>>>>>>>>> I have also tried the same for non-bucketed
hive tables. it is
>>>>>>>>>>>> working fine.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>>>>>
>>>>>>>>>>>> Let me know if needs other details.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Sanjiv Singh
>>>>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks,
>>>>>>>>>>> M. Varadharajan
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> "Experience is what you get when you didn't get
what you wanted"
>>>>>>>>>>>                -By Prof. Randy Pausch in "The
Last Lecture"
>>>>>>>>>>>
>>>>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks,
>>>>>>>>> M. Varadharajan
>>>>>>>>>
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>> "Experience is what you get when you didn't get what
you wanted"
>>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>>
>>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> M. Varadharajan
>>>>>>>
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>
>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Mime
View raw message