hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Selina Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-7368) datanucleus sometimes returns an empty result instead of an error or data
Date Wed, 10 Sep 2014 01:57:29 GMT

    [ https://issues.apache.org/jira/browse/HIVE-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127931#comment-14127931
] 

Selina Zhang commented on HIVE-7368:
------------------------------------

Hi Sush,

My feeling for the root cause probably is not as same as yours. Just want to provide another
possibility. Please correct me if I am wrong. 

Exist databases(tables) were reported as non-exist: This is due to the connection to db (mysql/oracle)
was bounced back due to the connection pool is small and the thread waiting time is too short.
Currently this "internal error" exception was mistakenly casted to NoSuchObjectException.
We have to fix the misleading error message. ( 

Parallelism execution: This is due to meta store usually hold connections for a very long
time because lots of drop/add/alter operations have HDFS operations involved. Sometimes the
table stats also are collected during the window. And connections to db is not shared by the
meta store clients. So the best practice for parallelism is increasing the size of connection
pools(DBCP for example). The db load is not heavy at all, we can utilize the concurrency of
existing RDBMS. DirectSQL get_database definitely will hold connection for much less time
than ORM get_database, so the connection shortage problem may not be obvious. 

I think "datanucleus.connectionPool.testSQL=SELECT 1" is the validation query for DBCP to
validate the underneath connection to RDBMS. Have it set DBCP will guarantee the connection
each time we borrow from the connection pool is valid. 

Thanks,


> datanucleus sometimes returns an empty result instead of an error or data
> -------------------------------------------------------------------------
>
>                 Key: HIVE-7368
>                 URL: https://issues.apache.org/jira/browse/HIVE-7368
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 0.12.0
>            Reporter: Sushanth Sowmyan
>
> I investigated a scenario wherein a user needed to use a large number of concurrent hive
clients doing simple DDL tasks, while not using a standalone metastore server. Say, for eg.,
each of them doing "drop table if exists tmp_blah_${i};"
> This would consistently fail stating that it could not create a db, which is a funny
error to have when trying to drop a db "if exists". On digging in, it turned out that the
error was a mistaken report, coming instead from an attempt by the embedded metastore attempting
to create a "default" db when it did not exist. The funny thing being that the default db
did exist, and the getDatabase call would return empty, rather than returning an error or
returning a result. We could disable hive.metastore.checkForDefaultDb and the number of these
reports would drastically fall, but that only moved the problem, and we'd get phantom reports
from time to time of various other databases that existed that were being reported as non-existent.
> On digging further, parallelism seemed to be an important factor in whether or not hive
was able to perform getDatabases without error. With about 20 simultaneous processes, there
seemed to be no errors whatsoever. At about 40 simultaneous processes, at least 1 would consistently
fail. At about 200, about 15-20 would consistently fail, in addition to taking a long time
to run.
> I wrote a sample JDBC ping (actually a get_database mimic) utility to see whether the
issue was with connecting from that machine to the database server, and this had no errors
whatsoever up to 400 simultaneous processes. The mysql server in question was configured to
serve up to 650 connections, and it seemed to be serving responses quickly and did not seem
overloaded. We also disabled connection pooling in case that was exacerbating a connection
availability issue with that many concurrent processes, each with an embedded metastore. That,
especially in conjunction with disabling schema checking, and specifying a "datanucleus.connectionPool.testSQL=SELECT
1" did a fair amount for performance in this scenarios, but the errors (or rather, the null-result-successes
when there shouldn't have been one) continued.
> On checking through hive again, if we modified hive to have datanucleus simply return
a connection, with which we did a direct sql get database, there would not be any error, but
if we tried to use jdo on datanucleus to construct a db object, we would get an empty result,
so the issue seems to crop up in the jdo mapping.
> One of the biggest issues with this investigation, for me, was the difficulty of reproducibility.
When trying to reproduce in a lab, we were unable to create a similar enough environment that
caused the issue. Even in the client's environment, moving from RHEL5 to RHEL6 made the issue
go away.
> Thus, we still have work to do on determining the underlying issue, I'm logging this
issue to collect information on similar issues we discover so we can work towards nailing
down the issue and then fixing it(in DN if need be)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message