atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ATLAS-503) Not all Hive tables are not imported into Atlas when interrupted with search queries while importing.
Date Fri, 27 May 2016 16:07:12 GMT

    [ https://issues.apache.org/jira/browse/ATLAS-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304252#comment-15304252
] 

Hemanth Yamijala commented on ATLAS-503:
----------------------------------------

An update on what I’ve investigated so far:

*tl;dr*

I am thinking of a retry based solution in the interim for balancing out the concurrency requirements
driven by performance, and correctness requirements as uncovered by this bug. The longer term
fix is likely coming either in ATLAS-496, or deeper understanding of the Titan graph model.

*Longer read, with excuses*

When using HBase as a storage backend, we have observed two specific scenarios when we got
lock related exceptions:
* Creating traits concurrently.
* Ingesting data from hive with more than one topic partition and consumer thread.

The exceptions are triggered when a transaction is committed - which is explained because
Titan tries to enforce consistency constraints only on commit as described [here|http://s3.thinkaurelius.com/docs/titan/0.5.4/eventual-consistency.html].
The commits happen from two specific places, both reflecting one of the use cases above, respectively:
* ManagementSystem.commit
* TitanGraph.commit

Also, with [~suma.shivaprasad]’s help, I understood that we have changed the HBase store
manager configuration of Atlas (from the defaults of Titan) to indicate that we will take
care of locking ourselves. This was done because otherwise, Titan’s own implementation of
locking was found to heavily degrade performance. (I have confirmed this with tests from my
end as well). 

To take care of this management of locking, we have implemented a pessimistic locking mechanism
in {{HBaseKeyColumnValueStore.acquireLock}}. Further, if there is a lock conflict, we immediately
throw a {{PermanentLockingException}} and the transaction fails. The granularity of the lock
is at a store, key and column level. In the tests above, we are running into this scenario
where multiple threads are trying to concurrently acquire a lock on objects of the same granularity.
In specific, for the two scenarios above, I’ve observed respectively:
* lock on the edgestore database (that stores the adjacency graph of Titan)
* lock on the graph index database (that maps from property value to vertex)

What has been difficult has been to identify the specific key & column on which the lock
is attempted to be acquired. The key and column values seem to be heavily encoded and except
for some printable characters, it has not been easy to identify them.

The general fix for locks as described [here by Stephen Mallette|https://groups.google.com/d/msg/aureliusgraphs/LbOx0wKhULc/u6q63GQrkg0J]
include:
* Retry transactions
* Keep committing transactions regularly - which I think we do to the most part.
* Change the schema to eliminate the need for locking
In my mind, these are in increasing order of complexity.

(Also note ATLAS-496, that [~dkantor] opened)

In the interim, I tried two experiments to fix this issue of concurrent updates:
* *Synchronize the commits* - within a JVM instance, this will clearly work, but will most
likely impact performance. However, my experiments show that this is still faster than letting
Titan manage the locking. A slightly more sophisticated fix here could be to synchronize on
the specific values of store, key and column to minimize contention. This has the risk of
causing deadlocks though, as I don’t know if we can assume anything about the order of locking
to be uniform.
* *Add retries to {{HBaseKeyColumnValueStore.acquireLock}}*. This worked too, to an extent
- the number of retries should be equal to the amount of concurrency expected for the worst
case of all concurrent threads trying to lock the same store, key and column. This is configurable
via the option {{atlas.graph.storage.lock.retries}}.

The *right* solution of changing the schema to eliminate locking requires us to understand
*when* Titan tries to lock. I find it difficult to understand this currently. (For e.g. it
doesn’t seem to be just for enforcing uniqueness constraints). I will try and get an answer
to this, but could take a while to come - any inputs others have will help here, of course.

To move forward, I am thinking of implementing the safer second option of retries, while I
try to understand if elimination of locking is possible from a model perspective. Any other
ideas are welcome - just please keep the short term perspective in mind.

> Not all Hive tables are not imported into Atlas when interrupted with search queries
while importing.  
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ATLAS-503
>                 URL: https://issues.apache.org/jira/browse/ATLAS-503
>             Project: Atlas
>          Issue Type: Bug
>            Reporter: Sharmadha Sainath
>            Assignee: Hemanth Yamijala
>            Priority: Critical
>             Fix For: 0.7-incubating
>
>         Attachments: hiv2atlaslogs.rtf
>
>
> On running a file containing 100 table creation commands using beeline -f , all hive
tables are created. But only 81 of them are imported into Atlas (HiveHook enabled) when queries
like "hive_table" is searched frequently while the import process for the table is going on.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message