atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Graham Wallis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ATLAS-1868) Highly inefficient DSL-queries
Date Wed, 14 Jun 2017 13:26:00 GMT

    [ https://issues.apache.org/jira/browse/ATLAS-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049176#comment-16049176
] 

Graham Wallis commented on ATLAS-1868:
--------------------------------------

Hi Christian,

This is a really interesting observation. It clearly makes sense to start the traversal at
the indexed vertex type.

I decided to experiment with Atlas 0.9 and then did some experimentation with Titan 1.0 (directly,
without Atlas). I then experimented with declarative-style gremlin queries and finally explored
the Titan index APIs to see if I could think of a way to implement an automated optimization.
The results were as follows:

h2. Experiment with Atlas 0.9
I tried the DSL query below with Atlas 0.9 and it produced the following gremlin - which I
think is similar in structure to Christian's example (from Atlas 0.7). It uses a local variable
to collate results but it still starts the traversal at the non-indexed end of the edge, instead
of reversing it.

DSL query: 
Table where columns.name="customer_id"

Translated Gremlin Query: 
{
  def r=(([]) as Set);
  def f1={
    GremlinPipeline x->x.as('a0').out('__Table.columns').has('Column.name',T.'eq','customer_id').
      back('a0').as('__res') [0..<25].select(['a0', '__res']).fill(r)
    };
  f1(g.V().has('__typeName','Table'));
  f1(g.V().has('__superTypeNames','Table'));
  r._().as('__tmp').
    transform({((Row)it).getColumn('a0')}).as('a0').back('__tmp').
    transform({((Row)it).getColumn('__res')}).as('__res') [0..<25].toList()
}

This translation was on Atlas 0.9 using Titan 0.5.4. I did not try it with Titan 1.0 because
there are parts of Atlas that either do not support Java 8 yet or have hard dependencies on
Tinkerpop 2.

h2. Experiments directly on Titan 1.0
I decided to do some exploration on Titan 1.0 - because I wanted to see if Titan's gremlin
optimization would handle this automatically. The answer is it doesn't seem to. I populated
a Titan 1.0 graph with 100K 'Person' vertices that each have one 'owns' edge to one of 100K
'Asset' vertices. i.e. there are 100K triples of Person->owns->Asset. I created a composite
index on Asset.assetid, with indexOnly() of vertices with label 'Asset' on the basis that
different vertex types may have properties with similar keys, so we would probably need a
label-specific index).

conf=new BaseConfiguration()
graph = TitanFactory.open('../conf/titan-berkeleyje.properties')

mgmt=graph.openManagement()
assetid = mgmt.makePropertyKey('assetid').dataType(String.class).make()
asset=mgmt.makeVertexLabel('Asset').make()
mgmt.buildIndex('assetById',Vertex.class).addKey(assetid).indexOnly(asset).buildCompositeIndex()
mgmt.commit()


for (i in 1..100000) {
  p=graph.addVertex(label,'Person','name',i)
  a=graph.addVertex(label,'Asset','assetid','asset-num-'+i)
  p.addEdge('owns',a)
}
graph.tx().commit()

I could then issue queries like the following:

// 'Forward' query
t=System.currentTimeMillis();g.V().hasLabel('Person').out('owns').hasLabel('Asset').has('assetid','asset-num-7').iterate();System.currentTimeMillis()-t

// 'Reversed' query
t=System.currentTimeMillis();g.V().hasLabel('Asset').has('assetid','asset-num-7').in('owns').hasLabel('Person').iterate();System.currentTimeMillis()-t


Results for DSL-style gremlin versus reversed traversal: 2773ms versus 2ms, i.e. ~ x1300 faster
when reversed. This result compares closely with Christian's results in which he saw ~ x1100
faster performance.

h2. Experiments with declarative queries
I experimented to see if I could get Titan's gremlin optimizer to translate the query into
the obviously more efficient traversal, following Christian's example. Hoewever, even a declarative
query, with the sub-traversals in any order, is not optimized:

gremlin> g.V().match(
......1> __.as('b').hasLabel('Asset').has('assetid','asset-num-38'),
......2>  __.as('a').hasLabel('Person'),
......3> __.as('a').out('owns').as('b')).
......4>  select('a','b').by('name').by('assetid')
04:20:08,513  WARN StandardTitanTx:1262 - Query requires iterating over all vertices [(~label
= Person)]. For better performance, use indexes
==>[a:38,b:asset-num-38]

You have to explicitly reverse the edge traversal to get good performance, as in the following:

gremlin> g.V().match(
......1> __.as('b').hasLabel('Asset').has('assetid','asset-num-38'),
......2>  __.as('a').hasLabel('Person'),
......3> __.as('b').in('owns').as('a')).
......4>  select('a','b').by('name').by('assetid')
==>[a:38,b:asset-num-38]


h2. Exploration of Titan's index APIs
If Titan cannot perform this optimization, it looks as if Atlas would have to implement it,
so I looked at Titan's TitanGraphIndex APIs to see how much Atlas would be able to derive
generically, but I'm currently of the opinion that the Titan APIs do not provide enough to
support a fully programmatic query optimization in Atlas. The alternative appears to be Atlas
capitalizing on its own knowledge of the graph schema, indexes and stats. I have not explored
that area yet.



> Highly inefficient DSL-queries
> ------------------------------
>
>                 Key: ATLAS-1868
>                 URL: https://issues.apache.org/jira/browse/ATLAS-1868
>             Project: Atlas
>          Issue Type: Bug
>          Components:  atlas-core
>    Affects Versions: 0.7-incubating
>         Environment: linux, hbase + solr configuration.
>            Reporter: Christian R
>              Labels: dsl, gremlin
>
> The DSL query 'mytype where property.id = "id1"' appears to be rewritten as a gremlin
query that resembles:
> g.V.has(typename, 'mytype'ยจ).as(x).out('property').has('id', 'id1').back('x')
> On our system this query takes 6-7 minutes. The query
> g.V.has('id', 'id1').in('property').has('typename', 'mytype')
> takes 350 milliseconds.
> Our graph:
> g.V.count() = 1359151
> We have atlas 0.7 installed. I've compiled the latest 0.9 code and looked at the generated
gremlin query as reported in the logs for the same DSL-query, and I think 0.9 has the same
performance issues. Unfortunately I don't have a big graph on a 0.9 installation to test performance.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message