jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damiano Albani <damiano.alb...@gmail.com>
Subject [Oak] Property index doesn't seem to be used in JOIN query
Date Wed, 23 May 2018 12:29:28 GMT
Hello,

I'm totally new to Apache Jackrabbit/Oak, so I hope my question hasn't been
answered previously.
In order to learn how it works, I am trying to set up a simple JCR
repository which can store information about, let's say, books, to be
combined with a list of users who can mark books as read or starred.
The goal is then to be able to let a user list the books:
 - she hasn't already read
 - she has previously starred

Here's the model I first came up with:

/users
  /user1
    name (STRING)
/books
  /book1
    title (STRING)
    readBy (REFERENCE) MULTIPLE
    starredBy (REFERENCE) MULTIPLE

The "readBy" and "starredBy" properties would then contain references to
the user node UUID.

Query for unread books:
  SELECT book.*
  FROM [book:book] AS book
  WHERE book.[book:readBy] <> '<user1-uuid>'

Query for starred books:
  SELECT book.*
  FROM [book:book] AS book
  WHERE book.[book:starredBy] = '<user1-uuid>'


I don't really like mixing book information and user information on the
same node, so I thought of this 2nd model:

/users
  /user1
    name (STRING)
/books
  /book1
    title (STRING)
    /userStatus
      readBy (REFERENCE) MULTIPLE
      starredBy (REFERENCE) MULTIPLE

That means introducing a JOIN in the queries.

Query for unread books:
  SELECT book.*, userStatus.*
  FROM [book:book] AS book
  INNER JOIN [book:userStatus] AS userStatus
  ON ISCHILDNODE(userStatus, book)
  WHERE userStatus.[book:readBy] <> '<user1-uuid>'

Query for starred books:
  SELECT book.*, userStatus.*
  FROM [book:book] AS book
  INNER JOIN [book:userStatus] AS userStatus
  ON ISCHILDNODE(userStatus, book)
  WHERE book.[book:starredBy] = '<user1-uuid>'


I then thought of a 3rd solution, using a LEFT OUTER JOIN for queries:

/users
  /user1
    name (STRING)
    /userStatuses
      /userStatus1
        book (REFERENCE)
        read (BOOLEAN)
        starred (BOOLEAN)
/books
  /book1
    title (STRING)

Query for unread books:
  SELECT book.*, userStatus.*
  FROM [book:book] AS book
  LEFT OUTER JOIN [book:userStatus] AS userStatus
  ON userStatus.[book:book] = book.[jcr:uuid]
  WHERE (ISCHILDNODE(userStatus, '/users/$userUuid/userStatuses') AND
userStatus.[book:read] = FALSE) OR userStatus.[book:read] IS NULL

Query for starred books:
  SELECT book.*, userStatus.*
  FROM [book:book] AS book
  LEFT OUTER JOIN [book:userStatus] AS userStatus
  ON userStatus.[book:book] = book.[jcr:uuid]
  WHERE ISCHILDNODE(userStatus, '/users/$userUuid/userStatuses') AND
userStatus.[book:starred] = TRUE

But this model introduces an additional complexity due to userStatus nodes
being present in the tree only when a book is first marked as read or
starred by the user.


I have implemented model #2 with Oak 1.9.1 (i.e. unstable version, but I
thought that it was applicable for an experiment).
My goal is support 100k+ books and several hundred users, so I looked at
indexes in Oak.
Lucene indexes appeared more complicated to set up that property indexes,
and the synchronous aspect of the property indexes appealed to me actually.

Here's the Kotlin pseudocode that I used to initialize the repository:

  @Bean
  fun contentRepository() : ContentRepository {
    val mongoDocumentNodeStoreBuilder =
MongoDocumentNodeStoreBuilder.newMongoDocumentNodeStoreBuilder()
            .setMongoDB("mongodb://localhost:27017/", "books", 16)

    return Jcr(Oak(mongoDocumentNodeStoreBuilder.build())
            .with(RepositoryInitializer {
                val oakIndexNodeBuilder = IndexUtils.getOrCreateOakIndex(it)

                run {
                    val indexName = "book-userStatus_book-readBy"
                    if (!oakIndexNodeBuilder.hasChildNode(indexName)) {

IndexUtils.createIndexDefinition(oakIndexNodeBuilder, indexName, true,
false,
                                  listOf("book:readBy"),
listOf("book:userStatus"))
                    }
                }

                it.nodeState
            })).createRepository()
  }

I see that Oak is building the index at startup, traversing almost 2
million nodes in my test case (the actual model is more extended than what
I presented here).
But when I run a query, it doesn't seem to use the index, given how long it
takes (and that it eventually times out):

  2018-05-23 13:59:35.656 DEBUG 12937 --- [nio-8080-exec-8]
o.a.j.oak.query.QueryEngineImpl          : Parsing JCR-SQL2 statement:
SELECT book.*, userStatus.*
  FROM [book:book] AS book
  INNER JOIN [book:userStatus] AS userStatus
  ON ISCHILDNODE(userStatus, book)
  WHERE userStatus.[book:readBy] = '60b02015-f09b-42a0-8a6a-d47d40194cfc'
  2018-05-23 13:59:35.657 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost using filter
Filter(query=SELECT book.*, userStatus.*
  FROM [book:book] AS book
  INNER JOIN [book:userStatus] AS userStatus
  ON ISCHILDNODE(userStatus, book)
  WHERE userStatus.[book:readBy] = '60b02015-f09b-42a0-8a6a-d47d40194cfc',
path=*)
  2018-05-23 13:59:35.657 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for reference is Infinity
  2018-05-23 13:59:35.657 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for property is Infinity
  2018-05-23 13:59:35.658 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for nodeType is 204704.0
  2018-05-23 13:59:35.658 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for traverse is 1.0E8
  2018-05-23 13:59:35.658 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost using filter
Filter(query=SELECT book.*, userStatus.*
  FROM [book:book] AS book
  INNER JOIN [book:userStatus] AS userStatus
  ON ISCHILDNODE(userStatus, book)
  WHERE userStatus.[book:readBy] = '60b02015-f09b-42a0-8a6a-d47d40194cfc',
path=*, property=[book:readBy=[60b02015-f09b-42a0-8a6a-d47d40194cfc]])
  2018-05-23 13:59:35.658 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for reference is Infinity
  2018-05-23 13:59:35.663 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for property is 3.0
  2018-05-23 13:59:35.664 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for nodeType is 102304.0
  2018-05-23 13:59:35.664 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for traverse is 1.0E8
  2018-05-23 13:59:35.664 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost using filter
Filter(query=SELECT book.*, userStatus.*
  FROM [book:book] AS book
  INNER JOIN [book:userStatus] AS userStatus
  ON ISCHILDNODE(userStatus, book)
  WHERE userStatus.[book:readBy] = '60b02015-f09b-42a0-8a6a-d47d40194cfc',
path=* && //parent/of/join)
  2018-05-23 13:59:35.664 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for reference is Infinity
  2018-05-23 13:59:35.664 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for property is Infinity
  2018-05-23 13:59:35.664 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for nodeType is 204704.0
  2018-05-23 13:59:35.664 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : cost for traverse is 1.0E8
  2018-05-23 13:59:35.664 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : query execute SELECT book.*,
userStatus.*
  FROM [book:book] AS book
  INNER JOIN [book:userStatus] AS userStatus
  ON ISCHILDNODE(userStatus, book)
  WHERE userStatus.[book:readBy] = '60b02015-f09b-42a0-8a6a-d47d40194cfc'
  2018-05-23 13:59:35.664 DEBUG 12937 --- [nio-8080-exec-8]
o.apache.jackrabbit.oak.query.QueryImpl  : query plan [book:userStatus] as
[userStatus] /* property book-userStatus_book-readBy =
60b02015-f09b-42a0-8a6a-d47d40194cfc where [userStatus].[book:readBy] =
'60b02015-f09b-42a0-8a6a-d47d40194cfc' */ inner join [book:book] as [book]
/* nodeType Filter(query=SELECT book.*, userStatus.*
  FROM [book:book] AS book
  INNER JOIN [book:userStatus] AS userStatus
  ON ISCHILDNODE(userStatus, book)
  WHERE userStatus.[book:readBy] = '60b02015-f09b-42a0-8a6a-d47d40194cfc',
path=* && //parent/of/join) */ on ischildnode([userStatus], [book])
  2018-05-23 13:59:39.563  WARN 12937 --- [nio-8080-exec-8]
a.j.o.p.i.p.s.ContentMirrorStoreStrategy : Index-Traversed 10000 nodes
(2285 index entries) using index jcr:primaryType with filter
Filter(query=SELECT book.*, userStatus.*
  FROM [book:book] AS book
  INNER JOIN [book:userStatus] AS userStatus
  ON ISCHILDNODE(userStatus, book)
  WHERE userStatus.[book:readBy] = '60b02015-f09b-42a0-8a6a-d47d40194cfc',
path=/books/book1)
  2018-05-23 13:59:43.467  WARN 12937 --- [nio-8080-exec-8]
a.j.o.p.i.p.s.ContentMirrorStoreStrategy : Index-Traversed 20000 nodes
(4259 index entries) using index jcr:primaryType with filter
Filter(query=SELECT book.*, userStatus.*
  FROM [book:book] AS book
  INNER JOIN [book:userStatus] AS userStatus
  ON ISCHILDNODE(userStatus, book)
  WHERE userStatus.[book:readBy] = '60b02015-f09b-42a0-8a6a-d47d40194cfc',
path=/books/book1)
  ...

So what am I doing wrong here?
Did I miss something obvious (in relation with JOINs in particular)?
Should I look into Lucene indexes? What will be the benefits / drawbacks in
my use case?
And what about my model by the way, does it make any sense to organize the
repository as I did?

Thanks a lot for your help!

-- 
Damiano Albani

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message