jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukas Kahwe Smith <...@pooteeweet.org>
Subject full text search improvements
Date Sat, 24 Mar 2012 14:12:48 GMT
Hi,

I am not a Jackrabbit developer but a very interested user and co-lead of the PHPCR [1] initiative.
I wanted to expand partially on what Ard said about potentially looking into hooking in Solr/ElasticSearch
[2] but some other issues I see with full text search in Jackrabbit 2.x

1) scaling

Now first up I am overall quite happy with the scalability of Jackrabbit 2.x.
Obviously there are two places though where at some point we need to support sharding and
that is the persistence manager (which seems to be covered in the current Oak plans) and the
lucene index (which doesnt seem to covered). Now imho there are already two perfectly fine
projects working on this with Solr (the more natural choice since its also an Apache project)
and ElasticSearch (imho it provides a much better API).

More over (optionally) leveraging these has several other advantages:
- mature products (especially ElasticSearch is very mature when it comes to sharding), supporting
them might also attract new users to Jackrabbit
- handle much larger data sets via sharding
- provide many more full text search specific features
- less pressure on Jackrabbit to support these features [3] [4]
- as these are both Lucene based the amount of code needed (for example to convert QOM to
Solr/ElasticSearch) will be minimal

---

2) facetting

Now I mentioned facetting [4] above. Right now Jackrabbit does not even support COUNT() [5],
which I find very painful and a major oversight. But really what people have come to expect
from full text search is facetting. Imho its so important that it should even be part of JCR
2.1 [6] and as you can see in this link it seems like HippoCMS developers agree that its a
very useful feature to have inside Jackrabbit.

---

3) "cleaner" data in results

This is actually a fairly trivial issue but with severe implications for scalability. As Ard
explained in many cases "a document" will span many nodes. Now when dealing with such a "document"
(especially when doing overview pages of a collection of documents) its not always necessary
to get the entire tree of nodes. All that is needed are some fields. For this the full text
search API could provide a much faster retrieval mechanism. However we have found that the
data stored inside the Lucene index is not the original data. It probably makes sense to only
store the tokenized version to limit the impact of the issue noted in 1), but the fact that
the same separator is used for spaces and multi value fields [7] makes it needlessly hard
in many cases to simply leverage the full text search API to fetch subsets of data from a
tree of nodes.

---

4) cover more SQL2 functions

This is a comparatively minor topic and might just be beyond the scope of this mailinglist
which seems to be more about designing the future architecture than "minor" feature requrts.
But it would be great to also support PATH(), DEPTH() etc. [8].

---

Now one last comment, I hope that all of you see the potentially in pushing Jackrabbit's user
base with the existence of PHPCR. Suddenly it becomes a high scalable database for the entire
PHP CMS community. As a matter of fact at DrupalCon Denver this week Drupal tentatively agreed
to migrate their storage API to PHPCR. Now this doesnt necessarily need to be limited to PHP
even, PHPCR just proofed that JCR isnt as language specific as many proponents of CMIS make
it out to be. Heck there is even someone that started to port JCR to Node.js [9] (well its
not very active, but hey).

My point being here, when thinking about Oak, please also think about the performance of users
talking to Jackrabbit via HTTP. The PHPCR team has done its best in trying to solve quite
a few performance issues with the current HTTP API, but it would be great of this was really
in everyones head.

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

[1] http://phpcr.github.com
[2] http://www.mail-archive.com/oak-dev@jackrabbit.apache.org/msg00337.html
[3] https://issues.apache.org/jira/browse/JCR-3204
[4] https://issues.apache.org/jira/browse/JCR-3134
[5] https://issues.apache.org/jira/browse/JCR-2605
[6] http://java.net/projects/jsr-333/lists/dev/archive/2011-12/message/3
[7] https://issues.apache.org/jira/browse/JCR-3028
[8] https://issues.apache.org/jira/browse/JCR-3145
[9] https://github.com/NoCR/NoCR
Mime
View raw message