jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enrique Medina Montenegro <e.medin...@gmail.com>
Subject Re: Jackrabbits reliability and performance
Date Thu, 14 Nov 2013 14:28:59 GMT
Hi Tarun,

Let me share my findings with you :-)

At my work we are evaluating the use of Jackrabbit to build a JCR
repository to store the register of marks (intellectual property) as
documents composed basically of an ID, some metadata (who created it, when,
etc.) and the XML and JSON representation of the mark itself. Currently, we
have all that information spread in several relational DBs and we would
like to take advantage of the versioning and observation features of the
JCR repository.

During our initial evaluation, mostly focused on performance, we noticed
serious issues when adding the 1 million marks we have currently in our DBs
underneath the same "parent" node, but we found out that this was actually
a known limitation by Jackrabbit, which clearly states that no more than
10K child nodes should be added to the same "parent "node:

http://wiki.apache.org/jackrabbit/Performance

However, we were still sort of forced to follow that path because we were
required to perform an initial dump of all the data in the DBs, and just
adding each mark as a sub-mode proved to be the fastest way to export all
the data in an acceptable window frame.

Nevertheless, we also tried to shard the nodes as a tree, basically
splitting the 9-digit ID of our marks into 3-digit groups, so each node
could only have as much as 1K sub-nodes within itself. For example, mark
with ID = 000342865 would be saved into --> root (node) -> marks (node) ->
000 (node) -> 342 (node) --> 000342865 (node). Theoretically, this would
perform much better than our original approach, but as a downside, it would
dramatically slow down the time it takes to export the 1M marks from the
DBs, going further out of our acceptable window frame (due to the fact
that, for each mark, it had to previously look up the exact node where to
store it, and the bigger the JCR repository was growing, the slower the
node lookup times were, therefore impacting the overall export process).

We also took a look at the BTreeManager, but we just couldn't make it work
due to the issue I describe here (which BTW has not been answered yet):

http://mail-archives.apache.org/mod_mbox/jackrabbit-users/201311.mbox/ajax/%3CCA%2BdeSP_weUQ0mtSBjoQGy3jq60jZEo7LtmF9kJZkvF1eyNvu-A%40mail.gmail.com%3E

So getting back to the original approach of storing everything under the
same node, how did we manage to get acceptable read times? Well, it boils
down to using Lucene's indexation (configured properly to only index the
"id" property, and not all the XML and JSON stuff - using the
IndexingConfiguration in the Search section of the repository config file)
to actually perform the search/retrieval of marks. So for instance, instead
of:

session.getNode("/marks/000342865") --> takes ~2.4segs with 1M marks under
the same node

we run this query with SQL2:

SELECT * FROM markType WHERE id = '000342865' --> takes tens of ms with 1M
marks under the same node thanks to Lucene's indexes

(notice that "markType" is a custom node type that we have created to model
our domain, in this case the marks)

LESSONS LEARNED: You need to clearly define the scope of your project in
terms of the functionality you're willing to use from Jackrabbit, and then
plan for detailed performance workshops to prove your approach. There are
always trade-offs (for instance, in my case, when I want to get the
specific version of a mark, I cannot use the "official" API through
"VersionManager" because it uses direct path to fetch the node prior to
getting the revision -->
session.getWorkspace().getVersionManager().getVersionHistory("/marks/000342865").getVersionByLabel("v.6.0"),
and I have to use the "deprecated" API method from the node itself, once
I've got it using the SQL2 statement mentioned above -->
markNode.getVersionHistory().getVersionByLabel("v.6.0"), with the
uncertainty on when that deprecated API will be removed...).

Please share your findings in the list as you make progress :-)

Regards,
Enrique Medina.


On Thu, Nov 14, 2013 at 10:40 AM, Tarun Dogra <Tarun.Dogra@orioncro.com>wrote:

> Respected Sir/Madam,
>
> In the next couple of months, we (ORION Clinical Services Ltd., UK) are
> about to release a clinical trial management system as a product to be used
> in-house by all our employees. We have bought this product off the shelf
> from a third party vendor. As suggested by our vendor, we would implement
> JackRabbit as the central repository system within this main product. But
> we are still not sure whether jackrabbit is an ideal solution to be
> integrated with our product and this is where we will need your help and
> would appreciate if you could share your expertise.
>
> Just to give you an overview of our organisation, we will have around 7500
> documents (each of size 250K approximately on an average) per "study"
> within our clinical trial management framework. We usually take on board
>  around 7-8 such studies per year. So, on the basis of 8 studies per year,
> the total size of all the documents will grow to 7500 x 250 x 8 = 15GB
> approximately per year. So just wanted to know a couple of things from you:
>
> 1.       Is Jackrabbit reliable enough as a system to cater to our above
> mentioned needs? and
>
> 2.       Will the management of so many documents have any adverse effects
> on jackrabbit's performance? - considering that Jackrabbit will reside on
> one of our own hosted server with the following spec -
>
> Poweredge R710
>
> CPU: 2 x Intel X5550
>
> Memory: 16GB
>
> Operating System: Windows 2008 R2 64bit SP1
>
> Disk capacity: C: 142gb and D: 1.22Tb
>
>
> Sorry if you are not the correct department to consult to in regards to
> our above mentioned concern and if this is the case, it will be much
> appreciated if you could direct us to the right department/person? Many
> thanks.
>
> Look forward to hearing from you.
>
> Regards,
> Tarun
>
> ________________________________
> **********************************Legal & Confidentiality
> Notice**************************************
> This email and attachments hereto are strictly private and confidential.
> Reading, copying, disclosure or use by anybody else is not authorised. If
> you have received this email in error, please delete it and notify us as
> soon as possible.
> The antivirus software used by ORION is automatically and constantly
> updated in an effort to minimise the risk of viruses infecting our systems,
> However, you should be aware that there is no absolute guarantee that any
> files attached to this email are virus free.
> ORION may monitor email traffic data and also the content of email for the
> purposes of security and staff training.
> ORION Clinical Services Limited is a private limited company registered in
> England. Company number 3457136. Registered address: 7 Bath Road, Slough,
> Berkshire, SL1 3UA. ORION Clinical Services Limited is the parent company
> of a number of subsidiary companies. For further details please visit our
> website at www.orioncro.com
> ________________________________________
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message