jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enrique Medina Montenegro <e.medin...@gmail.com>
Subject Re: Jackrabbit & Performance
Date Wed, 20 Nov 2013 18:39:34 GMT
Thank for the feedback, Peter. Much appreciated.

During these days I've also tried to segment the "marks" node into a deep
tree structure by taking the ID in groups of 3 digits. So for example, as
my IDs have 9 numbers, I can take the first 3 digits for the first level in
the tree, then the next 3 digits for the second level, and then the last 3
digits for the last level where the "mark" would actually be saved (as the
leaf). An example is worth a thousand words:

mark --> ID = 003672897

JCR --> root (node) --> marks (node) --> 003 (node) --> 672 (node) -->
003672897 (node)

This is a valid approach at the theory level, but at the practical level,
when I dump the 1M marks from the DB into JCR, for each an every "mark" it
has to lookup the path in the tree where to ultimately store the "mark",
and this lookup starts to take orders of seconds as the tree structure
grows, making the full extraction process from the DB too slow for our
requirements.

That's why I still need to stick to the "flat" structure taking profit of
the Lucene's index, while still being worried about the use of the
deprecated API, as I mention in my previous email.

Salu2,
Quique.


On Wed, Nov 20, 2013 at 7:27 PM, Peter Harrison <
peter.harrison@team.orcon.net.nz> wrote:

> I am by no means an expert, but I have been developing for three or four
> months with JackRabbit. The approach I've taken is not to include the base
> records under one node.
>
> For example, you may have classes of patent, such as medical, chemical
> process etc, and so you could break down the mark into subnodes for each
> class of patent. Finding a particular mark by its ID is still quite easy,
> but not as trivial as simply having a path like /mark/<patentid>.
>
> I have put a REST interface in front of JackRabbit that handles simple IDs
> - running the appropriate query, and then returning the object which
> contains the full path.
>
> This idea - that the path itself contains information about a node takes a
> little to get used to, but it allows you to do some very quick reporting on
> specific classes, as searches can be scoped to specific trees.
>
> What I'm learning is that JackRabbit isn't just another kind of DB - so
> you should not treat it as just another kind of flat table. You should be
> creating a deep tree structure rather than a shallow structure. Doing this
> allows you to utilise the path to limit the scope of queries.
>
> PS: I have also modified the Java OCM to allow lists of primitives to be
> stored as properties of a single subnode. I've been making changes to OCM
> on my local system, but am not really sure how to contribute back.
>
>
> On 20/11/13 23:39, Enrique Medina Montenegro wrote:
>
>> Hi list,
>>
>>
>> I’ve been evaluating Jackrabbit for several weeks, performing all sorts of
>> performance testing due to the nature of the repository we need to create
>> here at OHIM. Not sure if you’re aware of us, but we’re the European
>> Office
>> where you have to come to protect the intellectual property of your marks
>> and designs in the whole European Community. Currently, we are storing all
>> our marks and designs information in a relational DB, and besides serious
>> performance issues (it’s an old DB, not Oracle unfortunately) we don’t
>> have
>> functionality such as versioning or observation, and the fact that our
>> information is perfectly suitable to be modelled into an XML document, led
>> us to think about storing it in a JCR repository.
>>
>>
>>
>> I went through David’s model and decided to create a single node called
>> “marks” and then add one child node for each existing mark in our system
>> (~1 million marks where each mark would have ~50 versions/revisions), but
>> then I found that adding more than 10K child nodes could lead to potential
>> performance issues. However, after some testing, I also found that
>> indexing
>> the mark nodes allowed us to query them extremely fast using SQL2, so we
>> could overcome the issue with the 10K child nodes.
>>
>>
>>
>> For example, instead of doing à session.getNode(“/marks/000345123”) ß we
>> could query à SELECT * FROM [iptool:markType] WHERE [iptool:id] =
>> ‘000345123’ (notice that we defined our own custom node types and also
>> told
>> Lucene just to index the [iptool:id] property through the use of the
>> IndexConfiguration configuration).
>>
>>
>>
>> Evertyhing was then progressing smoothly, but then we realized that  in
>> order to fetch a specific version or even the base version of a particular
>> mark, the API recommended using the VersionManager:
>>
>>
>>
>> VersionHistory history =
>> session.getWorkspace().getVersionManager().getVersionHistory(markNode.
>> getPath());
>>
>>
>>
>> Unfortunately, this API makes use of the direct path access to the node
>> being versioned, which in our case was killing our performance due to the
>> 10K child nodes limitation (sort of). Although there’s the possibility to
>> access to the versions directly from the node itself using
>> àmarkNode.getBaseVersion() or markNode.getVersionHistory()
>>
>> ß these methods are deprecated and we are not quite sure whether they will
>> be removed in the short future or left there as an alternative way to
>> retrieving the version history from a node.
>>
>>
>>
>> Therefore, could I possibly get some answers from you to help us out in
>> making our final decision on whether to use Jackrabbit as our official JCR
>> repository implementation?
>>
>>
>>
>> ´  Is the direct retrieval of the version history through the node itself
>> (now deprecated) going to be eventually removed or not? If so, when is it
>> planned to be removed? If not, will it be kept as a “valid” alternative to
>> the current VersionManager approach?
>>
>> ´  Using the Lucene’s indexes is throwing very fast read times (magnitude
>> of tens of ms), but do you foresee other hidden issues or side effects to
>> maintain ~1M child nodes underneath the same parent “mark” node?
>>
>> ´  We also played around the BTreeManager, but we couldn’t make it work
>> with custom node types. I even posted this issue in the user mail list,
>> but
>> so far I haven’t got any response:
>>
>> http://mail-archives.apache.org/mod_mbox/jackrabbit-users/
>> 201311.mbox/ajax/%3CCA%2BdeSP_weUQ0mtSBjoQGy3jq60jZEo7LtmF9k
>> JZkvF1eyNvu-A%40mail.gmail.com%3E<https://mailtrack.io/trace/link/
>> d3712d035f427b56d11f00d2265d38a80e23bd13>
>>
>>
>> Thanks so much in advance for helping us out to choose Jackrabbit as our
>> JCR technology, hopefully!!! J
>>
>>
>> Sent with MailTrack<https://mailtrack.io/install?source=signature&
>> referral=e.medina.m@gmail.com>
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message