jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Santiago Gala <santiago.g...@gmail.com>
Subject Re: Jackrabbit & Performance
Date Wed, 20 Nov 2013 19:00:58 GMT
On Wed, Nov 20, 2013 at 7:39 PM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

> Thank for the feedback, Peter. Much appreciated.
>
> During these days I've also tried to segment the "marks" node into a deep
> tree structure by taking the ID in groups of 3 digits. So for example, as
> my IDs have 9 numbers, I can take the first 3 digits for the first level in
> the tree, then the next 3 digits for the second level, and then the last 3
> digits for the last level where the "mark" would actually be saved (as the
> leaf). An example is worth a thousand words:
>
>
Depending on your access patterns, you might also use the date, as in
/YYYY/MM/DD/003672897, for segmentation.


> mark --> ID = 003672897
>
> JCR --> root (node) --> marks (node) --> 003 (node) --> 672 (node) -->
> 003672897 (node)
>
> This is a valid approach at the theory level, but at the practical level,
> when I dump the 1M marks from the DB into JCR, for each an every "mark" it
> has to lookup the path in the tree where to ultimately store the "mark",
> and this lookup starts to take orders of seconds as the tree structure
> grows, making the full extraction process from the DB too slow for our
> requirements.
>
>
I did an evaluation of jackrabbit recently, and I found that using Apache
sling instead of "pure" jackrabbit made things very convenient for a number
of things. While I'm not sure if it would be faster or not, using Sling
REST API would enable you to create each document using something like

$ curl -u admin:admin -Fname=@binary.pdf -Fother_field=test ...   http://
<server>:<port>/marks/<nnn>/<mmm>/<ppp>/

or its equivalent request using any HTTPClient framework. This might show a
flatter response time, and will create the intermediate nodes if needed.

Regards
Santiago


> That's why I still need to stick to the "flat" structure taking profit of
> the Lucene's index, while still being worried about the use of the
> deprecated API, as I mention in my previous email.
>
> Salu2,
> Quique.
>
>
> On Wed, Nov 20, 2013 at 7:27 PM, Peter Harrison <
> peter.harrison@team.orcon.net.nz> wrote:
>
> > I am by no means an expert, but I have been developing for three or four
> > months with JackRabbit. The approach I've taken is not to include the
> base
> > records under one node.
> >
> > For example, you may have classes of patent, such as medical, chemical
> > process etc, and so you could break down the mark into subnodes for each
> > class of patent. Finding a particular mark by its ID is still quite easy,
> > but not as trivial as simply having a path like /mark/<patentid>.
> >
> > I have put a REST interface in front of JackRabbit that handles simple
> IDs
> > - running the appropriate query, and then returning the object which
> > contains the full path.
> >
> > This idea - that the path itself contains information about a node takes
> a
> > little to get used to, but it allows you to do some very quick reporting
> on
> > specific classes, as searches can be scoped to specific trees.
> >
> > What I'm learning is that JackRabbit isn't just another kind of DB - so
> > you should not treat it as just another kind of flat table. You should be
> > creating a deep tree structure rather than a shallow structure. Doing
> this
> > allows you to utilise the path to limit the scope of queries.
> >
> > PS: I have also modified the Java OCM to allow lists of primitives to be
> > stored as properties of a single subnode. I've been making changes to OCM
> > on my local system, but am not really sure how to contribute back.
> >
> >
> > On 20/11/13 23:39, Enrique Medina Montenegro wrote:
> >
> >> Hi list,
> >>
> >>
> >> I’ve been evaluating Jackrabbit for several weeks, performing all sorts
> of
> >> performance testing due to the nature of the repository we need to
> create
> >> here at OHIM. Not sure if you’re aware of us, but we’re the European
> >> Office
> >> where you have to come to protect the intellectual property of your
> marks
> >> and designs in the whole European Community. Currently, we are storing
> all
> >> our marks and designs information in a relational DB, and besides
> serious
> >> performance issues (it’s an old DB, not Oracle unfortunately) we don’t
> >> have
> >> functionality such as versioning or observation, and the fact that our
> >> information is perfectly suitable to be modelled into an XML document,
> led
> >> us to think about storing it in a JCR repository.
> >>
> >>
> >>
> >> I went through David’s model and decided to create a single node called
> >> “marks” and then add one child node for each existing mark in our system
> >> (~1 million marks where each mark would have ~50 versions/revisions),
> but
> >> then I found that adding more than 10K child nodes could lead to
> potential
> >> performance issues. However, after some testing, I also found that
> >> indexing
> >> the mark nodes allowed us to query them extremely fast using SQL2, so we
> >> could overcome the issue with the 10K child nodes.
> >>
> >>
> >>
> >> For example, instead of doing à session.getNode(“/marks/000345123”) ß
we
> >> could query à SELECT * FROM [iptool:markType] WHERE [iptool:id] =
> >> ‘000345123’ (notice that we defined our own custom node types and also
> >> told
> >> Lucene just to index the [iptool:id] property through the use of the
> >> IndexConfiguration configuration).
> >>
> >>
> >>
> >> Evertyhing was then progressing smoothly, but then we realized that  in
> >> order to fetch a specific version or even the base version of a
> particular
> >> mark, the API recommended using the VersionManager:
> >>
> >>
> >>
> >> VersionHistory history =
> >> session.getWorkspace().getVersionManager().getVersionHistory(markNode.
> >> getPath());
> >>
> >>
> >>
> >> Unfortunately, this API makes use of the direct path access to the node
> >> being versioned, which in our case was killing our performance due to
> the
> >> 10K child nodes limitation (sort of). Although there’s the possibility
> to
> >> access to the versions directly from the node itself using
> >> àmarkNode.getBaseVersion() or markNode.getVersionHistory()
> >>
> >> ß these methods are deprecated and we are not quite sure whether they
> will
> >> be removed in the short future or left there as an alternative way to
> >> retrieving the version history from a node.
> >>
> >>
> >>
> >> Therefore, could I possibly get some answers from you to help us out in
> >> making our final decision on whether to use Jackrabbit as our official
> JCR
> >> repository implementation?
> >>
> >>
> >>
> >> ´  Is the direct retrieval of the version history through the node
> itself
> >> (now deprecated) going to be eventually removed or not? If so, when is
> it
> >> planned to be removed? If not, will it be kept as a “valid” alternative
> to
> >> the current VersionManager approach?
> >>
> >> ´  Using the Lucene’s indexes is throwing very fast read times
> (magnitude
> >> of tens of ms), but do you foresee other hidden issues or side effects
> to
> >> maintain ~1M child nodes underneath the same parent “mark” node?
> >>
> >> ´  We also played around the BTreeManager, but we couldn’t make it work
> >> with custom node types. I even posted this issue in the user mail list,
> >> but
> >> so far I haven’t got any response:
> >>
> >> http://mail-archives.apache.org/mod_mbox/jackrabbit-users/
> >> 201311.mbox/ajax/%3CCA%2BdeSP_weUQ0mtSBjoQGy3jq60jZEo7LtmF9k
> >> JZkvF1eyNvu-A%40mail.gmail.com%3E<https://mailtrack.io/trace/link/
> >> d3712d035f427b56d11f00d2265d38a80e23bd13>
> >>
> >>
> >> Thanks so much in advance for helping us out to choose Jackrabbit as our
> >> JCR technology, hopefully!!! J
> >>
> >>
> >> Sent with MailTrack<https://mailtrack.io/install?source=signature&
> >> referral=e.medina.m@gmail.com>
> >>
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message