Return-Path: X-Original-To: apmail-jackrabbit-users-archive@minotaur.apache.org Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 737211896E for ; Sat, 14 Nov 2015 08:02:42 +0000 (UTC) Received: (qmail 14726 invoked by uid 500); 14 Nov 2015 08:02:42 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 14658 invoked by uid 500); 14 Nov 2015 08:02:42 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 14647 invoked by uid 99); 14 Nov 2015 08:02:42 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Nov 2015 08:02:42 +0000 Received: from mail-lb0-f169.google.com (mail-lb0-f169.google.com [209.85.217.169]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 4D87E1A04DE for ; Sat, 14 Nov 2015 08:02:41 +0000 (UTC) Received: by lbbsy6 with SMTP id sy6so36800627lbb.2 for ; Sat, 14 Nov 2015 00:02:39 -0800 (PST) MIME-Version: 1.0 X-Received: by 10.112.146.106 with SMTP id tb10mr12334860lbb.64.1447488159936; Sat, 14 Nov 2015 00:02:39 -0800 (PST) Received: by 10.25.216.229 with HTTP; Sat, 14 Nov 2015 00:02:39 -0800 (PST) Received: by 10.25.216.229 with HTTP; Sat, 14 Nov 2015 00:02:39 -0800 (PST) In-Reply-To: References: <5646657A.4080502@butterdev.com> Date: Sat, 14 Nov 2015 10:02:39 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Node Retrieval Performance From: Robert Munteanu To: users@jackrabbit.apache.org Content-Type: multipart/alternative; boundary=047d7b3442069703ac05247b9973 --047d7b3442069703ac05247b9973 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Nov 14, 2015 2:21 AM, "Clay Ferguson" wrote: > > In my opinion this one issue is the single most crippling achilies heel o= f > the entire JCR. Very likely to drive away many potential users of this API. > It's touted as an enterprise-scale API, but yet chokes on just a few tens > of thousands of nodes. This, IMO urgently needs to be addressed. I know > it's a technical limitation, and not a design decision, but to me that just > means it's an 'unsolved' problem. I'm not complaining or criticizing > developers, i'm just saying that as a community we need to solve this. I > should be able to have a 50 million nodes, and not be a problem, in an > ideal situation. RDBMS have solved these issues years ago, by a "never load > everything all at once" rule. However somehow the "It's ok to load all > children in memory" mentality caught on in the JCR and we are now stuck > with the results. Nope that this usually applies to direct child nodes, i.e. 50k nodes with the same parent. Such a number spread throughout the repository is not an issue. Robert > > > Best regards, > Clay Ferguson > wclayf@gmail.com > > > On Fri, Nov 13, 2015 at 4:47 PM, Dirk Rudolph > wrote: > > > Did I understood you right, you have thousands of child nodes below the > > root node? > > > > You should avoid this because this is considered bad practice in terms of > > write performance and depending on your concurrent access this might also > > block read access. > > > > http://wiki.apache.org/jackrabbit/Performance > > > > Try to introduce a structure to your content using BTreeManger > > > > > > > > https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/B= TreeManager.html > > > > Cheers, D > > > > > > On Friday, 13 November 2015, David Marginian wrote: > > > > > Thanks Clay. I am not trying to load that many records at once. The > > > application is crawling a directory. It places the files from that > > > directory into JackRabbit one at a time, and puts a content id onto a > > queue > > > which is picked up by consumers on different servers. Those consumer= s > > then > > > use the content id to retrieve the file from JackRabbit. Each piece o= f > > > content is saved in a node under the root node. The performance slowdown > > > is coming from calling session.getRootNode(), from what I can gather from > > > the docs I need the root node in order to add a child node. Note the > > > slowdown is pretty significant and I don't need to have close to 50k to > > > start seeing it (I start seeing it within a few minutes of running my > > > app). I don't need orderable nodes, how do I disable that? > > > > > > > > > On 11/13/2015 03:10 PM, Clay Ferguson wrote: > > > > > >> =E2=80=8BPlease let us know more about your use case. Why are you ev= en "trying" > > to > > >> load that many records all at once. Or at least scan them one by one, I > > >> mean. In most use cases you wouldn't need to do this kind of thing, > > unless > > >> it's some kind of backup or replication. I say "most" cases... I'm not > > >> saying you don't need to just asking for a bit more background. BTW: > > If > > >> you don't need 'orderable' nodes try to avoid them. That type of nod= e > > does > > >> not work at 'scale'... and 50K is propably pushing it.=E2=80=8B > > >> > > >> Best regards, > > >> Clay Ferguson > > >> wclayf@gmail.com > > >> > > >> > > >> On Fri, Nov 13, 2015 at 3:33 PM, wrote: > > >> > > >> Hi, > > >>> I am new to JackRabbit and using version 2.11.2. I am using JackRabbit > > >>> to > > >>> store documents in a multi-threaded environment. I noticed that th= e > > time > > >>> it takes to retrieve the root node is inconsistent and slow (severa= l > > >>> seconds +) and degrades over time (after 50K plus child nodes retrieval > > >>> is > > >>> taking ~15 seconds). > > >>> > > >>> Originally, I was using code as follows to obtain a repository: > > >>> > > >>> public Repository getRepository() throws ClassNotFoundException, > > >>> RepositoryException { > > >>> > > >>> > > >>> > > ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepo= sitoryFactory")); > > >>> return JcrUtils.getRepository(jackabbitServerUrl); > > >>> } > > >>> > > >>> Then I came across the following thread: > > >>> > > >>> > > >>> > > http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td15710= 27.html#a1571302 > > >>> > > >>> This thread had some useful information (BatchReadConfig), but I am not > > >>> certain how to use the API to take advantage of it. I have changed my > > >>> code > > >>> to the following but it doesn't appear that node retrieval performance > > >>> has > > >>> improved, is there something I am missing/doing wrong? > > >>> > > >>> 1) Repository Factory > > >>> public Repository getRepository(@SuppressWarnings("rawtypes") Map > > >>> parameters) throws RepositoryException { > > >>> String repositoryFactoryName =3D parameters !=3D null && ( > > >>> > > >>> parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) || > > >>> > > parameters.containsKey(PARAM_REPOSITORY_CONFIG)) > > >>> ? > > >>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory" > > >>> : "org.apache.jackrabbit.core.RepositoryFactoryImpl"; > > >>> > > >>> Object repositoryFactory; > > >>> try { > > >>> Class repositoryFactoryClass =3D > > >>> Class.forName(repositoryFactoryName, true, > > >>> Thread.currentThread().getContextClassLoader()); > > >>> > > >>> repositoryFactory =3D repositoryFactoryClass.newInstance(); > > >>> } > > >>> catch (Exception e) { > > >>> throw new RepositoryException(e); > > >>> } > > >>> > > >>> if (repositoryFactory instanceof RepositoryFactory) { > > >>> return ((RepositoryFactory) > > >>> repositoryFactory).getRepository(parameters); > > >>> } > > >>> else { > > >>> throw new RepositoryException(repositoryFactory + " is > > not a > > >>> RepositoryFactory"); > > >>> } > > >>> } > > >>> > > >>> 2) Use the factory to get a repo: > > >>> public Repository getRepository() throws ClassNotFoundException, > > >>> RepositoryException { > > >>> Map parameters =3D > > >>> Collections.singletonMap( > > >>> "org.apache.jackrabbit.jcr2spi.RepositoryConfig", > > >>> (RepositoryConfig) new > > >>> RepositoryConfigImpl(jackabbitServerUrl)); > > >>> > > >>> return getRepository(parameters); > > >>> } > > >>> > > >>> 3) Repository Config: > > >>> private static final class RepositoryConfigImpl implements > > >>> RepositoryConfig { > > >>> > > >>> private String jackabbitServerUrl; > > >>> > > >>> private RepositoryConfigImpl(String jackabbitServerUrl) { > > >>> super(); > > >>> this.jackabbitServerUrl =3D jackabbitServerUrl; > > >>> } > > >>> > > >>> public CacheBehaviour getCacheBehaviour() { > > >>> return CacheBehaviour.INVALIDATE; > > >>> } > > >>> > > >>> public int getItemCacheSize() { > > >>> return 100; > > >>> } > > >>> > > >>> public int getPollTimeout() { > > >>> return 5000; > > >>> } > > >>> > > >>> public RepositoryService getRepositoryService() throws > > >>> RepositoryException { > > >>> BatchReadConfig brc =3D new BatchReadConfig() { > > >>> public int getDepth(Path path, PathResolver resolver) > > >>> throws NamespaceException { > > >>> return 1; > > >>> } > > >>> }; > > >>> return new RepositoryServiceImpl(jackabbitServerUrl, brc); > > >>> } > > >>> > > >>> } > > >>> > > >>> Thanks for your time. > > >>> > > >>> David > > >>> > > >>> > > >>> > > >>> > > >>> > > > > > > > -- > > > > Dirk Rudolph | Senior Software Engineer > > > > Netcentric AG > > > > M: +41 79 642 37 11 > > D: +49 174 966 84 34 > > > > dirk.rudolph@netcentric.biz | www.netcentric.biz > > --047d7b3442069703ac05247b9973--