Return-Path: X-Original-To: apmail-jackrabbit-users-archive@minotaur.apache.org Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 97A0A187FB for ; Tue, 24 Nov 2015 21:47:23 +0000 (UTC) Received: (qmail 24896 invoked by uid 500); 24 Nov 2015 21:47:23 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 24835 invoked by uid 500); 24 Nov 2015 21:47:23 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 24823 invoked by uid 99); 24 Nov 2015 21:47:22 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Nov 2015 21:47:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 3D974C54EA for ; Tue, 24 Nov 2015 21:47:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.901 X-Spam-Level: ** X-Spam-Status: No, score=2.901 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id aU24hlpkIRx7 for ; Tue, 24 Nov 2015 21:47:13 +0000 (UTC) Received: from mail-yk0-f172.google.com (mail-yk0-f172.google.com [209.85.160.172]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 3AC9F24E30 for ; Tue, 24 Nov 2015 21:47:12 +0000 (UTC) Received: by ykdr82 with SMTP id r82so35258193ykd.3 for ; Tue, 24 Nov 2015 13:47:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Lv2cwjMSJDKyX6AdwgAKNZF8A45xjx68U42jzhXJvLk=; b=pVuM3jwF1BDOVAOC0YIoO5wEYLkQc6je7xSTlqm7wSirgnHkiRY49vHDFWnj2xyeQh r546BCp9gcZc7Mke5PN3+OS+l0kZfbY0xlQ0mhjkXBvVOTxAku8TGT7NfYcb95C/tYn8 ORmvmTbht3HlTGPXb3pkL0ZqIMP5BaArGkRv5CDk5vMI+Lfm+zWkzzsVzxIMMT5VvhGu nUtRub10DlAsse8b0mkgJcTPDiz8dUHjfgrOo5z3TvzmPZwP43M2kQQ2DTbS28izd3eg dJPFT8uLOS2pZ/jdSp9Kya6lP7+uSbLqDEGBmMmwujEZIwFdn1zvALaT//Y1HPOiFfuq +QDg== MIME-Version: 1.0 X-Received: by 10.13.195.133 with SMTP id f127mr29879555ywd.10.1448401631182; Tue, 24 Nov 2015 13:47:11 -0800 (PST) Received: by 10.129.89.5 with HTTP; Tue, 24 Nov 2015 13:47:11 -0800 (PST) In-Reply-To: <201511242131.tAOLMVNT029125@mx0a-00116901.pphosted.com> References: <201511231313.tANDCGij013332@mx0a-00116901.pphosted.com> <201511231713.tANHCZtr022901@mx0a-00116901.pphosted.com> <201511240317.tAO3DLcL005447@mx0a-00116901.pphosted.com> <201511241614.tAOGCgPL025297@mx0b-00116901.pphosted.com> <201511241756.tAOHqKVm010545@mx0b-00116901.pphosted.com> <201511242131.tAOLMVNT029125@mx0a-00116901.pphosted.com> Date: Tue, 24 Nov 2015 15:47:11 -0600 Message-ID: Subject: Re: Memory usage From: Clay Ferguson To: users@jackrabbit.apache.org Content-Type: multipart/alternative; boundary=001a114dade0b8087d0525504868 --001a114dade0b8087d0525504868 Content-Type: text/plain; charset=UTF-8 Come on Kevin, I just googled it and found it immediately bro. :) https://docs.jboss.org/jbossdna/0.7/manuals/reference/html/jcr-query-and-search.html#jcr-sql2-limits Best regards, Clay Ferguson wclayf@gmail.com On Tue, Nov 24, 2015 at 3:30 PM, Roll, Kevin wrote: > Unfortunately the use of 'limit' is not supported (via JCR-SQL2 queries): > > https://issues.apache.org/jira/browse/SLING-1873 > > I set resultFetchSize to a very low number and I was still able to iterate > through a larger result set, although this may have been batched behind the > scenes. I'm hoping that my new flag-based task will drastically cut down > the result set size and prevent the runaway memory usage anyway. > > > From: Clay Ferguson [mailto:wclayf@gmail.com] > Sent: Tuesday, November 24, 2015 1:35 PM > To: users@jackrabbit.apache.org > Subject: Re: Memory usage > > point #1. In SQL2 you can just build your query string dynamically and put > in the time of the last replication. So really I don't see the limitation > there. You would always just build your queries with the correct date on > it. But like you said, that is a 'weak" solution. I think actually the > 'dirty flag' kind of thing or 'needs replication flag' is better because > you can do it node-by-node and at any time, and you can shutdown and > restart and it will always pickup where it left off. With timestamps you > can run into situations where at one cycle it only half processed (failure > for whaever reason), and then your dates get messed up. So if I were you'd > do the flag approach. Seems more bullet proof. So if you have systems A , > B, C where a needs to replicate out to B and C, then what you'd do is ever > time you modify or create an A node, you set B_DIRTY=true, and C_DIRTY=true > on the A node, and that flags it to know a replication is pending. Sounds > like you are on the right track you just need to set a LIMIT on your query > so that it only grabs 100 or so at a time. I know MySQL has a LIMIT. Maybe > SQL2 does also. You'd just keep running 100 at a time using LIMIT until one > of the queries comes back empty. Will use hardly any memory, and be > bullet-proof AND always easily restartable/resumable. > > Best regards, > Clay Ferguson > wclayf@gmail.com > > > On Tue, Nov 24, 2015 at 11:56 AM, Roll, Kevin > wrote: > > > Basically we replicate images and associated metadata to another system. > > One of the use cases is that the user marks an image as interesting in > the > > local system. This metadata change (or any other) needs to then propagate > > to the other system. So, I am querying for nodes where jcr:lastModified > is > > greater than another Date which is the timestamp of the last replication. > > > > My understanding is that JCR-SQL2 can only do a comparison where the > > second operand is static. I am working on a different approach where I > set > > a flag on any node that needs to be replicated. I have event handlers for > > added and changed nodes - at that moment it is trivial to determine > whether > > the node should be flagged. I realized it is much easier than trying to > > figure it out later. The "later" case arises because we have the option > to > > switch this replication on and off, and there may be a situation where it > > becomes on and must catch up with a backlog of work. This way I can > simply > > query all nodes with the flag set (I have a scheduled task that looks for > > nodes needing replication). > > > > If there's a date comparison trick it might help as an interim solution > > until I get this other idea up and running. > > > > Thanks! > > > > -----Original Message----- > > From: Clay Ferguson [mailto:wclayf@gmail.com] > > Sent: Tuesday, November 24, 2015 12:15 PM > > To: users@jackrabbit.apache.org > > Subject: Re: Memory usage > > > > glad you're gettin' closer. > > > > If you want, tell us more about the date range problem, because I may > know > > a solution (or workaround). Remember dates can be treated as integers if > > you really need to. Integers are the fastest and most powerful data type > > for dbs to handle too. So there should be a good clean solution unless > you > > have a VERY unusual situation. > > > > Best regards, > > Clay Ferguson > > wclayf@gmail.com > > > > > > On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin > > wrote: > > > > > I think I am hot on the trail. I noticed this morning that the top > > objects > > > in the heap dump are not just Lucene, they are classes related to query > > > results. Due to a limitation in the Jackrabbit query language > > (specifically > > > the inability to compare two dynamic dates) I am running a query that > > > returns a result set proportional to the size of the repository (in > other > > > words it is unbounded). resultFetchSize is unlimited by default, so I > > think > > > I am getting larger and larger query results until I run out of space. > > > > > > I already changed this parameter yesterday, so I will see what happens > > > with the testing today. In the bigger picture I'm working on a better > way > > > to mark and query the nodes I'm interested in so I don't have to > perform > > an > > > unbounded query. > > > > > > Thanks again for the excellent support. > > > > > > P.S. We build and run a standalone Sling jar - it runs separately from > > our > > > main application. > > > > > > > > > -----Original Message----- > > > From: Ben Frisoni [mailto:frisonib@gmail.com] > > > Sent: Tuesday, November 24, 2015 11:05 AM > > > To: users@jackrabbit.apache.org > > > Subject: Re: Memory usage > > > > > > So just as Clay has mentioned above, Jackrabbit does not hold the > > complete > > > Lucene index in memory. How it actually works is there is a > VolatileIndex > > > which is memory. Any updates to the Lucene Index are first done here > and > > > then are committed to the FileSystem based on the threshold parameters. > > > This was obviously implemented for performance reasons. > > > http://wiki.apache.org/jackrabbit/Search > > > Parameters: > > > 1. > > > > > > maxVolatileIndexSize > > > > > > 1048576 > > > > > > The maximum volatile index size in bytes until it is written to disk. > The > > > default value is 1MB. > > > > > > 2. > > > > > > volatileIdleTime > > > > > > 3 > > > > > > Idle time in seconds until the volatile index part is moved to a > > persistent > > > index even though minMergeDocs is not reached. > > > > > > 1GB is quite low. My company has ran for over two years a production > > > instance of Jackrabbit with 1 GB of memory and it has not had any > issues. > > > The only time I saw huge spikes on memory consumption is on large > > > operations such as cloning a node with many descendants or querying a > > data > > > set with a 10k+ result size. > > > > > > You said you have gathered a heap dump, this should point you in the > > > direction of what objects are consuming majority of the heap. This > would > > be > > > a good start to see if it is jackrabbit causing the issue or your > > > application. > > > What type of deployment ( > > > http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit > > are > > > you guys running? Is it completed isolated or embedded in your > > application? > > > > > > On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin > > > wrote: > > > > > > > Hi, Ben. I was referring to the following page: > > > > > > > > https://jackrabbit.apache.org/jcr/search-implementation.html > > > > > > > > "The most recent generation of the search index is held completely in > > > > memory." > > > > > > > > Perhaps I am misreading this, or perhaps it is wrong, but I > interpreted > > > > that to mean that the size of the index in memory would be > proportional > > > to > > > > the repository size. I hope this is not true! > > > > > > > > I am currently trying to get information from our QA team about the > > > > approximate number of nodes in the repository. We are not currently > > > setting > > > > an explicit heap size - in the dumps I've examined it seems to run > out > > > > around 240Mb. I'm pushing to set something explicit but I'm now > hearing > > > > that older hardware has only 1Gb of memory, which gives us > practically > > > > nowhere to go. > > > > > > > > The queries that I'm doing are not very fancy... for example: > "select * > > > > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually > > > > rewriting that task so the query will be even simpler. > > > > > > > > Thanks for the help! > > > > > > > > > > > > users@jackrabbit.apache.org > > > > -----Original Message----- > > > > From: Ben Frisoni [mailto:frisonib@gmail.com] > > > > Sent: Monday, November 23, 2015 5:21 PM > > > > To: users@jackrabbit.apache.org > > > > Subject: Re: Memory usage > > > > > > > > It is a good idea to turn off supportHighlighting especially if you > > > aren't > > > > using the functionality. It takes up a lot of extra space within the > > > index. > > > > I am not sure where you heard that the Lucene Index is kept in memory > > > but I > > > > am pretty certain that is wrong. Can you point me to the > documentation > > > > saying this? > > > > > > > > Also what data set sizes are you querying against (10k nodes ? 100k > > > nodes? > > > > 1 mil nodes?). > > > > What heap size do you have set on the jvm? > > > > Reducing the resultFetchSize should help reduce the memory footprint > on > > > > queries. > > > > I am assuming you are using the QueryManager to retrieve nodes. Can > you > > > > give an example query that you are using? > > > > > > > > I have developed a patch to improve query performance on large data > > sets > > > > with jackrabbit 2.x. I should be done soon if I can gather together a > > few > > > > hours to finish up my work. If you would like you can give that a try > > > once > > > > I finish. > > > > > > > > Some other repository settings you might want to look at are: > > > > > > > > > > > > > > > > > class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager"> > > > > > > > > > > > > > > > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/> > > > > > > > > > > > > Hope this helps. > > > > > > > > > > > > > > --001a114dade0b8087d0525504868--