Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@jackrabbit.apache.org
MIME-Version: 1.0
In-Reply-To: <201511242131.tAOLMVNT029125@mx0a-00116901.pphosted.com>
References: <201511231313.tANDCGij013332@mx0a-00116901.pphosted.com>
	<CAGher0mh4cdJ2YOgyjqhud311bTdZCOZBY8_bq5sFrB0JBa--g@mail.gmail.com>
	<201511231713.tANHCZtr022901@mx0a-00116901.pphosted.com>
	<CAGher0kGhvJ_axAcH8dYHqtkAyU6exfHpk75Hr5ySWc5O5ZrJA@mail.gmail.com>
	<201511240317.tAO3DLcL005447@mx0a-00116901.pphosted.com>
	<CAGher0kxqRzqhkC43XTg6dvzd+4R25ewt9QvYq0oyEtKUkANTg@mail.gmail.com>
	<201511241614.tAOGCgPL025297@mx0b-00116901.pphosted.com>
	<CAN24SXnb4Wjb4CrfHGWPQsrSSCxtEs1LW2SbSjaNE36T7Vv2SA@mail.gmail.com>
	<201511241756.tAOHqKVm010545@mx0b-00116901.pphosted.com>
	<CAN24SX=_OdRN3NSFXX6mXrMksRuekvozEaimZ_n1RnAcjEZfng@mail.gmail.com>
	<201511242131.tAOLMVNT029125@mx0a-00116901.pphosted.com>
Date: Tue, 24 Nov 2015 15:47:11 -0600
Message-ID: 
 <CAN24SX=9b7Ovs9M3JcPXh7X16ii5Z5Gsu_Yn6zUWTK4d1CG6fw@mail.gmail.com>
Subject: Re: Memory usage
From: Clay Ferguson <wclayf@gmail.com>
To: users@jackrabbit.apache.org
Content-Type: multipart/alternative; boundary=001a114dade0b8087d0525504868

--001a114dade0b8087d0525504868
Content-Type: text/plain; charset=UTF-8

Come on Kevin, I just googled it and found it immediately bro. :)

https://docs.jboss.org/jbossdna/0.7/manuals/reference/html/jcr-query-and-search.html#jcr-sql2-limits

Best regards,
Clay Ferguson
wclayf@gmail.com


On Tue, Nov 24, 2015 at 3:30 PM, Roll, Kevin <Kevin-Roll@idexx.com> wrote:

> Unfortunately the use of 'limit' is not supported (via JCR-SQL2 queries):
>
> https://issues.apache.org/jira/browse/SLING-1873
>
> I set resultFetchSize to a very low number and I was still able to iterate
> through a larger result set, although this may have been batched behind the
> scenes. I'm hoping that my new flag-based task will drastically cut down
> the result set size and prevent the runaway memory usage anyway.
>
>
> From: Clay Ferguson [mailto:wclayf@gmail.com]
> Sent: Tuesday, November 24, 2015 1:35 PM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> point #1. In SQL2 you can just build your query string dynamically and put
> in the time of the last replication. So really I don't see the limitation
> there. You would always just build your queries with the correct date on
> it. But like you said, that is a 'weak" solution. I think actually the
> 'dirty flag' kind of thing or 'needs replication flag' is better because
> you can do it node-by-node and at any time, and you can shutdown and
> restart and it will always pickup where it left off. With timestamps you
> can run into situations where at one cycle it only half processed (failure
> for whaever reason), and then your dates get messed up. So if I were you'd
> do the flag approach. Seems more bullet proof. So if you have systems A ,
> B, C where a needs to replicate out to B and C, then what you'd do is ever
> time you modify or create an A node, you set B_DIRTY=true, and C_DIRTY=true
> on the A node, and that flags it to know a replication is pending. Sounds
> like you are on the right track you just need to set a LIMIT on your query
> so that it only grabs 100 or so at a time. I know MySQL has a LIMIT. Maybe
> SQL2 does also. You'd just keep running 100 at a time using LIMIT until one
> of the queries comes back empty. Will use hardly any memory, and be
> bullet-proof AND always easily restartable/resumable.
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com
>
>
> On Tue, Nov 24, 2015 at 11:56 AM, Roll, Kevin <Kevin-Roll@idexx.com>
> wrote:
>
> > Basically we replicate images and associated metadata to another system.
> > One of the use cases is that the user marks an image as interesting in
> the
> > local system. This metadata change (or any other) needs to then propagate
> > to the other system. So, I am querying for nodes where jcr:lastModified
> is
> > greater than another Date which is the timestamp of the last replication.
> >
> > My understanding is that JCR-SQL2 can only do a comparison where the
> > second operand is static. I am working on a different approach where I
> set
> > a flag on any node that needs to be replicated. I have event handlers for
> > added and changed nodes - at that moment it is trivial to determine
> whether
> > the node should be flagged. I realized it is much easier than trying to
> > figure it out later. The "later" case arises because we have the option
> to
> > switch this replication on and off, and there may be a situation where it
> > becomes on and must catch up with a backlog of work. This way I can
> simply
> > query all nodes with the flag set (I have a scheduled task that looks for
> > nodes needing replication).
> >
> > If there's a date comparison trick it might help as an interim solution
> > until I get this other idea up and running.
> >
> > Thanks!
> >
> > -----Original Message-----
> > From: Clay Ferguson [mailto:wclayf@gmail.com]
> > Sent: Tuesday, November 24, 2015 12:15 PM
> > To: users@jackrabbit.apache.org
> > Subject: Re: Memory usage
> >
> > glad you're gettin' closer.
> >
> > If you want, tell us more about the date range problem, because I may
> know
> > a solution (or workaround). Remember dates can be treated as integers if
> > you really need to. Integers are the fastest and most powerful data type
> > for dbs to handle too. So there should be a good clean solution unless
> you
> > have a VERY unusual situation.
> >
> > Best regards,
> > Clay Ferguson
> > wclayf@gmail.com
> >
> >
> > On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin <Kevin-Roll@idexx.com>
> > wrote:
> >
> > > I think I am hot on the trail. I noticed this morning that the top
> > objects
> > > in the heap dump are not just Lucene, they are classes related to query
> > > results. Due to a limitation in the Jackrabbit query language
> > (specifically
> > > the inability to compare two dynamic dates) I am running a query that
> > > returns a result set proportional to the size of the repository (in
> other
> > > words it is unbounded). resultFetchSize is unlimited by default, so I
> > think
> > > I am getting larger and larger query results until I run out of space.
> > >
> > > I already changed this parameter yesterday, so I will see what happens
> > > with the testing today. In the bigger picture I'm working on a better
> way
> > > to mark and query the nodes I'm interested in so I don't have to
> perform
> > an
> > > unbounded query.
> > >
> > > Thanks again for the excellent support.
> > >
> > > P.S. We build and run a standalone Sling jar - it runs separately from
> > our
> > > main application.
> > >
> > >
> > > -----Original Message-----
> > > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > > Sent: Tuesday, November 24, 2015 11:05 AM
> > > To: users@jackrabbit.apache.org
> > > Subject: Re: Memory usage
> > >
> > > So just as Clay has mentioned above, Jackrabbit does not hold the
> > complete
> > > Lucene index in memory. How it actually works is there is a
> VolatileIndex
> > > which is memory. Any updates to the Lucene Index are first done here
> and
> > > then are committed to the FileSystem based on the threshold parameters.
> > > This was obviously implemented for performance reasons.
> > > http://wiki.apache.org/jackrabbit/Search
> > > Parameters:
> > > 1.
> > >
> > > maxVolatileIndexSize
> > >
> > > 1048576
> > >
> > > The maximum volatile index size in bytes until it is written to disk.
> The
> > > default value is 1MB.
> > >
> > > 2.
> > >
> > > volatileIdleTime
> > >
> > > 3
> > >
> > > Idle time in seconds until the volatile index part is moved to a
> > persistent
> > > index even though minMergeDocs is not reached.
> > >
> > > 1GB is quite low. My company has ran for over two years a production
> > > instance of Jackrabbit with 1 GB of memory and it has not had any
> issues.
> > > The only time I saw huge spikes on memory consumption is on large
> > > operations such as cloning a node with many descendants or querying a
> > data
> > > set with a 10k+ result size.
> > >
> > > You said you have gathered a heap dump, this should point you in the
> > > direction of what objects are consuming majority of the heap. This
> would
> > be
> > > a good start to see if it is jackrabbit causing the issue or your
> > > application.
> > > What type of deployment (
> > > http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit
> > are
> > > you guys running? Is it completed isolated or embedded in your
> > application?
> > >
> > > On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Kevin-Roll@idexx.com>
> > > wrote:
> > >
> > > > Hi, Ben. I was referring to the following page:
> > > >
> > > > https://jackrabbit.apache.org/jcr/search-implementation.html
> > > >
> > > > "The most recent generation of the search index is held completely in
> > > > memory."
> > > >
> > > > Perhaps I am misreading this, or perhaps it is wrong, but I
> interpreted
> > > > that to mean that the size of the index in memory would be
> proportional
> > > to
> > > > the repository size. I hope this is not true!
> > > >
> > > > I am currently trying to get information from our QA team about the
> > > > approximate number of nodes in the repository. We are not currently
> > > setting
> > > > an explicit heap size - in the dumps I've examined it seems to run
> out
> > > > around 240Mb. I'm pushing to set something explicit but I'm now
> hearing
> > > > that older hardware has only 1Gb of memory, which gives us
> practically
> > > > nowhere to go.
> > > >
> > > > The queries that I'm doing are not very fancy... for example:
> "select *
> > > > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> > > > rewriting that task so the query will be even simpler.
> > > >
> > > > Thanks for the help!
> > > >
> > > >
> > > > users@jackrabbit.apache.org
> > > > -----Original Message-----
> > > > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > > > Sent: Monday, November 23, 2015 5:21 PM
> > > > To: users@jackrabbit.apache.org
> > > > Subject: Re: Memory usage
> > > >
> > > > It is a good idea to turn off supportHighlighting especially if you
> > > aren't
> > > > using the functionality. It takes up a lot of extra space within the
> > > index.
> > > > I am not sure where you heard that the Lucene Index is kept in memory
> > > but I
> > > > am pretty certain that is wrong. Can you point me to the
> documentation
> > > > saying this?
> > > >
> > > > Also what data set sizes are you querying against (10k nodes ? 100k
> > > nodes?
> > > > 1 mil nodes?).
> > > > What heap size do you have set on the jvm?
> > > > Reducing the resultFetchSize should help reduce the memory footprint
> on
> > > > queries.
> > > > I am assuming you are using the QueryManager to retrieve nodes. Can
> you
> > > > give an example query that you are using?
> > > >
> > > > I have developed a patch to improve query performance on large data
> > sets
> > > > with jackrabbit 2.x. I should be done soon if I can gather together a
> > few
> > > > hours to finish up my work. If you would like you can give that a try
> > > once
> > > > I finish.
> > > >
> > > > Some other repository settings you might want to look at are:
> > > >  <PersistenceManager
> > > >
> > > >
> > >
> >
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
> > > >       <param name="bundleCacheSize" value="256"/>
> > > > </PersistenceManager>
> > > >  <ISMLocking
> > > > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
> > > >
> > > >
> > > > Hope this helps.
> > > >
> > > >
> > >
> >
>

--001a114dade0b8087d0525504868--