jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roll, Kevin" <Kevin-R...@idexx.com>
Subject RE: Memory usage
Date Tue, 24 Nov 2015 21:30:57 GMT
Unfortunately the use of 'limit' is not supported (via JCR-SQL2 queries):

https://issues.apache.org/jira/browse/SLING-1873

I set resultFetchSize to a very low number and I was still able to iterate through a larger
result set, although this may have been batched behind the scenes. I'm hoping that my new
flag-based task will drastically cut down the result set size and prevent the runaway memory
usage anyway.


From: Clay Ferguson [mailto:wclayf@gmail.com] 
Sent: Tuesday, November 24, 2015 1:35 PM
To: users@jackrabbit.apache.org
Subject: Re: Memory usage

point #1. In SQL2 you can just build your query string dynamically and put
in the time of the last replication. So really I don't see the limitation
there. You would always just build your queries with the correct date on
it. But like you said, that is a 'weak" solution. I think actually the
'dirty flag' kind of thing or 'needs replication flag' is better because
you can do it node-by-node and at any time, and you can shutdown and
restart and it will always pickup where it left off. With timestamps you
can run into situations where at one cycle it only half processed (failure
for whaever reason), and then your dates get messed up. So if I were you'd
do the flag approach. Seems more bullet proof. So if you have systems A ,
B, C where a needs to replicate out to B and C, then what you'd do is ever
time you modify or create an A node, you set B_DIRTY=true, and C_DIRTY=true
on the A node, and that flags it to know a replication is pending. Sounds
like you are on the right track you just need to set a LIMIT on your query
so that it only grabs 100 or so at a time. I know MySQL has a LIMIT. Maybe
SQL2 does also. You'd just keep running 100 at a time using LIMIT until one
of the queries comes back empty. Will use hardly any memory, and be
bullet-proof AND always easily restartable/resumable.

Best regards,
Clay Ferguson
wclayf@gmail.com


On Tue, Nov 24, 2015 at 11:56 AM, Roll, Kevin <Kevin-Roll@idexx.com> wrote:

> Basically we replicate images and associated metadata to another system.
> One of the use cases is that the user marks an image as interesting in the
> local system. This metadata change (or any other) needs to then propagate
> to the other system. So, I am querying for nodes where jcr:lastModified is
> greater than another Date which is the timestamp of the last replication.
>
> My understanding is that JCR-SQL2 can only do a comparison where the
> second operand is static. I am working on a different approach where I set
> a flag on any node that needs to be replicated. I have event handlers for
> added and changed nodes - at that moment it is trivial to determine whether
> the node should be flagged. I realized it is much easier than trying to
> figure it out later. The "later" case arises because we have the option to
> switch this replication on and off, and there may be a situation where it
> becomes on and must catch up with a backlog of work. This way I can simply
> query all nodes with the flag set (I have a scheduled task that looks for
> nodes needing replication).
>
> If there's a date comparison trick it might help as an interim solution
> until I get this other idea up and running.
>
> Thanks!
>
> -----Original Message-----
> From: Clay Ferguson [mailto:wclayf@gmail.com]
> Sent: Tuesday, November 24, 2015 12:15 PM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> glad you're gettin' closer.
>
> If you want, tell us more about the date range problem, because I may know
> a solution (or workaround). Remember dates can be treated as integers if
> you really need to. Integers are the fastest and most powerful data type
> for dbs to handle too. So there should be a good clean solution unless you
> have a VERY unusual situation.
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com
>
>
> On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin <Kevin-Roll@idexx.com>
> wrote:
>
> > I think I am hot on the trail. I noticed this morning that the top
> objects
> > in the heap dump are not just Lucene, they are classes related to query
> > results. Due to a limitation in the Jackrabbit query language
> (specifically
> > the inability to compare two dynamic dates) I am running a query that
> > returns a result set proportional to the size of the repository (in other
> > words it is unbounded). resultFetchSize is unlimited by default, so I
> think
> > I am getting larger and larger query results until I run out of space.
> >
> > I already changed this parameter yesterday, so I will see what happens
> > with the testing today. In the bigger picture I'm working on a better way
> > to mark and query the nodes I'm interested in so I don't have to perform
> an
> > unbounded query.
> >
> > Thanks again for the excellent support.
> >
> > P.S. We build and run a standalone Sling jar - it runs separately from
> our
> > main application.
> >
> >
> > -----Original Message-----
> > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > Sent: Tuesday, November 24, 2015 11:05 AM
> > To: users@jackrabbit.apache.org
> > Subject: Re: Memory usage
> >
> > So just as Clay has mentioned above, Jackrabbit does not hold the
> complete
> > Lucene index in memory. How it actually works is there is a VolatileIndex
> > which is memory. Any updates to the Lucene Index are first done here and
> > then are committed to the FileSystem based on the threshold parameters.
> > This was obviously implemented for performance reasons.
> > http://wiki.apache.org/jackrabbit/Search
> > Parameters:
> > 1.
> >
> > maxVolatileIndexSize
> >
> > 1048576
> >
> > The maximum volatile index size in bytes until it is written to disk. The
> > default value is 1MB.
> >
> > 2.
> >
> > volatileIdleTime
> >
> > 3
> >
> > Idle time in seconds until the volatile index part is moved to a
> persistent
> > index even though minMergeDocs is not reached.
> >
> > 1GB is quite low. My company has ran for over two years a production
> > instance of Jackrabbit with 1 GB of memory and it has not had any issues.
> > The only time I saw huge spikes on memory consumption is on large
> > operations such as cloning a node with many descendants or querying a
> data
> > set with a 10k+ result size.
> >
> > You said you have gathered a heap dump, this should point you in the
> > direction of what objects are consuming majority of the heap. This would
> be
> > a good start to see if it is jackrabbit causing the issue or your
> > application.
> > What type of deployment (
> > http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit
> are
> > you guys running? Is it completed isolated or embedded in your
> application?
> >
> > On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Kevin-Roll@idexx.com>
> > wrote:
> >
> > > Hi, Ben. I was referring to the following page:
> > >
> > > https://jackrabbit.apache.org/jcr/search-implementation.html
> > >
> > > "The most recent generation of the search index is held completely in
> > > memory."
> > >
> > > Perhaps I am misreading this, or perhaps it is wrong, but I interpreted
> > > that to mean that the size of the index in memory would be proportional
> > to
> > > the repository size. I hope this is not true!
> > >
> > > I am currently trying to get information from our QA team about the
> > > approximate number of nodes in the repository. We are not currently
> > setting
> > > an explicit heap size - in the dumps I've examined it seems to run out
> > > around 240Mb. I'm pushing to set something explicit but I'm now hearing
> > > that older hardware has only 1Gb of memory, which gives us practically
> > > nowhere to go.
> > >
> > > The queries that I'm doing are not very fancy... for example: "select *
> > > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> > > rewriting that task so the query will be even simpler.
> > >
> > > Thanks for the help!
> > >
> > >
> > > users@jackrabbit.apache.org
> > > -----Original Message-----
> > > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > > Sent: Monday, November 23, 2015 5:21 PM
> > > To: users@jackrabbit.apache.org
> > > Subject: Re: Memory usage
> > >
> > > It is a good idea to turn off supportHighlighting especially if you
> > aren't
> > > using the functionality. It takes up a lot of extra space within the
> > index.
> > > I am not sure where you heard that the Lucene Index is kept in memory
> > but I
> > > am pretty certain that is wrong. Can you point me to the documentation
> > > saying this?
> > >
> > > Also what data set sizes are you querying against (10k nodes ? 100k
> > nodes?
> > > 1 mil nodes?).
> > > What heap size do you have set on the jvm?
> > > Reducing the resultFetchSize should help reduce the memory footprint on
> > > queries.
> > > I am assuming you are using the QueryManager to retrieve nodes. Can you
> > > give an example query that you are using?
> > >
> > > I have developed a patch to improve query performance on large data
> sets
> > > with jackrabbit 2.x. I should be done soon if I can gather together a
> few
> > > hours to finish up my work. If you would like you can give that a try
> > once
> > > I finish.
> > >
> > > Some other repository settings you might want to look at are:
> > >  <PersistenceManager
> > >
> > >
> >
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
> > >       <param name="bundleCacheSize" value="256"/>
> > > </PersistenceManager>
> > >  <ISMLocking
> > > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
> > >
> > >
> > > Hope this helps.
> > >
> > >
> >
>
Mime
View raw message