jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: How are developers using jackrabbit
Date Tue, 31 Jul 2007 20:15:44 GMT
Hello Vikas,

apparently nobody yet had time to react to your little survey, so I will just try to give
my 2 cents. IMO your questions are strongly intertwined with how you set up your content modelling,
which kind of data you have (binary data vs xml), what kind of useage you expect (searches
vs iterating nodes), etc etc, and therefore hard (impossible) to judge.

Though I am by far not yet in the position to base my remarks by code or proper examples or
benchmarking, I do think, you are having a usecase that would kind of "has the need for the
best of all worlds", regarding storing / indexing / iterating nodes / searching (with sorting)

I am not yet aware of the ins and outs on many parts of the JR, but at least storing 10K child
nodes per node is AFAIK currently not an option. Regarding your usecase, having around 36.000.000
documents after one year in one single ws with terabytes of data...so 100.000.000 docs within
three years...Well, I think you at least have to tune some settings :-) 

Though, something just to grasp the complexity of your requirements, I'll take the searching
part as an example for it: many millions of documents and terabytes of data, and you want
fast searching, right? Well, there is just this apache project out there, Hadoop, a lucene
subproject build on the MapReduce algorithm [1] to enable your fast searching. Though, obviously,
this is a bleeding edge apache toplevel project, and obviously not (yet...) available in JR.
But, as a next requirement you might have that you also need fast facetted navigation..then
you need the bleeding edge Solr[2] technology, so you somehow need to have the best of Solr
and Hadoop. Since, ofcourse, we also want authorisation, we need to add some bleeding edge
not yet existing toplevel project, that combines the best of two bleeding edge toplevel projects
to include authorisation on searches. And, of all projects, we do need to know exactly how
to tune the settings, because OOM's might occur in any project if you do not know the ins
and outs of configuration. I think you graps the idea of what I am trying to say: with 100.000.000
docs and many terabytes of data, searching becomes much complexer then the current JR lucene
impl IMO

For any other parts in JR probably similar arguments hold regarding the requirements you have
to deal with, but I think *any* system out in the open and closed will have these (though
others might digress a little on this because my knowledge is too shallow). 

I am not aware of available benchmarks or JR performance numbers, but perhaps other are,

Regards Ard

[1] http://lucene.apache.org/hadoop/
[2] http://lucene.apache.org/solr/

> We are concerned regarding Jackrabbit and its ability to handle really
> heavy load requirements. We are looking to use jackrabbit to push
> approximately 300-500 nodes a minute ranging to 100K nodes a day. The
> live repository could easily go to be a few terabytes all using a
> single workspace.
> We wanted to ask the community how is jackrabbit actually being used
> in production environments. So here is a email poll if you will.
> . How much of data are you pushing into jackrabbit at a time?
> . Are you using burst modes or continuous data feed?
> . What is the biggest repository (in size) that you have used or heard
> of being used with jackrabbit?
> . Are you satisfied with the response times of your queries?
> . Have you restrained having more that 10K child nodes per node?
> . What caching mechanism are you using? Are you modifying the default
> caching that comes with jackrabbit?
> . Are you using the default data store mechanisms such as file PMs and
> db PMs or have you built a custom PM or used one from Day systems?
> I hope these answers would help us and the community on the whole.
> Thanks.

View raw message