jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Soztutar <enis.soz.nu...@gmail.com>
Subject Re: How are developers using jackrabbit
Date Thu, 09 Aug 2007 06:30:24 GMT

I just encountered this message by chance, but i would like to share my 
opinion about it.

Ard Schrijvers wrote:
> Hello Vikas,
> apparently nobody yet had time to react to your little survey, so I will just try to
give my 2 cents. IMO your questions are strongly intertwined with how you set up your content
modelling, which kind of data you have (binary data vs xml), what kind of useage you expect
(searches vs iterating nodes), etc etc, and therefore hard (impossible) to judge.
> Though I am by far not yet in the position to base my remarks by code or proper examples
or benchmarking, I do think, you are having a usecase that would kind of "has the need for
the best of all worlds", regarding storing / indexing / iterating nodes / searching (with
sorting) etc.
> I am not yet aware of the ins and outs on many parts of the JR, but at least storing
10K child nodes per node is AFAIK currently not an option. Regarding your usecase, having
around 36.000.000 documents after one year in one single ws with terabytes of data...so 100.000.000
docs within three years...Well, I think you at least have to tune some settings :-) 
> Though, something just to grasp the complexity of your requirements, I'll take the searching
part as an example for it: many millions of documents and terabytes of data, and you want
fast searching, right? Well, there is just this apache project out there, Hadoop, a lucene
subproject build on the MapReduce algorithm [1] to enable your fast searching. Though, obviously,
this is a bleeding edge apache toplevel project, and obviously not (yet...) available in JR.
But, as a next requirement you might have that you also need fast facetted navigation..then
you need the bleeding edge Solr[2] technology, so you somehow need to have the best of Solr
and Hadoop. Since, ofcourse, we also want authorisation, we need to add some bleeding edge
not yet existing toplevel project, that combines the best of two bleeding edge toplevel projects
to include authorisation on searches. And, of all projects, we do need to know exactly how
to tune the settings, because OOM's might occur in any project if you do not know the ins
and outs of configuration. I think you graps the idea of what I am trying to say: with 100.000.000
docs and many terabytes of data, searching becomes much complexer then the current JR lucene
impl IMO
Hadoop enables one to deal with millions of files containing TBs of 
data. The data is stored, in what is called a distributed file system. 
The data can be processed parallel using map-reduce programming 
paradigm. The framework is fault tolerant regarding data storage and 
computation. Regarding searching as far as i know, JR uses lucene to 
store the index, but lucene has some issues with write only indexes. So 
solr (built on top of lucene) can be a high level solution to that.

I have been dealing with webdav integration of the filesystem interface 
for hadoop(using JR), and developed a working patch for hadoop. I will 
be glad if you check it out 
(https://issues.apache.org/jira/browse/HADOOP-496). Any feedback will be 
appreciated (since i am neither familiar with the JR at all, nor have a 
deeper understanding of the data flow model of JR ).

> For any other parts in JR probably similar arguments hold regarding the requirements
you have to deal with, but I think *any* system out in the open and closed will have these
(though others might digress a little on this because my knowledge is too shallow). 
> I am not aware of available benchmarks or JR performance numbers, but perhaps other are,
> Regards Ard
> [1] http://lucene.apache.org/hadoop/
> [2] http://lucene.apache.org/solr/
>> We are concerned regarding Jackrabbit and its ability to handle really
>> heavy load requirements. We are looking to use jackrabbit to push
>> approximately 300-500 nodes a minute ranging to 100K nodes a day. The
>> live repository could easily go to be a few terabytes all using a
>> single workspace.
>> We wanted to ask the community how is jackrabbit actually being used
>> in production environments. So here is a email poll if you will.
>> . How much of data are you pushing into jackrabbit at a time?
>> . Are you using burst modes or continuous data feed?
>> . What is the biggest repository (in size) that you have used or heard
>> of being used with jackrabbit?
>> . Are you satisfied with the response times of your queries?
>> . Have you restrained having more that 10K child nodes per node?
>> . What caching mechanism are you using? Are you modifying the default
>> caching that comes with jackrabbit?
>> . Are you using the default data store mechanisms such as file PMs and
>> db PMs or have you built a custom PM or used one from Day systems?
>> I hope these answers would help us and the community on the whole.
>> Thanks.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message