Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@jackrabbit.apache.org
Received-SPF: neutral (athena.apache.org: 212.174.130.108 is neither permitted
 nor denied by domain of enis.soz.nutch@gmail.com)
Message-ID: <46BAB480.4090605@gmail.com>
Date: Thu, 09 Aug 2007 09:30:24 +0300
From: Enis Soztutar <enis.soz.nutch@gmail.com>
User-Agent: Thunderbird 1.5.0.12 (X11/20070604)
MIME-Version: 1.0
To: users@jackrabbit.apache.org
Subject: Re: How are developers using jackrabbit
References: <A955EA1F8FE31749AEC8C998082F6C7C01ECDB48@hai01.hippo.local>
In-Reply-To: <A955EA1F8FE31749AEC8C998082F6C7C01ECDB48@hai01.hippo.local>
Content-Type: multipart/alternative;
 boundary="------------050609000407080400060103"

--------------050609000407080400060103
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable

Hi,

I just encountered this message by chance, but i would like to share my=20
opinion about it.

Ard Schrijvers wrote:
> Hello Vikas,
>
> apparently nobody yet had time to react to your little survey, so I wil=
l just try to give my 2 cents. IMO your questions are strongly intertwine=
d with how you set up your content modelling, which kind of data you have=
 (binary data vs xml), what kind of useage you expect (searches vs iterat=
ing nodes), etc etc, and therefore hard (impossible) to judge.
>
> Though I am by far not yet in the position to base my remarks by code o=
r proper examples or benchmarking, I do think, you are having a usecase t=
hat would kind of "has the need for the best of all worlds", regarding st=
oring / indexing / iterating nodes / searching (with sorting) etc.
>
> I am not yet aware of the ins and outs on many parts of the JR, but at =
least storing 10K child nodes per node is AFAIK currently not an option. =
Regarding your usecase, having around 36.000.000 documents after one year=
 in one single ws with terabytes of data...so 100.000.000 docs within thr=
ee years...Well, I think you at least have to tune some settings :-)=20
>
> Though, something just to grasp the complexity of your requirements, I'=
ll take the searching part as an example for it: many millions of documen=
ts and terabytes of data, and you want fast searching, right? Well, there=
 is just this apache project out there, Hadoop, a lucene subproject build=
 on the MapReduce algorithm [1] to enable your fast searching. Though, ob=
viously, this is a bleeding edge apache toplevel project, and obviously n=
ot (yet...) available in JR. But, as a next requirement you might have th=
at you also need fast facetted navigation..then you need the bleeding edg=
e Solr[2] technology, so you somehow need to have the best of Solr and Ha=
doop. Since, ofcourse, we also want authorisation, we need to add some bl=
eeding edge not yet existing toplevel project, that combines the best of =
two bleeding edge toplevel projects to include authorisation on searches.=
 And, of all projects, we do need to know exactly how to tune the setting=
s, because OOM's might occur in any project if you do not know the ins an=
d outs of configuration. I think you graps the idea of what I am trying t=
o say: with 100.000.000 docs and many terabytes of data, searching become=
s much complexer then the current JR lucene impl IMO
>  =20
Hadoop enables one to deal with millions of files containing TBs of=20
data. The data is stored, in what is called a distributed file system.=20
The data can be processed parallel using map-reduce programming=20
paradigm. The framework is fault tolerant regarding data storage and=20
computation. Regarding searching as far as i know, JR uses lucene to=20
store the index, but lucene has some issues with write only indexes. So=20
solr (built on top of lucene) can be a high level solution to that.

I have been dealing with webdav integration of the filesystem interface=20
for hadoop(using JR), and developed a working patch for hadoop. I will=20
be glad if you check it out=20
(https://issues.apache.org/jira/browse/HADOOP-496). Any feedback will be =

appreciated (since i am neither familiar with the JR at all, nor have a=20
deeper understanding of the data flow model of JR ).

> For any other parts in JR probably similar arguments hold regarding the=
 requirements you have to deal with, but I think *any* system out in the =
open and closed will have these (though others might digress a little on =
this because my knowledge is too shallow).=20
>
> I am not aware of available benchmarks or JR performance numbers, but p=
erhaps other are,
>
> Regards Ard
>
> [1] http://lucene.apache.org/hadoop/
> [2] http://lucene.apache.org/solr/
>
>  =20
>> We are concerned regarding Jackrabbit and its ability to handle really=

>> heavy load requirements. We are looking to use jackrabbit to push
>> approximately 300-500 nodes a minute ranging to 100K nodes a day. The
>> live repository could easily go to be a few terabytes all using a
>> single workspace.
>>
>> We wanted to ask the community how is jackrabbit actually being used
>> in production environments. So here is a email poll if you will.
>>
>> . How much of data are you pushing into jackrabbit at a time?
>>
>> . Are you using burst modes or continuous data feed?
>>
>> . What is the biggest repository (in size) that you have used or heard=

>> of being used with jackrabbit?
>>
>> . Are you satisfied with the response times of your queries?
>>
>> . Have you restrained having more that 10K child nodes per node?
>>
>> . What caching mechanism are you using? Are you modifying the default
>> caching that comes with jackrabbit?
>>
>> . Are you using the default data store mechanisms such as file PMs and=

>> db PMs or have you built a custom PM or used one from Day systems?
>>
>>
>> I hope these answers would help us and the community on the whole.
>>
>> Thanks.
>>
>>    =20
>
>  =20

--------------050609000407080400060103--