Return-Path: Delivered-To: apmail-jackrabbit-users-archive@locus.apache.org Received: (qmail 99911 invoked from network); 9 Aug 2007 06:30:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Aug 2007 06:30:55 -0000 Received: (qmail 43401 invoked by uid 500); 9 Aug 2007 06:30:53 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 43384 invoked by uid 500); 9 Aug 2007 06:30:53 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 43375 invoked by uid 99); 9 Aug 2007 06:30:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2007 23:30:53 -0700 X-ASF-Spam-Status: No, hits=3.2 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 212.174.130.108 is neither permitted nor denied by domain of enis.soz.nutch@gmail.com) Received: from [212.174.130.108] (HELO mail.agmlab.com) (212.174.130.108) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Aug 2007 06:30:51 +0000 Received: from mail.agmlab.com (localhost [127.0.0.1]) by mail.agmlab.com (Postfix) with ESMTP id 09DD1D4291 for ; Thu, 9 Aug 2007 09:30:41 +0300 (EEST) Received: from [192.168.15.18] (unknown [192.168.15.18]) by mail.agmlab.com (Postfix) with ESMTP id EDA58D4290 for ; Thu, 9 Aug 2007 09:30:40 +0300 (EEST) Message-ID: <46BAB480.4090605@gmail.com> Date: Thu, 09 Aug 2007 09:30:24 +0300 From: Enis Soztutar User-Agent: Thunderbird 1.5.0.12 (X11/20070604) MIME-Version: 1.0 To: users@jackrabbit.apache.org Subject: Re: How are developers using jackrabbit References: In-Reply-To: Content-Type: multipart/alternative; boundary="------------050609000407080400060103" X-Virus-Scanned: ClamAV using ClamSMTP X-Virus-Checked: Checked by ClamAV on apache.org --------------050609000407080400060103 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Hi, I just encountered this message by chance, but i would like to share my=20 opinion about it. Ard Schrijvers wrote: > Hello Vikas, > > apparently nobody yet had time to react to your little survey, so I wil= l just try to give my 2 cents. IMO your questions are strongly intertwine= d with how you set up your content modelling, which kind of data you have= (binary data vs xml), what kind of useage you expect (searches vs iterat= ing nodes), etc etc, and therefore hard (impossible) to judge. > > Though I am by far not yet in the position to base my remarks by code o= r proper examples or benchmarking, I do think, you are having a usecase t= hat would kind of "has the need for the best of all worlds", regarding st= oring / indexing / iterating nodes / searching (with sorting) etc. > > I am not yet aware of the ins and outs on many parts of the JR, but at = least storing 10K child nodes per node is AFAIK currently not an option. = Regarding your usecase, having around 36.000.000 documents after one year= in one single ws with terabytes of data...so 100.000.000 docs within thr= ee years...Well, I think you at least have to tune some settings :-)=20 > > Though, something just to grasp the complexity of your requirements, I'= ll take the searching part as an example for it: many millions of documen= ts and terabytes of data, and you want fast searching, right? Well, there= is just this apache project out there, Hadoop, a lucene subproject build= on the MapReduce algorithm [1] to enable your fast searching. Though, ob= viously, this is a bleeding edge apache toplevel project, and obviously n= ot (yet...) available in JR. But, as a next requirement you might have th= at you also need fast facetted navigation..then you need the bleeding edg= e Solr[2] technology, so you somehow need to have the best of Solr and Ha= doop. Since, ofcourse, we also want authorisation, we need to add some bl= eeding edge not yet existing toplevel project, that combines the best of = two bleeding edge toplevel projects to include authorisation on searches.= And, of all projects, we do need to know exactly how to tune the setting= s, because OOM's might occur in any project if you do not know the ins an= d outs of configuration. I think you graps the idea of what I am trying t= o say: with 100.000.000 docs and many terabytes of data, searching become= s much complexer then the current JR lucene impl IMO > =20 Hadoop enables one to deal with millions of files containing TBs of=20 data. The data is stored, in what is called a distributed file system.=20 The data can be processed parallel using map-reduce programming=20 paradigm. The framework is fault tolerant regarding data storage and=20 computation. Regarding searching as far as i know, JR uses lucene to=20 store the index, but lucene has some issues with write only indexes. So=20 solr (built on top of lucene) can be a high level solution to that. I have been dealing with webdav integration of the filesystem interface=20 for hadoop(using JR), and developed a working patch for hadoop. I will=20 be glad if you check it out=20 (https://issues.apache.org/jira/browse/HADOOP-496). Any feedback will be = appreciated (since i am neither familiar with the JR at all, nor have a=20 deeper understanding of the data flow model of JR ). > For any other parts in JR probably similar arguments hold regarding the= requirements you have to deal with, but I think *any* system out in the = open and closed will have these (though others might digress a little on = this because my knowledge is too shallow).=20 > > I am not aware of available benchmarks or JR performance numbers, but p= erhaps other are, > > Regards Ard > > [1] http://lucene.apache.org/hadoop/ > [2] http://lucene.apache.org/solr/ > > =20 >> We are concerned regarding Jackrabbit and its ability to handle really= >> heavy load requirements. We are looking to use jackrabbit to push >> approximately 300-500 nodes a minute ranging to 100K nodes a day. The >> live repository could easily go to be a few terabytes all using a >> single workspace. >> >> We wanted to ask the community how is jackrabbit actually being used >> in production environments. So here is a email poll if you will. >> >> . How much of data are you pushing into jackrabbit at a time? >> >> . Are you using burst modes or continuous data feed? >> >> . What is the biggest repository (in size) that you have used or heard= >> of being used with jackrabbit? >> >> . Are you satisfied with the response times of your queries? >> >> . Have you restrained having more that 10K child nodes per node? >> >> . What caching mechanism are you using? Are you modifying the default >> caching that comes with jackrabbit? >> >> . Are you using the default data store mechanisms such as file PMs and= >> db PMs or have you built a custom PM or used one from Day systems? >> >> >> I hope these answers would help us and the community on the whole. >> >> Thanks. >> >> =20 > > =20 --------------050609000407080400060103--