Return-Path: X-Original-To: apmail-incubator-gora-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-gora-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A16C57BA1 for ; Tue, 9 Aug 2011 17:09:13 +0000 (UTC) Received: (qmail 5496 invoked by uid 500); 9 Aug 2011 17:09:13 -0000 Delivered-To: apmail-incubator-gora-dev-archive@incubator.apache.org Received: (qmail 5475 invoked by uid 500); 9 Aug 2011 17:09:13 -0000 Mailing-List: contact gora-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: gora-dev@incubator.apache.org Delivered-To: mailing list gora-dev@incubator.apache.org Received: (qmail 5467 invoked by uid 99); 9 Aug 2011 17:09:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Aug 2011 17:09:13 +0000 X-ASF-Spam-Status: No, hits=3.8 required=5.0 tests=FR_ALMOST_VIAG2,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tdavidson@covario.com designates 173.227.41.150 as permitted sender) Received: from [173.227.41.150] (HELO mail.covario.com) (173.227.41.150) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Aug 2011 17:09:06 +0000 From: Tom Davidson To: "dev@nutch.apache.org" CC: "gora-dev@incubator.apache.org" Subject: RE: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora-hbase;0.1: not found in Nutch trunk] Date: Tue, 9 Aug 2011 17:08:42 +0000 Message-ID: <8FC6939DDF1D1440A318713F1E4E94BC026491@NAEXSAN01.semdirector.local> References: In-Reply-To: Content-Language: en-US Content-Type: multipart/alternative; boundary="_000_8FC6939DDF1D1440A318713F1E4E94BC026491NAEXSAN01semdirec_" MIME-Version: 1.0 X-WatchGuard-Spam-ID: str=0001.0A010201.4E4169B2.000A,ss=1,fgs=0 X-WatchGuard-Spam-Score: 0, clean; 0, no virus X-WatchGuard-Mail-Client-IP: 169.254.1.234 X-WatchGuard-Mail-From: tdavidson@covario.com X-Virus-Checked: Checked by ClamAV on apache.org --_000_8FC6939DDF1D1440A318713F1E4E94BC026491NAEXSAN01semdirec_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi All, I have been using Nutch 1.x for the last 9 months or so and it works well f= or large scale crawls up to around a billion pages. However, the inherent l= ack of random access in HDFS really starts to become a burden on our hadoop= cluster when going through the whole generate/update/fetch cycle. Being ab= le to circumvent HDFS and store data directly in Cassandra/HBase/SQL via GO= RA is an exciting development in Nutch 2, so I have an interest in making i= t succeed. That said, I too, have been frustrated by the state of affairs on Nutch 2. = I am willing to help. I see that Nutch is mainly an ant/ivy build process,= but there is an attempt at using Maven? IMO, ant/ivy seems a bit dated an= d I am really much more comfortable working with Maven. Would there be an i= nterest in completely moving to Maven as the build tool of choice? From: Kirby Bohling [mailto:kirby.bohling@gmail.com] Sent: Tuesday, August 09, 2011 8:31 AM To: dev@nutch.apache.org Cc: gora-dev@incubator.apache.org Subject: Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.g= ora#gora-hbase;0.1: not found in Nutch trunk] Julien, On Tue, Aug 9, 2011 at 10:10 AM, Julien Nioche > wrote: Hi Kirby, Grumble, Grumble. (adding dev@nutch, as that is more than likely where this discussion really belongs)... am adding gora-dev@incubator.apache.org as well It'd be really nice if folks could just follow the commands in the nightly build, and get a build pushed out. I've pointed this out previously, and was told this would be fixed "shortly" (right after GORA-0.1 finally got released, but not published in public maven repo, which as far as I know, it still isn't published, but I stopped checking on it). I understand and share your frustration, however you need to bear in mind t= hat things are done only if people volunteer and have time - usually taken = from their holiday, weekends, evenings. Chris (who is the de facto release = master for Nutch and Gora) has not had the time and nobody else has volunte= ered to do it. I don't mean to be a complainer, I'd happily try and contribute fixes on= this one, but most of this would likely have to be done on Hudson/Jenkins.= I think you're addressing a larger issue than I really meant. My point w= as, somehow a developer does a build on their desktop, and however that is = done should be duplicated on Hudson/Jenkins. If you need the trunk of gora= , then is it possible to checkout it out, build it and install it to a loca= l repo, and then build Nutch via Hudson/Jenkins? Whatever it takes to get = a build should be what the CI server is doing. The repeatable, but failing= builds is what really confuses and frustrates me. The nightly/CI build sh= ould be automating what devs on their desktop to ensure it'll work on a cle= an setup. Right now, it just tells you that for the last year, the totally= obvious steps will lead to a failure. I can figure out all of the configuration issues for Hudson/Jenkins to m= ake it work, if somebody can push that into the Apache version. However, I= think answering your questions first would be a good idea. My totally non= -binding +1 for setting up a CI/Nightly build for the various stable branch= es too, the only one I found on Apache was for trunk. As it happens, yesterday was the 1 year anniversary of the last successful Hudson/Jenkins build... If that actually worked, we could point people towards it as a useful recipe for how to get a build working off trunk. I haven't been following Nutch too closely, but it always strikes me as really odd, that there's a nightly build and it doesn't bother anybody that it fails all the time (and that there isn't a nightly build for the stable branches). The real issue behind all this is what we should do with Nutch 2.0. What fo= llows is only my opinion and I would love to hear what others have to say o= n this subject. Since we (actually mostly Dogacan) wrote 2.0 and delegated the storage to G= ora, the latter hasn't really taken off since incubation. There have been s= ome modest contributions to it but it does not seem to be used much and the= re is virtually nothing happening on it in terms of development. More worry= ingly, the people who initially contributed to it are not very active on th= e project (such is life, new jobs, different projects, etc...) anymore*. As= for Nutch 2.0, it hasn't made any progress in the last 12 months : we sti= ll have the same bugs, the tests do not work, the build has to be done manu= ally etc... At the same time, there has been a new lease of life into Nutch as a whole = : there is definitely more activity on the mailing lists, new users, new ac= tive committers etc... and quite a few bugfixes and improvements - most of= them backported from what had been done in the trunk and people seem fairl= y happy with what we can do with 1.4 So the question is : what shall we do with 2.0? Here are a few possibilitie= s : a) put some effort into it, fix the bugs and make so that it can be used in= stead of 1.x b) shelve it and leave it for enthusiasts to play with + make 1.x the trunk= again c) do nothing : keep 2.0 and 1.x in parallel (but having to maintain two b= ranches is quite a pain) d) abandon the idea of a neutral storage layer with Gora and hardwire it to= e.g. HBase Option (a) has not happened in the last 12 months and I am not very hopeful= about it. What do you guys think? I know nothing about the 2.0 branch, and can't really contribute to that= conversation (that job issue interferes will all my free time). Kirby Julien -- Error! Filename not specified. Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com --_000_8FC6939DDF1D1440A318713F1E4E94BC026491NAEXSAN01semdirec_--