From lucene-dev-return-3760-qmlist-jakarta-archive-lucene-dev=nagoya.apache.org@jakarta.apache.org Tue Jul 01 00:41:29 2003 Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 75482 invoked from network); 1 Jul 2003 00:41:28 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 1 Jul 2003 00:41:28 -0000 Received: (qmail 22441 invoked by uid 97); 1 Jul 2003 00:44:00 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 22434 invoked from network); 1 Jul 2003 00:43:59 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 1 Jul 2003 00:43:59 -0000 Received: (qmail 75181 invoked by uid 500); 1 Jul 2003 00:41:26 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 75151 invoked from network); 1 Jul 2003 00:41:25 -0000 Received: from mail1.atl.registeredsite.com (64.224.219.75) by daedalus.apache.org with SMTP; 1 Jul 2003 00:41:25 -0000 Received: from netmail.mail.registeredsite.com ([216.122.69.14]) by mail1.atl.registeredsite.com (8.12.9/8.12.9) with ESMTP id h610fX6n019505 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NOT) for ; Mon, 30 Jun 2003 20:41:33 -0400 Received: (qmail 1504 invoked by uid 89); 1 Jul 2003 00:43:05 -0000 Received: from unknown (HELO ehatchersolutions.com) (24.51.109.120) by mail.wayne-machine.com with SMTP; 1 Jul 2003 00:43:05 -0000 Date: Mon, 30 Jun 2003 20:41:37 -0400 Subject: Re: Lucene crawler plan Content-Type: text/plain; delsp=yes; charset=ISO-8859-1; format=flowed Mime-Version: 1.0 (Apple Message framework v552) From: Erik Hatcher To: "Lucene Developers List" Content-Transfer-Encoding: quoted-printable In-Reply-To: <3F00D52A.9010304@dstc.edu.au> Message-Id: X-Mailer: Apple Mail (2.552) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N If you are after a pure file system indexing abstraction, check out the =20= 'ant' project in the sandbox. It's got a DocumentHandler abstraction =20= allowing it to be a bit pluggable. Its not perfect, but it has worked =20= for me for quite some time quite sufficiently. Erik On Monday, June 30, 2003, at 08:26 PM, Peter Becker wrote: > Clemens Marschner wrote: > >> There's an experimental webcrawler in the lucene-sandbox area called >> larm-webcrawler (see >> http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/=20 >> overview.html), >> >> and a project on Sourceforge (http://larm.sf.net) that tries to =20 >> leverage >> this on a higher level. I want to encourage you to go on that side =20= >> and read >> through the specs in sourceforge's CVS. >> > I've done that by now -- my first problem was to identify LARM as the =20= > relevant project, but then things were reasonably easy to find. > >> It concludes pretty much everything that Andy wrote in his proposal, =20= >> and >> more. The project only contains conceptual documents at this time, =20= >> but if >> you're willing to contribute actively, that's very appreciated. >> > In many ways the project aims too high for us. We are interested only =20= > in the file system part and our time is limited. My hope was that =20 > someone would say there would be a basic framework somewhere where we =20= > can put our code, but due to the time limitations we will rather do =20= > our own thing. But this is maybe not as bad as it sounds since (a) our = =20 > original plan was very close to what you describe in certain parts of =20= > the system, (b) we have read your documentation and (c) our code will =20= > be BSD-licensed. > > The main ideas we have are: > - map file types to document processors > - use the java.io.FileFilter interface as base for the mappings > - the document processors will probably have a two method interface: > DocumentSummary processDocument(URL); > String getDisplayName(); > - the DocSummary class will model the common attributes like author, =20= > title, text, etc. with a Properties object to be extensible. It's main = =20 > purpose is to separate indexing concerns like stored/unstored and =20 > tokenized/untokenized from the document processors > - the display name will be used in the UI to create lists of =20 > FileFilter->DocumentProcessor mappings > - there will be some crawler code for the file system, but of course =20= > that is a lot easier > > Many of these things will not extend straightaway into the web =20 > context, but I think the main work we will do will be in implementing =20= > the different DocumentProcessors. That part should be reusable. The =20= > mapping idea should be reusable, although FileFilter would have to be =20= > replaced with something more abstract, at least a URLFilter. My =20 > experience with Java networking is not good enough to judge the =20 > complexity of that. > > We expect to have the relevant parts of this done next week. Code will = =20 > be on Sourceforge =20 > (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/), it =20= > might be at least useful as inspiration :-) We are also looking into =20= > alternatives for parsing PDF and other formats. We have a lot of =20 > problems with PDFBox at the moment, and there might be other =20 > candidates (http://www.cs.berkeley.edu/~phelps/Multivalent/). And we =20= > are looking into the option to use the UDK for indexing =20 > (http://udk.openoffice.org/), although that most likely will =20 > complicate deployment and increase program size quite a bit. One of =20= > the problems we have is that we have some interesting test cases for =20= > the parsing tools, but we can't give them away and don't have the time = =20 > to debug ourself. We have a file which causes PDFBox to get stuck =20 > without any feedback and an XLS file which causes POI to loop with =20 > funny messages for a long time until we run out of memory (with =20 > -mx500m). But that is something we have to talk to the other projects =20= > about. > > The point of this waffle is: if you think some of our ideas are not as = =20 > good as they should be or there are things that might affect reuse, =20= > please shout now :-) We start coding this right now. > >> Unfortunately I have to stop my efforts regarding LARM. Long story =20= >> short: My >> future employer says it's too close to their business. But in =20 >> contrast to >> other open source projects, there's already lots of ideas in that =20 >> document >> and lots of code in the old crawler. If you wish to contribute, it's =20= >> now up >> to you. >> > Fair enough. I guess as professional developer you can never be =20 > completely free from considering IP issues. > > Gr=FCsse, > Peter > > >> Clemens >> >> >> >> ----- Original Message ----- From: "Andrew C. Oliver" =20 >> >> To: "Peter Becker" >> Cc: "Lucene Developers List" >> Sent: Friday, June 27, 2003 2:53 AM >> Subject: Re: Lucene crawler plan >> >> >> >>> On 6/26/03 8:33 PM, "Peter Becker" wrote: >>> >>> >>>> Hi Andrew, >>>> >>>> are you the Andy signing this: >>>> http://jakarta.apache.org/lucene/docs/luceneplan.html? If no -- do =20= >>>> you >>>> know who wrote the page and could you forward this email? Thanks. =20= >>>> BTW: >>>> your website link on =20 >>>> http://jakarta.apache.org/lucene/docs/whoweare.html >>>> is dead. >>>> >>>> >>> Yes I wrote it. >>> >>> >>>> The question is: is there some code already? If yes: can we get it? = =20 >>>> Can >>>> we join the effort? If no: what are things we should consider doing = =20 >>>> to >>>> increase our chances that you guys accept our code in the end? We =20= >>>> are >>>> not really interested in maintaining the crawler bits and pieces, =20= >>>> our >>>> interest is in the visualization. We are happy to get something =20 >>>> going as >>>> part of our little demonstrator, but then we'd give it to you and =20= >>>> hope >>>> someone picks up maintenance. >>>> >>>> >>> I never wrote any code, but there is code in lucene-contrib which =20= >>> realized >>> most of what is in this document. I was going to write code, but =20= >>> someone >>> beat me to the punch and I was like "wow I have things I can do that >>> >> others >> >>> won't do for me" and moved on :-) >>> >>> I'm cc'ing lucene developers list. You'll find plenty of folks =20 >>> interested >>> in working with you on this. >>> >>> -Andy >>> >>>> Is this all an option anyway? It is ok to say no ;-) >>>> >>>> Regards, >>>> Peter >>>> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org