From lucene-dev-return-3761-qmlist-jakarta-archive-lucene-dev=nagoya.apache.org@jakarta.apache.org Tue Jul 01 02:21:36 2003 Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 58670 invoked from network); 1 Jul 2003 02:21:35 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 1 Jul 2003 02:21:35 -0000 Received: (qmail 23647 invoked by uid 97); 1 Jul 2003 02:24:07 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 23640 invoked from network); 1 Jul 2003 02:24:07 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 1 Jul 2003 02:24:07 -0000 Received: (qmail 58206 invoked by uid 500); 1 Jul 2003 02:21:31 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 58108 invoked from network); 1 Jul 2003 02:21:30 -0000 Received: from natsmtp01.webmailer.de (HELO post.webmailer.de) (192.67.198.81) by daedalus.apache.org with SMTP; 1 Jul 2003 02:21:30 -0000 Received: from dstc.edu.au (m080-108.nv.iinet.net.au [203.217.80.108]) by post.webmailer.de (8.12.8/8.8.7) with ESMTP id h612LZqi006867 for ; Tue, 1 Jul 2003 04:21:36 +0200 (MEST) Message-ID: <3F00F046.9090303@dstc.edu.au> Date: Tue, 01 Jul 2003 12:21:58 +1000 From: Peter Becker User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.3) Gecko/20030312 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Developers List Subject: Re: Lucene crawler plan References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Thanks Erik, this is far closer to what we are looking for. Using Ant is an interesting idea, although it probably won't help us for the UI tool. But we could try to layer things so we could use them for both -- we want to get some more sophisticated index management anyway. The option to create the index at one place and use it somewhere else would be great -- during testing and demoing we ran into the problem that we wanted to demo on a Windows box but using a Unix filesystem mounted via SMB/Samba. Symlinks are no fun in this case :-( To work around this we need to develop some notion of a base URL, then we could easily mount an index created on one machine on another -- even if the underlying OS changes. To go Enterprise we would still need some security concept, which we probably won't do before someone is willing to pay for it :-) It might be better to go intranet for that one anyway -- we should be able to take it all to the Web. Two differences between the Ant project and what we do right now: - the Ant project doesn't have a notion of an explicit file filter. I think this is important if you want to extend the filter options to more than just extensions and if you want some UI to manage the filter mappings. BTW: does anyone know of a Java implementation for file(1) magic? - the code creates Documents as return values. The reason we went away from this is that we want to use the same document handler with different index options. One of the core issues here is storing the body or not. I don't think there is any true answer for this one, so it should be configurable somehow. The two options I see are either returning a data object and then turning that into a Document somewhere else or passing some configuration object around. Both are not really nice, the first one needs to create an additional object all the time, while the second one puts quite some burder on the implementer of the document handler. Ideas on that one would be extremely welcome. Two ideas we will probably pick up from this are: - use Ant for creating indexes if we go larger than personal document retrieval - use JTidy for HTML parsing (we missed that one and used Swing instead, which is no good) So thanks again, that was quite helpful. Peter Erik Hatcher wrote: > If you are after a pure file system indexing abstraction, check out > the 'ant' project in the sandbox. It's got a DocumentHandler > abstraction allowing it to be a bit pluggable. Its not perfect, but > it has worked for me for quite some time quite sufficiently. > > Erik > > > On Monday, June 30, 2003, at 08:26 PM, Peter Becker wrote: > >> Clemens Marschner wrote: >> >>> There's an experimental webcrawler in the lucene-sandbox area called >>> larm-webcrawler (see >>> http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/ >>> overview.html), >>> >>> and a project on Sourceforge (http://larm.sf.net) that tries to >>> leverage >>> this on a higher level. I want to encourage you to go on that side >>> and read >>> through the specs in sourceforge's CVS. >>> >> I've done that by now -- my first problem was to identify LARM as >> the relevant project, but then things were reasonably easy to find. >> >>> It concludes pretty much everything that Andy wrote in his >>> proposal, and >>> more. The project only contains conceptual documents at this time, >>> but if >>> you're willing to contribute actively, that's very appreciated. >>> >> In many ways the project aims too high for us. We are interested >> only in the file system part and our time is limited. My hope was >> that someone would say there would be a basic framework somewhere >> where we can put our code, but due to the time limitations we will >> rather do our own thing. But this is maybe not as bad as it sounds >> since (a) our original plan was very close to what you describe in >> certain parts of the system, (b) we have read your documentation and >> (c) our code will be BSD-licensed. >> >> The main ideas we have are: >> - map file types to document processors >> - use the java.io.FileFilter interface as base for the mappings >> - the document processors will probably have a two method interface: >> DocumentSummary processDocument(URL); >> String getDisplayName(); >> - the DocSummary class will model the common attributes like author, >> title, text, etc. with a Properties object to be extensible. It's >> main purpose is to separate indexing concerns like stored/unstored >> and tokenized/untokenized from the document processors >> - the display name will be used in the UI to create lists of >> FileFilter->DocumentProcessor mappings >> - there will be some crawler code for the file system, but of course >> that is a lot easier >> >> Many of these things will not extend straightaway into the web >> context, but I think the main work we will do will be in >> implementing the different DocumentProcessors. That part should be >> reusable. The mapping idea should be reusable, although FileFilter >> would have to be replaced with something more abstract, at least a >> URLFilter. My experience with Java networking is not good enough to >> judge the complexity of that. >> >> We expect to have the relevant parts of this done next week. Code >> will be on Sourceforge >> (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/), it >> might be at least useful as inspiration :-) We are also looking into >> alternatives for parsing PDF and other formats. We have a lot of >> problems with PDFBox at the moment, and there might be other >> candidates (http://www.cs.berkeley.edu/~phelps/Multivalent/). And we >> are looking into the option to use the UDK for indexing >> (http://udk.openoffice.org/), although that most likely will >> complicate deployment and increase program size quite a bit. One of >> the problems we have is that we have some interesting test cases for >> the parsing tools, but we can't give them away and don't have the >> time to debug ourself. We have a file which causes PDFBox to get >> stuck without any feedback and an XLS file which causes POI to loop >> with funny messages for a long time until we run out of memory >> (with -mx500m). But that is something we have to talk to the other >> projects about. >> >> The point of this waffle is: if you think some of our ideas are not >> as good as they should be or there are things that might affect >> reuse, please shout now :-) We start coding this right now. >> >>> Unfortunately I have to stop my efforts regarding LARM. Long story >>> short: My >>> future employer says it's too close to their business. But in >>> contrast to >>> other open source projects, there's already lots of ideas in that >>> document >>> and lots of code in the old crawler. If you wish to contribute, >>> it's now up >>> to you. >>> >> Fair enough. I guess as professional developer you can never be >> completely free from considering IP issues. >> >> Grüsse, >> Peter >> >> >>> Clemens >>> >>> >>> >>> ----- Original Message ----- From: "Andrew C. Oliver" >>> >>> To: "Peter Becker" >>> Cc: "Lucene Developers List" >>> Sent: Friday, June 27, 2003 2:53 AM >>> Subject: Re: Lucene crawler plan >>> >>> >>> >>>> On 6/26/03 8:33 PM, "Peter Becker" wrote: >>>> >>>> >>>>> Hi Andrew, >>>>> >>>>> are you the Andy signing this: >>>>> http://jakarta.apache.org/lucene/docs/luceneplan.html? If no -- >>>>> do you >>>>> know who wrote the page and could you forward this email? Thanks. >>>>> BTW: >>>>> your website link on >>>>> http://jakarta.apache.org/lucene/docs/whoweare.html >>>>> is dead. >>>>> >>>>> >>>> Yes I wrote it. >>>> >>>> >>>>> The question is: is there some code already? If yes: can we get >>>>> it? Can >>>>> we join the effort? If no: what are things we should consider >>>>> doing to >>>>> increase our chances that you guys accept our code in the end? We >>>>> are >>>>> not really interested in maintaining the crawler bits and pieces, >>>>> our >>>>> interest is in the visualization. We are happy to get something >>>>> going as >>>>> part of our little demonstrator, but then we'd give it to you and >>>>> hope >>>>> someone picks up maintenance. >>>>> >>>>> >>>> I never wrote any code, but there is code in lucene-contrib which >>>> realized >>>> most of what is in this document. I was going to write code, but >>>> someone >>>> beat me to the punch and I was like "wow I have things I can do that >>>> >>> others >>> >>>> won't do for me" and moved on :-) >>>> >>>> I'm cc'ing lucene developers list. You'll find plenty of folks >>>> interested >>>> in working with you on this. >>>> >>>> -Andy >>>> >>>>> Is this all an option anyway? It is ok to say no ;-) >>>>> >>>>> Regards, >>>>> Peter >>>>> --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org