lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Lucene crawler plan
Date Tue, 01 Jul 2003 00:41:37 GMT
If you are after a pure file system indexing abstraction, check out the  
'ant' project in the sandbox.  It's got a DocumentHandler abstraction  
allowing it to be a bit pluggable.  Its not perfect, but it has worked  
for me for quite some time quite sufficiently.


On Monday, June 30, 2003, at 08:26  PM, Peter Becker wrote:

> Clemens Marschner wrote:
>> There's an experimental webcrawler in the lucene-sandbox area called
>> larm-webcrawler (see
>> overview.html),
>> and a project on Sourceforge ( that tries to  
>> leverage
>> this on a higher level. I want to encourage you to go on that side  
>> and read
>> through the specs in sourceforge's CVS.
> I've done that by now -- my first problem was to identify LARM as the  
> relevant project, but then things were reasonably easy to find.
>> It concludes pretty much everything that Andy wrote in his proposal,  
>> and
>> more. The project only contains conceptual documents at this time,  
>> but if
>> you're willing to contribute actively, that's very appreciated.
> In many ways the project aims too high for us. We are interested only  
> in the file system part and our time is limited. My hope was that  
> someone would say there would be a basic framework somewhere where we  
> can put our code, but due to the time limitations we will rather do  
> our own thing. But this is maybe not as bad as it sounds since (a) our  
> original plan was very close to what you describe in certain parts of  
> the system, (b) we have read your documentation and (c) our code will  
> be BSD-licensed.
> The main ideas we have are:
> - map file types to document processors
> - use the interface as base for the mappings
> - the document processors will probably have a two method interface:
>    DocumentSummary processDocument(URL);
>    String getDisplayName();
> - the DocSummary class will model the common attributes like author,  
> title, text, etc. with a Properties object to be extensible. It's main  
> purpose is to separate indexing concerns like stored/unstored and  
> tokenized/untokenized from the document processors
> - the display name will be used in the UI to create lists of  
> FileFilter->DocumentProcessor mappings
> - there will be some crawler code for the file system, but of course  
> that is a lot easier
> Many of these things will not extend straightaway into the web  
> context, but I think the main work we will do will be in implementing  
> the different DocumentProcessors. That part should be reusable. The  
> mapping idea should be reusable, although FileFilter would have to be  
> replaced with something more abstract, at least a URLFilter. My  
> experience with Java networking is not good enough to judge the  
> complexity of that.
> We expect to have the relevant parts of this done next week. Code will  
> be on Sourceforge  
> (, it  
> might be at least useful as inspiration :-) We are also looking into  
> alternatives for parsing PDF and other formats. We have a lot of  
> problems with PDFBox at the moment, and there might be other  
> candidates ( And we  
> are looking into the option to use the UDK for indexing  
> (, although that most likely will  
> complicate deployment and increase program size quite a bit. One of  
> the problems we have is that we have some interesting test cases for  
> the parsing tools, but we can't give them away and don't have the time  
> to debug ourself. We have a file which causes PDFBox to get stuck  
> without any feedback and an XLS file which causes POI to loop with  
> funny messages for a long time until we run out of memory (with  
> -mx500m). But that is something we have to talk to the other projects  
> about.
> The point of this waffle is: if you think some of our ideas are not as  
> good as they should be or there are things that might affect reuse,  
> please shout now :-) We start coding this right now.
>> Unfortunately I have to stop my efforts regarding LARM. Long story  
>> short: My
>> future employer says it's too close to their business. But in  
>> contrast to
>> other open source projects, there's already lots of ideas in that  
>> document
>> and lots of code in the old crawler. If you wish to contribute, it's  
>> now up
>> to you.
> Fair enough. I guess as professional developer you can never be  
> completely free from considering IP issues.
> GrĂ¼sse,
>    Peter
>> Clemens
>> ----- Original Message ----- From: "Andrew C. Oliver"  
>> <>
>> To: "Peter Becker" <>
>> Cc: "Lucene Developers List" <>
>> Sent: Friday, June 27, 2003 2:53 AM
>> Subject: Re: Lucene crawler plan
>>> On 6/26/03 8:33 PM, "Peter Becker" <> wrote:
>>>> Hi Andrew,
>>>> are you the Andy signing this:
>>>> If no -- do  
>>>> you
>>>> know who wrote the page and could you forward this email? Thanks.  
>>>> BTW:
>>>> your website link on  
>>>> is dead.
>>> Yes I wrote it.
>>>> The question is: is there some code already? If yes: can we get it?  
>>>> Can
>>>> we join the effort? If no: what are things we should consider doing  
>>>> to
>>>> increase our chances that you guys accept our code in the end? We  
>>>> are
>>>> not really interested in maintaining the crawler bits and pieces,  
>>>> our
>>>> interest is in the visualization. We are happy to get something  
>>>> going as
>>>> part of our little demonstrator, but then we'd give it to you and  
>>>> hope
>>>> someone picks up maintenance.
>>> I never wrote any code, but there is code in lucene-contrib which  
>>> realized
>>> most of what is in this document.  I was going to write code, but  
>>> someone
>>> beat me to the punch and I was like "wow I have things I can do that
>> others
>>> won't do for me" and moved on :-)
>>> I'm cc'ing lucene developers list.  You'll find plenty of folks  
>>> interested
>>> in working with you on this.
>>> -Andy
>>>> Is this all an option anyway? It is ok to say no ;-)
>>>> Regards,
>>>> Peter
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message