lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <pbec...@dstc.edu.au>
Subject Re: Lucene crawler plan
Date Tue, 01 Jul 2003 00:26:18 GMT
Clemens Marschner wrote:

>There's an experimental webcrawler in the lucene-sandbox area called
>larm-webcrawler (see
>http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html),
>
>and a project on Sourceforge (http://larm.sf.net) that tries to leverage
>this on a higher level. I want to encourage you to go on that side and read
>through the specs in sourceforge's CVS.
>
I've done that by now -- my first problem was to identify LARM as the 
relevant project, but then things were reasonably easy to find.

>It concludes pretty much everything that Andy wrote in his proposal, and
>more. The project only contains conceptual documents at this time, but if
>you're willing to contribute actively, that's very appreciated.
>
In many ways the project aims too high for us. We are interested only in 
the file system part and our time is limited. My hope was that someone 
would say there would be a basic framework somewhere where we can put 
our code, but due to the time limitations we will rather do our own 
thing. But this is maybe not as bad as it sounds since (a) our original 
plan was very close to what you describe in certain parts of the system, 
(b) we have read your documentation and (c) our code will be BSD-licensed.

The main ideas we have are:
- map file types to document processors
- use the java.io.FileFilter interface as base for the mappings
- the document processors will probably have a two method interface:
    DocumentSummary processDocument(URL);
    String getDisplayName();
- the DocSummary class will model the common attributes like author, 
title, text, etc. with a Properties object to be extensible. It's main 
purpose is to separate indexing concerns like stored/unstored and 
tokenized/untokenized from the document processors
- the display name will be used in the UI to create lists of 
FileFilter->DocumentProcessor mappings
- there will be some crawler code for the file system, but of course 
that is a lot easier

Many of these things will not extend straightaway into the web context, 
but I think the main work we will do will be in implementing the 
different DocumentProcessors. That part should be reusable. The mapping 
idea should be reusable, although FileFilter would have to be replaced 
with something more abstract, at least a URLFilter. My experience with 
Java networking is not good enough to judge the complexity of that.

We expect to have the relevant parts of this done next week. Code will 
be on Sourceforge 
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/), it 
might be at least useful as inspiration :-) We are also looking into 
alternatives for parsing PDF and other formats. We have a lot of 
problems with PDFBox at the moment, and there might be other candidates 
(http://www.cs.berkeley.edu/~phelps/Multivalent/). And we are looking 
into the option to use the UDK for indexing 
(http://udk.openoffice.org/), although that most likely will complicate 
deployment and increase program size quite a bit. One of the problems we 
have is that we have some interesting test cases for the parsing tools, 
but we can't give them away and don't have the time to debug ourself. We 
have a file which causes PDFBox to get stuck without any feedback and an 
XLS file which causes POI to loop with funny messages for a long time 
until we run out of memory (with -mx500m). But that is something we have 
to talk to the other projects about.

The point of this waffle is: if you think some of our ideas are not as 
good as they should be or there are things that might affect reuse, 
please shout now :-) We start coding this right now.

>Unfortunately I have to stop my efforts regarding LARM. Long story short: My
>future employer says it's too close to their business. But in contrast to
>other open source projects, there's already lots of ideas in that document
>and lots of code in the old crawler. If you wish to contribute, it's now up
>to you.
>
Fair enough. I guess as professional developer you can never be 
completely free from considering IP issues.

GrĂ¼sse,
    Peter


>Clemens
>
>
>
>----- Original Message ----- 
>From: "Andrew C. Oliver" <acoliver@apache.org>
>To: "Peter Becker" <pbecker@dstc.edu.au>
>Cc: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
>Sent: Friday, June 27, 2003 2:53 AM
>Subject: Re: Lucene crawler plan
>
>
>  
>
>>On 6/26/03 8:33 PM, "Peter Becker" <pbecker@dstc.edu.au> wrote:
>>
>>    
>>
>>>Hi Andrew,
>>>
>>>are you the Andy signing this:
>>>http://jakarta.apache.org/lucene/docs/luceneplan.html? If no -- do you
>>>know who wrote the page and could you forward this email? Thanks. BTW:
>>>your website link on http://jakarta.apache.org/lucene/docs/whoweare.html
>>>is dead.
>>>
>>>      
>>>
>>Yes I wrote it.
>>
>>    
>>
>>>The question is: is there some code already? If yes: can we get it? Can
>>>we join the effort? If no: what are things we should consider doing to
>>>increase our chances that you guys accept our code in the end? We are
>>>not really interested in maintaining the crawler bits and pieces, our
>>>interest is in the visualization. We are happy to get something going as
>>>part of our little demonstrator, but then we'd give it to you and hope
>>>someone picks up maintenance.
>>>
>>>      
>>>
>>I never wrote any code, but there is code in lucene-contrib which realized
>>most of what is in this document.  I was going to write code, but someone
>>beat me to the punch and I was like "wow I have things I can do that
>>    
>>
>others
>  
>
>>won't do for me" and moved on :-)
>>
>>I'm cc'ing lucene developers list.  You'll find plenty of folks interested
>>in working with you on this.
>>
>>-Andy
>>    
>>
>>>Is this all an option anyway? It is ok to say no ;-)
>>>
>>>Regards,
>>> Peter
>>>      
>>>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message