lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <>
Subject Re: Lucene crawler plan
Date Tue, 01 Jul 2003 22:36:04 GMT
Erik Hatcher wrote:

> On Monday, June 30, 2003, at 10:21  PM, Peter Becker wrote:
>> this is far closer to what we are looking for. Using Ant is an 
>> interesting idea, although it probably won't help us for the UI tool. 
>> But we could try to layer things so we could use them for both
> Yes, I'm sure a more generalized method could be developed that 
> accomodates both.  Its pretty decoupled even within the Ant project 
> with a DocumentHandler interface and all. 

And frankly -- these little code pieces are easy to port. The trick is 
knowing which library to use and how.

>> Two differences between the Ant project and what we do right now:
>> - the Ant project doesn't have a notion of an explicit file filter. I 
>> think this is important if you want to extend the filter options to 
>> more than just extensions and if you want some UI to manage the 
>> filter mappings. BTW: does anyone know of a Java implementation for 
>> file(1) magic?
> Ah, but Ant *does* have more sophisticated filtering mechanisms!  :)  
> The <fileset>'s that the <index> task can take can leverage any of 
> Ant's built-in capabilities, such as (new in Ant 1.5) Selector 
> capability.  So you could easily filter on file size, file date, etc, 
> and custom Selectors can be written and plugged in. 

Ant does. What I meant with the Ant project was the code in the Lucene 
CVS for Ant. The decision between the two DocumentHandlers seems to be 
made based on the extension. But maybe I didn't read the code properly.

What I want to see is a user-defined mapping from some kinds of 
FileFilters (extension, wildcard, regexp, magic numbers, whatever) to 
the DocumentHandlers. They should be applied in order and whenever one 
hits the iteration stops unless an exception gets thrown by the 
DocumentHandler. Additional DocumentHandlers could be mixed in to 
provide extra information. I am thinking of file system information and 
metadata stores here. These would be an independent dimension of data 
about the documents.

>> - the code creates Documents as return values. The reason we went 
>> away from this is that we want to use the same document handler with 
>> different index options. One of the core issues here is storing the 
>> body or not. I don't think there is any true answer for this one, so 
>> it should be configurable somehow.
> Agreed.  It was a toss-up when I went to implement as who is actually 
> in control of the Document instantiation and population.
>>  The two options I see are either returning a data object and then 
>> turning that into a Document somewhere else or passing some 
>> configuration object around. Both are not really nice, the first one 
>> needs to create an additional object all the time, while the second 
>> one puts quite some burder on the implementer of the document 
>> handler. Ideas on that one would be extremely welcome.
> If you invert what I have done then the "controller" needs to know 
> more information about the fields, more than you could convey in a 
> String/String Map - is a field indexed or not?  Is a field tokenized 
> or not?  Is it stored or not?  Who decides on the field names?  Who 
> decides all of these are the questions we have to answer to do this 
> type of stuff. 

Exactly. Somehow these issues should be separated from the issue of 
finding the data. Our current idea is to collect everything in a data 
object and then get some other code to turn it into a Lucene Document. 
Another version would be a wrapper/factory/strategy around the Lucene 
Document doing the mapping.

The field name question would be separated this way, but one question 
would be left: what are the fields. The idea of having the extra 
Properties field doesn't really help that much, since then we are back 
to where we started. Giving a big range of default fields (along Dublin 
Core?) would help, but would be quite some overkill. It could be 
expensive in terms of object creation, too -- the wrapper approach would 
probably better here.

>> Two ideas we will probably pick up from this are:
>> - use Ant for creating indexes if we go larger than personal document 
>> retrieval
> Keep in mind you could also launch Ant via the API from a GUI as well, 
> or just leverage the IndexTask itself and call it via the API and its 
> execute() method. 

I'll investigate this. Thanks.

>> - use JTidy for HTML parsing (we missed that one and used Swing 
>> instead, which is no good)
> I think there are probably some better options out there than using 
> JTidy these days, but I have not had time to investigate them.  JTidy 
> does the job reasonably well though. 

We are looking into some alternatives. We have a few ten thousand 
documents to test on :-) I suspect we will just implement whatever comes 
along and let them run, collecting exceptions and time eaten. Checking 
if they really got all interesting content will be too much work, though.

What are the issues with JTidy?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message