lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <li...@ehatchersolutions.com>
Subject Re: Lucene crawler plan
Date Wed, 02 Jul 2003 00:41:56 GMT
On Tuesday, July 1, 2003, at 06:36  PM, Peter Becker wrote:
>> Ah, but Ant *does* have more sophisticated filtering mechanisms!  :)  
>> The <fileset>'s that the <index> task can take can leverage any of 
>> Ant's built-in capabilities, such as (new in Ant 1.5) Selector 
>> capability.  So you could easily filter on file size, file date, etc, 
>> and custom Selectors can be written and plugged in.
>
> Ant does. What I meant with the Ant project was the code in the Lucene 
> CVS for Ant. The decision between the two DocumentHandlers seems to be 
> made based on the extension. But maybe I didn't read the code > properly.

But, look at the setters on IndexTask.  The document handler is 
pluggable.  The one that is provided is definitely dumb, no question, 
and was only meant as an example.  I have my own BlogDocumentHandler 
for indexing my blog entries, for example (they are text files, but get 
indexed differently than plain ol' .txt).

> What I want to see is a user-defined mapping from some kinds of 
> FileFilters (extension, wildcard, regexp, magic numbers, whatever) to 
> the DocumentHandlers. They should be applied in order and whenever one 
> hits the iteration stops unless an exception gets thrown by the 
> DocumentHandler. Additional DocumentHandlers could be mixed in to 
> provide extra information. I am thinking of file system information 
> and metadata stores here. These would be an independent dimension of 
> data about the documents.

Also note that the code could easily be modified to allow dynamic 
properties to be passed to document handlers (see Ant's 
DynamicConfigurator interface).  I experimented with this some myself, 
but didn't need it so didn't keep the code around.

>> I think there are probably some better options out there than using 
>> JTidy these days, but I have not had time to investigate them.  JTidy 
>> does the job reasonably well though.
>
> We are looking into some alternatives. We have a few ten thousand 
> documents to test on :-) I suspect we will just implement whatever 
> comes along and let them run, collecting exceptions and time eaten. 
> Checking if they really got all interesting content will be too much 
> work, though.
>
> What are the issues with JTidy?

The version number!  Its ancient.  It does a decent job with even 
mangled HTML though - I just suspect something better surely is out 
there by now.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message