jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@day.com>
Subject Re: Redesign TextFilter?
Date Mon, 03 Jul 2006 12:01:51 GMT
I agree with all your comments and I would keep the getContentTypes() 
method in the TextExtractor interface. I think that way it requires less 
  configuration because a TextExtractor knows the mime types it can 
handle anyway.

How should we handle backward compatibility? Do you think we should keep 
the existing TextFilter interface?

regards
  marcel

Jukka Zitting wrote:
> Hi,
> 
> While resolving JCR-470, I started thinking about the design of the
> TextFilter interface. There are a number of issues with the interface
> that I believe could do with some refactoring:
> 
> 1) The name of the interface. The name suggests that it's a filter for
> an existing text stream. A better name would probably be TextExtractor
> or something similar.
> 
> 2) The canFilter() method. The design forces Jackrabbit to always
> iterate through the whole list of configured index filters asking each
> whether it can filter a given content type. Wouldn't it be better to
> have a getContentTypes() method that would return the supported
> content types, so Jackrabbit could just do a map lookup to get to the
> correct filter. Alternatively we could completely do away with the
> method and use a separate configuration file that maps MIME types to
> implementation classes (or even configured instances).
> 
> 3) The doFilter() method. See below for a breakdown:
> 
> 3.1) Name of the method. As discussed in point 1 above, extractText()
> could be a better name.
> 
> 3.2) Return value. Why do we return a Map from the method? I see that
> it would be possible to return a set of fields for indexing, but none
> of the existing filter classes do this and in any case such a feature
> feels a bit confusing (what if there's a field name conflict). The
> design also forces the implementations to know the FieldNames.FULLTEXT
> constant, a small extra dependency. Would it make sense to return just
> a Reader instance?
> 
> 3.3) The PropertyState argument. All of the implementation classes
> just extract the binary stream from the first value of the givne
> PropertyState instance. I don't see any benefit of passing the entire
> state, it just adds complexity and an extra dependency. How about
> passing just the InputStream? It would probably also make sense to
> pass the actual MIME type along with the possible character encoding.
> 
> Put together, my proposal for a replacement interface would be:
> 
>    public interface TextExtractor {
> 
>        String[] getContentTypes(); // if any
> 
>        Reader extractText(InputStream stream, String type, String 
> encoding);
> 
>    }
> 
> Comments?
> 
> BR,
> 
> Jukka Zitting
> 


Mime
View raw message