jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (JCR-415) Enhance indexing of binary content
Date Tue, 11 Jul 2006 09:25:31 GMT
    [ http://issues.apache.org/jira/browse/JCR-415?page=comments#action_12420256 ] 

Jukka Zitting commented on JCR-415:
-----------------------------------

Marcel:
> NodeIndexer.addBinaryValue() is protected to allow subclasses to override it but it uses
the private
> method getValue(). Thus getValue() should be protected final in order to be usable for
a subclass. 

OK.

> Extracting text should be deferred to the time when the lucene Document acutally requests
character
> from Reader that is assigned to a Field. See http://issues.apache.org/jira/browse/JCR-264.

I think it would make more design sense to try to postpone the creation of the Document instances
instead of delaying text extraction. But I'm not too familiar with the details, so I'm OK
with adding lazy reading to the mix. In any case I think it's best to layer the lazy reading
on top of the TextExtractor interface instead of below it. A utility class like the following
could achieve this as long as the given InputStream remains valid until the document has been
read.

    class TextExtractorReader extends Reader {

        private final TextExtractor extractor;
        private final InputStream stream;
        private final String type;
        private final String encoding;

        private Reader reader;

        public TextExtractorReader(
                TextExtractor extractor, InputStream stream,
                String type, String encoding) {
            this.extractor = extractor;
            this.stream = stream;
            this.type = type;
            this.encoding = encoding;
            this.reader = null;
        }

        public int read(char[] buffer, int offset, int length) throws IOException {
            if (reader == null) {
                reader = extractor.extractText(stream, type, encoding);
            }
            return reader.read(buffer, offset, length);
        }

        public void close() throws IOException {
            if (reader != null) {
                reader.close();
            } else {
                stream.close();
            }
        }

    }

I can update the query patch accordingly.


> Enhance indexing of binary content
> ----------------------------------
>
>          Key: JCR-415
>          URL: http://issues.apache.org/jira/browse/JCR-415
>      Project: Jackrabbit
>         Type: Improvement

>   Components: indexing
>     Versions: 1.0, 1.0.1, 0.9
>     Reporter: Marcel Reutegger
>     Priority: Minor
>      Fix For: 1.1
>  Attachments: jackrabbit-extractor-r420472.patch, jackrabbit-query-r420472.patch, org.apache.jackrabbit.core.query-extractor.jpg,
org.apache.jackrabbit.core.query.lucene-extractor.jpg, org.apache.jackrabbit.extractor.jpg
>
> Indexing of binary content should be enhanced in order to allow either configuration
what fields are indexed or provide better support for custom NodeIndexer implementations.
> The current design has a couple of flaws that should be addressed at the same time:
> - Reader instances are requested from the text filters even though the reader might never
be used
> - only jcr:data properties of nt:resource nodes are fulltext indexed
> - It is up to the text filter implementation to decide the lucene field name for the
text representation, responsibility should be moved to the NodeIndexer. A text filter should
only provide a Reader instance.
> With those changes a custom NodeIndexer can then decide if a binary property has one
or more representations in the index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message