jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhou Wu <zwu...@yahoo.com>
Subject Metadata, TextExtractor
Date Tue, 04 Aug 2009 06:16:25 GMT


1. It looks if one wants to put the metadata from a document in a
repository, one has to do by his/her own. Why cannot we publish the metadata
(it should be configurable) during the text extracting stages? If I do it by
myself, I have to process the document once more just for the metadata --
affecting performance badly. Please note that in v2.0, the metadata object
is indeed obtained by Tika during the stage, but is discarded. Without
metadata in place, we miss too much  searchable information in the

2. TextExtractor interface has only one method:

 Reader extractText(InputStream stream, String type, String encoding)
        throws IOException;

  If I want to implement my own extractor such as automating MS Office for
the purpose, I have to write the stream first, and let MS Office open it and
do the processing. Why cannot we have an interface from a URL?

Reader extractText(URL url)
        throws IOException;

One would wonder why I'll go to MS Office to get the texts -- sure there are
many drawbacks for many of you, but it can extract texts whatever MS Office
documents (which are what I need), -- it is very accurate for these
documents -- particularly for the newer versions  and the documents of mixed
languages such as Right-2-left and left-2-right. Please note one can have 
RTL or LTR markers in the texts extracted -- I don't see any parsers that
can do this.

Just some thoughts -- I think for now I have to do this by myself.

View this message in context: http://www.nabble.com/Metadata%2C-TextExtractor-tp24802811p24802811.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

View raw message