tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Multiple documents per input stream
Date Sun, 27 Sep 2009 12:59:19 GMT
Hi Jukka,

> On Wed, Sep 23, 2009 at 7:38 PM, Ken Krugler
> <kkrugler_lists@transpac.com> wrote:
>> Longer term it would be great to not have to worry about handling two
>> different cases - e.g. by being able to call
>>
>> while (parser.parse(is, handler, metadata, context)) {
>>        <process the doc>
>> }
>>
>> Though I think this would also require passing in metadata like
>> RESOURCE_NAME_KEY, CONTENT_TYPE and CONTENT_ENCODING via context,  
>> to avoid
>> having to worry about selectively clearing out metadata. But I  
>> think that
>> would be better anyway, versus the co-mingling of input & output  
>> data in the
>> metadata container.
>
> The second option I gave in my earlier message is now a bit more
> straightforward with the parsing context option introduced recently in
> Tika trunk. You can now explicitly pass a delegate parser to be used
> to process any component documents:
>
>    Parser myComponentParser = new Parser() {
>        public void parse(...) throws ... {
>            // Process the component document stream
>            // in any way you like, optionally passing the
>            // extracted text also to the top level parser
>            // through the given ContentHandler
>        }
>    };
>
>    Map<String, Object> context = new HashMap<String, Object>();
>    context.put(Parser.class.getName(), myComponentParser);
>    parser.parse(stream, handler, metadata, context);
>
> In this example myComponentParser.parse() would get called once for
> each component document inside a package.

OK, thanks.

Though I don't think this would address the fundamental question of  
how to generically extract metadata like the title from compound  
documents, right?

You'd still have to know something about how the delegate parser  
embeds this information in the actual XHTML output.

Thanks,

-- Ken


--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


Mime
View raw message