tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Custom parser error
Date Wed, 01 Aug 2012 06:55:15 GMT
Hi,

> Hi Nick, sorry to bother again but I'm not quite sure of what you have
said.
> 
> 
> Nick Burch-2 wrote
> >
> > On Tue, 31 Jul 2012, 122jxgcn wrote:
> > If your TikaInputStream lacks a file, and getFile is called, one will
> > automatically be created for you. (That's part of the point!)
> >
> I believe created file will be empty. Then how can I process the input
file
> without its data?

It will not be empty. It seems there is some misunderstanding here. Of
cource a ResourceAsStream InputStream has no file backed (or the file is not
easy reachable). The main idea behin TikeInputStream is to provide the file
on request. If hasFile() returns false, TikaInputStream will do the
following when you call getFile():
- create temporary file
- copy the whole stream to the temporary file

After that you can process the contents. If the InputStream passed to
TikaInputStream has a possibility to get the file backed, it will return it
directly, but in most cases it will create a temporary one and copy the
contents into it. Because of this its always better to make your parser work
on a InputStream and only use a file, if the parser cannot (e.g. because it
needs random access).

> So basically, my file is converted to InputStream by
> 
> InputStream stream = HWPParserTest.class.getResourceAsStream(
>                 "/test-documents/testHWP.hwp");
> 
> After that, InputStream stream is passed to parser() of HWPParser and it
should
> be converted to TikaInputStream tstream without the loss of input file
data.
> I'm currently doing
> 
> TikaInputStream tstream = TikaInputStream.get(stream);
> 
> right now.
> I believe tstream.hasFile() should true right away in order to my parser
class to
> work.

No, hasFile only tells you if the wrapped InputStream has a backing file,
for resource streams this is not the case. If you cann getFile() it will
emulate a backing file by copying to a temporary one. After that the stream
is exhausted.

> Thanks a lot.
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Custom-
> parser-error-tp3998302p3998536.html
> Sent from the Apache Tika - Development mailing list archive at
Nabble.com.


Mime
View raw message