manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject RE: Repository document stream empty after Tika Transformation
Date Fri, 17 Jul 2015 13:11:16 GMT
I don't see this here.

I set up the following:
- file system repository connection
- null output connection
- tika extractor
- a job using all three

Running the job and looking at the simple history, I see null output
connection ingestion records that have proper document sizes.

Can you repeat the same setup there, and tell me what you get?

Thanks,
Karl

Sent from my Windows Phone
------------------------------
From: chalitha udara Perera
Sent: 7/17/2015 8:46 AM
To: Karl Wright
Cc: dev@manifoldcf.apache.org
Subject: Re: Repository document stream empty after Tika Transformation

Hi Karl,

I'm using 2.1 release  and I am using only the Solr output connector. If
you look at the inputstream size (
   document.getBinaryLength()) after tika connector it is zero.

Thanks,
Chalitha

On Fri, Jul 17, 2015 at 6:08 PM, Karl Wright <daddywri@gmail.com> wrote:

> The document stream contains what tika extracts.  If it can't extract
> anything then you will have an empty stream.
>
> It is also possible that if the stream is split, you are tripping over a
> bug that was fixed some time ago.  What mcf version is this, and do you
> have more than one output?
>
> Karl
>
> Sent from my Windows Phone
> ------------------------------
> From: chalitha udara Perera
> Sent: 7/17/2015 7:25 AM
> To: dev@manifoldcf.apache.org
> Subject: Repository document stream empty after Tika Transformation
>
> Hi All,
>
> I'm writing a transformation connector to extract low level features from
> images. First I used that connector without tika extractor and I worked
> fine. But when I used it with Tika connector (after tika) if fails to
> extract features. After debugging I found out that the stream is empty
> after tika transformation.
> Actually inside tika connector, it creates a new in memory or file stream
> output, but original input stream is never copied to it. Connector should
> reset binary stream after utilizing the stream to get metadata so the
> original inputstream is available from connector to connector.
>
> Here I have attached a simple solution of stream copy and reset that
> worked for me.
>
> Thanks,
> Chalitha
>
> --
> J.M Chalitha Udara Perera
>
> *Department of Computer Science and Engineering,*
> *University of Moratuwa,*
> *Sri Lanka*
>



-- 
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message