manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Repository document stream empty after Tika Transformation
Date Fri, 17 Jul 2015 21:17:43 GMT
Hi Chalitha,

The only documents I see here are documents that Tika cannot extract
content from, namely JPG's etc.

Karl


On Fri, Jul 17, 2015 at 12:09 PM, chalitha udara Perera <
chalithaudara@gmail.com> wrote:

> Hi Karl,
>
> Here I have attached the result from File System -> Tika Transform -> Null
> Output.
> Please find the attachment.
>
> Thank you,
> Chalitha
>
> On Fri, Jul 17, 2015 at 6:41 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> I don't see this here.
>>
>> I set up the following:
>> - file system repository connection
>> - null output connection
>> - tika extractor
>> - a job using all three
>>
>> Running the job and looking at the simple history, I see null output
>> connection ingestion records that have proper document sizes.
>>
>> Can you repeat the same setup there, and tell me what you get?
>>
>> Thanks,
>> Karl
>>
>> Sent from my Windows Phone
>> ------------------------------
>> From: chalitha udara Perera
>> Sent: 7/17/2015 8:46 AM
>> To: Karl Wright
>> Cc: dev@manifoldcf.apache.org
>> Subject: Re: Repository document stream empty after Tika Transformation
>>
>> Hi Karl,
>>
>> I'm using 2.1 release  and I am using only the Solr output connector. If
>> you look at the inputstream size (
>>    document.getBinaryLength()) after tika connector it is zero.
>>
>> Thanks,
>> Chalitha
>>
>> On Fri, Jul 17, 2015 at 6:08 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> The document stream contains what tika extracts.  If it can't extract
>>> anything then you will have an empty stream.
>>>
>>> It is also possible that if the stream is split, you are tripping over a
>>> bug that was fixed some time ago.  What mcf version is this, and do you
>>> have more than one output?
>>>
>>> Karl
>>>
>>> Sent from my Windows Phone
>>> ------------------------------
>>> From: chalitha udara Perera
>>> Sent: 7/17/2015 7:25 AM
>>> To: dev@manifoldcf.apache.org
>>> Subject: Repository document stream empty after Tika Transformation
>>>
>>> Hi All,
>>>
>>> I'm writing a transformation connector to extract low level features
>>> from images. First I used that connector without tika extractor and I
>>> worked fine. But when I used it with Tika connector (after tika) if fails
>>> to extract features. After debugging I found out that the stream is empty
>>> after tika transformation.
>>> Actually inside tika connector, it creates a new in memory or file
>>> stream output, but original input stream is never copied to it. Connector
>>> should reset binary stream after utilizing the stream to get metadata so
>>> the original inputstream is available from connector to connector.
>>>
>>> Here I have attached a simple solution of stream copy and reset that
>>> worked for me.
>>>
>>> Thanks,
>>> Chalitha
>>>
>>> --
>>> J.M Chalitha Udara Perera
>>>
>>> *Department of Computer Science and Engineering,*
>>> *University of Moratuwa,*
>>> *Sri Lanka*
>>>
>>
>>
>>
>> --
>> J.M Chalitha Udara Perera
>>
>> *Department of Computer Science and Engineering,*
>> *University of Moratuwa,*
>> *Sri Lanka*
>>
>
>
>
> --
> J.M Chalitha Udara Perera
>
> *Department of Computer Science and Engineering,*
> *University of Moratuwa,*
> *Sri Lanka*
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message