manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chalitha udara Perera <chalithaud...@gmail.com>
Subject Re: Repository document stream empty after Tika Transformation
Date Mon, 20 Jul 2015 10:36:24 GMT
Hi Karl,

May be it is not best to send image streams as it will index the binary
content. I can use multimedia extraction connectors before Tika connector
in the connector pipeline. Only thing is that I will have to use tika
internally to detect media types. No problem.

Thanks,
Chalitha

On Sat, Jul 18, 2015 at 12:54 PM, Karl Wright <daddywri@gmail.com> wrote:

> Mcf will transmit metadata for images but since there is no other content,
> the main content stream will have zero length.  This seems perfectly
> correct to me; I cannot see that any changes are needed or even desirable
> here.
>
> Thanks
> Karl
>
> Sent from my Windows Phone
> ------------------------------
> From: chalitha udara Perera
> Sent: 7/18/2015 1:20 AM
>
> To: Karl Wright
> Cc: dev@manifoldcf.apache.org
> Subject: Re: Repository document stream empty after Tika Transformation
>
> Hi Karl,
>
> I mainly work with images. Actually tika extracts exif metadata from
> images. I have attached manifold log containing image metadata extracted
> from tika. I like to use a separate connector after that to extract low
> level features such as SIFT to provide image search. Currently cannot do
> that because for images stream is zero.
>
> But I tried with some pdf documents and as you said I can see output
> connection ingestion records with correct document sizes.
>
> Thank you,
> Chalitha
>
> On Sat, Jul 18, 2015 at 2:47 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Chalitha,
>>
>> The only documents I see here are documents that Tika cannot extract
>> content from, namely JPG's etc.
>>
>> Karl
>>
>>
>> On Fri, Jul 17, 2015 at 12:09 PM, chalitha udara Perera <
>> chalithaudara@gmail.com> wrote:
>>
>>> Hi Karl,
>>>
>>> Here I have attached the result from File System -> Tika Transform ->
>>> Null Output.
>>> Please find the attachment.
>>>
>>> Thank you,
>>> Chalitha
>>>
>>> On Fri, Jul 17, 2015 at 6:41 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> I don't see this here.
>>>>
>>>> I set up the following:
>>>> - file system repository connection
>>>> - null output connection
>>>> - tika extractor
>>>> - a job using all three
>>>>
>>>> Running the job and looking at the simple history, I see null output
>>>> connection ingestion records that have proper document sizes.
>>>>
>>>> Can you repeat the same setup there, and tell me what you get?
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>> Sent from my Windows Phone
>>>> ------------------------------
>>>> From: chalitha udara Perera
>>>> Sent: 7/17/2015 8:46 AM
>>>> To: Karl Wright
>>>> Cc: dev@manifoldcf.apache.org
>>>> Subject: Re: Repository document stream empty after Tika Transformation
>>>>
>>>> Hi Karl,
>>>>
>>>> I'm using 2.1 release  and I am using only the Solr output connector.
>>>> If you look at the inputstream size (
>>>>    document.getBinaryLength()) after tika connector it is zero.
>>>>
>>>> Thanks,
>>>> Chalitha
>>>>
>>>> On Fri, Jul 17, 2015 at 6:08 PM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> The document stream contains what tika extracts.  If it can't extract
>>>>> anything then you will have an empty stream.
>>>>>
>>>>> It is also possible that if the stream is split, you are tripping over
>>>>> a bug that was fixed some time ago.  What mcf version is this, and do
you
>>>>> have more than one output?
>>>>>
>>>>> Karl
>>>>>
>>>>> Sent from my Windows Phone
>>>>> ------------------------------
>>>>> From: chalitha udara Perera
>>>>> Sent: 7/17/2015 7:25 AM
>>>>> To: dev@manifoldcf.apache.org
>>>>> Subject: Repository document stream empty after Tika Transformation
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I'm writing a transformation connector to extract low level features
>>>>> from images. First I used that connector without tika extractor and I
>>>>> worked fine. But when I used it with Tika connector (after tika) if fails
>>>>> to extract features. After debugging I found out that the stream is empty
>>>>> after tika transformation.
>>>>> Actually inside tika connector, it creates a new in memory or file
>>>>> stream output, but original input stream is never copied to it. Connector
>>>>> should reset binary stream after utilizing the stream to get metadata
so
>>>>> the original inputstream is available from connector to connector.
>>>>>
>>>>> Here I have attached a simple solution of stream copy and reset that
>>>>> worked for me.
>>>>>
>>>>> Thanks,
>>>>> Chalitha
>>>>>
>>>>> --
>>>>> J.M Chalitha Udara Perera
>>>>>
>>>>> *Department of Computer Science and Engineering,*
>>>>> *University of Moratuwa,*
>>>>> *Sri Lanka*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> J.M Chalitha Udara Perera
>>>>
>>>> *Department of Computer Science and Engineering,*
>>>> *University of Moratuwa,*
>>>> *Sri Lanka*
>>>>
>>>
>>>
>>>
>>> --
>>> J.M Chalitha Udara Perera
>>>
>>> *Department of Computer Science and Engineering,*
>>> *University of Moratuwa,*
>>> *Sri Lanka*
>>>
>>
>>
>
>
> --
> J.M Chalitha Udara Perera
>
> *Department of Computer Science and Engineering,*
> *University of Moratuwa,*
> *Sri Lanka*
>



-- 
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message