manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chalitha udara Perera <chalithaud...@gmail.com>
Subject Re: Repository document stream empty after Tika Transformation
Date Sat, 18 Jul 2015 05:20:29 GMT
Hi Karl,

I mainly work with images. Actually tika extracts exif metadata from
images. I have attached manifold log containing image metadata extracted
from tika. I like to use a separate connector after that to extract low
level features such as SIFT to provide image search. Currently cannot do
that because for images stream is zero.

But I tried with some pdf documents and as you said I can see output
connection ingestion records with correct document sizes.

Thank you,
Chalitha

On Sat, Jul 18, 2015 at 2:47 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Chalitha,
>
> The only documents I see here are documents that Tika cannot extract
> content from, namely JPG's etc.
>
> Karl
>
>
> On Fri, Jul 17, 2015 at 12:09 PM, chalitha udara Perera <
> chalithaudara@gmail.com> wrote:
>
>> Hi Karl,
>>
>> Here I have attached the result from File System -> Tika Transform ->
>> Null Output.
>> Please find the attachment.
>>
>> Thank you,
>> Chalitha
>>
>> On Fri, Jul 17, 2015 at 6:41 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> I don't see this here.
>>>
>>> I set up the following:
>>> - file system repository connection
>>> - null output connection
>>> - tika extractor
>>> - a job using all three
>>>
>>> Running the job and looking at the simple history, I see null output
>>> connection ingestion records that have proper document sizes.
>>>
>>> Can you repeat the same setup there, and tell me what you get?
>>>
>>> Thanks,
>>> Karl
>>>
>>> Sent from my Windows Phone
>>> ------------------------------
>>> From: chalitha udara Perera
>>> Sent: 7/17/2015 8:46 AM
>>> To: Karl Wright
>>> Cc: dev@manifoldcf.apache.org
>>> Subject: Re: Repository document stream empty after Tika Transformation
>>>
>>> Hi Karl,
>>>
>>> I'm using 2.1 release  and I am using only the Solr output connector. If
>>> you look at the inputstream size (
>>>    document.getBinaryLength()) after tika connector it is zero.
>>>
>>> Thanks,
>>> Chalitha
>>>
>>> On Fri, Jul 17, 2015 at 6:08 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> The document stream contains what tika extracts.  If it can't extract
>>>> anything then you will have an empty stream.
>>>>
>>>> It is also possible that if the stream is split, you are tripping over
>>>> a bug that was fixed some time ago.  What mcf version is this, and do you
>>>> have more than one output?
>>>>
>>>> Karl
>>>>
>>>> Sent from my Windows Phone
>>>> ------------------------------
>>>> From: chalitha udara Perera
>>>> Sent: 7/17/2015 7:25 AM
>>>> To: dev@manifoldcf.apache.org
>>>> Subject: Repository document stream empty after Tika Transformation
>>>>
>>>> Hi All,
>>>>
>>>> I'm writing a transformation connector to extract low level features
>>>> from images. First I used that connector without tika extractor and I
>>>> worked fine. But when I used it with Tika connector (after tika) if fails
>>>> to extract features. After debugging I found out that the stream is empty
>>>> after tika transformation.
>>>> Actually inside tika connector, it creates a new in memory or file
>>>> stream output, but original input stream is never copied to it. Connector
>>>> should reset binary stream after utilizing the stream to get metadata so
>>>> the original inputstream is available from connector to connector.
>>>>
>>>> Here I have attached a simple solution of stream copy and reset that
>>>> worked for me.
>>>>
>>>> Thanks,
>>>> Chalitha
>>>>
>>>> --
>>>> J.M Chalitha Udara Perera
>>>>
>>>> *Department of Computer Science and Engineering,*
>>>> *University of Moratuwa,*
>>>> *Sri Lanka*
>>>>
>>>
>>>
>>>
>>> --
>>> J.M Chalitha Udara Perera
>>>
>>> *Department of Computer Science and Engineering,*
>>> *University of Moratuwa,*
>>> *Sri Lanka*
>>>
>>
>>
>>
>> --
>> J.M Chalitha Udara Perera
>>
>> *Department of Computer Science and Engineering,*
>> *University of Moratuwa,*
>> *Sri Lanka*
>>
>
>


-- 
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message