manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <furkankam...@gmail.com>
Subject Re: Unnecessary Newline Characters and Metadata at Content
Date Fri, 25 Nov 2016 10:24:00 GMT
Hi Karl,

I used default values for Solr. At my Solr output connector "Use the
Extract Update Handler" is clicked. Update handler is defined as:
"/update/extract". There is no Tika content extractor defined at Job
pipeline.

I have WireShark captures and logs from both ManifoldCF and Solr. I can
share them if you want.

Kind Regards,
Furkan KAMACI

On Fri, Nov 25, 2016 at 12:02 AM, Karl Wright <daddywri@gmail.com> wrote:

> Is this being indexed via the extracting update handler?  What does your
> pipeline look like?  Is the tika extractor in the pipeline?
>
>
> Karl
>
>
> On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <furkankamaci@gmail.com>
> wrote:
>
>> I've indexed a file via ManifoldCF to Solr which has a content starts
>> with:
>>
>> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" directed
>> by Elia Kazan, 1951*
>>
>> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
>> Elia Kazan, 1951*
>>
>> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
>> Elia Kazan, 1951*
>>
>> However when I check Solr I see that at content:
>>
>> * " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
>> application/rtf   \nstream_size 13580   \nstream_name MARLON BRANDO.rtf
>> \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n  \n
>> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
>> directed by Elia Kazan \n"*
>>
>> There are 2 problems at here.
>>
>> 1) There are newline characters which are unnecessary.
>>
>> 2) There are metadata prepended to content field which should not be.
>>
>> So, one can think that problem maybe at Solr or ManifoldCF (related to
>> Tika). When I index same document to Solr via cURL there are not new line
>> characters or metadata prepended.
>>
>> What do you think about for a solution?
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>>
>

Mime
View raw message