manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Unnecessary Newline Characters and Metadata at Content
Date Thu, 24 Nov 2016 22:02:53 GMT
Is this being indexed via the extracting update handler?  What does your
pipeline look like?  Is the tika extractor in the pipeline?


Karl


On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <furkankamaci@gmail.com>
wrote:

> I've indexed a file via ManifoldCF to Solr which has a content starts with:
>
> *1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" directed
> by Elia Kazan, 1951*
>
> *2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
> Elia Kazan, 1951*
>
> *3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
> Elia Kazan, 1951*
>
> However when I check Solr I see that at content:
>
> * " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
> application/rtf   \nstream_size 13580   \nstream_name MARLON BRANDO.rtf
> \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n  \n
> \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
> directed by Elia Kazan \n"*
>
> There are 2 problems at here.
>
> 1) There are newline characters which are unnecessary.
>
> 2) There are metadata prepended to content field which should not be.
>
> So, one can think that problem maybe at Solr or ManifoldCF (related to
> Tika). When I index same document to Solr via cURL there are not new line
> characters or metadata prepended.
>
> What do you think about for a solution?
>
> Kind Regards,
> Furkan KAMACI
>
>

Mime
View raw message