Is this being indexed via the extracting update handler?  What does your pipeline look like?  Is the tika extractor in the pipeline?


Karl


On Thu, Nov 24, 2016 at 12:52 PM, Furkan KAMACI <furkankamaci@gmail.com> wrote:
I've indexed a file via ManifoldCF to Solr which has a content starts with:

1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" directed by Elia Kazan, 1951

2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by Elia Kazan, 1951

3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by Elia Kazan, 1951

However when I check Solr I see that at content:

 " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type application/rtf   \nstream_size 13580   \nstream_name MARLON BRANDO.rtf   \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n  \n \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\" directed by Elia Kazan \n"

There are 2 problems at here.

1) There are newline characters which are unnecessary.

2) There are metadata prepended to content field which should not be.

So, one can think that problem maybe at Solr or ManifoldCF (related to Tika). When I index same document to Solr via cURL there are not new line characters or metadata prepended.

What do you think about for a solution?

Kind Regards,
Furkan KAMACI