manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <furkankam...@gmail.com>
Subject Unnecessary Newline Characters and Metadata at Content
Date Thu, 24 Nov 2016 17:52:26 GMT
I've indexed a file via ManifoldCF to Solr which has a content starts with:

*1. Vivien Leigh and Marlon Brando in "A Streetcar Named Desire" directed
by Elia Kazan, 1951*

*2. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
Elia Kazan, 1951*

*3. Portrait of Marlon Brando for "A Streetcar Named Desire" directed by
Elia Kazan, 1951*

However when I check Solr I see that at content:

* " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
application/rtf   \nstream_size 13580   \nstream_name MARLON BRANDO.rtf
\nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n  \n
\n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
directed by Elia Kazan \n"*

There are 2 problems at here.

1) There are newline characters which are unnecessary.

2) There are metadata prepended to content field which should not be.

So, one can think that problem maybe at Solr or ManifoldCF (related to
Tika). When I index same document to Solr via cURL there are not new line
characters or metadata prepended.

What do you think about for a solution?

Kind Regards,
Furkan KAMACI

Mime
View raw message