lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <furkankam...@gmail.com>
Subject Re: Metadata and Newline Characters at Content
Date Thu, 24 Nov 2016 16:43:20 GMT
Hi Erick,

1) I am looking stored data via Solr Admin UI. I send the query and check
what is in content field.

2) I can debug the Tika settings if you think that this is not the desired
behaviour to have such metadata fields combined into content field.

*PS: *Is there any solution to get rid of it except for
using PatternCaptureGroupFilterFactory?

Kind Regards,
Furkan KAMACI

On Thu, Nov 24, 2016 at 6:31 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> 1> I'm assuming when you "see" this data you're looking at the stored
> data, right? It's a verbatim copy of whatever you sent to the field.
> I'm guessing it's a character-encoding mismatch between the source and
> what you use to display.
>
> 2> How are you extracting this data? There are Tika options I think
> that can/do mush fields together.
>
> Best,
> Erick
>
>
>
> On Thu, Nov 24, 2016 at 7:54 AM, Furkan KAMACI <furkankamaci@gmail.com>
> wrote:
> > Hi,
> >
> > I'm testing Solr 4.9.1 I've indexed documents via it. Content field at
> > schema has text_general field type which is not modified from original. I
> > do not copy any fields to content. When I check the data  I see content
> > values as like:
> >
> >  " \n \nstream_source_info MARLON BRANDO.rtf   \nstream_content_type
> > application/rtf   \nstream_size 13580   \nstream_name MARLON BRANDO.rtf
> > \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf   \n
> \n
> > \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named Desire\"
> > directed by Elia Kazan \n"
> >
> > My questions:
> >
> > 1) Is it usual to have that newline characters?
> > 2) Is it usual to have file metadata at the beginning of the content
> (i.e.
> > stream source, stream_content_type) or related to tool that I post data
> to
> > Solr?
> >
> > Kind Regards,
> > Furkan KAMACI
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message