lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <furkankam...@gmail.com>
Subject Re: Metadata and Newline Characters at Content
Date Sat, 26 Nov 2016 19:47:45 GMT
Hi Erick,

I resolved my metadata problem with configuring solrconfig.xml However even
I post data with post.sh I see content as like:

CANADA �1 \n  \n \n   \n Place

I have newline characters as \n and some non-ASCII characters. As far as I
understand it is usual to have such characters because that is a pdf file
and its newline characters are interpreted as *\n* at Solr. How can I
remove them (\n and non-ASCII characters).

Kind Regards,
Furkan KAMACI

On Thu, Nov 24, 2016 at 8:58 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> Not sure. What have you tried?
>
>  For production situations or when you want to take total control of
> the indexing process,I strongly recommend that you put the Tika
> parsing on the _client_.
>
> Here's a writeup on this topic:
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Thu, Nov 24, 2016 at 10:37 AM, Furkan KAMACI <furkankamaci@gmail.com>
> wrote:
> > Hi Erick,
> >
> > When I check the *Solr* documentation I see that [1]:
> >
> > *In addition to Tika's metadata, Solr adds the following metadata
> (defined
> > in ExtractingMetadataConstants):*
> >
> > *"stream_name" - The name of the ContentStream as uploaded to Solr.
> > Depending on how the file is uploaded, this may or may not be set.*
> > *"stream_source_info" - Any source info about the stream. See
> > ContentStream.*
> > *"stream_size" - The size of the stream in bytes(?)*
> > *"stream_content_type" - The content type of the stream, if available.*
> >
> > So, it seems that these may not be added by Tika, but Solr. Do you know
> how
> > to enable/disable this feature?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > [1] https://wiki.apache.org/solr/ExtractingRequestHandler
> >
> > On Thu, Nov 24, 2016 at 6:51 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> about PatternCaptureGroupFilterFactory. This isn't going to help. The
> >> data you see when you return stored data is _before_ any analysis so
> >> the Pattern....Factory won't be applied. You could do this in a
> >> ScriptUpdateProcessorFactory. Or, just don't worry about it and have
> >> the real app deal with it.
> >>
> >> I don't particularly know about the Tika settings, that's largely a
> guess.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Nov 24, 2016 at 8:43 AM, Furkan KAMACI <furkankamaci@gmail.com>
> >> wrote:
> >> > Hi Erick,
> >> >
> >> > 1) I am looking stored data via Solr Admin UI. I send the query and
> check
> >> > what is in content field.
> >> >
> >> > 2) I can debug the Tika settings if you think that this is not the
> >> desired
> >> > behaviour to have such metadata fields combined into content field.
> >> >
> >> > *PS: *Is there any solution to get rid of it except for
> >> > using PatternCaptureGroupFilterFactory?
> >> >
> >> > Kind Regards,
> >> > Furkan KAMACI
> >> >
> >> > On Thu, Nov 24, 2016 at 6:31 PM, Erick Erickson <
> erickerickson@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> 1> I'm assuming when you "see" this data you're looking at the stored
> >> >> data, right? It's a verbatim copy of whatever you sent to the field.
> >> >> I'm guessing it's a character-encoding mismatch between the source
> and
> >> >> what you use to display.
> >> >>
> >> >> 2> How are you extracting this data? There are Tika options I think
> >> >> that can/do mush fields together.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Nov 24, 2016 at 7:54 AM, Furkan KAMACI <
> furkankamaci@gmail.com>
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > I'm testing Solr 4.9.1 I've indexed documents via it. Content
> field at
> >> >> > schema has text_general field type which is not modified from
> >> original. I
> >> >> > do not copy any fields to content. When I check the data  I see
> >> content
> >> >> > values as like:
> >> >> >
> >> >> >  " \n \nstream_source_info MARLON BRANDO.rtf
>  \nstream_content_type
> >> >> > application/rtf   \nstream_size 13580   \nstream_name MARLON
> >> BRANDO.rtf
> >> >> > \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf
>  \n
> >> >> \n
> >> >> > \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named
> Desire\"
> >> >> > directed by Elia Kazan \n"
> >> >> >
> >> >> > My questions:
> >> >> >
> >> >> > 1) Is it usual to have that newline characters?
> >> >> > 2) Is it usual to have file metadata at the beginning of the
> content
> >> >> (i.e.
> >> >> > stream source, stream_content_type) or related to tool that I
post
> >> data
> >> >> to
> >> >> > Solr?
> >> >> >
> >> >> > Kind Regards,
> >> >> > Furkan KAMACI
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message