lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <furkankam...@gmail.com>
Subject Re: Metadata and Newline Characters at Content
Date Sat, 26 Nov 2016 20:59:51 GMT
PS: \n characters are not shown in browser but breaks how highlighter work.
 \n characters are considered at fragsize too.

On Sat, Nov 26, 2016 at 9:47 PM, Furkan KAMACI <furkankamaci@gmail.com>
wrote:

> Hi Erick,
>
> I resolved my metadata problem with configuring solrconfig.xml However
> even I post data with post.sh I see content as like:
>
> CANADA �1 \n  \n \n   \n Place
>
> I have newline characters as \n and some non-ASCII characters. As far as I
> understand it is usual to have such characters because that is a pdf file
> and its newline characters are interpreted as *\n* at Solr. How can I
> remove them (\n and non-ASCII characters).
>
> Kind Regards,
> Furkan KAMACI
>
> On Thu, Nov 24, 2016 at 8:58 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> Not sure. What have you tried?
>>
>>  For production situations or when you want to take total control of
>> the indexing process,I strongly recommend that you put the Tika
>> parsing on the _client_.
>>
>> Here's a writeup on this topic:
>>
>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>>
>> Best,
>> Erick
>>
>> On Thu, Nov 24, 2016 at 10:37 AM, Furkan KAMACI <furkankamaci@gmail.com>
>> wrote:
>> > Hi Erick,
>> >
>> > When I check the *Solr* documentation I see that [1]:
>> >
>> > *In addition to Tika's metadata, Solr adds the following metadata
>> (defined
>> > in ExtractingMetadataConstants):*
>> >
>> > *"stream_name" - The name of the ContentStream as uploaded to Solr.
>> > Depending on how the file is uploaded, this may or may not be set.*
>> > *"stream_source_info" - Any source info about the stream. See
>> > ContentStream.*
>> > *"stream_size" - The size of the stream in bytes(?)*
>> > *"stream_content_type" - The content type of the stream, if available.*
>> >
>> > So, it seems that these may not be added by Tika, but Solr. Do you know
>> how
>> > to enable/disable this feature?
>> >
>> > Kind Regards,
>> > Furkan KAMACI
>> >
>> > [1] https://wiki.apache.org/solr/ExtractingRequestHandler
>> >
>> > On Thu, Nov 24, 2016 at 6:51 PM, Erick Erickson <
>> erickerickson@gmail.com>
>> > wrote:
>> >
>> >> about PatternCaptureGroupFilterFactory. This isn't going to help. The
>> >> data you see when you return stored data is _before_ any analysis so
>> >> the Pattern....Factory won't be applied. You could do this in a
>> >> ScriptUpdateProcessorFactory. Or, just don't worry about it and have
>> >> the real app deal with it.
>> >>
>> >> I don't particularly know about the Tika settings, that's largely a
>> guess.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Nov 24, 2016 at 8:43 AM, Furkan KAMACI <furkankamaci@gmail.com
>> >
>> >> wrote:
>> >> > Hi Erick,
>> >> >
>> >> > 1) I am looking stored data via Solr Admin UI. I send the query and
>> check
>> >> > what is in content field.
>> >> >
>> >> > 2) I can debug the Tika settings if you think that this is not the
>> >> desired
>> >> > behaviour to have such metadata fields combined into content field.
>> >> >
>> >> > *PS: *Is there any solution to get rid of it except for
>> >> > using PatternCaptureGroupFilterFactory?
>> >> >
>> >> > Kind Regards,
>> >> > Furkan KAMACI
>> >> >
>> >> > On Thu, Nov 24, 2016 at 6:31 PM, Erick Erickson <
>> erickerickson@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> 1> I'm assuming when you "see" this data you're looking at the
>> stored
>> >> >> data, right? It's a verbatim copy of whatever you sent to the field.
>> >> >> I'm guessing it's a character-encoding mismatch between the source
>> and
>> >> >> what you use to display.
>> >> >>
>> >> >> 2> How are you extracting this data? There are Tika options
I think
>> >> >> that can/do mush fields together.
>> >> >>
>> >> >> Best,
>> >> >> Erick
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, Nov 24, 2016 at 7:54 AM, Furkan KAMACI <
>> furkankamaci@gmail.com>
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> > I'm testing Solr 4.9.1 I've indexed documents via it. Content
>> field at
>> >> >> > schema has text_general field type which is not modified from
>> >> original. I
>> >> >> > do not copy any fields to content. When I check the data 
I see
>> >> content
>> >> >> > values as like:
>> >> >> >
>> >> >> >  " \n \nstream_source_info MARLON BRANDO.rtf
>>  \nstream_content_type
>> >> >> > application/rtf   \nstream_size 13580   \nstream_name MARLON
>> >> BRANDO.rtf
>> >> >> > \nContent-Type application/rtf   \nresourceName MARLON
>> BRANDO.rtf   \n
>> >> >> \n
>> >> >> > \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named
>> Desire\"
>> >> >> > directed by Elia Kazan \n"
>> >> >> >
>> >> >> > My questions:
>> >> >> >
>> >> >> > 1) Is it usual to have that newline characters?
>> >> >> > 2) Is it usual to have file metadata at the beginning of the
>> content
>> >> >> (i.e.
>> >> >> > stream source, stream_content_type) or related to tool that
I post
>> >> data
>> >> >> to
>> >> >> > Solr?
>> >> >> >
>> >> >> > Kind Regards,
>> >> >> > Furkan KAMACI
>> >> >>
>> >>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message