lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: How to index X™ as &#8482; (HTML decimal entity)
Date Thu, 21 Nov 2013 17:04:47 GMT
"there is not really anything special about "special" characters"

Well, the distinction was about "named entities", which are indeed special.

Besides, in general, for more sophisticated text processing, character 
"types" are a valid distinction.

But all of this begs the question of the original question: "I need to store 
the HTML Entity (decimal) equivalent value (i.e. &#8482;) in SOLR rather 
than storing the original value."

Maybe the original poster could clarify the nature of their need.

-- Jack Krupansky

-----Original Message----- 
From: Michael Sokolov
Sent: Thursday, November 21, 2013 11:37 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

OK - probably I should have said "A",or "&#97;" :)  My point was just
that there is not really anything special about "special" characters.

On 11/21/2013 10:50 AM, Jack Krupansky wrote:
> "Would you store "a" as "&#65;" ?"
>
> No, not in any case.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Michael Sokolov
> Sent: Thursday, November 21, 2013 8:56 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>
> I have to agree w/Walter.  Use unicode as a storage format.  The entity
> encodings are for transfer/interchange.  Encode/decode on the way in and
> out if you have to.  Would you store "a" as "&#65;" ?  It makes it
> impossible to search for, for one thing.  What if someone wants to
> search for the TM character?
>
> -Mike
>
> On 11/20/13 12:07 PM, Jack Krupansky wrote:
>> AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format 
>> for storing text to be rendered. If you disagree - try explaining 
>> yourself.
>>
>> But maybe TM should be encoded as "&trade;". Ditto for other named SGML 
>> entities.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Walter Underwood
>> Sent: Wednesday, November 20, 2013 11:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>
>> Again, I'd like to know why this is wanted. It sounds like an X-Y, 
>> problem. Storing Unicode characters as XML/HTML encoded character 
>> references is an extremely bad idea.
>>
>> wunder
>>
>> On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" <jack@basetechnology.com> 
>> wrote:
>>
>>> Any analysis filtering affects the indexed value only, but the stored 
>>> value would be unchanged from the original input value. An update 
>>> processor lets you modify the original input value that will be stored.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Uwe Reh
>>> Sent: Wednesday, November 20, 2013 5:43 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>>
>>> What's about having a simple charfilter in the analyzer queue for
>>> indexing *and* searching. e.g
>>> <charFilter class="solr.PatternReplaceFilterFactory" pattern="™"
>>> replacement="&#8482;" />
>>> or
>>> <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-specials.txt" />
>>>
>>> Uwe
>>>
>>> Am 19.11.2013 23:46, schrieb Developer:
>>>> I have a data coming in to SOLR as below.
>>>>
>>>> <field name="displayName">X™ - Black</field>
>>>>
>>>> I need to store the HTML Entity (decimal) equivalent value (i.e. 
>>>> &#8482;)
>>>> in SOLR rather than storing the original value.
>>>>
>>>> Is there a way to do this?
>>>
>>
>> -- 
>> Walter Underwood
>> wunder@wunderwood.org
>>
>>
>>
> 


Mime
View raw message