lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yc...@club-internet.fr
Subject Re: Re: Problem with html code inside xml
Date Tue, 02 Oct 2007 23:01:26 GMT
Hi !

I'm facing a similar problem. Some HTML docs are correctly indexed and others are simply rejected
even I encoded all problematic HTML tags as Thorsten suggested.

In the following example, "my_doc.xml" is a valid "XML" file, compliant with my Solr's schema
fields :

$ java -jar post.jar ./my_doc.xml 

SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings
are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file solrdoc
SimplePostTool: FATAL: Connection error (is Solr running at http://localhost:8983/solr/update
?): java.io.IOException: Server returned HTTP response code: 500 for URL: http://localhost:8983/solr/update

Is there any way to let "Solr" to be more verbose than that ?
Do I need to go into the Java code to understand what happen?
 I'm looking for a simple solution.

Thanks in advance

cheers
Y.

----Message d'origine----
>De: "steve.christin@gmail.com" 
>Sujet: Re: Problem with html code inside xml
>Date: Tue, 2 Oct 2007 16:15:26 +0200
>A: solr-user@lucene.apache.org
>
>Thanks
>
>I use this solution:
>
>put  <![CDATA[  Here my hml code   ]]> in the xml to be indexed and  
>it works, nothing to change in the xsl.
>
>In the schema I use this fieldType
>
><fieldType name="html" class="solr.TextField"  
>positionIncrementGap="100">
>     	<analyzer>
>         	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          	<filter class="solr.WordDelimiterFilterFactory"  
>generateWordParts="1" generateNumberParts="1" catenateWords="1"  
>catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>          	<filter class="solr.LowerCaseFilterFactory"/>
>          	<filter class="solr.StopFilterFactory" ignoreCase="true"  
>words="stopwords.txt"/>
>          	<filter class="solr.ISOLatin1AccentFilterFactory"/>
>          	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      	</analyzer>
>      </fieldType>
>
>----------
>Now question:
>I created a field to index only the text for this html code.
>
>I created a field type:
>
><fieldType name="htmlTxt" class="solr.TextField"  
>positionIncrementGap="100">
>     	<analyzer>
>         	<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>          	<filter class="solr.WordDelimiterFilterFactory"  
>generateWordParts="1" generateNumberParts="1" catenateWords="1"  
>catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>          	<filter class="solr.LowerCaseFilterFactory"/>
>          	<filter class="solr.StopFilterFactory" ignoreCase="true"  
>words="stopwords.txt"/>
>          	<filter class="solr.ISOLatin1AccentFilterFactory"/>
>          	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      	</analyzer>
>      </fieldType>
>
>Everything works (the div tags, p tags are removed) but some  
><strong>nnn</strong>   or <br/> tags are style in the text after  
>indexing.
>
>If you've got any idea to solve this problem it we'll be great.
>
>Thanks
>
>S. Christin
>
>
>
>-------------
>
>
>Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :
>
>> On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
>>> If I understand, you want to keep the raw html code in solr like that
>>> (in your posting xml file):
>>>
>>> <field name="storyFullText">
>>>   <html></html>
>>> </field>
>>>
>>> I think you should encode your content to protect these xml entities:
>>> <  ->  &lt;
>>>> -> &gt;
>>> " -> &quot;
>>> & -> &amp;
>>>
>>> If you use perl, have a look at HTML::Entities.
>>
>> AFAIR you cannot use tags, they always are getting transformed to
>> entities. The solution is to have a xsl transformation after the
>> response that transforms the entities back to tags.
>>
>> Have a look at the thread
>> http://marc.info/?t=116775837900001&r=1&w=2
>> and especially at
>> http://marc.info/?l=solr-user&m=116782664828926&w=2
>>
>> HTH
>>
>> salu2
>>
>>>
>>>
>>> On 9/25/07, steve.christin@gmail.com <steve.christin@gmail.com>  
>>> wrote:
>>>> Hello,
>>>>
>>>> I've got some problem with html code who is embedded in xml file:
>>>>
>>>> Sample source .
>>>>
>>>> <content>
>>>>         <stories>
>>>>                 <div class="storyTitle">
>>>>                          Les débats
>>>>                 </div>
>>>>                 <div class="storyIntroductionText">
>>>>                         Le premier tour des élections fédérales  
>>>> se déroulera le 21
>>>> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
>>>> vous, dont plusieurs grands débats à l'enseigne de Forums.
>>>>                 </div>
>>>>                 <div class="paragraph">
>>>>                         <div class="paragraphTitle"/>
>>>>                         <div class="paragraphText">
>>>>                                 my para textehere
>>>>                                 <br/>
>>>>                                 <br/>
>>>>                                 Vous trouverez sur cette page  
>>>> toutes les dates et les heures de
>>>> ces différents rendez-vous ainsi que le nom et les partis des
>>>> débatteurs. De plus, vous pourrez également écouter ou  
>>>> réécouter
>>>> l'ensemble de ces émissions.
>>>>                         </div>
>>>>                 </div>
>>>> ....
>>>> ---------
>>>> When a make a query on solr I've got something like that in the
>>>> source code of the xml result:
>>>>
>>>> <td xmlns="http://www.w3.org/1999/xhtml">
>>>> <span class="markup">&lt;</span>
>>>> <span class="start-tag">div</span>
>>>> <span class="attribute-name">class</span>
>>>> <span class="markup">=</span>
>>>> <span class="attribute-value">"paragraph"</span>
>>>> <span class="markup">&gt;</span><div class="expander-content">
>>>> <div class="indent"><span class="markup">&lt;</span>
>>>> <span class="start-tag">div</span>
>>>> <span class="attribute-name">class</span>
>>>> <span class="markup">=</span>
>>>> <span class="attribute-value">"paragraphTitle"</span>
>>>> <span class="markup">/&gt;</span></div><table><tr>
>>>> <td class="expander">−<div class="spacer"/>
>>>> </td><td><span class="markup">&lt;</span>
>>>> ...
>>>>
>>>> It is not exactly what I want. I want to keep the html tags, that  
>>>> all
>>>> without formatting.
>>>>
>>>> So the br tags and a tags are well formed in xml and json result,  
>>>> but
>>>> the div tags are not kept.
>>>> ---------
>>>> In the schema.xml I've got this for the html content
>>>>
>>>> <fieldType name="html" class="solr.TextField" />
>>>>
>>>>   <field name="storyFullText" type="html" indexed="true"
>>>> stored="true" multiValued="true"/>
>>>>
>>>> ---------
>>>>
>>>> Any help would be appreciate.
>>>>
>>>> Thanks in advance.
>>>>
>>>> S. Christin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>> -- 
>> Thorsten Scherler                                  
>> thorsten.at.apache.org
>> Open Source Java                      consulting, training and  
>> solutions
>>
>


Mime
View raw message