lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Luis Betancourt González <jlbetanco...@uci.cu>
Subject Re: Design optimal Solr Schema
Date Thu, 30 Oct 2014 15:15:44 GMT
Are you going to use the values stored on Solr to display the data in HTML? For searching purposes
I suggest to delete all the HTML tags, and store the plain text, for this you could use the
HTMLStripCharFilterFactory char filter, this will "clean" your content and only pass the actual
text which is in the end what you're going to use. 

If you are going to use the solr result to display the content in an HTML page then I would
suggest to keep your index clean and index only the actual searchable text no HTML, I actually
use the recommended filter to strip HTML out of crawled HTML pages. Although what a Solr document
means to you? An entire conversation is modeled 1 Solr document? have you considered separating
each conversation interaction on a document? 


----- Original Message -----
From: "tomas.kalas" <kalanek@email.cz>
To: solr-user@lucene.apache.org
Sent: Thursday, October 30, 2014 10:27:50 AM
Subject: Design optimal Solr Schema

Hello i have problem with design of schema in Solr. I have a transcript of a
telephone conversation in this format. I parse it at individual fields. I
have this schema:

<?xml version="1.0"?>
<add>
<doc>
<field name="id">01.cn</field>
<field name="t">0<br /> 1<br /> 2<br /> 2 <br /> 3 <br />
....</field>
<field name="st">0.00<br /> 1.54<br /> 1.54<br /> 1.54 <br />
1.57 <br />
....</field>
<field name="et">1.54<br /> 1.54<br /> 1.57<br /> 1.57 <br />
1.7 <br />
....</field>
<field name="w">_SILENCE_<br /> <s><br /> HELLO<br /> HALLO
<br /> _DELETE_
<br /> ....</field>
<field name="p">0.000000<br /> 1<br /> 1<br /> 2.06115e-009 <br
/> 1 <br />
....</field>
<field name="c">0<br /> 0<br /> 0<br /> 0 <br /> 0 <br />
....</field>
</doc>
</add>

I displayed it in html document, and therefore i used the <br />.

This is a original document:

T=0 ST=0.00 ET=1.54 W=_SILENCE_ P=0.000000 C=0
T=1 ST=1.54 ET=1.54 W=<s> P=1 C=0
T=2 ST=1.54 ET=1.57 W=HELLO P=1 C=0
T=2 ST=1.54 ET=1.57 W=HALLO P=2.06115e-009 C=0
T=3 ST=1.57 ET=1.70 W=_DELETE_ P=1 C=0
T=3 ST=1.57 ET=1.70 W=NO P=2.06115e-009 C=0
T=4 ST=1.70 ET=2.12 W=HOW P=1 C=0
T=5 ST=2.12 ET=2.18 W=ARE_ P=0.25 C=0
T=5 ST=2.12 ET=2.18 W=_DELETE_ P=0.25 C=0
..........................................
..........................................

Id - filename
T = Segment
ST = Start time
ET = End time
W = Word
P = Probability
C = Chanel

I want to search for example word which is to time 1.57 (w:HeLLO) AND (t:[0
TO 1.57]). But if i have all data in one field (t, st,et ...) then it
doesn't work. It find all files where is hello a further time than 1.57.

Do you have any ideas how it make it? Thanks a lot for your help.



--
View this message in context: http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message