Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <9AAE3903-2B93-46B6-B71C-F4CDC96B672D@cominvent.com>
References: 
 <CAMySt+FkpbAXEHBnpmuB3bejbDEQd6izHciOAaXrQQv2vgq3Zg@mail.gmail.com>
	<8F5AEE81635C4D8A9B6894BA6C5E5E3A@JackKrupansky>
	<CAMySt+GTxiuKqVTHwM0eQPNSAoJR8W3DC7yGeZ7nCVjCbvjChQ@mail.gmail.com>
	<CAN4YXvfLBbsTAoR_itPnVRC3UThRDU=F9qVMXSmy3WmfOy8U4w@mail.gmail.com>
	<CAMySt+FMJ9wybEFSHEOKb7kEW8RJL-T6YfMW+Gno+q00h-nN3w@mail.gmail.com>
	<9AAE3903-2B93-46B6-B71C-F4CDC96B672D@cominvent.com>
Date: Thu, 7 Mar 2013 14:03:20 -0500
Message-ID: 
 <CAMySt+GvEGMRiHU+trHcxU1PZ7wdrNk9u9xJS=O7yCQJANMRng@mail.gmail.com>
Subject: Re: Ability to specify 2 different query analyzers for same indexed
 field in Solr
From: Tom Burton-West <tburtonw@umich.edu>
To: dev@lucene.apache.org
Content-Type: multipart/alternative; boundary=f46d0442824833379104d75a5cfe

--f46d0442824833379104d75a5cfe
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Thanks Jan,

The blog post is very good, I didn't quite realize all those various
pitfalls with synonyms.

  I would still like the ability to specify two different query analysis
chains with one index, rather than having to write a custom parser for each
use case.   For example the Traditional/Simplified Chinese use case in my
previous message could probably be solved with a custom query parser along
the lines of the synonym solution in the blog post but if there were a way
to specify two different query analysis chains for the same indexed field,
I would not have to write a custom query parser.

Tom


On Tue, Mar 5, 2013 at 5:39 PM, Jan H=F8ydahl <jan.asf@cominvent.com> wrote=
:

> Hi,
>
> Please have a look at
> http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ and a
> working plugin to Solr to deboost the expanded synonyms. The plugin code
> currently lacks ability to configure different dictionaries for each fiel=
d,
> but that could be added. Also see SOLR-4381 for eventual inclusion in Sol=
r.
>
> --
> Jan H=F8ydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> 5. mars 2013 kl. 17:26 skrev Tom Burton-West <tburtonw@umich.edu>:
>
> Thanks Erick,
>
> Payloads might work but I'm looking at a more general problem
>
> Here is another use case:
>
> We have a mix of Traditional and Simplified Chinese documents indexed in
> the same OCR field.
>  When a user searches using Traditional Chinese, I would like to also
> search in Simplified Chinese, but rank the results matching Traditional
> Chinese higher.   Similarly, if a user enters a query in Simplified
> Chinese, I want to also search in Traditional Chinese but rank matches of
> the Simplified Chinese query terms higher.
>
> Since it is not always possible to determine whether a short query is in
> Simplified or Traditional Chinese here is what I would like to do.
>
> 1) Convert the query to Traditional Chinese
> 2) Convert the query to Simplified Chinese
> (One of these two steps would not be necessary if I could reliably
> determine the nature of the query)
>
> q1=3DQueryAsEntered^10 OR QueryTraditional^1 OR QuerySimplifed^1.
>
> Again, this could be done with copy fields, but that would increase my
> index size too much.  What I really want to be able to do is to query the
> same index (i.e. document as created ) with the user's query
> processed/analyzed in 3 different ways.
>
> I could do this myself in the app layer, but I would really like to be
> able to use Solr.
>
>
> Tom
>
>
>
> On Mon, Mar 4, 2013 at 8:19 PM, Erick Erickson <erickerickson@gmail.com>w=
rote:
>
>> Tom:
>>
>> I wonder if you could do something with payloads here. Index all terms
>> with payloads of 10, but synonyms with 1?
>>
>> Random thought off the top of my head.
>>
>> Erick
>>
>>
>>>     <analyzer type=3Dindex>
>>>    <tokenizer class=3D"solr.StandardTokenizerFactory"/>
>>>   <filter class=3D"solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> <fieldType name=3D"plain">
>>>     <analyzer type=3Dquery>
>>>    <tokenizer class=3D"solr.StandardTokenizerFactory"/>
>>>   <filter class=3D"solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>>
>>> <fieldType name=3D"syn">
>>>     <analyzer type=3Dindex>
>>>    <tokenizer class=3D"solr.StandardTokenizerFactory"/>
>>>   <filter class=3D"solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> <fieldType name=3D"plain">
>>>     <analyzer type=3Dquery>
>>>    <tokenizer class=3D"solr.StandardTokenizerFactory"/>
>>>    <filter class=3D"solr.SynonymFilterFactory" synonyms=3D"synonyms.txt=
"
>>> ignoreCase=3D"true" expand=3D"true"/>
>>>   <filter class=3D"solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> <copyField source=3D"plain" dest=3D"syn"/>
>>>
>>> On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky <jack@basetechnology.com=
>wrote:
>>>
>>>>   Please clarify, and try providing a couple more use cases. I mean,
>>>> the case you provided suggests that the contents of the index will be
>>>> different between the two fields, while you told us that you wanted to
>>>> share the same indexed field. In other words, it sounds like you will =
have
>>>> two copies of similar data anyway.
>>>>
>>>> Maybe you simply want one copy of the stored value for the field and
>>>> then have one or more copyfields that index the same source data
>>>> differently, but don=92t re-store the copied source data.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>>  *From:* Tom Burton-West <tburtonw@umich.edu>
>>>> *Sent:* Monday, March 04, 2013 3:57 PM
>>>> *To:* dev@lucene.apache.org
>>>> *Subject:* Ability to specify 2 different query analyzers for same
>>>> indexed field in Solr
>>>>
>>>> Hello,
>>>>
>>>> We would like to be able to specify two different fields that both use
>>>> the same indexed field but use different analyzers.   An example use-c=
ase
>>>> for this might be doing query-time synonym expansion with the synonyms
>>>> weighted lower than an exact match.
>>>>
>>>> q=3Dexact_field^10 OR synonyms^1
>>>>
>>>> The normal way to do this in Solr, which is just to set up separate
>>>> analyzer chains and use a copyfield, will not work for us because the =
field
>>>> in question is huge.  It is about 7 TB of OCR.
>>>>
>>>> Is there a way to do this currently in Solr?   If not ,
>>>>
>>>> 1) should I open a JIRA issue?
>>>> 2) can someone point me towards the part of the code I might need to
>>>> modify?
>>>>
>>>> Tom
>>>>
>>>>  Tom Burton-West
>>>> Information Retrieval Programmer
>>>> Digital Library Production Service
>>>> University of Michigan Library
>>>> http://www.hathitrust.org/blogs/large-scale-search
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

--f46d0442824833379104d75a5cfe
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Thanks Jan,<div><br></div><div>The blog post is very good, I didn&#39;t qui=
te realize all those various pitfalls with synonyms.=A0</div><div><br></div=
><div>=A0 I would still like the ability to specify two different query ana=
lysis chains with one index, rather than having to write a custom parser fo=
r each use case. =A0 For example the Traditional/Simplified Chinese use cas=
e in my previous message could probably be solved with a custom query parse=
r along the lines of the synonym solution in the blog post but if there wer=
e a way to specify two different query analysis chains for the same indexed=
 field, I would not have to write a custom query parser.</div>
<div><br></div><div>Tom =A0</div><div><br></div><div><br><br><div class=3D"=
gmail_quote">On Tue, Mar 5, 2013 at 5:39 PM, Jan H=F8ydahl <span dir=3D"ltr=
">&lt;<a href=3D"mailto:jan.asf@cominvent.com" target=3D"_blank">jan.asf@co=
minvent.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word">Hi,<div>=
<br></div><div>Please have a look at=A0<a href=3D"http://nolanlawson.com/20=
12/10/31/better-synonym-handling-in-solr/" target=3D"_blank">http://nolanla=
wson.com/2012/10/31/better-synonym-handling-in-solr/</a>=A0and a working pl=
ugin to Solr to deboost the expanded synonyms. The plugin code currently la=
cks ability to configure different dictionaries for each field, but that co=
uld be added. Also see=A0SOLR-4381 for eventual inclusion in Solr.</div>
<div><br><div>
<div>--<br>Jan H=F8ydahl, search solution architect<br>Cominvent AS - <a hr=
ef=3D"http://www.cominvent.com" target=3D"_blank">www.cominvent.com</a><br>=
Solr Training - <a href=3D"http://www.solrtraining.com" target=3D"_blank">w=
ww.solrtraining.com</a></div>

</div>
<br><div><div>5. mars 2013 kl. 17:26 skrev Tom Burton-West &lt;<a href=3D"m=
ailto:tburtonw@umich.edu" target=3D"_blank">tburtonw@umich.edu</a>&gt;:</di=
v><br><blockquote type=3D"cite"><div>Thanks Erick,</div><div><br></div><div=
>
Payloads might work but I&#39;m looking at a more general problem</div><div=
><br></div>Here is another use case:<div><br></div><div>We have a mix of Tr=
aditional and Simplified Chinese documents indexed in the same OCR field. =
=A0</div>

<div>=A0When a user searches using Traditional Chinese, I would like to als=
o search in Simplified Chinese, but rank the results matching Traditional C=
hinese higher. =A0 Similarly, if a user enters a query in Simplified Chines=
e, I want to also search in Traditional Chinese but rank matches of the Sim=
plified Chinese query terms higher.</div>

<div><br></div><div>Since it is not always possible to determine whether a =
short query is in Simplified or Traditional Chinese here is what I would li=
ke to do.</div><div><br></div><div>1) Convert the query to Traditional Chin=
ese</div>

<div>2) Convert the query to Simplified Chinese</div><div>(One of these two=
 steps would not be necessary if I could reliably determine the nature of t=
he query)</div><div><br></div><div>q1=3DQueryAsEntered^10 OR QueryTradition=
al^1 OR QuerySimplifed^1.</div>

<div><br></div><div>Again, this could be done with copy fields, but that wo=
uld increase my index size too much. =A0What I really want to be able to do=
 is to query the same index (i.e. document as created ) with the user&#39;s=
 query processed/analyzed in 3 different ways.</div>

<div><br></div><div>I could do this myself in the app layer, but I would re=
ally like to be able to use Solr.</div><div><br></div><div><br></div><div>T=
om</div><div><br></div><div><br><br><div class=3D"gmail_quote">On Mon, Mar =
4, 2013 at 8:19 PM, Erick Erickson <span dir=3D"ltr">&lt;<a href=3D"mailto:=
erickerickson@gmail.com" target=3D"_blank">erickerickson@gmail.com</a>&gt;<=
/span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Tom:<div><br></div><div>I w=
onder if you could do something with payloads here. Index all terms with pa=
yloads of 10, but synonyms with 1?</div>

<div><br></div><div>Random thought off the top of my head.</div>
<div><br></div><div>Erick</div></div><div class=3D"gmail_extra"><br><div cl=
ass=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><br><div>=A0 =A0 &l=
t;analyzer type=3Dindex&gt;</div>


<div>=A0 =A0&lt;tokenizer class=3D&quot;solr.StandardTokenizerFactory&quot;=
/&gt;</div><div>=A0 &lt;filter class=3D&quot;solr.LowerCaseFilterFactory&qu=
ot;/&gt;</div><div>&lt;/analyzer&gt;</div><div><div>&lt;fieldType name=3D&q=
uot;plain&quot;&gt;</div>


<div>=A0 =A0 &lt;analyzer type=3Dquery&gt;</div><div>=A0 =A0&lt;tokenizer c=
lass=3D&quot;solr.StandardTokenizerFactory&quot;/&gt;</div><div>=A0 &lt;fil=
ter class=3D&quot;solr.LowerCaseFilterFactory&quot;/&gt;</div><div>&lt;/ana=
lyzer&gt;</div>


</div><div><br></div><div><div>&lt;fieldType name=3D&quot;syn&quot;&gt;</di=
v><div>=A0 =A0 &lt;analyzer type=3Dindex&gt;</div><div>=A0 =A0&lt;tokenizer=
 class=3D&quot;solr.StandardTokenizerFactory&quot;/&gt;</div><div>=A0 &lt;f=
ilter class=3D&quot;solr.LowerCaseFilterFactory&quot;/&gt;</div>


<div>&lt;/analyzer&gt;</div><div><div>&lt;fieldType name=3D&quot;plain&quot=
;&gt;</div><div>=A0 =A0 &lt;analyzer type=3Dquery&gt;</div><div>=A0 =A0&lt;=
tokenizer class=3D&quot;solr.StandardTokenizerFactory&quot;/&gt;</div><div>=
=A0 =A0&lt;filter class=3D&quot;solr.SynonymFilterFactory&quot; synonyms=3D=
&quot;synonyms.txt&quot; ignoreCase=3D&quot;true&quot; expand=3D&quot;true&=
quot;/&gt;</div>


<div>=A0 &lt;filter class=3D&quot;solr.LowerCaseFilterFactory&quot;/&gt;</d=
iv><div>&lt;/analyzer&gt;</div></div></div><div>&lt;copyField source=3D&quo=
t;plain&quot; dest=3D&quot;syn&quot;/&gt;</div><div><br><div class=3D"gmail=
_quote">


On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:jack@basetechnology.com" target=3D"_blank">jack@basetechnology.=
com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div dir=3D"ltr">
<div dir=3D"ltr">
<div style=3D"font-size:12pt;font-family:&#39;Calibri&#39;">
<div>Please clarify, and try providing a couple more use cases. I mean, the=
 case=20
you provided suggests that the contents of the index will be different betw=
een=20
the two fields, while you told us that you wanted to share the same indexed=
=20
field. In other words, it sounds like you will have two copies of similar d=
ata=20
anyway.</div>
<div>=A0</div>
<div>Maybe you simply want one copy of the stored value for the field and t=
hen=20
have one or more copyfields that index the same source data differently, bu=
t=20
don=92t re-store the copied source data.</div>
<div><br>-- Jack Krupansky</div>
<div style=3D"font-size:small;font-style:normal;text-decoration:none;font-f=
amily:&#39;Calibri&#39;;display:inline;font-weight:normal">
<div style=3D"FONT:10pt tahoma">
<div>=A0</div>
<div style=3D"BACKGROUND:#f5f5f5">
<div><b>From:</b> <a title=3D"tburtonw@umich.edu" href=3D"mailto:tburtonw@u=
mich.edu" target=3D"_blank">Tom Burton-West</a> </div>
<div><b>Sent:</b> Monday, March 04, 2013 3:57 PM</div>
<div><b>To:</b> <a title=3D"dev@lucene.apache.org" href=3D"mailto:dev@lucen=
e.apache.org" target=3D"_blank">dev@lucene.apache.org</a> </div>
<div><b>Subject:</b> Ability to specify 2 different query analyzers for sam=
e=20
indexed field in Solr</div></div></div>
<div>=A0</div></div>
<div style=3D"font-size:small;font-style:normal;text-decoration:none;font-f=
amily:&#39;Calibri&#39;;display:inline;font-weight:normal">Hello,=20

<div>=A0</div>
<div>We would like to be able to specify two different fields that both use=
 the=20
same indexed field but use different analyzers.=A0=A0 An example use-case=
=20
for this might be doing query-time synonym expansion with the synonyms weig=
hted=20
lower than an exact match.=A0=A0 </div>
<div>=A0</div>
<div>q=3Dexact_field^10 OR synonyms^1</div>
<div>=A0</div>
<div>The normal way to do this in Solr, which is just to set up separate=20
analyzer chains and use a copyfield, will not work for us because the field=
 in=20
question is huge.=A0 It is about 7 TB of OCR.</div>
<div>=A0</div>
<div>Is there a way to do this currently in Solr?=A0=A0 If not ,</div>
<div>=A0</div>
<div>1) should I open a JIRA issue?</div>
<div>2) can someone point me towards the part of the code I might need to=
=20
modify?</div>
<div>=A0</div>
<div>Tom </div>
<div>=A0</div>
<div>
<div style=3D"color:rgb(34,34,34);font-size:13px;font-family:arial,sans-ser=
if">Tom=20
Burton-West</div>
<div style=3D"color:rgb(34,34,34);font-size:13px;font-family:arial,sans-ser=
if">Information=20
Retrieval Programmer</div>
<div style=3D"color:rgb(34,34,34);font-size:13px;font-family:arial,sans-ser=
if">Digital=20
Library Production Service</div>
<div style=3D"color:rgb(34,34,34);font-size:13px;font-family:arial,sans-ser=
if">University=20
of Michigan Library</div>
<div style=3D"color:rgb(34,34,34);font-size:13px;font-family:arial,sans-ser=
if"><a style=3D"COLOR:rgb(17,85,204)" href=3D"http://www.hathitrust.org/blo=
gs/large-scale-search" target=3D"_blank">http://www.hathitrust.org/blogs/la=
rge-scale-search</a></div>


</div>
<div>=A0</div>
<div>=A0</div></div></div></div></div>
</blockquote></div><br></div></div>
</blockquote></div><br></div>
</blockquote></div><br></div>
</blockquote></div><br></div></div></blockquote></div><br></div>

--f46d0442824833379104d75a5cfe--