Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BAB23D9DE for ; Thu, 7 Mar 2013 19:03:47 +0000 (UTC) Received: (qmail 91115 invoked by uid 500); 7 Mar 2013 19:03:46 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 91064 invoked by uid 500); 7 Mar 2013 19:03:46 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 91056 invoked by uid 99); 7 Mar 2013 19:03:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Mar 2013 19:03:46 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.82.44] (HELO mail-wg0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Mar 2013 19:03:42 +0000 Received: by mail-wg0-f44.google.com with SMTP id dr12so1454899wgb.11 for ; Thu, 07 Mar 2013 11:03:20 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:x-gm-message-state; bh=AXnaeL7UMgTGlxYlS5qr5AymwuQmb2q6J4iR0QafGgE=; b=PJRlFrNefjaYGkrYbQUc/XZXo+mABGEFDqGjkMPOKr0SphBHtDIkJCr+og6a403KM7 F+NCxp+ZlUOsrHhC/zMWsT5yn2PSUYc8JdRXHcgHzwqwB6qa1nuTy9vSwe146Mcmn2SG P/dEU2ENolxPb+5c4HTY7Rx8t/HTGsfZyqZw/l3y+JVti2P7SkSP4vTF5/Wg7qiUGv+m Bi2ny2LNDwOWkOGEePqeJQo0kmZKBVUIynhPHdHouV8GC03ZLNlfX2zkHbEx4CIwIswh 1ImIHmUg2gCwHVsnkyBtaxdaKOBzcr7fq7uiJx2bswIMdCIOPwcpl9WTzh8UgJup7Lab KPdg== MIME-Version: 1.0 X-Received: by 10.180.79.6 with SMTP id f6mr35422380wix.26.1362683000726; Thu, 07 Mar 2013 11:03:20 -0800 (PST) Received: by 10.194.157.38 with HTTP; Thu, 7 Mar 2013 11:03:20 -0800 (PST) In-Reply-To: <9AAE3903-2B93-46B6-B71C-F4CDC96B672D@cominvent.com> References: <8F5AEE81635C4D8A9B6894BA6C5E5E3A@JackKrupansky> <9AAE3903-2B93-46B6-B71C-F4CDC96B672D@cominvent.com> Date: Thu, 7 Mar 2013 14:03:20 -0500 Message-ID: Subject: Re: Ability to specify 2 different query analyzers for same indexed field in Solr From: Tom Burton-West To: dev@lucene.apache.org Content-Type: multipart/alternative; boundary=f46d0442824833379104d75a5cfe X-Gm-Message-State: ALoCoQnKF0+6mU2Zpifcv1wSeIOJUFu+e7euXhTZ0ENLtzWhWiwyYHIJ7SEkNcozaoRCZRMPJ6jt X-Virus-Checked: Checked by ClamAV on apache.org --f46d0442824833379104d75a5cfe Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Thanks Jan, The blog post is very good, I didn't quite realize all those various pitfalls with synonyms. I would still like the ability to specify two different query analysis chains with one index, rather than having to write a custom parser for each use case. For example the Traditional/Simplified Chinese use case in my previous message could probably be solved with a custom query parser along the lines of the synonym solution in the blog post but if there were a way to specify two different query analysis chains for the same indexed field, I would not have to write a custom query parser. Tom On Tue, Mar 5, 2013 at 5:39 PM, Jan H=F8ydahl wrote= : > Hi, > > Please have a look at > http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ and a > working plugin to Solr to deboost the expanded synonyms. The plugin code > currently lacks ability to configure different dictionaries for each fiel= d, > but that could be added. Also see SOLR-4381 for eventual inclusion in Sol= r. > > -- > Jan H=F8ydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > 5. mars 2013 kl. 17:26 skrev Tom Burton-West : > > Thanks Erick, > > Payloads might work but I'm looking at a more general problem > > Here is another use case: > > We have a mix of Traditional and Simplified Chinese documents indexed in > the same OCR field. > When a user searches using Traditional Chinese, I would like to also > search in Simplified Chinese, but rank the results matching Traditional > Chinese higher. Similarly, if a user enters a query in Simplified > Chinese, I want to also search in Traditional Chinese but rank matches of > the Simplified Chinese query terms higher. > > Since it is not always possible to determine whether a short query is in > Simplified or Traditional Chinese here is what I would like to do. > > 1) Convert the query to Traditional Chinese > 2) Convert the query to Simplified Chinese > (One of these two steps would not be necessary if I could reliably > determine the nature of the query) > > q1=3DQueryAsEntered^10 OR QueryTraditional^1 OR QuerySimplifed^1. > > Again, this could be done with copy fields, but that would increase my > index size too much. What I really want to be able to do is to query the > same index (i.e. document as created ) with the user's query > processed/analyzed in 3 different ways. > > I could do this myself in the app layer, but I would really like to be > able to use Solr. > > > Tom > > > > On Mon, Mar 4, 2013 at 8:19 PM, Erick Erickson w= rote: > >> Tom: >> >> I wonder if you could do something with payloads here. Index all terms >> with payloads of 10, but synonyms with 1? >> >> Random thought off the top of my head. >> >> Erick >> >> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> ignoreCase=3D"true" expand=3D"true"/> >>> >>> >>> >>> >>> On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky wrote: >>> >>>> Please clarify, and try providing a couple more use cases. I mean, >>>> the case you provided suggests that the contents of the index will be >>>> different between the two fields, while you told us that you wanted to >>>> share the same indexed field. In other words, it sounds like you will = have >>>> two copies of similar data anyway. >>>> >>>> Maybe you simply want one copy of the stored value for the field and >>>> then have one or more copyfields that index the same source data >>>> differently, but don=92t re-store the copied source data. >>>> >>>> -- Jack Krupansky >>>> >>>> *From:* Tom Burton-West >>>> *Sent:* Monday, March 04, 2013 3:57 PM >>>> *To:* dev@lucene.apache.org >>>> *Subject:* Ability to specify 2 different query analyzers for same >>>> indexed field in Solr >>>> >>>> Hello, >>>> >>>> We would like to be able to specify two different fields that both use >>>> the same indexed field but use different analyzers. An example use-c= ase >>>> for this might be doing query-time synonym expansion with the synonyms >>>> weighted lower than an exact match. >>>> >>>> q=3Dexact_field^10 OR synonyms^1 >>>> >>>> The normal way to do this in Solr, which is just to set up separate >>>> analyzer chains and use a copyfield, will not work for us because the = field >>>> in question is huge. It is about 7 TB of OCR. >>>> >>>> Is there a way to do this currently in Solr? If not , >>>> >>>> 1) should I open a JIRA issue? >>>> 2) can someone point me towards the part of the code I might need to >>>> modify? >>>> >>>> Tom >>>> >>>> Tom Burton-West >>>> Information Retrieval Programmer >>>> Digital Library Production Service >>>> University of Michigan Library >>>> http://www.hathitrust.org/blogs/large-scale-search >>>> >>>> >>>> >>> >>> >> > > --f46d0442824833379104d75a5cfe Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Thanks Jan,

The blog post is very good, I didn't qui= te realize all those various pitfalls with synonyms.=A0

=A0 I would still like the ability to specify two different query ana= lysis chains with one index, rather than having to write a custom parser fo= r each use case. =A0 For example the Traditional/Simplified Chinese use cas= e in my previous message could probably be solved with a custom query parse= r along the lines of the synonym solution in the blog post but if there wer= e a way to specify two different query analysis chains for the same indexed= field, I would not have to write a custom query parser.

Tom =A0



On Tue, Mar 5, 2013 at 5:39 PM, Jan H=F8ydahl <jan.asf@co= minvent.com> wrote:
Hi,
=
Please have a look at=A0http://nolanla= wson.com/2012/10/31/better-synonym-handling-in-solr/=A0and a working pl= ugin to Solr to deboost the expanded synonyms. The plugin code currently la= cks ability to configure different dictionaries for each field, but that co= uld be added. Also see=A0SOLR-4381 for eventual inclusion in Solr.

--
Jan H=F8ydahl, search solution architect
Cominvent AS - www.cominvent.com
= Solr Training - w= ww.solrtraining.com

5. mars 2013 kl. 17:26 skrev Tom Burton-West <tburtonw@umich.edu>:
Thanks Erick,

Payloads might work but I'm looking at a more general problem

Here is another use case:

We have a mix of Tr= aditional and Simplified Chinese documents indexed in the same OCR field. = =A0
=A0When a user searches using Traditional Chinese, I would like to als= o search in Simplified Chinese, but rank the results matching Traditional C= hinese higher. =A0 Similarly, if a user enters a query in Simplified Chines= e, I want to also search in Traditional Chinese but rank matches of the Sim= plified Chinese query terms higher.

Since it is not always possible to determine whether a = short query is in Simplified or Traditional Chinese here is what I would li= ke to do.

1) Convert the query to Traditional Chin= ese
2) Convert the query to Simplified Chinese
(One of these two= steps would not be necessary if I could reliably determine the nature of t= he query)

q1=3DQueryAsEntered^10 OR QueryTradition= al^1 OR QuerySimplifed^1.

Again, this could be done with copy fields, but that wo= uld increase my index size too much. =A0What I really want to be able to do= is to query the same index (i.e. document as created ) with the user's= query processed/analyzed in 3 different ways.

I could do this myself in the app layer, but I would re= ally like to be able to use Solr.


T= om



On Mon, Mar = 4, 2013 at 8:19 PM, Erick Erickson <erickerickson@gmail.com><= /span> wrote:
Tom:

I w= onder if you could do something with payloads here. Index all terms with pa= yloads of 10, but synonyms with 1?

Random thought off the top of my head.

Erick


=A0 =A0 &l= t;analyzer type=3Dindex>
=A0 =A0<tokenizer class=3D"solr.StandardTokenizerFactory"= />
=A0 <filter class=3D"solr.LowerCaseFilterFactory&qu= ot;/>
</analyzer>
<fieldType name=3D&q= uot;plain">
=A0 =A0 <analyzer type=3Dquery>
=A0 =A0<tokenizer c= lass=3D"solr.StandardTokenizerFactory"/>
=A0 <fil= ter class=3D"solr.LowerCaseFilterFactory"/>
</ana= lyzer>

<fieldType name=3D"syn">
=A0 =A0 <analyzer type=3Dindex>
=A0 =A0<tokenizer= class=3D"solr.StandardTokenizerFactory"/>
=A0 <f= ilter class=3D"solr.LowerCaseFilterFactory"/>
</analyzer>
<fieldType name=3D"plain"= ;>
=A0 =A0 <analyzer type=3Dquery>
=A0 =A0<= tokenizer class=3D"solr.StandardTokenizerFactory"/>
= =A0 =A0<filter class=3D"solr.SynonymFilterFactory" synonyms=3D= "synonyms.txt" ignoreCase=3D"true" expand=3D"true&= quot;/>
=A0 <filter class=3D"solr.LowerCaseFilterFactory"/>
</analyzer>
<copyField source=3D&quo= t;plain" dest=3D"syn"/>

On Mon, Mar 4, 2013 at 4:43 PM, Jack Krupansky <jack@basetechnology.= com> wrote:
Please clarify, and try providing a couple more use cases. I mean, the= case=20 you provided suggests that the contents of the index will be different betw= een=20 the two fields, while you told us that you wanted to share the same indexed= =20 field. In other words, it sounds like you will have two copies of similar d= ata=20 anyway.
=A0
Maybe you simply want one copy of the stored value for the field and t= hen=20 have one or more copyfields that index the same source data differently, bu= t=20 don=92t re-store the copied source data.

-- Jack Krupansky
=A0
Sent: Monday, March 04, 2013 3:57 PM
Subject: Ability to specify 2 different query analyzers for sam= e=20 indexed field in Solr
=A0
Hello,=20
=A0
We would like to be able to specify two different fields that both use= the=20 same indexed field but use different analyzers.=A0=A0 An example use-case= =20 for this might be doing query-time synonym expansion with the synonyms weig= hted=20 lower than an exact match.=A0=A0
=A0
q=3Dexact_field^10 OR synonyms^1
=A0
The normal way to do this in Solr, which is just to set up separate=20 analyzer chains and use a copyfield, will not work for us because the field= in=20 question is huge.=A0 It is about 7 TB of OCR.
=A0
Is there a way to do this currently in Solr?=A0=A0 If not ,
=A0
1) should I open a JIRA issue?
2) can someone point me towards the part of the code I might need to= =20 modify?
=A0
Tom
=A0
Tom=20 Burton-West
Information=20 Retrieval Programmer
Digital=20 Library Production Service
University=20 of Michigan Library
=A0
=A0





--f46d0442824833379104d75a5cfe--