lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: Dilemma - Very Frequent Synonym updates for Huge Index
Date Thu, 01 Jul 2010 11:04:48 GMT


--- On Thu, 7/1/10, Ravi Kiran <ravi.bhaskar@gmail.com> wrote:

> From: Ravi Kiran <ravi.bhaskar@gmail.com>
> Subject: Dilemma - Very Frequent Synonym updates for Huge Index
> To: solr-user@lucene.apache.org
> Date: Thursday, July 1, 2010, 7:57 AM
> Hello,
>         Hoping some solr guru can help
> me out here. We are a news
> organization trying to migrate 10 million documents from
> FAST to solr. The
> plan is to have our Editorial team add/modify synonyms
> multiple times during
> a day as they deem appropriate. Hence we plan on using
> query time synonyms
> as we cannot reindex every time they modify the synonyms
> file(for the
> entities extracted by OpenNLP like
> locations/organizations/person names from
> article body) . Since the synonyms are for names Iam
> concerned that the
> multi-phrase issue crops up with the query-time synonyms.
> for example
> synonyms could be as follows
> 
> The Washington Post Co., The Washington Post, Washington
> Post, The Post,
> TWP, WAPO
> DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland
> Security
> USCIS, United States Citizenship and Immigration Services,
> U.S.C.I.S.
> 
> Barack Obama,Barack H. Obama,Barack Hussein Obama,President
> Obama
> Hillary Clinton,Hillary R. Clinton,Hillary Rodham
> Clinton,Secretary
> Clinton,Sen. Clinton
> William J. Clinton,William Jefferson Clinton,President
> Clinton,President
> Bill Clinton
> 
> Virginia, Va., VA
> D.C,Washington D.C, District of Columbia
> 
> I have the following fieldType in schema.xml for the
> keywords/entites...What
> issues should I be aware off ? And is there a better way to
> achieve it
> without having to reindex a million docs on each synonym
> change. NOTE that I
> use tokenizerFactory="solr.KeywordTokenizerFactory" for
> the
> SynonymFilterFactory to keep the words intact without
> splitting
> 
>     <!--  Field Type Keywords/Entities
> Extracted from OpenNLP -->
>     <fieldType name="keywordText"
> class="solr.TextField"
> sortMissingLast="true" omitNorms="true"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>         <filter
> class="solr.TrimFilterFactory" />
>         <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt,entity-stopwords.txt"
> enablePositionIncrements="true"/>
> 
>         <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>         <filter
> class="solr.TrimFilterFactory" />
>         <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt,entity-stopwords.txt"
> enablePositionIncrements="true"
> />
>         <filter
> class="solr.SynonymFilterFactory"
> tokenizerFactory="solr.KeywordTokenizerFactory"
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> ignoreCase="true" expand="true" />
>         <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 

Have ever used this fieldType? Search on this field will be troublesome.
You need to search exactly same entries as in your synonym.txt. Additional you need to use
raw or field query parser. Because query text is spitted at white-spaces before it reaches
KeywordTokenizer. 

For example:  q=keywordText:(Washington Post Bill Clinton)&debugQuery=on


      

Mime
View raw message