lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Kiran <ravi.bhas...@gmail.com>
Subject Re: Dilemma - Very Frequent Synonym updates for Huge Index
Date Thu, 01 Jul 2010 19:43:23 GMT
Hello Mr.Arslan,
                        In your previous email you said <<Additional you
need to use raw or field query parser. Because query text is spitted at
white-spaces before it reaches KeywordTokenizer>>

But form the analysis page I dont see the splitting happening on white space
see my result below. Did I understand you right or am I barking up the wrong
tree ?

Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory
{luceneMatchVersion=LUCENE_24}  term position 1 term text Barack Obama term
type word source start,end 0,12 payload
 org.apache.solr.analysis.TrimFilterFactory
{luceneMatchVersion=LUCENE_24}  term
position 1 term text Barack Obama term type word source start,end 0,12
payload
 org.apache.solr.analysis.StopFilterFactory
{words=stopwords.txt,entity-stopwords.txt, ignoreCase=true,
enablePositionIncrements=true, luceneMatchVersion=LUCENE_24}  term position
1 term text Barack Obama term type word source start,end 0,12 payload
 org.apache.solr.analysis.SynonymFilterFactory
{tokenizerFactory=solr.KeywordTokenizerFactory,
synonyms=person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt,
expand=true, ignoreCase=true, luceneMatchVersion=LUCENE_24}  term
position 1 term
text Barack Obama Barak Obama Barack H. Obama Barack Hussein Obama President
Obama term type word word word word word source start,end 0,12 0,12 0,12
0,12 0,12 payload




 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
{luceneMatchVersion=LUCENE_24} term position 1 term text Barack Obama Barak
Obama Barack H. Obama Barack Hussein Obama President Obama term type word
word word word word source start,end 0,12 0,12 0,12 0,12 0,12


On Thu, Jul 1, 2010 at 7:04 AM, Ahmet Arslan <iorixxx@yahoo.com> wrote:

>
>
> --- On Thu, 7/1/10, Ravi Kiran <ravi.bhaskar@gmail.com> wrote:
>
> > From: Ravi Kiran <ravi.bhaskar@gmail.com>
> > Subject: Dilemma - Very Frequent Synonym updates for Huge Index
> > To: solr-user@lucene.apache.org
> > Date: Thursday, July 1, 2010, 7:57 AM
> > Hello,
> >         Hoping some solr guru can help
> > me out here. We are a news
> > organization trying to migrate 10 million documents from
> > FAST to solr. The
> > plan is to have our Editorial team add/modify synonyms
> > multiple times during
> > a day as they deem appropriate. Hence we plan on using
> > query time synonyms
> > as we cannot reindex every time they modify the synonyms
> > file(for the
> > entities extracted by OpenNLP like
> > locations/organizations/person names from
> > article body) . Since the synonyms are for names Iam
> > concerned that the
> > multi-phrase issue crops up with the query-time synonyms.
> > for example
> > synonyms could be as follows
> >
> > The Washington Post Co., The Washington Post, Washington
> > Post, The Post,
> > TWP, WAPO
> > DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland
> > Security
> > USCIS, United States Citizenship and Immigration Services,
> > U.S.C.I.S.
> >
> > Barack Obama,Barack H. Obama,Barack Hussein Obama,President
> > Obama
> > Hillary Clinton,Hillary R. Clinton,Hillary Rodham
> > Clinton,Secretary
> > Clinton,Sen. Clinton
> > William J. Clinton,William Jefferson Clinton,President
> > Clinton,President
> > Bill Clinton
> >
> > Virginia, Va., VA
> > D.C,Washington D.C, District of Columbia
> >
> > I have the following fieldType in schema.xml for the
> > keywords/entites...What
> > issues should I be aware off ? And is there a better way to
> > achieve it
> > without having to reindex a million docs on each synonym
> > change. NOTE that I
> > use tokenizerFactory="solr.KeywordTokenizerFactory" for
> > the
> > SynonymFilterFactory to keep the words intact without
> > splitting
> >
> >     <!--  Field Type Keywords/Entities
> > Extracted from OpenNLP -->
> >     <fieldType name="keywordText"
> > class="solr.TextField"
> > sortMissingLast="true" omitNorms="true"
> > positionIncrementGap="100">
> >       <analyzer type="index">
> >         <tokenizer
> > class="solr.KeywordTokenizerFactory"/>
> >         <filter
> > class="solr.TrimFilterFactory" />
> >         <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"/>
> >
> >         <filter
> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer
> > class="solr.KeywordTokenizerFactory"/>
> >         <filter
> > class="solr.TrimFilterFactory" />
> >         <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"
> > />
> >         <filter
> > class="solr.SynonymFilterFactory"
> > tokenizerFactory="solr.KeywordTokenizerFactory"
> >
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> > ignoreCase="true" expand="true" />
> >         <filter
> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >       </analyzer>
> >     </fieldType>
> >
>
> Have ever used this fieldType? Search on this field will be troublesome.
> You need to search exactly same entries as in your synonym.txt. Additional
> you need to use raw or field query parser. Because query text is spitted at
> white-spaces before it reaches KeywordTokenizer.
>
> For example:  q=keywordText:(Washington Post Bill Clinton)&debugQuery=on
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message