lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Dilemma - Very Frequent Synonym updates for Huge Index
Date Sun, 04 Jul 2010 13:33:10 GMT
About reindexing and performance. This is not really a problem as you
can re-index on a completely different machine and then just
move the completed index to your production machines and reopen
your index. SOLR has this capability out of the box. Here's a link
to get you started:
http://wiki.apache.org/solr/SolrCollectionDistributionScripts

Your first few queries on a newly-opened index will be a bit slower
unless you do pre-warming. But the reindexing process can be
done without affecting the current searcher in any way. Of course
you'll need the disk space available, but disks are cheap <G>...

HTH
Erick

On Thu, Jul 1, 2010 at 2:06 PM, Ravi Kiran <ravi.bhaskar@gmail.com> wrote:

> Hello Mr. Høydahl,
>                          I thought of doing it exactly as you have said,
> Shall try out and see where I land. However Iam still skeptical about that
> approach from the performance point of view as we are a round the clock
> news
> organization and huge reindexing might affect the speed of searches
> moreover
> in the news business "being first" is more important hence we need those
> synonyms to take affect right away and thats where we are in a quandry
>
>   With regards to the OpenNLP implementation, our design is plain vanilla
> outside of SOLR. We generate the XML on the fly with extracted entities
> from
> OpenNLP and then index it straight into SOLR. However, we do some sanity
> checks for locations prior to indexing using wordnet so that false
> positives
> are avoided in location names.
>
> Thanks,
>
> Ravi Kiran Bhaskar
>
> On Thu, Jul 1, 2010 at 5:40 AM, Jan Høydahl / Cominvent <
> jan.asf@cominvent.com> wrote:
>
> > Hi,
> >
> > I think I would look at a hybrid approach, where you keep adding new
> > synonyms to a query-side qynonym dictionary for immediate effect. And
> then
> > every now and then or every Nth night you move those synonyms over to the
> > index-side dictionary and trigger a full reindex.
> >
> > A nice side effect of reindexing now and then could be that if your
> OpenNLP
> > extraction dictionaries have changed, it will be reflected too.
> >
> > BTW: Could you share details of your OpenNLP integration with us? I'm
> about
> > to do it on another project..
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> > Training in Europe - www.solrtraining.com
> >
> > On 1. juli 2010, at 06.57, Ravi Kiran wrote:
> >
> > > Hello,
> > >        Hoping some solr guru can help me out here. We are a news
> > > organization trying to migrate 10 million documents from FAST to solr.
> > The
> > > plan is to have our Editorial team add/modify synonyms multiple times
> > during
> > > a day as they deem appropriate. Hence we plan on using query time
> > synonyms
> > > as we cannot reindex every time they modify the synonyms file(for the
> > > entities extracted by OpenNLP like locations/organizations/person names
> > from
> > > article body) . Since the synonyms are for names Iam concerned that the
> > > multi-phrase issue crops up with the query-time synonyms. for example
> > > synonyms could be as follows
> > >
> > > The Washington Post Co., The Washington Post, Washington Post, The
> Post,
> > > TWP, WAPO
> > > DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
> > > USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
> > >
> > > Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
> > > Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
> > > Clinton,Sen. Clinton
> > > William J. Clinton,William Jefferson Clinton,President
> Clinton,President
> > > Bill Clinton
> > >
> > > Virginia, Va., VA
> > > D.C,Washington D.C, District of Columbia
> > >
> > > I have the following fieldType in schema.xml for the
> > keywords/entites...What
> > > issues should I be aware off ? And is there a better way to achieve it
> > > without having to reindex a million docs on each synonym change. NOTE
> > that I
> > > use tokenizerFactory="solr.KeywordTokenizerFactory" for the
> > > SynonymFilterFactory to keep the words intact without splitting
> > >
> > >    <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
> > >    <fieldType name="keywordText" class="solr.TextField"
> > > sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> > >      <analyzer type="index">
> > >        <tokenizer class="solr.KeywordTokenizerFactory"/>
> > >        <filter class="solr.TrimFilterFactory" />
> > >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"/>
> > >
> > >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >      </analyzer>
> > >      <analyzer type="query">
> > >        <tokenizer class="solr.KeywordTokenizerFactory"/>
> > >        <filter class="solr.TrimFilterFactory" />
> > >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"
> > > />
> > >        <filter class="solr.SynonymFilterFactory"
> > > tokenizerFactory="solr.KeywordTokenizerFactory"
> > >
> >
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> > > ignoreCase="true" expand="true" />
> > >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >      </analyzer>
> > >    </fieldType>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message