lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samarendra Pratap <samarz...@gmail.com>
Subject Re: Single filter instance with different searchers
Date Wed, 10 Nov 2010 06:27:34 GMT
Thanks Erick for you insight.

I'd appreciate if someone could throw more light on it.

Thanks

On Tue, Nov 9, 2010 at 11:27 PM, Erick Erickson <erickerickson@gmail.com>wrote:

> I'm going to have to leave answering that to people with more
> familiarity with the underlying code than I have...
>
> That said, I'd #guess# that you'll be OK because I'd #guess# that
> filters are maintained on a per-reader basis and the results
> are synthesized when combined in a MultiSearcher.
>
> But that's all a guess....
>
> Best
> Erick
>
> On Tue, Nov 9, 2010 at 2:48 AM, Samarendra Pratap <samarzone@gmail.com
> >wrote:
>
> > Thanks Erick, you cleared some of my confusions. But I still have a
> doubt.
> >
> >  As you can see in previous example code I am re-creating parallel multi
> > searcher for each search. (This is the actual scenario on production
> > servers)
> > The ParallelMultiSearcher constructor is taking different combination of
> > searchers each time. It means that the same document may be assigned a
> > different docid for next search.
> >
> > So my primary question is - Will the cached results from a filter created
> > with one multi searcher, work fine with another multi searcher?
> (Underlying
> > IndexSearchers are opened only once. It is combination of IndexSearchers
> > which is varying for each search.)
> >
> >  I have tested it with my real code and sample indexes and it gives me a
> > feeling that results are correct, but I am not able to understand how,
> > given
> > my above confusion.
> >
> > Can you suggest me a something with another curiosity - Which option will
> > be
> > more efficient - 1. MultiSearchers  (either recreating for each search or
> > reusing cached ones) with different searchers or 2. having a single index
> > for all last update date criteria and using filters for different
> > combinations of last update dates.
> >  As I wrote in my previous mail we have different physical indexes based
> on
> > different ranges of update dates. We select appropriate indexes based on
> > the
> > user selected options.
> >
> > On Tue, Nov 9, 2010 at 4:25 AM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > Ignore my previous, I thought you were constructing your own filters.
> > What
> > > you're doing should
> > > be OK.
> > >
> > > Here's the source of my confusion.  Each of your indexes has Lucene
> > > document
> > > IDs starting at
> > > 0. In your example, you have two docs/index. So, if you created a
> Filter
> > > via
> > > lower-level
> > > calls, it could not be applied across different indexes. See the
> > discussion
> > > here:
> > > http://www.gossamer-threads.com/lists/lucene/java-user/106376. That
> is,
> > > the bit in your Filter for index0, doc0 would be the same bit as in
> > index1,
> > > doc0.
> > >
> > > But, that's not what you are doing. The (Parallel)MultiSearcher takes
> > > care of mapping these doc IDs appropriately for you so you don't have
> to
> > > worry about
> > > what I was thinking about. Here's a program that illustrates this. It
> > > creates
> > > three RAMDirectories then  dumps the Lucene doc ID from each. Then it
> > > creates
> > > a multisearcher from the same three dirs and walks that, dumping the
> > Lucene
> > > doc ID.
> > > You'll see that the doc IDs change even though the contents are the
> > > same....
> > >
> > > Again, though, this isn't a problem because you are using a
> > MultiSearcher,
> > > which
> > > takes care of this for you.
> > >
> > > Which is yet another reason to never, never, never count on lucene doc
> > IDs
> > > outside their context!
> > >
> > > Output at the end......
> > >
> > > import org.apache.lucene.analysis.Analyzer;
> > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > import org.apache.lucene.document.Document;
> > > import org.apache.lucene.document.Field;
> > > import org.apache.lucene.index.IndexWriter;
> > > import org.apache.lucene.search.*;
> > > import org.apache.lucene.store.Directory;
> > > import org.apache.lucene.store.RAMDirectory;
> > > import org.apache.lucene.util.Version;
> > >
> > > import java.io.IOException;
> > >
> > > import static org.apache.lucene.index.IndexWriter.*;
> > >
> > > public class EoeTest {
> > >    public static void main(String[] args) {
> > >        EoeTest eoe = new EoeTest();
> > >        eoe.doIt();
> > >    }
> > >    private void doIt() {
> > >        try {
> > >            populateIndexes();
> > >            searchAndSpit();
> > >            tryMulti();
> > >         } catch (Exception e) {
> > >            e.printStackTrace();
> > >        }
> > >
> > >    }
> > >
> > >     private Searcher getMulti() throws IOException {
> > >        IndexSearcher[] searchers = new IndexSearcher[3];
> > >        searchers[0] = new IndexSearcher(_ram1, true);
> > >        searchers[1] = new IndexSearcher(_ram2, true);
> > >        searchers[2] = new IndexSearcher(_ram3, true);
> > >        return new MultiSearcher(searchers);
> > >    }
> > >    private void tryMulti() throws IOException {
> > >        searchOne("multi ", getMulti());
> > >    }
> > >
> > >    private void searchAndSpit() throws IOException {
> > >        searchOne("ram1", new IndexSearcher(_ram1, true));
> > >        searchOne("ram2", new IndexSearcher(_ram2, true));
> > >        searchOne("ram3", new IndexSearcher(_ram3, true));
> > >    }
> > >    private void searchOne(String which, Searcher is) throws IOException
> {
> > >        log("dumping " + which);
> > >        TopDocs hits = is.search(new MatchAllDocsQuery(), 100);
> > >        for (int idx = 0; idx < hits.scoreDocs.length; ++idx) {
> > >            ScoreDoc sd = hits.scoreDocs[idx];
> > >            Document doc = is.doc(sd.doc);
> > >            log(String.format("lid: %d, content: %s", sd.doc,
> > > doc.get("content")));
> > >        }
> > >        is.close();
> > >    }
> > >    private void log(String msg) {
> > >        System.out.println(msg);
> > >    }
> > >    private void populateIndexes() throws IOException {
> > >        popOne(_ram1);
> > >        popOne(_ram2);
> > >        popOne(_ram3);
> > >    }
> > >
> > >    private void popOne(Directory dir) throws IOException {
> > >        IndexWriter iw = new IndexWriter(dir, _std,
> > MaxFieldLength.LIMITED);
> > >        Document doc = new Document();
> > >        doc.add(new Field("content", "common " +
> > > Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
> > > Field.TermVector.YES));
> > >        iw.addDocument(doc);
> > >
> > >        doc = new Document();
> > >        doc.add(new Field("content", "common " +
> > > Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
> > > Field.TermVector.YES));
> > >        iw.addDocument(doc);
> > >
> > >        iw.close();
> > >    }
> > >
> > >
> > >    Directory _ram1 = new RAMDirectory();
> > >    Directory _ram2 = new RAMDirectory();
> > >    Directory _ram3 = new RAMDirectory();
> > >    Analyzer _std = new StandardAnalyzer(Version.LUCENE_29);
> > > }
> > >
> > > ************************************output****************
> > > where lid: ### is the Lucene doc ID returned in scoreDocs
> > > ***********************************************************
> > >
> > > dumping ram1
> > > lid: 0, content: common 0.11100571422470962
> > > lid: 1, content: common 0.31555863707233567
> > > dumping ram2
> > > lid: 0, content: common 0.01235509997022377
> > > lid: 1, content: common 0.7017712652104814
> > > dumping ram3
> > > lid: 0, content: common 0.9472403989314128
> > > lid: 1, content: common 0.7105628402082196
> > > dumping multi
> > > lid: 0, content: common 0.11100571422470962
> > > lid: 1, content: common 0.31555863707233567
> > > lid: 2, content: common 0.01235509997022377
> > > lid: 3, content: common 0.7017712652104814
> > > lid: 4, content: common 0.9472403989314128
> > > lid: 5, content: common 0.7105628402082196
> > >
> > >
> > >
> > >
> > > On Mon, Nov 8, 2010 at 3:33 AM, Samarendra Pratap <samarzone@gmail.com
> > > >wrote:
> > >
> > > > Hi Erick, Thanks for the reply.
> > > >  Your answer have puzzled me more because what I am able to view is
> not
> > > > what you say or I am not able to grasp your meaning.
> > > >  I have written a small program which is exactly what my original
> > > question
> > > > was. Here I am creating a CachingWrapperFilter on one index and
> reusing
> > > it
> > > > on other indexes. This single filter gives me results as expected
> from
> > > each
> > > > of the index. I will appreciate if you can throw some light.
> > > >
> > > > I have given the output after the program ends
> > > >
> > > >
> > > >
> > >
> >
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> > > > // following program is compiled with java6
> > > >
> > > > import org.apache.lucene.index.*;
> > > > import org.apache.lucene.analysis.*;
> > > > import org.apache.lucene.analysis.standard.*;
> > > > import org.apache.lucene.search.*;
> > > > import org.apache.lucene.search.spans.*;
> > > > import org.apache.lucene.store.*;
> > > > import org.apache.lucene.document.*;
> > > > import org.apache.lucene.queryParser.*;
> > > > import org.apache.lucene.util.*;
> > > >
> > > > import java.util.*;
> > > >
> > > > public class FilterTest
> > > > {
> > > > protected Directory[] dirs;
> > > >  protected Analyzer a;
> > > > protected Searcher[] searchers;
> > > > protected QueryParser qp;
> > > >  protected Filter f;
> > > > protected Hashtable<String, Filter> filters;
> > > >
> > > >  public FilterTest()
> > > > {
> > > > // create analyzer
> > > >  a = new StandardAnalyzer(Version.LUCENE_29);
> > > > // create query parser
> > > > qp = new QueryParser(Version.LUCENE_29, "content", a);
> > > >  // initialize "filters" Hashtable
> > > > filters = new Hashtable<String, Filter>();
> > > >  }
> > > >
> > > > protected void createDirectories(int length)
> > > > {
> > > >  // create specified number of RAM directories
> > > > dirs = new Directory[length];
> > > >  for(int i=0;i<length;i++)
> > > > dirs[i] = new RAMDirectory();
> > > > }
> > > >
> > > > protected void createIndexes() throws Exception
> > > > {
> > > > /* create indexes for each directory.
> > > >  each index contains two documents.
> > > > every document contains one term, unique across all indexes, one term
> > > > unique across single index and one term common to all indexes
> > > >  */
> > > > for(int i=0;i<dirs.length;i++)
> > > > {
> > > >  IndexWriter iw = new IndexWriter(dirs[i], a, true,
> > > > IndexWriter.MaxFieldLength.LIMITED);
> > > >
> > > > Document d = new Document();
> > > >  // unique id across all indexes
> > > > d.add(new Field("id", ""+(i*2+1), Field.Store.YES,
> > > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > > >  // unique id in a single indexes
> > > > d.add(new Field("docnumber", "1", Field.Store.YES,
> > > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > > >  // common word in all indexes
> > > > d.add(new Field("content", "common", Field.Store.YES,
> > > Field.Index.ANALYZED,
> > > > Field.TermVector.YES));
> > > >  iw.addDocument(d);
> > > >
> > > > d = new Document();
> > > > // unique id across all indexes
> > > >  d.add(new Field("id", ""+(i*2+2), Field.Store.YES,
> > > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > > > // unique id in a single indexes
> > > >  d.add(new Field("docnumber", "2", Field.Store.YES,
> > > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > > > // common word in all indexes
> > > >  d.add(new Field("content", "common", Field.Store.YES,
> > > > Field.Index.ANALYZED, Field.TermVector.YES));
> > > > iw.addDocument(d);
> > > >
> > > > iw.close();
> > > > }
> > > > }
> > > >
> > > > protected void openSearchers() throws Exception
> > > > {
> > > > // open searches for every directory and save it in an array
> > > >  searchers = new Searcher[dirs.length];
> > > > for(int i=0;i<dirs.length;i++)
> > > >  searchers[i] = new IndexSearcher(IndexReader.open(dirs[i], true));
> > > > }
> > > >
> > > >  protected Searcher getSearcher(int[] arr) throws Exception
> > > > {
> > > > // provides a ParallelMultiSearcher instance with the searchers lying
> > at
> > > > index positions provided in the argument
> > > >  Searcher[] s = new Searcher[arr.length];
> > > > for(int i=0;i<arr.length;i++)
> > > >  s[i] = this.searchers[arr[i]];
> > > >
> > > > return new ParallelMultiSearcher(s);
> > > >  }
> > > >
> > > > protected ScoreDoc[] search(String query, String filter, Searcher s)
> > > throws
> > > > Exception
> > > >  {
> > > > Filter f = null;
> > > > if(filter != null)
> > > >  {
> > > > if(filters.containsKey(filter))
> > > > {
> > > >  System.out.println("Reusing filter for - " + filter);
> > > > f = filters.get(filter);
> > > >  }
> > > > else
> > > > {
> > > >  System.out.println("Creating new filter for - " + filter);
> > > > f = new CachingWrapperFilter(new
> QueryWrapperFilter(qp.parse(filter)));
> > > >  filters.put(filter, f);
> > > > }
> > > > }
> > > >  System.out.println("Query:("+query+"), Filter:("+filter+")");
> > > > return s.search(qp.parse(query), f, 1000).scoreDocs;
> > > >  }
> > > >
> > > > public static void main(String[] args) throws Exception
> > > >  {
> > > >  FilterTest ft = new FilterTest();
> > > > ft.startTest();
> > > > }
> > > >
> > > > public void startTest()
> > > > {
> > > > try
> > > >  {
> > > > Query q;
> > > >
> > > > createDirectories(3);
> > > >  createIndexes();
> > > > openSearchers();
> > > > Searcher s;
> > > >  ScoreDoc[] sd;
> > > >
> > > > System.out.println("===================================");
> > > >  System.out.println("Fields of all the documents");
> > > > // creating searcher for all indexes
> > > >  s = getSearcher(new int[]{0,1,2});
> > > > // search all documents and their ids
> > > >  sd = search("+content:common", null, s);
> > > > for(int i=0;i<sd.length;i++)
> > > >  {
> > > > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > > > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> > > >  }
> > > > System.out.println("\n\n");
> > > >
> > > > System.out.println("===================================");
> > > >  System.out.println("Searching for documents in a single index.
> Filter
> > > > will be created and cached");
> > > > s = getSearcher(new int[]{0});
> > > >  sd = search("+content:common", "docnumber:1", s);
> > > > System.out.println("Hits:"+sd.length);
> > > >  for(int i=0;i<sd.length;i++)
> > > > {
> > > > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > > > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> > > >  }
> > > > System.out.println("\n\n");
> > > >
> > > > System.out.println("===================================");
> > > >  System.out.println("Searching for documents in a other indexes other
> > > than
> > > > previous search. Query and filter will be same. Filter will be
> > reused");
> > > >  s = getSearcher(new int[]{1,2});
> > > > sd = search("+content:common", "docnumber:1", s);
> > > >  System.out.println("Hits:"+sd.length);
> > > > for(int i=0;i<sd.length;i++)
> > > >  {
> > > > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > > > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> > > >  }
> > > >
> > > > }
> > > > catch(Exception e)
> > > >  {
> > > > e.printStackTrace();
> > > > }
> > > >  }
> > > > }
> > > >
> > > >
> > >
> >
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> > > > OUTPUT:
> > > > [samar@myserver java]$ java FilterTest
> > > > ===================================
> > > > Fields of all the documents
> > > > Query:(+content:common), Filter:(null)
> > > >         id:1, docid:1
> > > >         id:2, docid:2
> > > >         id:3, docid:1
> > > >         id:4, docid:2
> > > >         id:5, docid:1
> > > >         id:6, docid:2
> > > >
> > > >
> > > >
> > > > ===================================
> > > > Searching for documents in a single index. Filter will be created and
> > > > cached
> > > > Creating new filter for - docid:1
> > > > Query:(+content:common), Filter:(docid:1)
> > > > Hits:1
> > > >         id:1, docid:1
> > > >
> > > >
> > > >
> > > > ===================================
> > > > Searching for documents in indexes other than previous search. Query
> > and
> > > > filter will be same. Filter will be reused
> > > > Reusing filter for - docid:1
> > > > Query:(+content:common), Filter:(docid:1)
> > > > Hits:2
> > > >         id:3, docid:1
> > > >         id:5, docid:1
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Nov 3, 2010 at 7:04 PM, Erick Erickson <
> > erickerickson@gmail.com
> > > >wrote:
> > > >
> > > >> I'm assuming you're down in Lucene land. Unless somehow you've
> > > >> gotten 63 separate filters when you think you only have one, I don't
> > > >> think what you're doing will work. Or I'm failing to understand what
> > > >> you're doing at all.
> > > >>
> > > >> The problem is I expect each of your indexes starts with document
> > > >> 1. So your Filter is really a bit set keyed by Lucene document ID.
> > > >>
> > > >> So applying filter 2 to index 54 will NOT do what you want. What I
> > > >> suspect you're seeing is that applying your filter is producing
> enough
> > > >> results from index 54 (to continue my example) to fool you into
> > > >> thinking it's working.
> > > >>
> > > >> Try running the query with and without the filter on each of your
> > > indexes,
> > > >> perhaps as a control including a restrictive clause in the query
> > > >> to do the same thing your filter is doing. Or construct the filter
> new
> > > >> for comparison.... If the numbers continue to be the same, I clearly
> > > >> don't understand something! <G>....
> > > >>
> > > >> Best
> > > >> Erick
> > > >>
> > > >> On Wed, Nov 3, 2010 at 6:05 AM, Samarendra Pratap <
> > samarzone@gmail.com
> > > >> >wrote:
> > > >>
> > > >> > Hi. We have a large index (~ 28 GB) which is distributed in three
> > > >> different
> > > >> > directories, each representing a country. Each of these country
> wise
> > > >> > indexes
> > > >> > is further distributed on the basis of last update date into
21
> > > smaller
> > > >> > indexes. This index is updated once in a day.
> > > >> >
> > > >> > A user can search into any of one country and can choose last
> update
> > > >> date
> > > >> > plus some other criteria.
> > > >> >
> > > >> > When the server application starts, index readers and hence
> > searchers
> > > >> are
> > > >> > created for each of the small indexes (21 x 3) and put in an
> array.
> > > >> > Depending on the option (country and last update date) chosen
by
> > user
> > > we
> > > >> > pick the searchers of correct date range/country and create a
new
> > > >> > ParallelMultiSearcher instance.
> > > >> >
> > > >> > Now my question is - can I use single filter (caching filter)
> > instance
> > > >> for
> > > >> > every search (may be on different searchers)?
> > > >> >
> > > >> >
> > > >>
> > >
> >
> ===================================================================================
> > > >> >
> > > >> > e.g
> > > >> > for first search i create an filter of experience 4 years and
save
> > it.
> > > >> >
> > > >> > if another search for a different country (and hence difference
> > index)
> > > >> also
> > > >> > has same experience criteria, i.e. 4 years, can i use the same
> > filter
> > > >> > instance for second search too?
> > > >> >
> > > >> > i have tested a little for this and surprisingly i have got
> correct
> > > >> > results.
> > > >> > i was wondering if this is the correct way. or do i need to create
> > > >> > different
> > > >> > filters for each searcher (or index reader) instance?
> > > >> >
> > > >> > Thanks in advance.
> > > >> >
> > > >> > --
> > > >> > Regards,
> > > >> > Samar
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Samar
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Samar
> >
>



-- 
Regards,
Samar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message