lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From semelak ss <semelak...@yahoo.com>
Subject Re: Exact Phrase Query
Date Sun, 02 Nov 2008 20:28:32 GMT
Hello Erick,

If it weren't for your help and kind response, I would be struggling now with the initial
problem I had. The solution to that problem turned out to be the one you mentioned in your
response (indexwriters/indexreaders both being opened at the same time).

The problem I mentioned in my last response is different from the initial question I posted.
It's really a request for thoughts and people inputs on how to improve searching given the
structure of the data described in my last response.

Again, I appreciate your help (and I am not saying this because I am looking forward to your
response.) 


--- On Sun, 11/2/08, Erick Erickson <erickerickson@gmail.com> wrote:

> From: Erick Erickson <erickerickson@gmail.com>
> Subject: Re: Exact Phrase Query
> To: java-user@lucene.apache.org, semelak_14@yahoo.com
> Date: Sunday, November 2, 2008, 12:11 PM
> Sorry, but I've really run out of patience here. You
> have consistently
> stated only
> part of the problem, never posting enough information to
> allow me to answer
> helpfully. You haven't even taken the time to proofread
> your posts, which
> has wasted my (limited, volunteer) time.
> 
> In the future, please consider the fact that people trying
> to help with your
> 
> problem are volunteering their time and respect that fact
> by making a
> greater effort to make it easy and efficient for us to help
> with what is,
> after all, *your* problem.
> 
> Best
> Erick
> 
> On Sun, Nov 2, 2008 at 11:03 AM, semelak ss
> <semelak_14@yahoo.com> wrote:
> 
> > Also, is there a way to pass a null or no tokenizer
> when writing to the
> > index the field "words" ?? I have no need
> for tokenizing the words and the
> > exact query will always be known.
> >
> > To understand better the problem, when are performing
> words comparison in
> > large number of text documents. Each word in each
> sentence is compared with
> > the rest of the words in the other sentences. A
> similarity score is computed
> > for each pair and stored in the index for fast
> retrieval in the future
> > (computation of the score is resource intensive). What
> we used to do is
> > construct a matrix and store the words in alphabetical
> order (for binary
> > search) and then load the words when the program is
> launched. Due to the
> > size of the files generated, the update was a real
> struggle.
> >
> > Thus, we decided to use Lucene and store a score for
> each pair of words.
> > Updates should be much easier and faster, however
> improving the search is
> > something we're looking into. We are new to
> Lucene, and would appreciate any
> > input in this regard.
> >
> > Knowing that the document would contain only two
> fields : score and words
> > and that no tokenization is needed, what would be the
> most efficient way for
> > implementing this index using Lucene ?
> >
> >
> > --- On Sun, 11/2/08, semelak ss
> <semelak_14@yahoo.com> wrote:
> >
> > > From: semelak ss <semelak_14@yahoo.com>
> > > Subject: Re: Exact Phrase Query
> > > To: java-user@lucene.apache.org
> > > Date: Sunday, November 2, 2008, 7:26 AM
> > > I was in a hurry when copying and pasting the
> code. What
> > > I've been using is only writer. RamWriter was
> never used
> > > as it never really worked (thanks to you, I now
> understand
> > > the reason).
> > >
> > > The above is not really related to the problem I
> was
> > > facing. I modified my code so that an
> > > indexreader/indexwriter is opened right before
> the words
> > > comparison takes place and is closed right after.
> (currently
> > > not using RamDir due to the problems faced
> earlier)
> > >
> > > Considering that the program is basically a loop
> that does
> > > thousands and thousands of comparison, this is
> definitely
> > > not the most efficient way of handling things.
> > >
> > > I would appreciate any input in this regard on
> how to
> > > improve the efficiency.
> > >
> > >
> > >
> > > --- On Sat, 11/1/08, Erick Erickson
> > > <erickerickson@gmail.com> wrote:
> > >
> > > > From: Erick Erickson
> <erickerickson@gmail.com>
> > > > Subject: Re: Exact Phrase Query
> > > > To: java-user@lucene.apache.org,
> semelak_14@yahoo.com
> > > > Date: Saturday, November 1, 2008, 5:06 PM
> > > > ahhhh, finally. I'm almost completely
> sure you
> > > can't
> > > > *write* to a
> > > > RAMDirectory
> > > > and expect the underlying FSDir to be
> updated. The
> > > intent
> > > > of RAMDirectorys
> > > > is to *read* in an index from disk and keep
> it in
> > > memory.
> > > > Essentially I
> > > > believe
> > > > that your RAMDirecotry constructor is taking
> a
> > > snapshot of
> > > > the underlying
> > > > disk index, modifying that in-memory copy,
> and
> > > throwing it
> > > > away without
> > > > ever writing it to disk. I wouldn't
> expect opening
> > > the
> > > > FSDirectory after
> > > > writing
> > > > to the RAMDirectory to find anything. Ever.
> > > >
> > > > If you really need the RAMDir, I suspect
> you'll
> > > have to
> > > > open an FS-based
> > > > writer as well as a RAM-based writer, and
> write to
> > > both
> > > > when necessary.
> > > > You'll probably also have to open/search
> your
> > > RAM-based
> > > > index as the
> > > > faster alternative to re-opening the
> FS-based index.
> > > Either
> > > > way, reopening
> > > > the index is probably expensive, are you
> sure you need
> > > to?
> > > > Is there a way
> > > > to keep your information in an internal data
> structure
> > > for
> > > > some period of
> > > > time?
> > > >
> > > > Best
> > > > Erick
> > > >
> > > >
> > > >
> > > > On Sat, Nov 1, 2008 at 6:31 PM, semelak ss
> > > > <semelak_14@yahoo.com> wrote:
> > > >
> > > > > I am not entirely sure if this can be
> the cause,
> > > but
> > > > here is something I
> > > > > thought might be related:
> > > > > The idea is have an index containing
> documents
> > > where
> > > > each document has a
> > > > > combination of two words : word1 and
> word2 and a
> > > score
> > > > for these two words.
> > > > > The index would be searched first if
> the two
> > > words
> > > > exist, and if not the
> > > > > score would be computed on the fly and
> then added
> > > to
> > > > the index. This process
> > > > > would be repeated thousands of times
> for
> > > thousands of
> > > > words.
> > > > >
> > > > > Hence, I have an indexwriter and a
> searcher
> > > > > --------------------
> > > > > RAMDirectory ramDir = new
> > > RAMDirectory(INDEX_DIR);
> > > > > IndexWriter  ramWriter = new
> IndexWriter(ramDir,
> > > new
> > > > WhitespaceAnalyzer(),
> > > > >
> true,IndexWriter.MaxFieldLength.UNLIMITED);
> > > > > writer = new IndexWriter(INDEX_DIR,new
> > > > WhitespaceAnalyzer(),true
> > > > > ,IndexWriter.MaxFieldLength.UNLIMITED);
> > > > >
> > > > > FSDirectory fsdir =
> > > > FSDirectory.getDirectory(INDEX_DIR);
> > > > > IndexReader ir =
> IndexReader.open(fsdir);
> > > > > _searcher = new IndexSearcher(ir);
> > > > > --------------------
> > > > >
> > > > > The indexWriter is closed near the end
> of the
> > > program
> > > > (it's open while
> > > > > searching for words combinations ).
> > > > >
> > > > > When using Luke,, I was able to search
> > > successfully
> > > > for exact phrases. My
> > > > > guess is that the problem I am facing
> has
> > > something to
> > > > do with the
> > > > > indexWriter, but I can not pinpoint the
> exact
> > > cause of
> > > > the problem.
> > > > >
> > > > >
> > > > > --- On Sat, 11/1/08, semelak ss
> > > > <semelak_14@yahoo.com> wrote:
> > > > >
> > > > > > From: semelak ss
> > > <semelak_14@yahoo.com>
> > > > > > Subject: Re: Exact Phrase Query
> > > > > > To: java-user@lucene.apache.org
> > > > > > Date: Saturday, November 1, 2008,
> 10:03 AM
> > > > > > When using Luke,, searching for
> the
> > > followings
> > > > gives me hits
> > > > > > now:
> > > > > > "insurer storm"
> > > > > > The synatx of the query as parsed
> by Luke is
> > > :
> > > > > > word:"insurer storm"
> > > > > >
> > > > > > The code I am using is as follows:
> > > > > > ----------------------
> > > > > > _searcher = new
> IndexSearcher(INDEX_DIR);
> > > > > > _parser = new
> QueryParser("word",
> > > new
> > > > > > WhitespaceAnalyzer());
> > > > > > Query q = _parser.parse(query);
> > > > > > System.out.println(q.toString());
> // this
> > > outputs
> > > > ->
> > > > > > word:"insurer storm"
> > > > > > TopDocs vv= _searcher.search(q,
> 1);
> > > > > > Hits tmph = _searcher.search(q);
> > > > > > ---------------------------------
> > > > > >
> > > > > > both vv and tmph give no results
> (their size
> > > is
> > > > 0)
> > > > > >
> > > > > >
> > > > > >
> > > > > > --- On Fri, 10/31/08, semelak ss
> > > > > > <semelak_14@yahoo.com>
> wrote:
> > > > > >
> > > > > > > From: semelak ss
> > > > <semelak_14@yahoo.com>
> > > > > > > Subject: Re: Exact Phrase
> Query
> > > > > > > To:
> java-user@lucene.apache.org
> > > > > > > Date: Friday, October 31,
> 2008, 9:41 AM
> > > > > > > For indexing, I use the
> following:
> > > > > > > ===========
> > > > > > > writer = new
> IndexWriter(INDEX_DIR,new
> > > > > > > WhitespaceAnalyzer(),true
> > > > > > >
> ,IndexWriter.MaxFieldLength.UNLIMITED);
> > > > > > > Document doc = new
> Document();
> > > > > > > String tmpword =
> > > this.getProperForm(word1,
> > > > word2);
> > > > > > > doc.add(new
> Field("WORDS",
> > > > tmpword,
> > > > > > > Field.Store.YES,
> > > Field.Index.TOKENIZED));
> > > > > > > doc.add(new
> Field("score",
> > > > > > Double.toString(score)
> > > > > > > , Field.Store.YES,
> Field.Index.NO));
> > > > > > > writer.addDocument(adoc);
> > > > > > > ============
> > > > > > >
> > > > > > > For searching,, I use the
> following
> > > (query =
> > > > > > > "homeowner work"
> gives no
> > > hits ,,
> > > > > > > "homeowner" gives
> results):
> > > > > > > ============
> > > > > > > _searcher = new
> > > IndexSearcher(INDEX_DIR);
> > > > > > > _parser = new
> > > QueryParser("WORDS",
> > > > new
> > > > > > > WhitespaceAnalyzer());
> > > > > > > q = _parser.parse(query);
> > > > > > > Hits tmph =
> _searcher.search(q);
> > > > > > >
> > > > > > > ============
> > > > > > >
> > > > > > > A sample document (contained
> in the
> > > index)
> > > > is the
> > > > > > > following:
> > > > > > >
> > > > > > > filed: value
> > > > > > > -----: -----
> > > > > > > WORDS:"homeowners
> work"
> > > > > > > score: 0.1515417
> > > > > > >
> > > > > > >
> > > > > > > Also, please note that I
> tried using
> > > Luke to
> > > > browse
> > > > > > the
> > > > > > > index and the fields seem to
> be filled
> > > out
> > > > with words
> > > > > > just
> > > > > > > as expected. Searching,
> however, with
> > > exact
> > > > phrases
> > > > > > yield no
> > > > > > > answer. Searching with single
> words
> > > gives
> > > > hits.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --- On Fri, 10/31/08, Erick
> Erickson
> > > > > > >
> <erickerickson@gmail.com> wrote:
> > > > > > >
> > > > > > > > From: Erick Erickson
> > > > > > <erickerickson@gmail.com>
> > > > > > > > Subject: Re: Exact
> Phrase Query
> > > > > > > > To:
> java-user@lucene.apache.org,
> > > > > > semelak_14@yahoo.com
> > > > > > > > Date: Friday, October
> 31, 2008,
> > > 5:57 AM
> > > > > > > > You need to give us more
> > > information
> > > > for
> > > > > > meaningful
> > > > > > > replies,
> > > > > > > > like
> > > > > > > > the analyzers you use
> when
> > > indexing and
> > > > > > searching, the
> > > > > > > > exact
> > > > > > > > query you use, perhaps
> the
> > > snippets of
> > > > the code,
> > > > > > etc.
> > > > > > > >
> > > > > > > > That said, things to
> check:
> > > > > > > > Get a copy of Luke and
> examine
> > > your
> > > > index. You
> > > > > > can
> > > > > > > even
> > > > > > > > run queries through that
> tool and
> > > see
> > > > what gets
> > > > > > sent
> > > > > > > to the
> > > > > > > > database and what
> responses you
> > > get
> > > > with those
> > > > > > > analyzers.
> > > > > > > >
> > > > > > > > Make sure you're
> analyzers at
> > > query
> > > > and index
> > > > > > time
> > > > > > > are
> > > > > > > > doing
> > > > > > > > what you expect.
> Query.toString()
> > > is
> > > > your friend.
> > > > > > If
> > > > > > > you
> > > > > > > > don't
> > > > > > > > take the time to
> understand
> > > analyzers,
> > > > you'll
> > > > > > > spend
> > > > > > > > lots of time
> > > > > > > > spinning your wheels.
> > > > > > > >
> > > > > > > > And you really should
> wait more
> > > than 9
> > > > minutes
> > > > > > before
> > > > > > > > pinging
> > > > > > > > the list....
> > > > > > > >
> > > > > > > > Best
> > > > > > > > Erick
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 31, 2008 at
> 8:44 AM,
> > > > semelak ss
> > > > > > > >
> <semelak_14@yahoo.com>
> > > wrote:
> > > > > > > >
> > > > > > > > > I have documents
> containing
> > > > multiple words
> > > > > > in the
> > > > > > > the
> > > > > > > > field "word"
> > > > > > > > > for example, one of
> the
> > > documents
> > > > contain in
> > > > > > the
> > > > > > > field
> > > > > > > > "word" the
> > > > > > > > > following:
> > > > > > > > > homeowners work
> > > > > > > > >
> > > > > > > > > When searching for
> single
> > > words
> > > > (i.e.
> > > > > > homewoners
> > > > > > > ) I
> > > > > > > > get hits.
> > > > > > > > >
> > > > > > > > > However, searching
> for the
> > > exact
> > > > phrase
> > > > > > > > "homeowners
> work" gives
> > > me no
> > > > > > > > > hits!! I use the
> double
> > > quotes
> > > > when
> > > > > > searching for
> > > > > > > > exact phrases.
> > > > > > > > >
> > > > > > > > > Any idea why ??
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe,
> e-mail:
> > > > > > > >
> > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional
> commands,
> > > e-mail:
> > > > > > > >
> java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> > > > > > >
> java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands,
> e-mail:
> > > > > > >
> java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail:
> > > > > >
> java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail:
> > > > > > java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail:
> > > > java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail:
> > > > java-user-help@lucene.apache.org
> > > > >
> > > > >
> > >
> > >
> > >
> > >
> > >
> > >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > > java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> >
> >
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >
> >


      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message