lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rafael Rossini" <rafael.ross...@gmail.com>
Subject Re: Extract terms not by reader, but by documents
Date Wed, 05 Sep 2007 14:48:04 GMT
Thank´s for the reply Grant, let me try to explain exactly what I´d like to
do. Take the 2 docs:

Doc1: "Microsoft is a nice software company, and Xbox seems to be a nice
product too."
Doc2: "Nintendo and Sony have been in the game industry for a long time, but
now, Microsoft is trying to enter with Xbox"

Now If I have a query like this, "(Nintendo AND Sony AND Microsoft) OR
(Xbox)" and perform a query.rewrite(reader).extractTerms(set), my set is
going to have all the terms (Nintendo, Sony, Microsoft, Xbox) right? But
when I iterate the docs, I wanted to do something like
query.rewrite(doc).extracTerms(set),  [I know this method does not exist, is
just and example of the funcionality].

So, for Doc1, my set would be populate with (Xbox) only, and for Doc2 the
set would be populated with (Nintendo, Sony, Microsoft, Xbox).
Is this possible? Is it clear now what I´m trying to achieve?

[]s
     Rossini

On 9/4/07, Grant Ingersoll <gsingers@apache.org> wrote:
>
> Not sure if I am understanding what you are trying to do.  I think
> you are trying to find out which terms occurred in a particular
> document, correct?
>
> I also am not sure about your first example.  My understanding of
> extractTerms is that it just gives you back the set of all terms that
> occur in the _query_, not necessarily those that matched in the
> document, although it has this effect for things like WildcardQuery
> and others that get expanded using TermEnum since they are expanded
> based on what is in the index.  I think this is best seen by the
> implementation of extractTerms() in TermQuery.java in which it just
> adds the term from the query into the set.  Likewise for BooleanQuery
> which loops over the clauses and extracts the terms from each clause
> and adds them to the set.  Thus, if you had a boolean query of all
> term queries, you would get back the set of all the terms.
>
> As for the problem it sounds like you are interested in, you could
> use SpanQuery functionality with some post processing analysis or try
> using Term Vectors and the new (unreleased) TermVectorMapper (TVM)
> functionality (or possibly a combination of both).  In this case, you
> will need to write your own implementation of the TVM that takes in
> the query so it knows what terms to identify. If you go the latter
> route, know that it is new functionality and probably doesn't have a
> whole lot of users yet, so there may still be issues with it.  See
> the nightly build or nightly javadocs for info on these.
>
> The other question that might be helpful, is what custom highlighting
> are you doing that isn't covered by the contrib/highlighter?  Perhaps
> you have some suggestions that are generic enough to help improve
> it?  Just a thought.
>
> Hope this helps,
> Grant
>
> On Sep 4, 2007, at 5:01 PM, Rafael Rossini wrote:
>
> > Hi all,
> >
> >     In some custom highlighting, I often write a code like this:
> >
> >        Set<Term> matchedTerms = new HashSet<Term>();
> >        query.rewrite(reader).extractTerms(matchedTerms);
> >
> >     With this code the Term Set gets populated by the matched query
> > in your
> > whole index. Is it possible to this with a document instead of the
> > reader?
> > Something like
> > query.rewrite(documentId).extractTerms(matchedTerms) ?
> >
> > []s
> >      Rossini
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message