lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Filter to retrieve random documents without specific terms ?
Date Wed, 30 Mar 2011 09:44:54 GMT
If your query explicitly excludes certain terms then surely you can be
confident that matched docs will not contain those terms, and if your
random docs are a subset of those matched docs they won't contain them
either.


--
Ian.


On Tue, Mar 29, 2011 at 11:01 PM, Patrick Diviacco
<patrick.diviacco@gmail.com> wrote:
> One last thing, how do I check if the random document does not contain the
> term ?
>
> In other words, I cannot just pass the TermsFilter but I need to check if
> the retrieved random document is valid or not to know if I have enough.
>
> Any code example is appreciated.. so far I have this one, to retrieve docs
> without that specific term.
>
> BooleanFilter termsNOTFilter = new BooleanFilter();
> FilterClause notTermClause = new FilterClause(termsFilter,
> org.apache.lucene.search.BooleanClause.Occur.MUST_NOT);
> termsNOTFilter.add(notTermClause);
>
> thanks
>
>
>
>
> On 29 March 2011 22:12, Ian Lea <ian.lea@gmail.com> wrote:
>
>> > Plan A sounds better because I don't want to consider the entire
>> collection
>> > and then remove results from it.
>>
>> Fine, your choice.
>>
>> > However, the same code has to work with 2 different collections. The
>> first
>> > one has 30.000 docs the other one 90.000.
>>
>> No problem.  The number of docs is irrelevant.
>>
>> > How can I get the total amount of docs from a collection ?
>>
>> IndexReader.numDocs().  See also maxDoc() and numDeletedDocs().
>>
>>
>> --
>> Ian.
>>
>> > On 29 March 2011 21:48, Ian Lea <ian.lea@gmail.com> wrote:
>> >
>> >> Here are a couple of ideas.
>> >>
>> >> Plan A.
>> >>
>> >> Think of a number, say 10, retrieve n * 10 docids in your search and
>> >> then loop round java.util.Random.nextInt(n * 10) until you've got
>> >> enough.
>> >>
>> >> Plan B.
>> >>
>> >> Reverse your MUST NOT search to get a list of docids that you don't
>> >> want, then loop round Random.nextInt(indexreader.numDocs()), selecting
>> >> those that are not deleted (!indexreader.isDeleted(docid)) and are not
>> >> in your exclusion list.
>> >>
>> >>
>> >> I'm sure there are other ways, probably better.
>> >>
>> >>
>> >> --
>> >> Ian.
>> >>
>> >>
>> >> On Tue, Mar 29, 2011 at 8:00 PM, Patrick Diviacco
>> >> <patrick.diviacco@gmail.com> wrote:
>> >> > Ok I've solved the first part of the problem. I'm now selecting all
>> >> > documents that do not contain a given term with a BooleanFilter
>> >> > and FilterClause, MUST NOT.
>> >> >
>> >> > I still have to understand how to retrieve random documents and limit
>> the
>> >> > number of retrieved docs to N.
>> >> >
>> >> > thanks
>> >> >
>> >> > On 29 March 2011 20:40, Patrick Diviacco <patrick.diviacco@gmail.com>
>> >> wrote:
>> >> >
>> >> >> Is there a Filter to get a limited number of random collection
docs
>> from
>> >> >> the index which DO NOT contain a specific term ?
>> >> >>
>> >> >> i.e. term="pizza"
>> >> >>
>> >> >> I want to run the query against 10 random documents of the collection
>> >> that
>> >> >> do not contain the term "pizza".
>> >> >>
>> >> >> thanks
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message