lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "anisha@ekkitab" <ani...@ekkitab.com>
Subject Re: how to use DuplicateFilter to get unique documents based on a fieldName
Date Fri, 05 Mar 2010 13:20:37 GMT

Ok sorry for not explaining my problem clearly earlier. We have around 5
fields in each document. ID, ISBN, author, title  and the category which
this book falls under. ( You are right about point 3, we are indeed storing
multiple genre against the book, which means 1 book 1 doc.)

doc.add(new Field("entityId", book.get("id"), Field.Store.YES,
Field.Index.NO));
doc.add(new Field("author", book.get("author"), Field.Store.NO,
Field.Index.TOKENIZED));
etc etc... and this document is added using the IndexWriter.

and when a search is issued we search for the search term in
title/author/isbn/category....based on some inputs... then a set of books
are returned( you are right about point 2 as well... since we search only on
title/author/genre, we were only indexing those ). The way we wanted these
books to be laid out to the user was such that he can navigate through the
categories, which the books he searched for belong to, to kind of being able
to narrow the search. 

While the count of books, for the given search term, under a particular
category was correct, the overall count of the books were incorrect because
of some books being repeated in various categories. For this reason, we
wanted a duplicate filter on the ID which would give us only the unique
books... and there was something wrong in the way it was implemented... the
ID in the document was not indexed as you can see in the above code. When
this was fixed it worked as expected...but for some performance issues..
because of the huge index sizes ( 3 million books ). Anyway looks like we
have figured the solution ( moved the filter out of the search.. applied it
on the result or something like that ) Thanks so much for ur time.

-Anisha



Anshum-2 wrote:
> 
> Hi Anish,
> So am I getting something wrong here? You said "I have created a search
> index on book Id , title ,and author from a database of books which fall
> under various categories." so those are 3 fields, right?
> 1. How do you filter the doc types (as in the genres) at search time? Do
> you
> even need to do that, if yes how?
> 2. If you're doing that 'm assuming you're already indexing the genre
> somehow. Right?
> 3. How about a field for the genre having multi-valued entries (multiple
> field objects going into the same doc with the same field label). This
> would
> help you store 1 doc as 1 doc having multiple genres instead of duplicate
> entries.
> 
> I'm still not sure if I've gotten tre problem correctly, but hope this is
> of
> help!
> 
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
> 
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
> 
> 
> On Fri, Mar 5, 2010 at 12:07 PM, anisha@ekkitab <anisha@ekkitab.com>
> wrote:
> 
>>
>> Hi Zhangchi
>>
>>
>> Thanks for your reply.
>>
>> We have about 3 million records (different isbns) in the database and
>> documents little more than that, and we wouldn't want to do the deduping
>> at
>> indexing time, because one book ( one isbn ) can be available under 2 or
>> more categories( like fiction, comics & novels, science etc)
>>
>> We had actually applied filter on the primary key ie ID, and it wasn't
>> working, so I was hoping for some sample code. But then we found out that
>> the field name on which we wanted the duplicate filter to be applied (Id)
>> was not actually indexed while adding it into the document. ie
>> Field.Index
>> was set to NO. We changed this, repopulated the documents and the
>> filtering
>> works now.
>>
>> Thanks for your time.
>>
>>
>>
>>
>> zhangchi wrote:
>> >
>> >
>> > i think you should check the index first.using the lukeall to see if
>> there
>> > is the duplicate books.
>> >
>> > On Thu, 04 Mar 2010 20:43:26 +0800, anisha@ekkitab <anisha@ekkitab.com>
>> > wrote:
>> >
>> >>
>> >> Hi there, Could someone help me with the usage of DuplicateFilters.
>> Here
>> >> is
>> >> my problem
>> >>
>> >> I have created a search index on book Id , title ,and author from a
>> >> database
>> >> of books which fall under various categories. Some books fall under
>> more
>> >> than one category. Now, when i issue a search, I get back 'X' books
>> >> matching
>> >> the search criteria, some of which are repeated, because that books
>> are
>> >> in
>> >> different documents and its the expected behaviour.
>> >>
>> >> I use the  TopFieldDocCollector . getTotalHits() to get the total
>> count.
>> >> But
>> >> this includes the repeats as mentioned above. This count is not the
>> >> actual
>> >> count, Hence when I issue a search on title or author i want to get a
>> >> unique
>> >> count / list of books. How do I use DuplicateFilter to acheive this.
>> >>
>> >> Please help
>> >>
>> >> Regards
>> >> Anish
>> >
>> >
>> > --
>> > Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27790391.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27793771.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message