lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dror Matalon <d...@zapatec.com>
Subject Re: merged search of document
Date Wed, 07 Jan 2004 20:57:30 GMT

I can see the problem, but I'm not sure it's something Lucene should
provide. I guess you can try to do some post processing of Lucene
results. For AND and OR operations it should be quite easy. If you get
any hits for a page in a book, the whole book has the terms. The hard
part will be handling "NOT" operations. Seems like you'd have to
actually do a '+' search for the term and then rule out all the books
that do contain the term. Yuck.

On Wed, Jan 07, 2004 at 09:16:16PM +0100, Thomas Scheffler wrote:
> Am Mit, den 07.01.2004 schrieb Dror Matalon um 20:10:
> > On Wed, Jan 07, 2004 at 07:58:52PM +0100, Thomas Scheffler wrote:
> > > Am Mit, den 07.01.2004 schrieb Dror Matalon um 19:00:
> > > > The solution is simple, but you need to think of it conceptually in a
> > > > different way. Instead of "all documents with the same DocID are the same
> > > > document" think "fetch all the document where DocId is XYZ."
> > > > 
> > > > Assuming the contents are in a field called contents
> > > > you do 
> > > > +(DocID:XYZ) (contents:foo) (contents:bar)
> > > 
> > > I allready was on that way but think of a search like (foo -bar). With
> > > your solution it will result in a hit because on page 345 (to keep my
> > > example) is the word "foo" and no "bar". Of cause I want with my model,
> > > that the book don't get a hit for that query. You see how hard it is to
> > > handle, isn't it? 
> > 
> > I think, I'm starting to understand. So you want to treat several
> > documents as one, and if the hit fails for one of the documents, it
> > should fail for all the documents with the same id. OK. This begs the
> > question. Why don't you make all these document with the same id one
> > document, and index them together?
> 
> This would be a functional but not nice solution. The "pages" are send
> to my java class. This point I cannot change cause it api related
> restriction. To index 1000 pages I have to index the first one, when I
> get the second one I need to reget the first page, bind both together an
> send it to the indexwriter. I must keep track of every single page the
> "book" contains. This procedure is made for every page and get uglier
> while page size is increasing. Furthermore my "book" allows single pages
> to be deleted or updated. Every time such a atomic task
> (adding/deleting) is performed the index for the whole "book" must be
> restored. The mechanism to transfer a "page" to a lucene document is
> very time consuming, so I wan't to do that stuff as less as possible. It
> would be great as you see, if somehow lucene is possible to thread a
> "logical document" (consisting of several lucene documents) like normal
> lucene documents.
> 
> > 
> > > 
> > > > 
> > > > For that matter, you can use a standard analyzer on the query and use
a
> > > > boolean to tie it to the specific document set.
> > > > 
> > > > This is how we do searching on a specific channel at fastbuzz.com.
> > > > 
> > > > Dror
> > > > 
> > > > 
> > > > On Wed, Jan 07, 2004 at 05:21:43PM +0100, Thomas Scheffler wrote:
> > > > > 
> > > > > Jamie Stallwood sagte:
> > > > > > +(DocID:XYZ DocID:ABC) +(foo bar)
> > > > > >
> > > > > > will find a document that (MUST have (xyz OR abc)) AND (MUST
have (foo OR
> > > > > > bar)).
> > > > > 
> > > > > This is just the solution for the example in real world I really
don't
> > > > > have noc documents containing "foo" or "bar". What I meant was: Make
> > > > > Lucene think, that all Documents with the same DocID are ONE Document.
> > > > > Imagine you have a big book, say 1000 pages. Instead of putting the
whole
> > > > > book in the index, you split it up in single pages and index them.
Now
> > > > > it's faster if a page changes or is deleted to update your index
instead
> > > > > of doing it over and over again for all 1000 pages. So you problem
starts
> > > > > when you're searching on the book. You search for (foo bar), foo
is on
> > > > > site 345 while bar ist on 435. You want to get a hit for the book.
So I
> > > > > need a solution matching this more generic example.
> > > > > 
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Thomas Scheffler [mailto:thomas.scheffler@uni-jena.de]
> > > > > > Sent: 07 January 2004 11:23
> > > > > > To: lucene-user@jakarta.apache.org
> > > > > > Subject: merged search of document
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I need a tip for implementation. I have several documents all
of them with
> > > > > > a field named DocID. DocID identifies not a single Lucene Document
but a
> > > > > > collection of them. When I wan't to start a seach it should
handle the
> > > > > > search in that way, as these lucene documents where one.
> > > > > >
> > > > > > example:
> > > > > >
> > > > > > Document 1: DocID:XYZ
> > > > > >
> > > > > > containing: foo
> > > > > >
> > > > > > Document 2: DocID:XYZ
> > > > > >
> > > > > > containing: bar
> > > > > >
> > > > > > Document 3: DocID:ABC
> > > > > >
> > > > > > containing: foo bar
> > > > > >
> > > > > > Document 4: GHJ
> > > > > >
> > > > > > containing: foo
> > > > > >
> > > > > > As you already guesses, when I'm searching for "+foo +bar" I
wan't the
> > > > > > hits to contain Document 1, Document 2 and Document 3, not Document
4. Is
> > > > > > that clear what I want? How do I implement such a monster? Is
that
> > > > > > possible with lucene? The content is not stored within lucene
it's just
> > > > > > tokenized and indexed.
> > > > > >
> > > > > > Any help?
> > > > > >
> > > > > > Thanks in advance!
> > > > > >
> > > > > > Thomas Scheffler
> > > > > >
> > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > > > >
> > > > > >
> > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > > > >
> > > > > >
> > > > > 
> > > > > 
> > > > > -- 
> > > > > 
> > > > > 
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > > > 
> > > --
> > > Fachbegriffe der Informatik - Einfach erklärt
> > > =============================================
> > > N° 37 -- Fehlertolerant :
> > > 
> > > Das Programm erlaubt keine Benutzereingaben. 
> > > 
> --
> Fachbegriffe der Informatik - Einfach erklärt
> =============================================
> N° 385 -- fügt sich in bestehende Strukturen ein :
> 
> Microsoft Passport-Account nötig (Henryk Plötz) 
> 



-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message