lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Polites" <jason.poli...@gmail.com>
Subject Re: Indexing Documents which has Attachments and are Refered many times!!
Date Sat, 19 Aug 2006 07:07:57 GMT
I think you can still achieve your desired outcome, but I'm not sure I fully
understand the use case.  Can you describe more clearly a specific example
of what you need to achieve?

You are correct that "joins" in lucene aren't really a strong point, but
this is often a by-product of thinking about Lucene in the same wy we think
about a database; which is a mistake (in my opinion).

I often have trouble trying to use Lucene in the same way I use a database
(joined queries etc).  Usually I need to re-think the way I am using Lucene
to avoid needing to do this.

On 8/19/06, Shaghayegh Sahebie <shaghayegh_sahebie@yahoo.com> wrote:
>
> thanks Jason and Steve;
> maybe i didn't understand your solution well, but in this system a
> document is refered many times (we have a refer description wich we should
> index it also) and each time a document is refered i should update it in the
> lucene index and so i should delete it and then index it again. and it means
> deleting and indexing this document many times. i think indexing is time
> consuming.
> and as another question can lucene have different value for one of it's
> fields? 'cause i refer a doc many time and i need many refer fields(but i
> don't know how many) in one documents fields. and i can not e.g. concat
> all the refers in one feld 'cause i may have other constraints on a refer
> and doument also (e.g. give me the documents which they have word "foo" in
> their refers and the refer of them which has this wordis on "yyyy/mm/dd"
> date. if i concat the refers i may find a refer which has the word in it but
> another refer of this document is on the given date. i mean i should know
> which refer is on which date and also other fields of a refer other than
> date. the same is true for attachments) on the date of a refer an
>
> i think the main problem i got is that lucene can not handle joins and i
> think i need joins.
>
> regards
> --Shaghayegh
>
> Steven Rowe <sarowe@syr.edu> wrote: As Jason says, you can structure each
> Lucene document with one Field per
> content type, and index all data that way.  The database is not required.
>
> To address your search complexity concern, you can create queries that
> search only those Field(s) the user wants -- there is no need to have a
> Field for each possible combination of content type.
>
> Steve
>
> Jason Polites wrote:
> > Maybe I'm not understanding your requirement, but this should be fairly
> > simple in Lucene.
> >
> > Each document in your document management system would be represented by
> a
> > single Lucene document in the index.  Each lucene document will then
> have
> > several fields, each field representing the values of the "meta data"
> > associated with your document in the document management system.
> >
> > For example:
> >
> > Lets say you have a document which has the following structure:
> >
> > Title: Sample Document
> > Date: 01/01/2006
> > Attachment: Some attachment (this is the content of the attachment)
> > Attachment: Some other attachment
> > Refers: Mr X
> > Refers: Mr Y
> >
> > Your Lucene document would have the same structure.  All of these items
> in
> > your "real" document would simply be Fields in the lucene document.
> >
> > In the case of your attachments, you could also consider indexing them
> > separately (as well).  This way users could search for attachments
> without
> > needing the documents to which they are attached.
> >
> > If your attachments ARE documents (that is, your just have a foreign key
> > style relationship between two documents), then you would simple index
> each
> > "real" document as a separate document in lucene and add some sort of
> > reference field which contains the ID of all related documents.
> >
> > For Example:
> > -------------------
> > ID: 123
> > Title: Sample Document
> > Date: 01/01/2006
> > Attachment: 456
> > Attachment: 789
> > Refers: Mr X
> > Refers: Mr Y
> > --------------------
> > ID: 456
> > Title: Other Document
> > Date:.. etc etc...
> >
> > The one thing to be mindful of is re-indexing existing documents.  If
> you
> > have document that is already indexed and you want to make a change (eg
> you
> > want to add a new "refers" value), then you need to re-index the entire
> > document.  This means you need to either "store" all fields you want to
> > keep
> > during re-indexing (which is typically all of them), or you need to
> > re-index
> > the document from its source. Storing all the data in the index can have
> > adverse effects on the performance of the index however. (hope this
> makes
> > sense).
> >
> >
> >
> > On 8/12/06, Shaghayegh Sahebie  wrote:
> >>
> >> Hi all;
> >> We have got a Document management system and we want to build a search
> on
> >> it. We have tree kind of content in our system: Refers, Documents and
> >> Attachments. A document can have multiple attachments and can be
> >> Refered to
> >> many users.
> >> Our users want to be able to search on documents attachments and
> refers.
> >> for example they want to search the Documents which are created at
> >> "2006/07/06" date and have the word "Lucene" in it or their Refers and
> >> are
> >> Refered to Mr.x.
> >> Our users want to be ale to search in all 8 possible selections of
> >> Document, Refer and Attachment, I mean they want to be able to search
> >> just
> >> in Refers, in both Refers and Documents, ...
> >> How can we handle it?
> >> I thaught to store diferent kinds of Docs in a DB, search in the DB at
> >> first and search in Lucene based on DB results and phrases given to
> >> search
> >> (Handling Document, Refer or Attachments parts in a DB search). But
> >> the DB
> >> results maybe so big and i don't know if a Lucene query can have these
> >> much
> >> of search Terms.
> >> Another way is to Index each document, refer and attachment in the
> >> index 8
> >> times(all the possible selections of Refer, Document and Attachment)
> but
> >> this way has lots of redundancy even more than 8 times! 'cause each
> >> Document
> >> is indexed "8 * Refer number of Document" times.
> >> I really don't know what to do, Any suggestions Please?
> >>
> >> Thanks in advance
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Everyone is raving about the  all-new Yahoo! Mail Beta.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message