lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Polites" <jason.poli...@gmail.com>
Subject Re: Indexing Documents which has Attachments and are Refered many times!!
Date Sat, 12 Aug 2006 16:02:33 GMT
Maybe I'm not understanding your requirement, but this should be fairly
simple in Lucene.

Each document in your document management system would be represented by a
single Lucene document in the index.  Each lucene document will then have
several fields, each field representing the values of the "meta data"
associated with your document in the document management system.

For example:

Lets say you have a document which has the following structure:

Title: Sample Document
Date: 01/01/2006
Attachment: Some attachment (this is the content of the attachment)
Attachment: Some other attachment
Refers: Mr X
Refers: Mr Y

Your Lucene document would have the same structure.  All of these items in
your "real" document would simply be Fields in the lucene document.

In the case of your attachments, you could also consider indexing them
separately (as well).  This way users could search for attachments without
needing the documents to which they are attached.

If your attachments ARE documents (that is, your just have a foreign key
style relationship between two documents), then you would simple index each
"real" document as a separate document in lucene and add some sort of
reference field which contains the ID of all related documents.

For Example:
-------------------
ID: 123
Title: Sample Document
Date: 01/01/2006
Attachment: 456
Attachment: 789
Refers: Mr X
Refers: Mr Y
--------------------
ID: 456
Title: Other Document
Date:.. etc etc...

The one thing to be mindful of is re-indexing existing documents.  If you
have document that is already indexed and you want to make a change (eg you
want to add a new "refers" value), then you need to re-index the entire
document.  This means you need to either "store" all fields you want to keep
during re-indexing (which is typically all of them), or you need to re-index
the document from its source. Storing all the data in the index can have
adverse effects on the performance of the index however. (hope this makes
sense).



On 8/12/06, Shaghayegh Sahebie <shaghayegh_sahebie@yahoo.com> wrote:
>
> Hi all;
> We have got a Document management system and we want to build a search on
> it. We have tree kind of content in our system: Refers, Documents and
> Attachments. A document can have multiple attachments and can be Refered to
> many users.
> Our users want to be able to search on documents attachments and refers.
> for example they want to search the Documents which are created at
> "2006/07/06" date and have the word "Lucene" in it or their Refers and are
> Refered to Mr.x.
> Our users want to be ale to search in all 8 possible selections of
> Document, Refer and Attachment, I mean they want to be able to search just
> in Refers, in both Refers and Documents, ...
> How can we handle it?
> I thaught to store diferent kinds of Docs in a DB, search in the DB at
> first and search in Lucene based on DB results and phrases given to search
> (Handling Document, Refer or Attachments parts in a DB search). But the DB
> results maybe so big and i don't know if a Lucene query can have these much
> of search Terms.
> Another way is to Index each document, refer and attachment in the index 8
> times(all the possible selections of Refer, Document and Attachment) but
> this way has lots of redundancy even more than 8 times! 'cause each Document
> is indexed "8 * Refer number of Document" times.
> I really don't know what to do, Any suggestions Please?
>
> Thanks in advance
>
>
> ---------------------------------
> Do you Yahoo!?
> Everyone is raving about the  all-new Yahoo! Mail Beta.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message