lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shaghayegh Sahebie <>
Subject Re: Indexing Documents which has Attachments and are Refered many times!!
Date Sat, 19 Aug 2006 05:48:30 GMT
thanks Jason and Steve;
maybe i didn't understand your solution well, but in this system a document is refered many
times (we have a refer description wich we should index it also) and each time a document
is refered i should update it in the lucene index and so i should delete it and then index
it again. and it means deleting and indexing this document many times. i think indexing is
time consuming. 
and as another question can lucene have different value for one of it's fields? 'cause i refer
a doc many time and i need many refer fields(but i don't know how many) in one documents fields.
and i can not e.g. concat all the refers in one feld 'cause i may have other constraints on
a refer and doument also (e.g. give me the documents which they have word "foo" in their refers
and the refer of them which has this wordis on "yyyy/mm/dd" date. if i concat the refers i
may find a refer which has the word in it but another refer of this document is on the given
date. i mean i should know which refer is on which date and also other fields of a refer other
than date. the same is true for attachments) on the date of a refer an

i think the main problem i got is that lucene can not handle joins and i think i need joins.


Steven Rowe <> wrote: As Jason says, you can structure each Lucene document
with one Field per
content type, and index all data that way.  The database is not required.

To address your search complexity concern, you can create queries that
search only those Field(s) the user wants -- there is no need to have a
Field for each possible combination of content type.


Jason Polites wrote:
> Maybe I'm not understanding your requirement, but this should be fairly
> simple in Lucene.
> Each document in your document management system would be represented by a
> single Lucene document in the index.  Each lucene document will then have
> several fields, each field representing the values of the "meta data"
> associated with your document in the document management system.
> For example:
> Lets say you have a document which has the following structure:
> Title: Sample Document
> Date: 01/01/2006
> Attachment: Some attachment (this is the content of the attachment)
> Attachment: Some other attachment
> Refers: Mr X
> Refers: Mr Y
> Your Lucene document would have the same structure.  All of these items in
> your "real" document would simply be Fields in the lucene document.
> In the case of your attachments, you could also consider indexing them
> separately (as well).  This way users could search for attachments without
> needing the documents to which they are attached.
> If your attachments ARE documents (that is, your just have a foreign key
> style relationship between two documents), then you would simple index each
> "real" document as a separate document in lucene and add some sort of
> reference field which contains the ID of all related documents.
> For Example:
> -------------------
> ID: 123
> Title: Sample Document
> Date: 01/01/2006
> Attachment: 456
> Attachment: 789
> Refers: Mr X
> Refers: Mr Y
> --------------------
> ID: 456
> Title: Other Document
> Date:.. etc etc...
> The one thing to be mindful of is re-indexing existing documents.  If you
> have document that is already indexed and you want to make a change (eg you
> want to add a new "refers" value), then you need to re-index the entire
> document.  This means you need to either "store" all fields you want to
> keep
> during re-indexing (which is typically all of them), or you need to
> re-index
> the document from its source. Storing all the data in the index can have
> adverse effects on the performance of the index however. (hope this makes
> sense).
> On 8/12/06, Shaghayegh Sahebie  wrote:
>> Hi all;
>> We have got a Document management system and we want to build a search on
>> it. We have tree kind of content in our system: Refers, Documents and
>> Attachments. A document can have multiple attachments and can be
>> Refered to
>> many users.
>> Our users want to be able to search on documents attachments and refers.
>> for example they want to search the Documents which are created at
>> "2006/07/06" date and have the word "Lucene" in it or their Refers and
>> are
>> Refered to Mr.x.
>> Our users want to be ale to search in all 8 possible selections of
>> Document, Refer and Attachment, I mean they want to be able to search
>> just
>> in Refers, in both Refers and Documents, ...
>> How can we handle it?
>> I thaught to store diferent kinds of Docs in a DB, search in the DB at
>> first and search in Lucene based on DB results and phrases given to
>> search
>> (Handling Document, Refer or Attachments parts in a DB search). But
>> the DB
>> results maybe so big and i don't know if a Lucene query can have these
>> much
>> of search Terms.
>> Another way is to Index each document, refer and attachment in the
>> index 8
>> times(all the possible selections of Refer, Document and Attachment) but
>> this way has lots of redundancy even more than 8 times! 'cause each
>> Document
>> is indexed "8 * Refer number of Document" times.
>> I really don't know what to do, Any suggestions Please?
>> Thanks in advance

To unsubscribe, e-mail:
For additional commands, e-mail:

Do you Yahoo!?
 Everyone is raving about the  all-new Yahoo! Mail Beta.
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message