lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris D <bro...@gmail.com>
Subject Indexing document instances and retrieving instance attributes
Date Wed, 10 Aug 2005 17:54:37 GMT
I'm adding files to an index over time, so after some time I'm likely
to see the same file more than once. I would like to be able to search
for the information about that particular instance of the file
(Filename, date etc) For instance I index File1 and then File2 (which
are identical) at different times I want to be able to search for the
contents and retrieve all the Filenames and MIME.

The first way I did it was to add a seperate doc for every instance as follows

DOC   1
FILEID 123
MIME   test/html
CONTENT   blam blam blam etc.

DOC   2
FILEID 123
FILENAME  File1
DATE   090909

DOC   3
FILEID 123
FILENAME  File2
DATE   101010
AUTH   Jim Jones

The problem with this was that if the user needed all of the Filenames
that are associated with content:blam I would have to search for
fileID:123 to retrieve them. This gets slow with several thousand hits
because I have to do a search for every hit.

I solved that by using multiple fields of the same name.

DOC   1
FILEID 123
MIME   test/html
CONTENT   blam blam blam etc.
FILENAME  File1
DATE   090909
FILENAME  File2
DATE   101010
AUTH   Jim Jones

But now I have a problem where I can't retrieve specific information
about an instance of the file. I tried using getFields(String) but if
I wanted the author for instance 2 I have a problem, it should be Jim
jones but in the index it looks like he's the auther for instance 1.

One solution I see would be to fill all of the fields for each
instance with empty strings, but that seems like a bit of a hack.

Another that fell appart fairly quickly was to have a reference table.

DOCID 				 1
FILEID 				   123abd321
MIME/TYPE			text/html
INSTANCE 			uri1 collectiondate1
URI1 				    http://blam.com/
COLLECTIONDATE1		12355
INSTANCE			uri2 collectiondate2 author2
URI2				     http://google.ca/
COLLECTIONDATE2		12356
AUTHOR2				Jim Brown

Now I can't search for URI without having to search for URI1:foo + URI2:foo ...

How can I make specific attributes of an instance of the file
searchable without having to do a search for every hit?

Thanks,
Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message