lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jt oob <jt2...@yahoo.co.uk>
Subject Multiple document types, same search engine
Date Thu, 17 Jun 2004 14:02:31 GMT
Hi,
I just wondered how other lucene users are handling having multiple
document types available for searching.

I am initially concentrating on news groups, and so was planning on
converting all news postings into XML. There would be one document
field for each header field in the posting, one document field for the
whole header, and one document field for the body text.

I would also then like to add html to the indexes and later other
document types.

To do this I think i need to try and identify things which will be
common across all(most) document types such as "author" and "topic".
For the news posting i would then have to map the "From" and "Subject"
fields over to "author" and "topic" whereas in the html i would map
over the "built by" (or similar) string if it exists and perhaps the
<TITLE>.

My aim  is to give users an advanced search capability over multiple
document types. I'm not sure if I am looking at the problem the correct
way, or, if I am, where I should do the mappings from document specific
fields such as "From" to my generic ones such as "author". I could
duplicate the data, so a news posting would have both "author" and
"From" fields, or should build it into my search parsing so when the
user enters the query "topic: jt" it gets converted to "Subject: jt |
Title: jt" to get both the news and the html. 

Any comments would be appreciated,

Thanks for reading this far!
jt


	
	
		
___________________________________________________________ALL-NEW Yahoo! Messenger - sooooo
many all-new ways to express yourself http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message