lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kyriakos Ktorides" <k.ktori...@varkoume.com>
Subject Web search engine size optimisation problems..
Date Thu, 10 Oct 2002 02:35:45 GMT
Hello, 

I've been trying for a while to create a web search engine to spider a
small number of websites (around 1000 of them). Before even considering
Lucene I used a dbms and tried "crawling" a site while taking in all
keywords from the html files (filtering out stopwords etc).
Unfortunately this simplistic approach resulted into huge amounts of
data which made the whole project impractical. Then I looked into Lucene
as a friend suggested because it's more efficient in storing indexes of
this kind. Since most websites nowadays are dynamically produced based
on templates much of the web page content remains the same over and over
again meaning that the same words are re-added to the index making it
larger without adding any useful information to it. I came up with the
idea to approximately find which keywords remain the same over the site
and index them only once in a document calling it the "base". Now every
page from the same website gets compared to the base document and only
the differences are stored as a separate document with a field
containing the "link" to the base document. This works as expected i.e.
it substantially decreases the index size but introduces another
problem; how do I search?

Say I want to run a query with two terms being searched using the AND
operator. For example search for "home" and "test". Suppose that "home"
is in the base document and "test" appears in a couple of documents of
the same website but does not exist in the base document. The correct
result is those two documents. How do I get Lucene to do this for me?

I've not had any experience before with search engine programming so I
might be doing it all wrong, I'd be glad if anyone could point me to the
right direction if I am doing it all wrong. I'm expecting your
suggestions or comments. 

Thanks in advance,

Kyriakos Ktorides


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message