lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dr. John Takacs" <taka...@highway61.com>
Subject RE: New Lucene-powered Website
Date Fri, 28 Nov 2003 11:30:12 GMT
Ulrich,

Well done!

I too would love to know how you implemented the summarizer.  If you are
unable to provide the details, would you be able to steer a person in the
right direction?  I've experimented with a few applications that will do it,
some my own, some found via searches, but none are as clear cut and
professional as yours (i.e. most were simply grabbing the first 200 or so
characters of a page....etc etc).

Regards,

John

-----Original Message-----
From: news [mailto:news@sea.gmane.org]On Behalf Of Ulrich Mayring
Sent: Thursday, November 27, 2003 8:30 PM
To: lucene-user@jakarta.apache.org
Subject: New Lucene-powered Website


Hello,

we (DENIC) are the world's second largest domain registry (.de-zone has
almost 6.9 million domains) and are using Lucene to index and search our
website in a high-traffic scenario. Most of our web pages are available
in English in addition to our native language German. If you want to try
our Lucene-based search engine, please start here:

http://www.denic.de/en/special/index.jsp

Use the input field on the page to search our website. Don't use the
input field at the top right, that is only for searching domains in our
domain database, it has nothing to do with Lucene.

The indexes for German and English are seperate, so you should find only
English pages from that page.

A somewhat interesting feature is the summarizer, on the results page
you'll get a short summary of the page. These are not hand-written
blurbs, rather they are generated automatically from the HTML pages at
indexing time. I'd be especially interested in improvement suggestions
in this area.

Naturally, the automatically generated texts don't have the same quality
as hand-written ones. But they're better than nothing and in my eyes
more useful than Google-style excerpts. How many times has it happened
to you that the Google excerpt doesn't really tell you anything, because
it's totally out of context? Summaries tell you what the whole page is
about, irregardless of the context within which your search terms may
appear. After reading the summary you should (hopefully) be able to
decide whether the page contains the info you're looking for. Comments
welcome!

We're using the snowball stemmers/analyzers for German and English,
custom stopword lists and the HTML parser from the Sourceforge
htmlparser project. Apart from that it's vanilla Lucene.

cheers,

Ulrich



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message