lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ulrich Mayring <>
Subject New Lucene-powered Website
Date Thu, 27 Nov 2003 11:29:53 GMT

we (DENIC) are the world's second largest domain registry (.de-zone has 
almost 6.9 million domains) and are using Lucene to index and search our 
website in a high-traffic scenario. Most of our web pages are available 
in English in addition to our native language German. If you want to try 
our Lucene-based search engine, please start here:

Use the input field on the page to search our website. Don't use the 
input field at the top right, that is only for searching domains in our 
domain database, it has nothing to do with Lucene.

The indexes for German and English are seperate, so you should find only 
English pages from that page.

A somewhat interesting feature is the summarizer, on the results page 
you'll get a short summary of the page. These are not hand-written 
blurbs, rather they are generated automatically from the HTML pages at 
indexing time. I'd be especially interested in improvement suggestions 
in this area.

Naturally, the automatically generated texts don't have the same quality 
as hand-written ones. But they're better than nothing and in my eyes 
more useful than Google-style excerpts. How many times has it happened 
to you that the Google excerpt doesn't really tell you anything, because 
it's totally out of context? Summaries tell you what the whole page is 
about, irregardless of the context within which your search terms may 
appear. After reading the summary you should (hopefully) be able to 
decide whether the page contains the info you're looking for. Comments 

We're using the snowball stemmers/analyzers for German and English, 
custom stopword lists and the HTML parser from the Sourceforge 
htmlparser project. Apart from that it's vanilla Lucene.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message