lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Lucene in the Humanties
Date Fri, 18 Feb 2005 19:46:36 GMT
It's about time I actually did something real with Lucene....  :)

I have been working with the Applied Research in Patacriticism group at 
the University of Virginia for a few months and finally ready to 
present what I've been doing.  The primary focus of my group is working 
with the Rossetti Archive - poems, artwork, interpretations, 
collections, and so on of Dante Gabriel Rossetti.  I was initially 
brought on to build a collection and exhibit system, though I got 
detoured a bit as I got involved in applying Lucene to the archive to 
replace their existing search system.  The existing system used an old 
version of Tamino with XPath queries.  Tamino is not at fault here, at 
least not entirely, because our data is in a very complicated set of 
XML files with a lot of non-normalized and legacy metadata - getting at 
things via XPath is challenging and practically impossible in many 

My work is now presentable at

(rose is for ROsetti SEarch)

This system is implicitly designed for academics who are delving into 
Rossetti's work, so it may not be all that interesting for most of you. 
  Have fun and send me any interesting things you discover, especially 
any issues you may encounter.

Here are some numbers to give you a sense of what is going on 
underneath... There are currently 4,983 XML files, totally about 110MB. 
  Without getting into a lot of details of the confusing domain, there 
are basically 3 types of XML files (works, pictures, and transcripts).  
It is important that  there be case-sensitive and case-insensitive 
searches.  To accomplish that, a custom analyzer is used in two 
different modes, one applying a LowerCaseFilter, and one not with the 
same documents written to two different indexes.  There is one 
particular type of XML file that gets indexed as two different types of 
documents (a specialized summary/header type).  In this first set of 
indexes, it is basically a one-to-one mapping of XML file to Lucene 
Document (with one type being indexed twice in different ways) - all 
said there are 5539 documents in each of the two main indexes.  The 
transcript type gets sliced into another set of original case and 
lowercased indexes with each document in that index representing a 
document division (a <div> element in the XML).  There are 12326 
documents in each of these <div>-level indexes.   All said, the 4 
indexes built total about 3GB in size - I'm storing several fields in 
order to hit-highlight.  Only one of these indexes is being hit at a 
time - it depends on what parameters you use when querying for which 
index is used.

Lucene brought the search times into a usable, and impressive to the 
scholars, state.  The previous search solution often timed the browser 
out!  Search results now are in the milliseconds range.

The amount of data is tiny compared to most usages of Lucene, but 
things are getting interesting in other ways.   There has been little 
tuning in terms of ranking quality so far, but this is the next area of 
work.  There is one document type that is more important than the 
others, and it is being boosted during indexing.  There is now a 
growing interest in tinkering with all the new knobs and dials that are 
now possible.  Putting in similar and more-like-this features are 
desired and will be relatively straightforward to implement.  I'm 
currently using catch-all-aggregate-field technique for a default field 
for QueryParser searching.  Using a multi-field expansion is an area 
that is desirable instead though.  So, I've got my homework to do and 
catch up on all the goodness that has been mentioned in this list 
recently regarding all of these techniques.

An area where I'd like to solicit more help from the community relates 
to something akin to personalization.  The scholars would like to be 
able to tune results based on the role (such as "art historian") that 
is searching the site.  This would involve some type of training or 
continual learning process so that someone searching feeds back 
preferences implicitly for their queries by visiting the actual 
documents that are of interest.  Now that the scholars have seen what 
is possible (I showed them the cool SearchMorph comparison page 
searching Wikipedia for "rossetti"), they want more and more!

So - here's where I'm soliciting feedback - who's doing these types of 
things in the realm of Humanties?  Where should we go from here in 
terms of researching and applying the types of features dreamed about 
here?    How would you recommend implementing these types of features?

I'd be happy to share more about what I've done under the covers.  As 
you may be able to tell, the web UI is Tapestry for the search and 
results pages (though you won't be able to tell from the URL's you'll 
see :).  The UI was designed primarily by one of our very graphical/CSS 
savvy post doc research associates, and was designed with the research 
scholar in mind.  I continue to push for the "free form" search box to 
transcend structural search, but there are some good scholarly reasons 
to have focused structural searching also.  The documents beyond the 
links in the search results were another area where I've applied some 
elbow grease - they were originally dynamically generated with hideous 
URL's through Tamino and a Saxon transformation servlet.  These are 
documents that _rarely_ change - so they are now being XSL'd into all 
sorts of HTML views during a build process and statically served with 
Apache with quite clean URLs (and Google friendly, which, 
interestingly, is not something scholars care that much about for these 
types of archives it seems).  The XML files are accessible directly as 
hyperlinks at the bottoms of the document pages too - so you can see 
the parsing fun I've had (thank you JDOM and XPath!).


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message