lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Praveen Peddi" <>
Subject Re: Lucene in the Humanties
Date Fri, 18 Feb 2005 20:00:20 GMT
Good work Eric (even though UI could be made pretty). We use lucene so I 
have some knowledge of it. I could see the features you are using with 
lucene (like paging, highlighting, different kinds of pharases). Over all, 
good stuff.

----- Original Message ----- 
From: "Erik Hatcher" <>
To: "Lucene User" <>
Sent: Friday, February 18, 2005 2:46 PM
Subject: Lucene in the Humanties

> It's about time I actually did something real with Lucene....  :)
> I have been working with the Applied Research in Patacriticism group at 
> the University of Virginia for a few months and finally ready to present 
> what I've been doing.  The primary focus of my group is working with the 
> Rossetti Archive - poems, artwork, interpretations, collections, and so on 
> of Dante Gabriel Rossetti.  I was initially brought on to build a 
> collection and exhibit system, though I got detoured a bit as I got 
> involved in applying Lucene to the archive to replace their existing 
> search system.  The existing system used an old version of Tamino with 
> XPath queries.  Tamino is not at fault here, at least not entirely, 
> because our data is in a very complicated set of XML files with a lot of 
> non-normalized and legacy metadata - getting at things via XPath is 
> challenging and practically impossible in many cases.
> My work is now presentable at
> (rose is for ROsetti SEarch)
> This system is implicitly designed for academics who are delving into 
> Rossetti's work, so it may not be all that interesting for most of you. 
> Have fun and send me any interesting things you discover, especially any 
> issues you may encounter.
> Here are some numbers to give you a sense of what is going on 
> underneath... There are currently 4,983 XML files, totally about 110MB. 
> Without getting into a lot of details of the confusing domain, there are 
> basically 3 types of XML files (works, pictures, and transcripts).  It is 
> important that  there be case-sensitive and case-insensitive searches.  To 
> accomplish that, a custom analyzer is used in two different modes, one 
> applying a LowerCaseFilter, and one not with the same documents written to 
> two different indexes.  There is one particular type of XML file that gets 
> indexed as two different types of documents (a specialized summary/header 
> type).  In this first set of indexes, it is basically a one-to-one mapping 
> of XML file to Lucene Document (with one type being indexed twice in 
> different ways) - all said there are 5539 documents in each of the two 
> main indexes.  The transcript type gets sliced into another set of 
> original case and lowercased indexes with each document in that index 
> representing a document division (a <div> element in the XML).  There are 
> 12326 documents in each of these <div>-level indexes.   All said, the 4 
> indexes built total about 3GB in size - I'm storing several fields in 
> order to hit-highlight.  Only one of these indexes is being hit at a 
> time - it depends on what parameters you use when querying for which index 
> is used.
> Lucene brought the search times into a usable, and impressive to the 
> scholars, state.  The previous search solution often timed the browser 
> out!  Search results now are in the milliseconds range.
> The amount of data is tiny compared to most usages of Lucene, but things 
> are getting interesting in other ways.   There has been little tuning in 
> terms of ranking quality so far, but this is the next area of work.  There 
> is one document type that is more important than the others, and it is 
> being boosted during indexing.  There is now a growing interest in 
> tinkering with all the new knobs and dials that are now possible.  Putting 
> in similar and more-like-this features are desired and will be relatively 
> straightforward to implement.  I'm currently using 
> catch-all-aggregate-field technique for a default field for QueryParser 
> searching.  Using a multi-field expansion is an area that is desirable 
> instead though.  So, I've got my homework to do and catch up on all the 
> goodness that has been mentioned in this list recently regarding all of 
> these techniques.
> An area where I'd like to solicit more help from the community relates to 
> something akin to personalization.  The scholars would like to be able to 
> tune results based on the role (such as "art historian") that is searching 
> the site.  This would involve some type of training or continual learning 
> process so that someone searching feeds back preferences implicitly for 
> their queries by visiting the actual documents that are of interest.  Now 
> that the scholars have seen what is possible (I showed them the cool 
> SearchMorph comparison page searching Wikipedia for "rossetti"), they want 
> more and more!
> So - here's where I'm soliciting feedback - who's doing these types of 
> things in the realm of Humanties?  Where should we go from here in terms 
> of researching and applying the types of features dreamed about here? 
> How would you recommend implementing these types of features?
> I'd be happy to share more about what I've done under the covers.  As you 
> may be able to tell, the web UI is Tapestry for the search and results 
> pages (though you won't be able to tell from the URL's you'll see :).  The 
> UI was designed primarily by one of our very graphical/CSS savvy post doc 
> research associates, and was designed with the research scholar in mind. 
> I continue to push for the "free form" search box to transcend structural 
> search, but there are some good scholarly reasons to have focused 
> structural searching also.  The documents beyond the links in the search 
> results were another area where I've applied some elbow grease - they were 
> originally dynamically generated with hideous URL's through Tamino and a 
> Saxon transformation servlet.  These are documents that _rarely_ change - 
> so they are now being XSL'd into all sorts of HTML views during a build 
> process and statically served with Apache with quite clean URLs (and 
> Google friendly, which, interestingly, is not something scholars care that 
> much about for these types of archives it seems).  The XML files are 
> accessible directly as hyperlinks at the bottoms of the document pages 
> too - so you can see the parsing fun I've had (thank you JDOM and XPath!).
> Erik
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message