uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: UIMA and Lucene
Date Thu, 30 Nov 2006 13:57:00 GMT
James Montgomery wrote:
> Hello all,
> 
> I'm working on a project with an engineering firm to develop a search tool
> that can find relevant engineering documents and also provide information
> about relationships between documents (for instance, they mention the same
> part). We are currently leaning most strongly towards a combination of
> Lucene for search and UIMA for document analysis. 

Good ;-)

> I see on the Incubator
> Wiki (http://wiki.apache.org/incubator/UimaProposal) that better
> integration or communication between these two products is being 
> considered.
> Here are some questions about this and UIMA:
> 
> - Would others recommend the use of Lucene to search analysis results
> produced by UIMA components?

It depends what your requirements are.  Lucene is certainly a good 
choice for a search engine.  You may also want to consider Solr 
(http://incubator.apache.org/solr/) which uses Lucene internally.  One 
constraint is that Lucene, like most text search engines, does not 
support span search.  What I mean by that is that you can not, for 
example, index the internal structure of an XML document.  So suppose 
you have a UIMA analysis pipeline that discovers book descriptions, and 
inside those book descriptions, the author, title, ISBN or what have you 
of that book.  Then you might want to post queries like, show me all 
instances of books where the author is "Smith" and the title contains 
the word "Lucene".  There is no obvious way to model this kind of search 
in Lucene.

What Lucene does support are fields.  Fields are global to the entire 
document.  So if your application does not really require span support 
and you can model your UIMA data as fields, Lucene is a good choice. 
For example, if your application can discover product names, you can 
create a "product" field in Lucene and for each document index the 
product names you found under that field.  This will allow you to search 
specifically for documents containing product names.

> - What other search engines and search engine SDKs would others recommend,
> perhaps as being better suited to integration with UIMA?

There is a search engine that comes with the pre-Apache UIMA SDK you can 
download from IBM: http://www.alphaworks.ibm.com/tech/uima
It supports span search, and it is planned to make a version available 
that works with Apache UIMA in the future.  I'm not exactly sure what 
the license conditions are, anyone else know?

> - Although UIMA has only just entered the Apache Incubator, how soon might
> efforts be made to provide an interface between Lucene and UIMA?

Most of the current UIMA developers will be concentrating on getting the 
first release on Apache out the door, hopefully early next year.  After 
that is done, we hope to have time to look at the Lucene integration. 
On the other hand, you are not the only one interested in such an 
integration, and there's always the possibility that somebody else will 
step up and do it.

>   - Should this question be directed to the developer list?

No, the user's list is fine.  All developers read the user's list.

> - What sites would others recommend for open source UIMA analysis 
> components
> for different document formats?

There is a general UIMA component repository here: 
http://uima.lti.cs.cmu.edu/

I'm not sure what you mean by "for different document formats".  Formats 
such as html or pdf?  I'm not sure anybody has done any open source 
document format parsing for UIMA yet.  It should not be too difficult to 
wrap existing technology, such as http://www.pdfbox.org/, for use in UIMA.

HTH,
Thilo



Mime
View raw message