lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Build vs. Buy?
Date Thu, 09 Feb 2006 02:52:03 GMT
I'm trying to upgrade our search functionality (currently, RTF/text
only, and exact phrase match only) at my company, and have run into some
concerns.  Our 4 main formats are:


RTF - javax.swing looks fine, we use those classes already.


MS Word - I know that POI exists, but development on the Word portion
seems to have stopped, and there are a lot of nasty looking bugs in
their DB.  Since we're involved in dealing with contracts, many of our
Word files are large and complicated.  How has everyone's experience
with POI's Word parsing been?


PDF - Looks like PDFBox has memory issues.  Frankly, this is not a
problem in anything other than indexing.  Minor, but still a concern.


Word Perfect - There doesn't seem to be any converters for this format?


I would hate to have to recommend to my boss to shell out $10k to $25k
(or more!) in licensing fees for a commercial search engine just because
I can't parse the files and the commercial ones can, but that is still
cheaper than dedicating two engineers for 6 months if we have to write
parsers for Word, PDF and Word Perfect if we go with Lucene (frankly,
there's less risk too, considering how complicated parsing would be.)  I
know that Lucene doesn't deal with file formats, but the basic fact is,
to use Lucene, you have to present it text strings, and there's no way
to get that without dealing with file formats.  


What is the experience of people on the list with implementing parsers
for anything more than text, html and xml?


Thanks for any insights,

Jeff Wang

diCarta, Inc.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message