How does i2 do it? http://www.i2a.com/websearch/ - they list both an HTML parser and a PDF parser as part of their solution. J --- Doug Cutting wrote: > > From: William Wong [mailto:keng.wong@verizon.net] > > > > How about adding filters for different file types > such as > > -HTML (there is one in the demo already) > > -XML > > -PDF > > -MsWord/RTF > > -other common file formats > > These would be great. Who will implement them? > I was only listing tasks that I plan to do. > > I think the best API for such converters is a method > that takes a > java.io.InputStream and returns a java.io.Reader > containing plain text, > e.g.: > public static java.io.InputStream > getText(java.io.Reader); > That way they can easily be used by Lucene > analyzers. > > Should we put converters in > org.apache.lucene.document? > > Contributions anyone? > > Doug __________________________________________________ Do You Yahoo!? NEW from Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month. http://geocities.yahoo.com/ps/info1