lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Beginner: Specific indexing
Date Mon, 08 Sep 2008 22:27:03 GMT

: I think I'm getting you. But the files I'm  going to parse have many formats
: : PDF, HTML, Word.
: they don't have a particular structure, memos if you will. But the ones I'm
: interested in will have the triplets I described

Ahhhh...  see this is something i completley didn't realize.  "Lucene" as 
a library really doesn't provide any sort of mechanism for doing text 
extraction from unknown file formats ... With some small exceptions (like 
the HTMLStripTokenizer in Solr) the TokenStream concept is much more about 
finding "Tokens" from a stream of plain text -- not about finding "Text" 
in arbitrary (possibly binary) files.

You'll probably wantto check out the Tika subproject...
    http://incubator.apache.org/tika/
...or some of the various "How do i index _____ documents?" FAQs...
    http://wiki.apache.org/lucene-java/LuceneFAQ


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message