lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raymond Balm├Ęs" <raymond.bal...@gmail.com>
Subject Re: Beginner: Specific indexing
Date Tue, 09 Sep 2008 13:11:01 GMT
Well that is well explained in "Lucene in Action" if you want to search
files you have to build a file parser and there is a good example given. So
not really my problem.

But I thought I could go thru the token stream only once, where I have to go
twice 1. for detecting my triplets , 2. for indexing the text.

-Raymond-

On Tue, Sep 9, 2008 at 12:27 AM, Chris Hostetter
<hossman_lucene@fucit.org>wrote:

>
> : I think I'm getting you. But the files I'm  going to parse have many
> formats
> : : PDF, HTML, Word.
> : they don't have a particular structure, memos if you will. But the ones
> I'm
> : interested in will have the triplets I described
>
> Ahhhh...  see this is something i completley didn't realize.  "Lucene" as
> a library really doesn't provide any sort of mechanism for doing text
> extraction from unknown file formats ... With some small exceptions (like
> the HTMLStripTokenizer in Solr) the TokenStream concept is much more about
> finding "Tokens" from a stream of plain text -- not about finding "Text"
> in arbitrary (possibly binary) files.
>
> You'll probably wantto check out the Tika subproject...
>    http://incubator.apache.org/tika/
> ...or some of the various "How do i index _____ documents?" FAQs...
>    http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message