lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: pdf indexing problem
Date Wed, 01 Sep 2004 23:38:59 GMT
Siegfried,

My guess is that the '.' is accidental and there is nothing special
about '.'.  I've used PDFBox+Lucene and it worked well.  Are you aware
of Lucene-specific classes included in the PDFBox distribution?  You
may be able to use those classes, or you could at least look at their
source, to make sure you're using both PDFBox and Lucene API correctly.

Also, how long is the extracted text?
See this:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#maxFieldLength
Maybe that's the problem.

Otis


--- Siegfried Puchbauer <S.Puchbauer@spp.at> wrote:

> Hi,
> 
>  
> 
> I have a PDF Parser which uses PDFBox libary to parse PDF documents
> into
> plain text.
> 
> I have tried this parser by sending the output directly to the
> commandline and it works, I 
> get the plain text, like I get it with my HTMLParser.
> 
> But there is a problem with the indexing, I think:
> 
> I can only find the document with lucene with words which occur
> before
> the first dot. With
> a word which occurs after the first "." Lucene doesn't find the
> related
> document. So I think
> these words are not indexed. I do not use any special analyzer and I
> cannot understand
> why it does not work. 
> 
>  
> 
> The indexing with html files work, there I can find words after the
> first ".".
> 
> Its also possible that the missing indexing is not related to the "."
> Character, but to
> 
> The first newline (\n). I don't know.
> 
>  
> 
> Have you got any ideas for making pdf-index work?
> 
>  
> 
> Siegfried Puchbauer
> 
>  
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message