lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siegfried Puchbauer" <>
Subject pdf indexing problem
Date Wed, 01 Sep 2004 14:52:17 GMT


I have a PDF Parser which uses PDFBox libary to parse PDF documents into
plain text.

I have tried this parser by sending the output directly to the
commandline and it works, I 
get the plain text, like I get it with my HTMLParser.

But there is a problem with the indexing, I think:

I can only find the document with lucene with words which occur before
the first dot. With
a word which occurs after the first "." Lucene doesn't find the related
document. So I think
these words are not indexed. I do not use any special analyzer and I
cannot understand
why it does not work. 


The indexing with html files work, there I can find words after the
first ".".

Its also possible that the missing indexing is not related to the "."
Character, but to

The first newline (\n). I don't know.


Have you got any ideas for making pdf-index work?


Siegfried Puchbauer


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message