lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ganesh" <emailg...@yahoo.co.in>
Subject Re: PDF text extracted without spaces
Date Fri, 03 Dec 2010 10:43:55 GMT
The main problem is i am not getting whitespace and newline char. This is happening only for
PDF documents. 

Sample outoput: Someofthedifferencesare but it should be Some of the differences are

Regards
Ganesh

----- Original Message ----- 
From: "Alexander Aristov" <alexander.aristov@gmail.com>
To: <java-user@lucene.apache.org>
Sent: Friday, December 03, 2010 2:39 PM
Subject: Re: PDF text extracted without spaces


> anyway even if you get correct whitespaces and new lines this won't affect
> indexing.
> 
> Best Regards
> Alexander Aristov
> 
> 
> On 3 December 2010 10:00, Lance Norskog <goksron@gmail.com> wrote:
> 
>> The text should come out as a stream of words with space, but without
>> any of the formatting in the PDF. Extraction is only good enough to
>> tell you that a word is somewhere inside a PDF file.  Can you post a
>> short bit of the text that it extracted?
>>
>> Also, you should try this test on different PDF files that were made
>> with different software.
>>
>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailgane@yahoo.co.in> wrote:
>> > Hello all,
>> >
>> > I know, this is not the right group to ask this question, thought some of
>> you guys might have experienced.
>> >
>> > I newbie with Tika. I am using latest version 0.8 version. I extracted
>> text from PDF document but found spaces and new line missing. Indexing the
>> data gives wrong result. Could any one in this group could help me? I am
>> using tika directly to extract the contents, which later gets indexed.
>> >
>> > Regards
>> > Ganesh
>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
>> Download Now! http://messenger.yahoo.com/download.php
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message