lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ganesh" <emailg...@yahoo.co.in>
Subject Re: PDF text extracted without spaces
Date Fri, 03 Dec 2010 11:55:52 GMT
I first extract the contents from documents using tika and latter index it with Lucene. The
problem is the extracted text from PDF using tika has no whitespaces. 

Regards
Ganesh


----- Original Message ----- 
From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
To: <java-user@lucene.apache.org>
Sent: Friday, December 03, 2010 4:40 PM
Subject: RE: PDF text extracted without spaces


> Hi Ganesh
> 
> I encountered this same problem last week. I was thinking if it was possible to include
at minimum a WhitespaceAnalyzer somewhere within Tika which would solve the problem. I am
not sure of how this would be done as I am not familiar with Tika codebase.
> 
> Unfortunately I don't think that the solution to the first part of this problem lies
within the java-user mailing list.
> 
> When were you sending extracted contents to Lucene... at what later stage?
> 
> Thank you
> 
> Lewis
> 
> -----Original Message-----
> From: Ganesh [mailto:emailgane@yahoo.co.in]
> Sent: 03 December 2010 10:44
> To: java-user@lucene.apache.org
> Subject: Re: PDF text extracted without spaces
> 
> The main problem is i am not getting whitespace and newline char. This is happening only
for PDF documents.
> 
> Sample outoput: Someofthedifferencesare but it should be Some of the differences are
> 
> Regards
> Ganesh
> 
> ----- Original Message -----
> From: "Alexander Aristov" <alexander.aristov@gmail.com>
> To: <java-user@lucene.apache.org>
> Sent: Friday, December 03, 2010 2:39 PM
> Subject: Re: PDF text extracted without spaces
> 
> 
>> anyway even if you get correct whitespaces and new lines this won't affect
>> indexing.
>>
>> Best Regards
>> Alexander Aristov
>>
>>
>> On 3 December 2010 10:00, Lance Norskog <goksron@gmail.com> wrote:
>>
>>> The text should come out as a stream of words with space, but without
>>> any of the formatting in the PDF. Extraction is only good enough to
>>> tell you that a word is somewhere inside a PDF file.  Can you post a
>>> short bit of the text that it extracted?
>>>
>>> Also, you should try this test on different PDF files that were made
>>> with different software.
>>>
>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailgane@yahoo.co.in> wrote:
>>> > Hello all,
>>> >
>>> > I know, this is not the right group to ask this question, thought some of
>>> you guys might have experienced.
>>> >
>>> > I newbie with Tika. I am using latest version 0.8 version. I extracted
>>> text from PDF document but found spaces and new line missing. Indexing the
>>> data gives wrong result. Could any one in this group could help me? I am
>>> using tika directly to extract the contents, which later gets indexed.
>>> >
>>> > Regards
>>> > Ganesh
>>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
>>> Download Now! http://messenger.yahoo.com/download.php
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> Email has been scanned for viruses by Altman Technologies' email management service -
www.altman.co.uk/emailsystems
> 
> Glasgow Caledonian University is a registered Scottish charity, number SC021474
> 
> Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009
and Herald Society’s Education Initiative of the Year 2009
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message