lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: PDF text extracted without spaces
Date Fri, 03 Dec 2010 12:08:35 GMT
Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
Have you tried asking on the tika mailing list?
http://tika.apache.org/mail-lists.html.


--
Ian.


On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <emailgane@yahoo.co.in> wrote:
> I first extract the contents from documents using tika and latter index it with Lucene.
The problem is the extracted text from PDF using tika has no whitespaces.
>
> Regards
> Ganesh
>
>
> ----- Original Message -----
> From: "McGibbney, Lewis John" <Lewis.McGibbney@gcu.ac.uk>
> To: <java-user@lucene.apache.org>
> Sent: Friday, December 03, 2010 4:40 PM
> Subject: RE: PDF text extracted without spaces
>
>
>> Hi Ganesh
>>
>> I encountered this same problem last week. I was thinking if it was possible to include
at minimum a WhitespaceAnalyzer somewhere within Tika which would solve the problem. I am
not sure of how this would be done as I am not familiar with Tika codebase.
>>
>> Unfortunately I don't think that the solution to the first part of this problem lies
within the java-user mailing list.
>>
>> When were you sending extracted contents to Lucene... at what later stage?
>>
>> Thank you
>>
>> Lewis
>>
>> -----Original Message-----
>> From: Ganesh [mailto:emailgane@yahoo.co.in]
>> Sent: 03 December 2010 10:44
>> To: java-user@lucene.apache.org
>> Subject: Re: PDF text extracted without spaces
>>
>> The main problem is i am not getting whitespace and newline char. This is happening
only for PDF documents.
>>
>> Sample outoput: Someofthedifferencesare but it should be Some of the differences
are
>>
>> Regards
>> Ganesh
>>
>> ----- Original Message -----
>> From: "Alexander Aristov" <alexander.aristov@gmail.com>
>> To: <java-user@lucene.apache.org>
>> Sent: Friday, December 03, 2010 2:39 PM
>> Subject: Re: PDF text extracted without spaces
>>
>>
>>> anyway even if you get correct whitespaces and new lines this won't affect
>>> indexing.
>>>
>>> Best Regards
>>> Alexander Aristov
>>>
>>>
>>> On 3 December 2010 10:00, Lance Norskog <goksron@gmail.com> wrote:
>>>
>>>> The text should come out as a stream of words with space, but without
>>>> any of the formatting in the PDF. Extraction is only good enough to
>>>> tell you that a word is somewhere inside a PDF file.  Can you post a
>>>> short bit of the text that it extracted?
>>>>
>>>> Also, you should try this test on different PDF files that were made
>>>> with different software.
>>>>
>>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailgane@yahoo.co.in> wrote:
>>>> > Hello all,
>>>> >
>>>> > I know, this is not the right group to ask this question, thought some
of
>>>> you guys might have experienced.
>>>> >
>>>> > I newbie with Tika. I am using latest version 0.8 version. I extracted
>>>> text from PDF document but found spaces and new line missing. Indexing the
>>>> data gives wrong result. Could any one in this group could help me? I am
>>>> using tika directly to extract the contents, which later gets indexed.
>>>> >
>>>> > Regards
>>>> > Ganesh
>>>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
>>>> Download Now! http://messenger.yahoo.com/download.php
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now!
http://messenger.yahoo.com/download.php
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> Email has been scanned for viruses by Altman Technologies' email management service
- www.altman.co.uk/emailsystems
>>
>> Glasgow Caledonian University is a registered Scottish charity, number SC021474
>>
>> Winner: Times Higher Education’s Widening Participation Initiative of the Year
2009 and Herald Society’s Education Initiative of the Year 2009
>> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>>
> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message