lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gora Mohanty <g...@mimirtech.com>
Subject Re: Indexing in Solr: invalid UTF-8
Date Tue, 09 Oct 2012 15:08:04 GMT
On 9 October 2012 17:42, Patrick Oliver Glauner
<patrick.oliver.glauner@cern.ch> wrote:
> Hello everybody
>
> Meanwhile, I checked this issue in detail: we use pdftotext to extract text from our
PDFs (<http://cds.cern.ch/>). Some generated text files contain \uFFFF and \uD835.
>
> unicode(text, 'utf-8') does not throw any exception for these texts. Subsequently, Solr
throws an exception when these are sent to the indexer.

Off-topic, but this is because the Unicode escape sequence
'\uxxxx' is not being interpreted here. You have to explicitly
do that. Here is an example with '\u2018', the opening
quote (I did not have a font which covered '\ud835'). Please
note the difference between:
print unicode('\u2018')
\u2018

and

print unicode('\u2018').decode('unicode-escape')
‘
Regards,
Gora

Mime
View raw message