lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Binoy Dalal <binoydala...@gmail.com>
Subject Re: Issues when indexing PDF files
Date Thu, 17 Dec 2015 09:15:53 GMT
You can always write an update handler plugin to convert your PDFs to utf-8
and then push them to solr

On Thu, 17 Dec 2015, 14:16 Zheng Lin Edwin Yeo <edwinyeozl@gmail.com> wrote:

> Hi Alexandre,
>
> Thanks for your reply.
>
> So the only way to solve this issue is to explore with PDF specific tools
> and change the encoding of the file?
> Is there any way to configure it in Solr?
>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 15:42, Alexandre Rafalovitch <arafalov@gmail.com>
> wrote:
>
> > They could be using custom fonts and non-Unicode characters. That's
> > probably something to explore with PDF specific tools.
> > On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinyeozl@gmail.com>
> > wrote:
> >
> > > I've checked all the files which has problem with the content in the
> Solr
> > > index using the Tika app. All of them shows the same issues as what I
> see
> > > in the Solr index.
> > >
> > > So does the issues lies with the encoding of the file? Are we able to
> > check
> > > the encoding of the file?
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> > > wrote:
> > >
> > > > Hi Erik,
> > > >
> > > > I've shared the file on dropbox, which you can access via the link
> > here:
> > > >
> > https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> > > >
> > > > This is what I get from the Tika app after dropping the file in.
> > > >
> > > > Content-Length: 75092
> > > > Content-Type: application/pdf
> > > > Type: COSName{Info}
> > > > X-Parsed-By: org.apache.tika.parser.DefaultParser
> > > > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> > > > X-TIKA:digest:SHA256:
> > > > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> > > > access_permission:assemble_document: true
> > > > access_permission:can_modify: true
> > > > access_permission:can_print: true
> > > > access_permission:can_print_degraded: true
> > > > access_permission:extract_content: true
> > > > access_permission:extract_for_accessibility: true
> > > > access_permission:fill_in_form: true
> > > > access_permission:modify_annotations: true
> > > > dc:format: application/pdf; version=1.3
> > > > pdf:PDFVersion: 1.3
> > > > pdf:encrypted: false
> > > > producer: null
> > > > resourceName: Desmophen+670+BAe.pdf
> > > > xmpTPg:NPages: 3
> > > >
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > > On 17 December 2015 at 00:15, Erik Hatcher <erik.hatcher@gmail.com>
> > > wrote:
> > > >
> > > >> Edwin - Can you share one of those PDF files?
> > > >>
> > > >> Also, drop the file into the Tika app and see what it sees directly
> -
> > > get
> > > >> the tika-app JAR and run that desktop application.
> > > >>
> > > >> Could be an encoding issue?
> > > >>
> > > >>         Erik
> > > >>
> > > >> —
> > > >> Erik Hatcher, Senior Solutions Architect
> > > >> http://www.lucidworks.com <http://www.lucidworks.com/>
> > > >>
> > > >>
> > > >>
> > > >> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
> > > edwinyeozl@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > Hi,
> > > >> >
> > > >> > I'm using Solr 5.3.0
> > > >> >
> > > >> > I'm indexing some PDF documents. However, for certain PDF files,
> > there
> > > >> are
> > > >> > chinese text in the documents, but after indexing, what is indexed
> > in
> > > >> the
> > > >> > content is either a series of "??????" or an empty content.
> > > >> >
> > > >> > I'm using the post.jar that comes together with Solr.
> > > >> >
> > > >> > What could be the reason that causes this?
> > > >> >
> > > >> > Regards,
> > > >> > Edwin
> > > >>
> > > >>
> > > >
> > >
> >
>
-- 
Regards,
Binoy Dalal

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message