any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Extraction of structure from non-XML based formats
Date Wed, 03 Jul 2013 19:50:36 GMT
Hi Chris,

BTW I should have said in my mail to Peter, sorry for taking ages ot get
back. These emails come though as batches, so sometimes it can be days if
the lists are quiet... which they are.
Anyway,

On Wed, Jul 3, 2013 at 12:34 AM, <dev-digest-help@any23.apache.org> wrote:

>
>
> What about integrating Any23 into Tika -- which has a PDF parser,
> etc.? I'd be happy to try and help out wherever I can.
>
> Yeah I suppose this is the next logical step Chris. The problem I see here
though is that, with regards to trivial structured content such as schemas,
name spaces, etc., which I may add are completely useless for my purpose, I
have a feeling that I am kinda beating my head against a wall here.
Any23 extracts structured markup such as DC, LKIFCore, hListings, etc. None
of this structure is/will be available within my PDF's. This creates a
problem for me. It means that I cannot use most of the built in extraction
implementations from Any23. Which leaves me to code the stuff myself...
Thanks for chiming in on this one.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message