stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From St├ęphane Corlosquet <scorlosq...@gmail.com>
Subject Re: User story: Don't want to lose the semantic information I already have inside my CMS
Date Fri, 09 Nov 2012 21:44:26 GMT
On Fri, Nov 9, 2012 at 7:30 AM, Walter Kasper <kasper@dfki.de> wrote:

> Hi,
>
>
> Rupert Westenthaler wrote:
>
>> Hi Walter, all
>>
>> I had already a look at the htmlextractor and I think it is a nice
>> addition to Stanbol!
>>
>> I would be interested in an Engine that does not only extract embedded
>> knowledge, but also keeps the link to the actual position within the
>> parsed Content. In more detail I would like to link the extracted
>> knowledge with an fise:Enhancement (e.g. a fise:TextAnnotation) that
>> selects the annotated part of the content.
>>
>> This would not only allow to have the extracted knowledge in the
>> metadata of the ContentItem, but also allow EnhancementEngines to
>> process those information in the same way as if they would be
>> extracted by an other engine (e.g. linking an RDFa annotation about an
>> Person, Place in the same way as an Person, Place detected by an NER
>> engine).
>>
>
> I think that could be done.
>
>
>
>> Jukka Zitting  presentation "Content extraction with Apache Tika" [1]
>> at the ApacheCon included a nice example on how to extract the text of
>> an Link. I think this is a nice starting point for such an feature.
>>
>> Generally I think it would be better to add RDFa, Micro Data support
>> to directly to Tika instead of implementing custom solutions within
>> Stanbol. WDYT?
>>
>
> Tika currently is not suitable for RDFa extraction etc. because its HTML
> parser (TagSoup) throws away all namespace declarations needed for the RDF.
>

You might want to consider any23 [1], another Apache project which can
extract RDFa and other semantic markups from HTML. There are also some
independent RDFa parser you can use in java such as [2].

Steph.

[1] http://any23.apache.org/extractors.html
[2] https://github.com/niklasl/clj-rdfa-jena

>
> Best regards,
>
> Walter
>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbr├╝cken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kasper@dfki.de
> ------------------------------**------------------------------**-
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> ------------------------------**------------------------------**-
>
>


-- 
Steph.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message