stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rupert Westenthaler <>
Subject Re: User story: Don't want to lose the semantic information I already have inside my CMS
Date Fri, 09 Nov 2012 10:28:53 GMT
Hi Walter, all

I had already a look at the htmlextractor and I think it is a nice
addition to Stanbol!

I would be interested in an Engine that does not only extract embedded
knowledge, but also keeps the link to the actual position within the
parsed Content. In more detail I would like to link the extracted
knowledge with an fise:Enhancement (e.g. a fise:TextAnnotation) that
selects the annotated part of the content.

This would not only allow to have the extracted knowledge in the
metadata of the ContentItem, but also allow EnhancementEngines to
process those information in the same way as if they would be
extracted by an other engine (e.g. linking an RDFa annotation about an
Person, Place in the same way as an Person, Place detected by an NER

Jukka Zitting  presentation "Content extraction with Apache Tika" [1]
at the ApacheCon included a nice example on how to extract the text of
an Link. I think this is a nice starting point for such an feature.

Generally I think it would be better to add RDFa, Micro Data support
to directly to Tika instead of implementing custom solutions within
Stanbol. WDYT?


[1] Slide 19

On Thu, Nov 8, 2012 at 12:31 PM, Walter Kasper <> wrote:
> Hi Rüdiger,
> RDFa extraction from HTML is part of the htmlextractor engine in Stanbol.
> Iwould welcome it if you could test it with yourOpenCms docs.
> Best regards,
> Walter
> Rüdiger Kurz wrote:
>> Hi Staboler,
>> during ApacheCon in Sinsheim I had some interesting conversations with
>> Fabian, Rupert and Anil as result I want to summarize one of the discussions
>> as an user story telling a typical requirement for us as CMS provider.
>> Talking about traditional Content Management Systems and assuming that
>> they don't store semantic informations is not correct. For example CMS
>> Systems already deliver RDFa annotated HTML, nearly all systems are
>> providing some tagging/categorizing mechanism. Specially OpenCms provides a
>> generic approach to define a structured content and therefore we have the
>> information that a specific field/item of a content has a specified type and
>> a defined label. E.g. A technology event named ApacheCon takes place in
>> Sinsheim from 05. Nov until 08. Nov 2012 is the information that is already
>> stored in OpenCms. More over OpenCms is able to connect that event with all
>> speakers/persons that will make a presentation on that event, ...
>> What we would like to achieve is not only a plain text enhancement more
>> over we are interested in telling Stanbol all informations and associations
>> we already know. In other words we absolutely don't want to lose the
>> semantic information that is already existent in OpenCms.
>> A good starting point would be a REST endpoint providing the ability to
>> retrieve a RDFa annotated HTML document and than extracts the RDFa in order
>> to store those inside the semantic-index/entity-hub/... as I previously
>> suggested on the list under the subject "Extend stanbol content hub for RDFa
>> support". Maybe the content hub is not the right component, but the
>> requirement of RDFa extraction is still existent.
> --
> Dr. Walter Kasper
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email:
> -------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------

| Rupert Westenthaler   
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

View raw message