stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luca Dini <>
Subject CV Mining (Early adopter program)
Date Thu, 01 Mar 2012 14:44:27 GMT
Dear All,
Please let me introduce a new early adopter project, in which we will be 
involved. I hope in a great and intellectually inspiring communication 
with you all.
Kind regards,

The project (run by CELI under the umbrella of the  IKS early adopter 
program) aims  to integrate Stanbol technology with a specific context 
of use, i.e. CV management via CMS and semantic technologies. The 
crucial challenge of this integration is the parametrization of Stanbol 
to deal with information which has been automatically extracted from CV. 
Besides the direct integration results, which will be distributed at the 
same conditions as Stanbol software, the early adoption project will 
produce two additional by-products:

     The provision to Stanbol of classes allowing the connection with 
Linguagrid ( and possibly LanguageGrid 
     The verification of the extensibility of Stanbol to languages other 
than English (The project will concern CVs written in French).

We envisage two prototypical use cases, which are described in the 
Use-Case 1: Human Resources Department

The context is the one of a Human Resource Department of a big company 
or any recruitment company. The basic goal is to provide them with an 
open source document management system able to deal in an intelligent 
way with non structured CV (or "resumes"), i.e. CVs which comes in 
Microsoft Word, pdf, Open Office etc. Each time a new CV arrives it is 
inserted in the document base. Behind the scene this is not just adding 
a document but passing it to a Standbol server which enhances it with 
structured information.

This might represent:

     experiences of the candidate
     skills of the candidate
     Education level
     reference data (name, address etc.)
     contact data

Some of these data might be slightly more structured than just named 
entities, but definitely in the representation power of rdf. Some of 
them could be even more semantically enriched, by providing external 
information on companies, places, specific technologies etc.

As a result of this personnel at the HR department would be able to 
formulate queries such as (just an exemplification):

     All CV of people living in Paris older then 27 years
     All CV of people with skills in SQL server and Java
     All people who have worked in an high tech company since november 2011.


In terms of GUI the user will be confronted with a system that allows 
easy search and easy population of CV data.

Use-Case 2: Employment Administration

In the second use case we are keeping into account the needs of public 
agencies with the institutional role of re-integrating in the labor 
market persons which loose their job or that are looking for their first 
job. In particular we are considering institutions such as the French 
Pôle emploi ( , This institution is in 
charge of crossing the demand and the offer on the labor market, in 
particular by addressing candidates to the right potential employer, 
suggesting possible educational training, by shaping their skills, etc. 
In many cases these agencies are managed at a local rather than a 
national level, as the market of labor is affected by regional 
constraints. In this use case the parametrized CMS has a double goal:

     Much like in the previous case to allow the fast and intelligent 
retrieval of CVs out of the document base in order to answer potential 
employer needs.
     To be able to perform Business Intelligence like tasks over the 
structured information provided by the mass of analyzed CVs. Of course 
performing BI analysis is out of the scope of this proposal, but the 
structuring of CV information into ontology based classes is definitely 
the first step towards this direction.


 From a technical point of view the most interesting challenge consists 
in integrating the set of Stanbol enhancer, with the semantic web 
services provided at In principle it should not be a 
different integration than what has already been made with OpenCalais WS 
and Zemanta WS. However there are at least two major challenges:

     Multilinguality. The extraction will consider French documents 
rather than English ones. Moreover, in a second phase (not covered by 
the present project, the whole system could be extended to Italian and 
     Ontological extension. While CVs typically contains quite a lot of 
named entities which are already covered by Stanbol (e.g. geographical 
names, time expressions, Company names, person names) there are entities 
which will need some ontology extension such as skills and education.
     Structural Complexity. In a CV instances of entities are linked 
each other in a structurally complex way. For instance places are not 
just a flat list of geographical entities, but their are likely to be 
connected with periods, with job types, with companies, etc. Handling 
this structural complexity represents an important challenge.

Luca Dini

12-14 rue Claude Genin
38000 Grenoble

33 Avenue Philippe Auguste
75011 Paris

tel: 00 33 476 24 23 80


View raw message