stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luca Dini <>
Subject Re: CV Mining (Early adopter program)
Date Fri, 02 Mar 2012 09:53:36 GMT
Hi Rupert,
thanks for all your observations. My comments in the body of the message.
On 01/03/2012 23:14, Rupert Westenthaler wrote:
> Hi Luca
> A really interesting Scenario.
> On Thu, Mar 1, 2012 at 3:44 PM, Luca Dini<>  wrote:
>>     The provision to Stanbol of classes allowing the connection with
>> Linguagrid ( and possibly LanguageGrid
>> (
>>     The verification of the extensibility of Stanbol to languages other than
>> English (The project will concern CVs written in French).
> Ok this answers my question of the other Email. Can you maybe provide
> some additional information (links) about this services. What is the
> License of Language Grid. I was not able to find information related
> to that.
The reason is that licensing vary according to the service provider. As 
you have seen we are not the only providers via linguagrid. As far as 
our services are concerned, they are open access but not open source. In 
short, this means:
1) unlimited access for research/educational purposes, with support for 
integration ecc..
2) free access for "commercial purposes", with no service level guarantee.
3) Paying access (subscription or pay per use) if some service level 
guarantee is needed, Prices vary of course depending on volumes, 
constraint, time of response etc.

Concerning standbol, as IKS it is a research project we are willing to 
give unlimited access to all standbol instances. Of course the 
limitation is represented by the computational power of the AmazonWS 
instances where the linguagrid and related services are hosted. In front 
of a massive adoption and the need of activating many instances (they 
have a costs) we will be forced to impose some kind of fee. But this is 
a future scenario, as currently the linguagrid seems to scale rather well.

>> The basic goal is to provide them with an open
>> source document management system able to deal in an intelligent way with
>> non structured CV (or "resumes"), i.e. CVs which comes in Microsoft Word,
>> pdf, Open Office etc.
> Apache Stanbol has now two EnhancementEngines for processing non plain
> text documents
> * MetaxaEngine (mainly based on
> * TikaEngine (Apache Tika)
> Therefore the kind of documents you mentioned should be supported by Stanbol.
>> This might represent:
>>     experiences of the candidate
>>     skills of the candidate
>>     Education level
>>     reference data (name, address etc.)
>>     contact data
>> Some of these data might be slightly more structured than just named
>> entities, but definitely in the representation power of rdf. Some of them
>> could be even more semantically enriched, by providing external information
>> on companies, places, specific technologies etc.
> It is very easy to import data that are available as RDF into stanbol
> and used it for Entity Extraction and Linking. There is also support
> for importing existing vCard files. Such data are converted to RDF by
> using the schema.
As Oliver said I think that the crucial thing will be to identify the 
right reference schema. In some cases (e.g. skills) I guess that we will 
be forced to have a mixed approach, as there is non standard vocabulary 
for representing them.

>> As a result of this personnel at the HR department would be able to
>> formulate queries such as (just an exemplification):
>>     All CV of people living in Paris older then 27 years
>>     All CV of people with skills in SQL server and Java
>>     All people who have worked in an high tech company since november 2011.
> Do you plan to use the Apache Contenthub for Semantic Search, or does
> the CMS you use already support such kind of searches?
On this matter, I will write a separate email. Basically the answer is 
that we are open for suggestions.
>> Challenges
>>  From a technical point of view the most interesting challenge consists in
>> integrating the set of Stanbol enhancer, with the semantic web services
>> provided at In principle it should not be a different
>> integration than what has already been made with OpenCalais WS and Zemanta
>> WS. However there are at least two major challenges:
>>     Multilinguality. The extraction will consider French documents rather
>> than English ones. Moreover, in a second phase (not covered by the present
>> project, the whole system could be extended to Italian and French.
> Stanbol already nicely supports multi lingual scenarios. The LangId
> engine can be used to detect the language of a Document (internally
> used Apache Tika) and stores the detected language in the metadata.
> Other engines can use this language for further processing.
That's great: probably my consideration of multilinguality as a 
challenge was due to the fact that  that most integrated linguistic 
engines where dealing with English. I was also wondering if the 
strategies for matching a given named entity with e.g. dbpedia url are 
completely language independent.
> When dealing with French you might want to update the Configuration of
> the SolrCore used to store the Controlled vocabulary with French
> specific configurations such as stop words, stemmers ... This will
> improve the results for the NamedEntityTaggingEngine and
> KeywordLinkingEngine engine.
I understand this for the  KeywordLinkingEngine, but not completely for 
the NamedEntityTaggingEngine. In our view we will have to integrate a 
new French/Italian NamedEntityTaggingEngine which will handle stop words 
and all other language related aspects internally. But this believe 
might just be due to the fact that our knowledge of the whole system is 
still limited.

>>     Ontological extension. While CVs typically contains quite a lot of named
>> entities which are already covered by Stanbol (e.g. geographical names, time
>> expressions, Company names, person names) there are entities which will need
>> some ontology extension such as skills and education.
>>     Structural Complexity. In a CV instances of entities are linked each
>> other in a structurally complex way. For instance places are not just a flat
>> list of geographical entities, but their are likely to be connected with
>> periods, with job types, with companies, etc. Handling this structural
>> complexity represents an important challenge.
> This might be indeed a challenge. I would start to split up the
> content in smaller pieces (e.g. sentences) and try to group Entities
> extracted from such parts.
> If you than build a semantic index that stores such pieces as own
> documents even searches for a job type at a specific company could
> work quite nicely.
We will follow  the approach you describe: if I understand correctly you 
propose to make use use of atomic information (e.g. an experienceLine) 
as a kind of document in such a way that it is possible to formulate 
query such as "all documents of type experienceLine which contains a job 
X and a company Y" right?

> Such a System would not really "understand" the structural complexity
> but still should be able to present Users with good search results.
> best
> Rupert

Luca Dini

12-14 rue Claude Genin
38000 Grenoble

33 Avenue Philippe Auguste
75011 Paris

tel: 00 33 476 24 23 80


View raw message