stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rupert Westenthaler <rupert.westentha...@gmail.com>
Subject Re: Entityhub : get all composed terms
Date Mon, 13 Jun 2011 14:15:49 GMT
On Mon, Jun 13, 2011 at 2:14 PM, Florent André <florent@apache.org> wrote:
> Thanks for this detailed explanation.
>
> Both usecases (with or without tokeniser) have justifications and usages
> depending on the situation.
>
> For workaround this,
> - I first try to use regex request like "^Apache$", but this don't work
> because - if I well remember - SolrYard don't accept regex request. That's
> true ?

Thats true. I think there are some RegexSearchers for Lucene, but I do
not know how to use them from Solr. I will have an other Look after
switching to the newest version of Solr.

>
> - I set up a loop that test each entity retrieve and select just the exact
> matching one.
>
> IMO, the choice between untokenised or tokenised version, could be better if
> done in the code and not during the indexing.
> For example, one could use
> if (getUntokenisedResults("Westenthaler") == null){
> getTokenisedResults("Westenthaler")}

The problem is that I can only search fields that are indexed. So to
allow both tokenized and un-tokenized searches one needs both versions
to be indexed in two different fields.
If both variants would be available, I would rather add this feature
by adding an new option to the text constraint.

>
> Just an idea...
>
> I'm not sure to well understand you last sentence :
>> However I could imagine that this would require a lot of changes to
>> the current code, because currently the code assumes that only one of
>> language and data type is present at the same time.

This means that adding this feature would require a lot of changes in
the SolrYard implementation.

>
> For now, the skos I use is only in FR, but we imagine to add EN information
> in it.

If you do no parse a language in the TextConstraint it will search all
languages. If you parse languages that it will limit the search to
prefLabels of such languages.

> It could be possible to index and retrieve entities for the two languages or
> not ?

If the provided SKOS files define labels in multiple languages they
will be index.

here a Example taken from the IPTC subject codes

		<rdf:Description
rdf:about="http://cv.iptc.org/newscodes/subjectcode/01002000">
  			<rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept" />
  			<skos:prefLabel xml:lang="de">Architektur</skos:prefLabel>
  			<skos:prefLabel xml:lang="it">Architettura</skos:prefLabel>
  			<skos:prefLabel xml:lang="es">arquitectura</skos:prefLabel>
  			<skos:prefLabel xml:lang="fr">Architecture</skos:prefLabel>
  			<skos:prefLabel xml:lang="en-GB">architecture</skos:prefLabel>
  			<skos:definition xml:lang="de">Entwurf von Gebäuden, Denkmälern
und deren Umgebung.</skos:definition>
  			<skos:definition xml:lang="es">Diseño de edificios, monumentos y
espáciosalrededor de ellos</skos:definition>
  			<skos:definition xml:lang="it">Ideazione e progettazione di
edifici, monumenti e degli spazi loro circostanti</skos:definition>
  			<skos:definition xml:lang="en-GB">Designing of buildings,
monuments and the spaces around them</skos:definition>
  			<skos:definition xml:lang="fr">Conception des immeubles, des
monuments et des espaces qui les entourent</skos:definition>
  			
  			<skos:broaderTransitive>
  				<rdf:Description
rdf:about="http://cv.iptc.org/newscodes/subjectcode/01000000">
  					<rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept" />
  				</rdf:Description>
  			</skos:broaderTransitive>
  			
  		</rdf:Description>

>
> Thanks for this really great enhancement.
> Use SolrYard is so much faster that use a D2RQ link : for the same (pretty
> big) document :
> - 22 seconds for solrYard
> - 2/3 min for D2RQ

great to hear!

best
Rupert Westenthaler
>
> ++
>
> On 06/10/2011 07:44 PM, Rupert Westenthaler wrote:
>>
>> Hi
>>
>> Good Question ...
>>
>> Text Fields are indexed by using tokenizers in Solr. Therefore a
>> search for "Apache" will find all documents (entities) that have this
>> token for the skos:prefLabel field. This is the reason why you also
>> get "Apache fondation", "Apache bylaw", etc... even if the PatternType
>> is set to "none".
>> As far as I know the only way to go around this is to deactivate any
>> tokenizers for such filed. However without a tokenizer a query for
>> "Westenthaler" would not return "Rupert Westenthaler", what would also
>> be seen as strange by a lot of users.
>>
>> To deactivate Tokenizers for a natural language field one needs to
>> modify the solr schema (schema.xml). Having both (tokenized and
>> un-tokenized) versions is currently not possible.
>>
>> Here are the necessary additions to the schema.xml to deactivate
>> tokenizing for the skos:prefLabel
>>
>> To get this you would need to add
>>    <!-- one field for each language -->
>>    <field name="@en/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="true" multiValued="true"/>
>>    <field name="@de/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="true" multiValued="true"/>
>>    <field name="@it/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="true" multiValued="true"/>
>>    <field name="@fr/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="true" multiValued="true"/>
>>    <field name="@/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="true" multiValued="true"/>
>>    <!-- used for multi lingual searches -->
>>    <field name="_!@/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="false" multiValued="true"/>
>>
>> If this is a frequent feature I could modify the SolrYard to use
>> suffixes for languages. This would allow to index multiple versions of
>> natural language texts with different prefixes. The prefixes would
>> than indicate if a tokenizer should be used or not.
>>
>> However I could imagine that this would require a lot of changes to
>> the current code, because currently the code assumes that only one of
>> language and data type is present at the same time.
>>
>> best
>> Rupert Westenthaler
>>
>> On Fri, Jun 10, 2011 at 11:07 AM, florent andré
>> <florent.andre-dev@4sengines.com>  wrote:
>>>
>>> Hi Rupert, *,
>>>
>>> As promise in Berlin, I have a question for you ! :)
>>>
>>> I have this query :
>>>
>>> FieldQuery query = site.getQueryFactory().createFieldQuery();
>>>
>>>                query.setConstraint(NamespaceEnum.skos + "prefLabel",
>>>                                new TextConstraint(signToFind));
>>>
>>>                query.addSelectedField(NamespaceEnum.skos + "related");
>>>                query.addSelectedField(NamespaceEnum.skos + "narrower");
>>>                query.addSelectedField(NamespaceEnum.skos + "broader");
>>>                query.addSelectedField(NamespaceEnum.skos + "inScheme");
>>>
>>>                query.setLimit(this.numSuggestions);
>>>
>>>
>>> When the signToFind is a one word term eg "Apache", I get all composed
>>> term
>>> that contain this word eg "Apache fondation", "Apache bylaw", etc...
>>>
>>> That could be interesting in some case, but not always.
>>>
>>> As I read in your documentation, there is :
>>> - patternType: one of "wildcard", "regex" or "none" (default is "none")
>>>
>>> As I don't define a pattern type, this could be in "none", so it could be
>>> a
>>> strict matching, right ?
>>>
>>> So, in this case I could have only one word term matching entity, or I
>>> miss
>>> something ?
>>>
>>>
>>> Thanks
>>> ++
>>>
>>
>>
>>
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Mime
View raw message