stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Florent André <flor...@apache.org>
Subject Re: Entityhub : get all composed terms
Date Mon, 13 Jun 2011 12:14:25 GMT
Thanks for this detailed explanation.

Both usecases (with or without tokeniser) have justifications and usages 
depending on the situation.

For workaround this,
- I first try to use regex request like "^Apache$", but this don't work 
because - if I well remember - SolrYard don't accept regex request. 
That's true ?

- I set up a loop that test each entity retrieve and select just the 
exact matching one.

IMO, the choice between untokenised or tokenised version, could be 
better if done in the code and not during the indexing.
For example, one could use
if (getUntokenisedResults("Westenthaler") == null){
getTokenisedResults("Westenthaler")}

Just an idea...

I'm not sure to well understand you last sentence :
 > However I could imagine that this would require a lot of changes to
 > the current code, because currently the code assumes that only one of
 > language and data type is present at the same time.

For now, the skos I use is only in FR, but we imagine to add EN 
information in it.
It could be possible to index and retrieve entities for the two 
languages or not ?

Thanks for this really great enhancement.
Use SolrYard is so much faster that use a D2RQ link : for the same 
(pretty big) document :
- 22 seconds for solrYard
- 2/3 min for D2RQ

++

On 06/10/2011 07:44 PM, Rupert Westenthaler wrote:
> Hi
>
> Good Question ...
>
> Text Fields are indexed by using tokenizers in Solr. Therefore a
> search for "Apache" will find all documents (entities) that have this
> token for the skos:prefLabel field. This is the reason why you also
> get "Apache fondation", "Apache bylaw", etc... even if the PatternType
> is set to "none".
> As far as I know the only way to go around this is to deactivate any
> tokenizers for such filed. However without a tokenizer a query for
> "Westenthaler" would not return "Rupert Westenthaler", what would also
> be seen as strange by a lot of users.
>
> To deactivate Tokenizers for a natural language field one needs to
> modify the solr schema (schema.xml). Having both (tokenized and
> un-tokenized) versions is currently not possible.
>
> Here are the necessary additions to the schema.xml to deactivate
> tokenizing for the skos:prefLabel
>
> To get this you would need to add
>     <!-- one field for each language -->
>     <field name="@en/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="true" multiValued="true"/>
>     <field name="@de/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="true" multiValued="true"/>
>     <field name="@it/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="true" multiValued="true"/>
>     <field name="@fr/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="true" multiValued="true"/>
>     <field name="@/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="true" multiValued="true"/>
>     <!-- used for multi lingual searches -->
>     <field name="_!@/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="false" multiValued="true"/>
>
> If this is a frequent feature I could modify the SolrYard to use
> suffixes for languages. This would allow to index multiple versions of
> natural language texts with different prefixes. The prefixes would
> than indicate if a tokenizer should be used or not.
>
> However I could imagine that this would require a lot of changes to
> the current code, because currently the code assumes that only one of
> language and data type is present at the same time.
>
> best
> Rupert Westenthaler
>
> On Fri, Jun 10, 2011 at 11:07 AM, florent andré
> <florent.andre-dev@4sengines.com>  wrote:
>> Hi Rupert, *,
>>
>> As promise in Berlin, I have a question for you ! :)
>>
>> I have this query :
>>
>> FieldQuery query = site.getQueryFactory().createFieldQuery();
>>
>>                 query.setConstraint(NamespaceEnum.skos + "prefLabel",
>>                                 new TextConstraint(signToFind));
>>
>>                 query.addSelectedField(NamespaceEnum.skos + "related");
>>                 query.addSelectedField(NamespaceEnum.skos + "narrower");
>>                 query.addSelectedField(NamespaceEnum.skos + "broader");
>>                 query.addSelectedField(NamespaceEnum.skos + "inScheme");
>>
>>                 query.setLimit(this.numSuggestions);
>>
>>
>> When the signToFind is a one word term eg "Apache", I get all composed term
>> that contain this word eg "Apache fondation", "Apache bylaw", etc...
>>
>> That could be interesting in some case, but not always.
>>
>> As I read in your documentation, there is :
>> - patternType: one of "wildcard", "regex" or "none" (default is "none")
>>
>> As I don't define a pattern type, this could be in "none", so it could be a
>> strict matching, right ?
>>
>> So, in this case I could have only one word term matching entity, or I miss
>> something ?
>>
>>
>> Thanks
>> ++
>>
>
>
>

Mime
View raw message