stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rupert Westenthaler <rupert.westentha...@gmail.com>
Subject Re: Entityhub : get all composed terms
Date Fri, 10 Jun 2011 17:44:46 GMT
Hi

Good Question ...

Text Fields are indexed by using tokenizers in Solr. Therefore a
search for "Apache" will find all documents (entities) that have this
token for the skos:prefLabel field. This is the reason why you also
get "Apache fondation", "Apache bylaw", etc... even if the PatternType
is set to "none".
As far as I know the only way to go around this is to deactivate any
tokenizers for such filed. However without a tokenizer a query for
"Westenthaler" would not return "Rupert Westenthaler", what would also
be seen as strange by a lot of users.

To deactivate Tokenizers for a natural language field one needs to
modify the solr schema (schema.xml). Having both (tokenized and
un-tokenized) versions is currently not possible.

Here are the necessary additions to the schema.xml to deactivate
tokenizing for the skos:prefLabel

To get this you would need to add
   <!-- one field for each language -->
   <field name="@en/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="true" multiValued="true"/>
   <field name="@de/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="true" multiValued="true"/>
   <field name="@it/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="true" multiValued="true"/>
   <field name="@fr/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="true" multiValued="true"/>
   <field name="@/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="true" multiValued="true"/>
   <!-- used for multi lingual searches -->
   <field name="_!@/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="false" multiValued="true"/>

If this is a frequent feature I could modify the SolrYard to use
suffixes for languages. This would allow to index multiple versions of
natural language texts with different prefixes. The prefixes would
than indicate if a tokenizer should be used or not.

However I could imagine that this would require a lot of changes to
the current code, because currently the code assumes that only one of
language and data type is present at the same time.

best
Rupert Westenthaler

On Fri, Jun 10, 2011 at 11:07 AM, florent andré
<florent.andre-dev@4sengines.com> wrote:
> Hi Rupert, *,
>
> As promise in Berlin, I have a question for you ! :)
>
> I have this query :
>
> FieldQuery query = site.getQueryFactory().createFieldQuery();
>
>                query.setConstraint(NamespaceEnum.skos + "prefLabel",
>                                new TextConstraint(signToFind));
>
>                query.addSelectedField(NamespaceEnum.skos + "related");
>                query.addSelectedField(NamespaceEnum.skos + "narrower");
>                query.addSelectedField(NamespaceEnum.skos + "broader");
>                query.addSelectedField(NamespaceEnum.skos + "inScheme");
>
>                query.setLimit(this.numSuggestions);
>
>
> When the signToFind is a one word term eg "Apache", I get all composed term
> that contain this word eg "Apache fondation", "Apache bylaw", etc...
>
> That could be interesting in some case, but not always.
>
> As I read in your documentation, there is :
> - patternType: one of "wildcard", "regex" or "none" (default is "none")
>
> As I don't define a pattern type, this could be in "none", so it could be a
> strict matching, right ?
>
> So, in this case I could have only one word term matching entity, or I miss
> something ?
>
>
> Thanks
> ++
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Mime
View raw message