lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Graham Poulter (JIRA)" <j...@apache.org>
Subject [jira] Updated: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema
Date Wed, 25 Nov 2009 06:51:44 GMT

     [ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Graham Poulter updated SOLR-1599:
---------------------------------

    Description: 
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in
an index.  This introduces relevance problems when using a single schema to store multiple
entity types, for example to support "search for tracks" and "search for artists".   The ranking
for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF
for the name field does not include counts from _artist_ entities.  The effect on ranking
would be most pronounced for query terms that have a low document frequency for _track_ entities
but a high frequency for _artist_ entities, or visa versa.

The current work-around to make the IDF be entity-specific is to use a separate Solr core
for each entity type sharing the schema - and repeating the process of copying solrconfig.xml
and schema.xml to all the cores.  This would be more complicated with replication, and even
more complicated with index distribution, because you must now maintain a core for _artists_
and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he
suggests calculating _numDocs_ after the application of filters.  He recognises however that
the document frequency (DF_t) for each query term in a _track_ search would also needs to
exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t
must be calculated at index time, when Solr does not know what filters will be applied.

I suggest having a metadata field _entitytype_ specified on submitting a batch of documents.
The the schema would specify a list of allowed entity types and a default entity type. For
example, document could say either entitytype="track" or entitytype="artist".  Each each entity
type has an independent set of document frequencies, so the term "foo" will have a DF for
entitytype="artist" and a different DF for entitytype="track".   This might be implemented
by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist"
would be implemented by searching only the _artist_ index, analogous to searching only on
the _artist_ core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a
single Solr core can support many different entity types that share a common schema but use
partially overlapping subsets of fields, instead of having to configure maintain, replicate
and distribute separate solr cores for every entity type.

  was:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in
an index.  This introduces relevance problems when using a single schema to store multiple
entity types, for example to support "search for tracks" and "search for artists".   The ranking
for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF
for the name field does not include counts from _artist_ entities.  The effect on ranking
would be most pronounced for query terms that have a low document frequency for _track_ entities
but a high frequency for _artist_ entities.

The current work-around to make the IDF be entity-specific is to use a separate Solr core
for each entity type sharing the schema - and repeating the process of copying solrconfig.xml
and schema.xml to all the cores.  This would be more complicated with replication, and even
more complicated with index distribution, because you must now maintain a core for _artists_
and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he
suggests calculating _numDocs_ after the application of filters.  He recognises however that
the document frequency (DF_t) for each query term in a _track_ search would also needs to
exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t
must be calculated at index time, when Solr does not know what filters will be applied.

I suggest having a metadata field _entitytype_ specified on submitting a batch of documents.
The the schema would specify a list of allowed entity types and a default entity type. For
example, document could say either entitytype="track" or entitytype="artist".  Each each entity
type has an independent set of document frequencies, so the term "foo" will have a DF for
entitytype="artist" and a different DF for entitytype="track".   This might be implemented
by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist"
would be implemented by searching only the _artist_ index, analogous to searching only on
the _artist_ core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a
single Solr core can support many different entity types that share a common schema but use
partially overlapping subsets of fields, instead of having to configure maintain, replicate
and distribute separate solr cores for every entity type.


> Improve IDF and relevance by separately indexing different entity types sharing a common
schema
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1599
>                 URL: https://issues.apache.org/jira/browse/SOLR-1599
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Graham Poulter
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents
in an index.  This introduces relevance problems when using a single schema to store multiple
entity types, for example to support "search for tracks" and "search for artists".   The ranking
for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF
for the name field does not include counts from _artist_ entities.  The effect on ranking
would be most pronounced for query terms that have a low document frequency for _track_ entities
but a high frequency for _artist_ entities, or visa versa.
> The current work-around to make the IDF be entity-specific is to use a separate Solr
core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml
and schema.xml to all the cores.  This would be more complicated with replication, and even
more complicated with index distribution, because you must now maintain a core for _artists_
and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where
he suggests calculating _numDocs_ after the application of filters.  He recognises however
that the document frequency (DF_t) for each query term in a _track_ search would also needs
to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t
must be calculated at index time, when Solr does not know what filters will be applied.
> I suggest having a metadata field _entitytype_ specified on submitting a batch of documents.
The the schema would specify a list of allowed entity types and a default entity type. For
example, document could say either entitytype="track" or entitytype="artist".  Each each entity
type has an independent set of document frequencies, so the term "foo" will have a DF for
entitytype="artist" and a different DF for entitytype="track".   This might be implemented
by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist"
would be implemented by searching only the _artist_ index, analogous to searching only on
the _artist_ core in the multi-core workaround.
> With this solution (entity type metadata field implemented with separate Lucene indeces)
a single Solr core can support many different entity types that share a common schema but
use partially overlapping subsets of fields, instead of having to configure maintain, replicate
and distribute separate solr cores for every entity type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message