lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
Date Mon, 23 Feb 2009 19:58:02 GMT

    [ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676041#action_12676041
] 

Hoss Man commented on SOLR-799:
-------------------------------

The separation of concerns between schema.xml and solrconfig.xml has always been...

 * schema.xml: what is the data, what is it's nature, what are it's intrinsic properties?
 * solrconfig.xml: what can people do with your data, how can they use it?

fields, fieldTypes, analyzers, copyFields go in the schema.xml because they are (in theory)
intrinsic to the nature of your data regardless of where a given document comes from: 
 * documents should only have one author
 * categoryName should always be tokenized in a particular way
 * prices need to sort numericly not lexigraphicallyy
 * any text indexed in the shortSummary field shoudl also be indexed in the searchableAbstract
field
 * etc...

request handlers that dictate how people can use the data are specified in solrconfig.xml
-- when searching data request handlers (which may leverage search componets) dictate what
a user is allowed to get/see;  when modifying an index request handlers (which may leverage
update processors) dictate what data is allowed to come from various sources and in what formats.

In short: as far as document indexing goes, the options configured in solrconfig.xml specify
how to "build up" a Document object from user input, while the options in schema.xml specify
how to "tear it down" into it's individual terms and values for indexing.

With the near duplicate detection code, it is the schema's job to say which fields can exist
in the input documents, including a signature field --  but it is the solrconfig's job to
decide how to compute that signature field ... after all: the computation might be different
depending on the source of the data (ie: different processor chains could be configured for
different request handlers)

> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Assignee: Yonik Seeley
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking as well
as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message