lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-2802) Toolkit of UpdateProcessors for modifying document values
Date Thu, 08 Dec 2011 02:24:40 GMT

     [ https://issues.apache.org/jira/browse/SOLR-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated SOLR-2802:
---------------------------

    Attachment: SOLR-2802_update_processor_toolkit.patch

I had some time to revisit this issue more again today.

Improvements in this patch:

* exclude options - you can now specify one ore more sets of "exclude" lists which are parsed
just like the main list of field specifies (examples below)
* improved defaults for ConcatFieldUpdateProcessorFactory - default behavior is now to only
concat values for fields that the schema says are multiValued=false and (StrField or TextField)
* new RemoveBlankFieldUpdateProcessorFactory - removes any 0 length CharSequence values it
finds, by default looks at all fields
* new FieldLengthUpdateProcessorFactory - replaces any CharSequence values it finds with their
length, by default it looks at no fields

As part of this work, i tweaked the abstract classes so that the "default" assumption about
what fields a subclass should match "by default" is still "all fields" but it's easy for the
subclasses to override this -- the user still has the final say, and the abstract class handles
that, but if the user doesn't configure anything the sub-class can easily say "my default
should be ___"

bq. I think I don't completely follow the explicit ruling

I explained myself really terribly before - i was convoluting what should really be two orthogonal
things:

1) the *field names* that a processor looks at -- the user should have lots of options for
configuring the field selector explicitly, and if they don't, then a sensible default based
on the specifics of the processor should be applied, and the user should still have the ability
to configure exclusion rules on top of that default

2) the *values types* that a process will deal with -- regardless of what field names a processor
is configured with, it should be logical about the types of values it finds in those fields.
 The FieldLengthUpdateProcessorFactory i just added for example only pays attention to values
that are CharSequence, if for example the SolrInputField already contained an Integer wouldn't
make sense to toString() that and then find the length of that String vlaue.

bq. I think Date/Number parsing should only be done on compatible fields only. I think if
a subsequent parser moves / renames fields, then this processor should have been configured
before the processor that does the Date/Number parsing.

But that could easily lead to a chicken-vs-egg problem.  I think ideally you should be able
to have field names in your SolrInputDocuments (and in your processor configurations) that
don't exist in your schema at all, so you can have "transitory" names that exist purely for
passing info arround.

Imagine a situation where you want to let clients submit documents containing a "publishDate"
field, but you want to be able to cleanly accept real Date objects (from java clients) or
Strings in a variety of formats, and then you want the final index to contain two versions
of that date: one indexed TrieDateField called "pubDate", and one non indexed StrField called
"prettyDate" -- ie, there is no  "publishDate" in your schema at all.  You could then configure
some "ParseDateFieldUpdateProcessor" on the "publishDate" even though that field name isn't
in your schema, so that you have consistent Date objects, and then use a CloneFieldUpdateProcessor
and/or RenameFieldUpdateProcessor to get that Date object into both your "pubDate" and "prettyDate"
fields, and then use some sort of FormatDateFieldUpdateProcessor on the "prettyDate" field.

There may be other solutions to that type of problem, but I guess the bottom line from my
perspective is: why bother making a processor deliberately fails the user configures it to
do something unexpected but still viable?  If they want to Parse Strings -> Dates on a
TrieIntField, why not just let them do it?  maybe they've got another processor later that
is going to convert that Date to "days since epoc" as an integer?


{panel}
Examples of the exclude configuration...

{code}
<updateRequestProcessorChain name="trim-few">
  <processor class="solr.TrimFieldUpdateProcessorFactory">
    <str name="fieldRegex">foo.*</str>
    <str name="fieldRegex">bar.*</str>
    <!-- each set of exclusions is checked independently -->
    <lst name="exclude">
      <str name="typeClass">solr.DateField</str>
    </lst>
    <lst name="exclude">
      <str name="fieldRegex">.*HOSS.*</str>
    </lst>
  </processor>
</updateRequestProcessorChain>
<updateRequestProcessorChain name="trim-some">
  <processor class="solr.TrimFieldUpdateProcessorFactory">
    <str name="fieldRegex">foo.*</str>
    <str name="fieldRegex">bar.*</str>
    <!-- only excluded if it matches all in set -->
    <lst name="exclude">
      <str name="typeClass">solr.DateField</str>
      <str name="fieldRegex">.*HOSS.*</str>
    </lst>
  </processor>
</updateRequestProcessorChain>
{code}

In the "trim-few" case, field names will be excluded if they are DateFields _or_ match the
"HOSS" regex.  In the "trim-some" case, field names will be excluded only if they are _both_
a DateField _and_ match the "HOSS" regex.
{panel}
                
> Toolkit of UpdateProcessors for modifying document values
> ---------------------------------------------------------
>
>                 Key: SOLR-2802
>                 URL: https://issues.apache.org/jira/browse/SOLR-2802
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: SOLR-2802_update_processor_toolkit.patch, SOLR-2802_update_processor_toolkit.patch
>
>
> Frequently users ask about questions about things where the answer is "you could do it
with an UpdateProcessor" but the number of our of hte box UpdateProcessors is generally lacking
and there aren't even very good base classes for the common case of manipulating field values
when adding documents

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message