lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Fodor (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (SOLR-2202) Money FieldType
Date Wed, 03 Nov 2010 20:50:26 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927990#action_12927990
] 

Greg Fodor edited comment on SOLR-2202 at 11/3/10 4:49 PM:
-----------------------------------------------------------

This update to the patch includes a number of performance enhancements and is the version
of the patch we will be likely to push to production.

First, this patch introduces the defaultCurrency parameter, which defaults to USD. The default
currency allows you to omit the currency code in the field value (ie, "5000" instead of "5000,USD".)
However, it plays a more pivotal role in improving performance.

The previous patches provided a naive approach to constructing the trie bounding range by
taking the current max and min currency exchange rates to the target currency. This proved
to be minimally useful since the relative magnitude of currency units vary wildly and hence
the bounding range often spanned the full document set.

The solution I took in this patch is to compute the bounding range by taking into account
the "currency drift." Before getting to that, though, the indexing process was updated to
include a new dynamic field that indexes the value of the field in the default currency, exchanged
at the current rate at indexing time. (Additionally, a stored field is optionally created
if the money field is marked as stored.)

The historical max and min exchange rates (the "drift") are now tracked by solr in a properties
file. The properties file is named after the currency config file. For example, if the config
file is "currency.xml", the properties file is "currency.xml.drift.properties". This file
is designed to work correctly with replication, and is updated by Solr whenever the currency
config file is loaded.

To compute an accurate bounding range, it is necessary to compute the max and min "historical
composite exchange rates". The "historical" refers to the fact that the historical max/min
exchange rates are used instead of the current exchange rate. The "composite" refers to the
fact that the max/min exchange rate is computed by taking the max/min of a composition of
the max/min exchange rates between the source currency S, the target currency T, and all intermediate
currencies Z. For example, to compute the max historical composite exchange rate between USD
and EUR, take the max value of the the value x*y, where x is the max historical exchange rate
between USD->Z, and y is the max historical exchange rate between Z->EUR, for all currencies
Z.

I made an attempt at proving mathematically that this historical composite exchange rate approach
computes a minimal upper bound and maximal lower bound for the trie query. If necessary I
can attach this proof.

Beyond this, I added some additional intra-query caching and changed the query construction
from the FilteredQuery approach (which seemed to be inefficient in leveraging the trie query)
to the BooleanQuery. You'll note that I rely upon the second clause in the BooleanQuery being
scored first, which eliminates the expensive exchange rate conversions from happening for
documents that fall outside the trie range.

I ran into a limitation of the current resource loader API, however, in that it does not allow
access to creating or writing new resources, which is needed to maintain the drift properties
file. For now, I only support SolrResourceLoader which writes to the local filesystem by extracting
the config directory. However, the new ZkResourceLoader is not supported, for example. A non-fatal
warning is emitted to the log when this occurs. The side effect of this is that currency exchange
rate drift will not be tracked, resulting in incorrect range and point queries if the currency.xml
file is updated. It would be nice if it were possible to ask the ResourceLoader for an OutputStream
to a new resource for this purpose.

Some limitations:

    * The default currency cannot be changed after the initial index, otherwise the index
effectively is corrupt since the value for the trie bound is indexed in the default currency.
    * Loss or corruption of the drift file will cause erroneous range and point queries (documents
will be omitted from the results, though no incorrect documents will appear.)
    * As mentioned above, the only ResourceLoader supported are SolrResourceLoaders that respond
to getConfigDir(). Please let me know if there is a safer, more canonical way to store and
load Solr-maintained metadata that lives with the index.

Also note that this has been tested with replication. The only thing necessary for replication
to work is that the currency.xml and currency.xml.drift.properties file be included as part
of the replication. A limitation here is that if no documents are updated but the currency
exchange rates change, the file will not be replicated due to Solr's policy of not replicating
files without index changes. It would be useful to allow this behavior to be overridden. In
our case this isn't a problem since our index churn is high enough that replication events
happen regularly.

In the end these changes result in accurate currency range queries that perform nearly as
fast as their non-currency counterparts. 

      was (Author: gfodor):
    This update to the patch includes a number of performance enhancements and is the version
of the patch we will be likely to push to production.

First, this patch introduces the defaultCurrency parameter, which defaults to USD. The default
currency allows you to omit the currency code in the field value (ie, "5000" instead of "5000,USD".)
However, it plays a more pivotal role in improving performance.

The previous patches provided a naive approach to constructing the trie bounding range by
taking the current max and min currency exchange rates to the target currency. This proved
to be minimally useful since the relative magnitude of currency units vary wildly and hence
the bounding range often spanned the full document set.

The solution I took in this patch is to compute the bounding range by taking into account
the "currency drift." Before getting to that, though, the indexing process was updated to
include a new dynamic field that indexes the value of the field in the default currency, exchanged
at the current rate at indexing time. (Additionally, a stored field is optionally created
if the money field is marked as stored.)

The historical max and min exchange rates (the "drift") are now tracked by solr in a properties
file. The properties file is named after the currency config file. For example, if the config
file is "currency.xml", the properties file is "currency.xml.drift.properties". This file
is designed to work correctly with replication, and is updated by Solr whenever the currency
config file is loaded.

To compute an accurate bounding range, it is necessary to compute the max and min "historical
composite exchange rates". The "historical" refers to the fact that the historical max/min
exchange rates are used instead of the current exchange rate. The "composite" refers to the
fact that the max/min exchange rate is computed by taking the max/min of a composition of
the max/min exchange rates between the source currency S, the target currency T, and all intermediate
currencies X. For example, to compute the max historical composite exchange rate between USD
and EUR, take the max value of the the value x*y, where x is the max historical exchange rate
between USD->Z, and y is the max historical exchange rate between Z->EUR, for all currencies
Z.

I made an attempt at proving mathematically that this historical composite exchange rate approach
computes a minimal upper bound and maximal lower bound for the trie query. If necessary I
can attach this proof.

Beyond this, I added some additional intra-query caching and changed the query construction
from the FilteredQuery approach (which seemed to be inefficient in leveraging the trie query)
to the BooleanQuery. You'll note that I rely upon the second clause in the BooleanQuery being
scored first, which eliminates the expensive exchange rate conversions from happening for
documents that fall outside the trie range.

I ran into a limitation of the current resource loader API, however, in that it does not allow
access to creating or writing new resources, which is needed to maintain the drift properties
file. For now, I only support SolrResourceLoader which writes to the local filesystem by extracting
the config directory. However, the new ZkResourceLoader is not supported, for example. A non-fatal
warning is emitted to the log when this occurs. The side effect of this is that currency exchange
rate drift will not be tracked, resulting in incorrect range and point queries if the currency.xml
file is updated. It would be nice if it were possible to ask the ResourceLoader for an OutputStream
to a new resource for this purpose.

Some limitations:

    * The default currency cannot be changed after the initial index, otherwise the index
effectively is corrupt since the value for the trie bound is indexed in the default currency.
    * Loss or corruption of the drift file will cause erroneous range and point queries (documents
will be omitted from the results, though no incorrect documents will appear.)
    * As mentioned above, the only ResourceLoader supported are SolrResourceLoaders that respond
to getConfigDir(). Please let me know if there is a safer, more canonical way to store and
load Solr-maintained metadata that lives with the index.

Also note that this has been tested with replication. The only thing necessary for replication
to work is that the currency.xml and currency.xml.drift.properties file be included as part
of the replication. A limitation here is that if no documents are updated but the currency
exchange rates change, the file will not be replicated due to Solr's policy of not replicating
files without index changes. It would be useful to allow this behavior to be overridden. In
our case this isn't a problem since our index churn is high enough that replication events
happen regularly.

In the end these changes result in accurate currency range queries that perform nearly as
fast as their non-currency counterparts. 
  
> Money FieldType
> ---------------
>
>                 Key: SOLR-2202
>                 URL: https://issues.apache.org/jira/browse/SOLR-2202
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 1.5
>            Reporter: Greg Fodor
>         Attachments: SOLR-2022-solr-3.patch, SOLR-2202-lucene-1.patch, SOLR-2202-solr-1.patch,
SOLR-2202-solr-2.patch, SOLR-2202-solr-4.patch, SOLR-2202-solr-5.patch, SOLR-2202-solr-6.patch,
SOLR-2202-solr-7.patch
>
>
> Attached please find patches to add support for monetary values to Solr/Lucene with query-time
currency conversion. The following features are supported:
> - Point queries (ex: "price:4.00USD")
> - Range quries (ex: "price:[$5.00 TO $10.00]")
> - Sorting.
> - Currency parsing by either currency code or symbol.
> - Symmetric & Asymmetric exchange rates. (Asymmetric exchange rates are useful if
there are fees associated with exchanging the currency.)
> At indexing time, money fields can be indexed in a native currency. For example, if a
product on an e-commerce site is listed in Euros, indexing the price field as "10.00EUR" will
index it appropriately. By altering the currency.xml file, the sorting and querying against
Solr can take into account fluctuations in currency exchange rates without having to re-index
the documents.
> The new "money" field type is a polyfield which indexes two fields, one which contains
the amount of the value and another which contains the currency code or symbol. The currency
metadata (names, symbols, codes, and exchange rates) are expected to be in an xml file which
is pointed to by the field type declaration in the schema.xml.
> The current patch is factored such that Money utility functions and configuration metadata
lie in Lucene (see MoneyUtil and CurrencyConfig), while the MoneyType and MoneyValueSource
lie in Solr. This was meant to mirror the work being done on the spacial field types.
> This patch has not yet been deployed to production but will be getting used to power
the international search capabilities of the search engine at Etsy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message