lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)
Date Fri, 10 Jun 2011 01:37:58 GMT

     [ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated SOLR-2583:
------------------------------

    Attachment: patch.txt

{quote}
What do you mean with a two-stage table, can you clarify this please?
{quote}

See: http://www.strchr.com/multi-stage_tables

i attached a patch, of a (not great) implementation i was sorta kinda trying to clean up for
other reasons... maybe you can use it.

in the sparse case, blocks that share all the default value are folded into one block (in
this patch, blocksize=256 but maybe you should be able to configure it).

for example in your 4GB case (1billion floats), if you use this with SmallFloat the absolute
worst case (no sharing) is 1GB + 16MB or so, and the best case (all default values) is 16MB,
but the lookups should be a lot faster than hashtables... all primitive types, etc... and
it could definitely be improved more.

really this is still probably overkill, as the datastructure is intended to share blocks with
the same values in general, when in reality its probably enough to just share ones that have
only the default value set...

i didnt look at the solr side to see if its possible to build it incrementally (this would
be better, rather than building then compact()ing, but i wonder if this is possible due to
lucenedocid/solr id, etc)


> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in the index.
The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource
is created per external scoring file. FileFloatSource creates a float array with the size
of the number of docs (this is also done if the file to load is not found). If there are much
less entries in the scoring file than there are number of docs in total the big float array
wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map contains as
many entries as there are scoring entries in the external file, but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message