lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-6888) Decompressing documents on first-pass distributed queries to get docId is inefficient, use indexed values instead?
Date Mon, 29 Dec 2014 17:42:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260257#comment-14260257
] 

Erick Erickson commented on SOLR-6888:
--------------------------------------

Right, thanks!

> Decompressing documents on first-pass distributed queries to get docId is inefficient,
use indexed values instead?
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6888
>                 URL: https://issues.apache.org/jira/browse/SOLR-6888
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 5.0, Trunk
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>         Attachments: SOLR-6888-hacktiming.patch
>
>
> Assigning this to myself to just not lose track of it, but I won't be working on this
in the near term; anyone feeling ambitious should feel free to grab it.
> Note, docId used here is whatever is defined for <uniqueKey>...
> Since Solr 4.1, the compression/decompression process is based on 16K blocks and is automatic,
and not configurable. So, to get a single stored value one must decompress an entire 16K block.
At least.
> For SolrCloud (and distributed processing in general), we make two trips, one to get
the doc id and score (or other sort criteria) and one to return the actual data.
> The first pass here requires that we return the top N docIDs and sort criteria, which
means that each and every sub-request has to unpack at least one 16K block (and sometimes
more) to get just the doc ID. So if we have 20 shards and only want 20 rows, 95% of the decompression
cycles will be wasted. Not to mention all the disk reads.
> It seems like we should be able to do better than that. Can we argue that doc ids are
'special' and should be cached somehow? Let's discuss what this would look like. I can think
of a couple of approaches:
> 1> Since doc IDs are "special", can we say that for this purpose returning the indexed
version is OK? We'd need to return the actual stored value when the full doc was requested,
but for the sub-request only what about returning the indexed value instead of the stored
one? On the surface I don't see a problem here, but what do I know? Storing these as DocValues
seems useful in this case.
> 1a> A variant is treating numeric docIds specially since the indexed value and the
stored value should be identical. And DocValues here would be useful it seems. But this seems
an unnecessary specialization if <1> is implemented well.
> 2> We could cache individual doc IDs, although I'm not sure what use that really is.
Would maintaining the cache overwhelm the savings of not decompressing? I really don't like
this idea, but am throwing it out there. Doing this from stored data up front would essentially
mean decompressing every doc so that seems untenable to try up-front.
> 3> We could maintain an array[maxDoc] that held document IDs, perhaps lazily initializing
it. I'm not particularly a fan of this either, doesn't seem like a Good Thing. I can see lazy
loading being almost, but not quite totally, useless, i.e. a hit ratio near 0, especially
since it'd be thrown out on every openSearcher.
> Really, the only one of these that seems viable is <1>/<1a>. The others would
all involve decompressing the docs anyway to get the ID, and I suspect that caching would
be of very limited usefulness. I guess <1>'s viability hinges on whether, for internal
use, the indexed form of DocId is interchangeable with the stored value.
> Or are there other ways to approach this? Or isn't it something to really worry about?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message