lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mathias Walter" <mathias.wal...@gmx.net>
Subject AW: FieldCache
Date Mon, 25 Oct 2010 07:41:53 GMT
I don't think it is an XY problem.

I indexed about 90 million sentences and the PAS (predicate argument structures) they consist
of (which are about 500 million). Then
I try to do NER (named entity recognition) by searching about 5 million entities. For each
entity I need the all search results, not
just the top X. Since about 10 percent of the entities are high frequent (i. e. there are
more than 5 million hits for "human"), it
takes very long to obtain the data from the index. "Very long" means about a day with 15 distributed
Katta nodes. Katta is just a
distribution and shard balancing solution on top of Lucene.

Initially, I tried distributed search with Solr. But it was too slow to retrieve a large set
of documents. Then I switch to Lucene
and made some improvements. I enabled the field cache for my ID field and another single char
field (PAS type) to get the benefit of
accessing the fields with an array. Unfortunately, the IDs are too large to fit in memory.
I gave 12 GB of RAM to each node and also
tried to use the MMapDirectory and/or CompressedOops. Lucene always runs out of memory.

Then I investigated the storage of the fields. String fields are stored in UTF-8 encoding.
But my ID will never contain UTF8
characters. It consists of number schema but does not fit into a single long. I encoded it
into a byte array of 11 bytes (compared
to 30 bytes of UTF-8 encoding). Then I changed the field description in schema.xml to binary.
I still use the EmbeddedSolrServer to
create the indices.
Also, I had to remove the uniquekey node because binary fields cannot be indexed, which is
the requirement for the unique key.

After reindexing I discovered that nonindexed or binary fields cannot be used with the FieldCache.

Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. The size was
increased to 7 characters (= 14 bytes)
which is still a gain of more than 50 percent compared to the UTF8 encoding. BTW: I found
no sample how to use the
IndexableBinaryStringTools class except in the unit tests.

Unfortunately, I was not able use it with the EmbeddedSolrServer and the Lucene client. The
search result never looked identical
compared to the IDs used to create the SolrInputDocument.

I assume that the char[] returned form IndexableBinaryStringTools.encode is encoded in UTF-8
again and then stored. At some point
the information is lost and cannot be recovered.

Recently I upgraded to trunk (4.0) and tried to use the ByteRefs from FieldCache.DEFAULT.getTerms
directly. But the bytes are
encoded in an unknown form (unknown to me) and cannot be decoded with IndexableBinaryStringTools.decode.

The question is now, how to increase the performance of the binary field retrieval by not
exploding the memory?

I also read some comments which suggest using of payloads. But I never tried this approach.
Also, the column-stride fields approach
(LUCENE-2186) looks promising but is not released yet.

BTW: I made some tests with a smaller index and the ID encoded as string. Using the field
cache improves the hit retrieval
dramatically (from 18 seconds down to 2 seconds per query, with a large number of results).

--
Kind regards,
Mathias

> -----Urspr√ľngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerickson@gmail.com]
> Gesendet: Samstag, 23. Oktober 2010 21:40
> An: solr-user@lucene.apache.org
> Betreff: Re: FieldCache
> 
> Why do you want to? Basically, the caches are there to improve
> #searching#. To search something, you must index it. Retrieving
> it is usually a rare enough operation that caching is irrelevant.
> 
> This smells like an XY problem, see:
> http://people.apache.org/~hossman/#xyproblem
> 
> If this seems like gibberish, could you explain your problem
> a little more?
> 
> Best
> Erick
> 
> On Thu, Oct 21, 2010 at 10:20 AM, Mathias Walter <mathias.walter@gmx.net>wrote:
> 
> > Hi,
> >
> > does a field which should be cached needs to be indexed?
> >
> > I have a binary field which is just stored. Retrieving it via
> > FieldCache.DEFAULT.getTerms returns empty ByteRefs.
> >
> > Then I found the following post:
> > http://www.mail-archive.com/dev@lucene.apache.org/msg05403.html
> >
> > How can I use the FieldCache with a binary field?
> >
> > --
> > Kind regards,
> > Mathias
> >
> >


Mime
View raw message