lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: deleting large amount data from solr cloud
Date Thu, 17 Apr 2014 15:35:44 GMT
bq: Will it get split at any point later?

"Split" is a little ambiguous here. Will it be copied into two or more
segments? No. Will it disappear? Possibly. Eventually this segment
will be merged if you add enough documents to the system. Consider
this scenario:
you add 1M docs to your system and it results in 10 segments (numbers
made up). Then you optimize, and you have 1M docs in 1 segment. Fine
so far.

Now you add 750K of those docs over again, which will delete them from
the 1 big segment. Your merge policy will, at some point, select this
segment to merge and it'll disappear...

FWIW,
Erick@Pedantic.com

On Thu, Apr 17, 2014 at 7:24 AM, Vinay Pothnis <pothnis@gmail.com> wrote:
> Thanks a lot Shalin!
>
>
> On 16 April 2014 21:26, Shalin Shekhar Mangar <shalinmangar@gmail.com>wrote:
>
>> You can specify maxSegments parameter e.g. maxSegments=5 while optimizing.
>>
>>
>> On Thu, Apr 17, 2014 at 6:46 AM, Vinay Pothnis <pothnis@gmail.com> wrote:
>>
>> > Hello,
>> >
>> > Couple of follow up questions:
>> >
>> > * When the optimize command is run, looks like it creates one big segment
>> > (forceMerge = 1). Will it get split at any point later? Or will that big
>> > segment remain?
>> >
>> > * Is there anyway to maintain the number of segments - but still merge to
>> > reclaim the deleted documents space? In other words, can I issue
>> > "forceMerge=20"? If so, how would the command look like? Any examples for
>> > this?
>> >
>> > Thanks
>> > Vinay
>> >
>> >
>> >
>> > On 16 April 2014 07:59, Vinay Pothnis <pothnis@gmail.com> wrote:
>> >
>> > > Thank you Erick!
>> > > Yes - I am using the expunge deletes option.
>> > >
>> > > Thanks for the note on disk space for the optimize command. I should
>> have
>> > > enough space for that. What about the heap space requirement? I hope it
>> > can
>> > > do the optimize with the memory that is allocated to it.
>> > >
>> > > Thanks
>> > > Vinay
>> > >
>> > >
>> > > On 16 April 2014 04:52, Erick Erickson <erickerickson@gmail.com>
>> wrote:
>> > >
>> > >> The optimize should, indeed, reduce the index size. Be aware that it
>> > >> may consume 2x the disk space. You may also try expungedeletes, see
>> > >> here: https://wiki.apache.org/solr/UpdateXmlMessages
>> > >>
>> > >> Best,
>> > >> Erick
>> > >>
>> > >> On Wed, Apr 16, 2014 at 12:47 AM, Vinay Pothnis <pothnis@gmail.com>
>> > >> wrote:
>> > >> > Another update:
>> > >> >
>> > >> > I removed the replicas - to avoid the replication doing a full
>> copy. I
>> > >> am
>> > >> > able delete sizeable chunks of data.
>> > >> > But the overall index size remains the same even after the deletes.
>> It
>> > >> does
>> > >> > not seem to go down.
>> > >> >
>> > >> > I understand that Solr would do this in background - but I don't
>> seem
>> > to
>> > >> > see the decrease in overall index size even after 1-2 hours.
>> > >> > I can see a bunch of ".del" files in the index directory, but
the it
>> > >> does
>> > >> > not seem to get cleaned up. Is there anyway to monitor/follow
the
>> > >> progress
>> > >> > of index compaction?
>> > >> >
>> > >> > Also, does triggering "optimize" from the admin UI help to compact
>> the
>> > >> > index size on disk?
>> > >> >
>> > >> > Thanks
>> > >> > Vinay
>> > >> >
>> > >> >
>> > >> > On 14 April 2014 12:19, Vinay Pothnis <pothnis@gmail.com>
wrote:
>> > >> >
>> > >> >> Some update:
>> > >> >>
>> > >> >> I removed the auto warm configurations for the various caches
and
>> > >> reduced
>> > >> >> the cache sizes. I then issued a call to delete a day's worth
of
>> data
>> > >> (800K
>> > >> >> documents).
>> > >> >>
>> > >> >> There was no out of memory this time - but some of the nodes
went
>> > into
>> > >> >> recovery mode. Was able to catch some logs this time around
and
>> this
>> > is
>> > >> >> what i see:
>> > >> >>
>> > >> >> ****************
>> > >> >> *WARN  [2014-04-14 18:11:00.381] [org.apache.solr.update.PeerSync]
>> > >> >> PeerSync: core=core1_shard1_replica2 url=http://host1:8983/solr
>> > >> >> <http://host1:8983/solr> too many updates received since
start -
>> > >> >> startingUpdates no longer overlaps with our currentUpdates*
>> > >> >> *INFO  [2014-04-14 18:11:00.476]
>> > >> [org.apache.solr.cloud.RecoveryStrategy]
>> > >> >> PeerSync Recovery was not successful - trying replication.
>> > >> >> core=core1_shard1_replica2*
>> > >> >> *INFO  [2014-04-14 18:11:00.476]
>> > >> [org.apache.solr.cloud.RecoveryStrategy]
>> > >> >> Starting Replication Recovery. core=core1_shard1_replica2*
>> > >> >> *INFO  [2014-04-14 18:11:00.535]
>> > >> [org.apache.solr.cloud.RecoveryStrategy]
>> > >> >> Begin buffering updates. core=core1_shard1_replica2*
>> > >> >> *INFO  [2014-04-14 18:11:00.536]
>> > >> [org.apache.solr.cloud.RecoveryStrategy]
>> > >> >> Attempting to replicate from
>> > >> http://host2:8983/solr/core1_shard1_replica1/
>> > >> >> <http://host2:8983/solr/core1_shard1_replica1/>.
>> > >> core=core1_shard1_replica2*
>> > >> >> *INFO  [2014-04-14 18:11:00.536]
>> > >> >> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating
new
>> http
>> > >> >> client,
>> > >> >>
>> > >>
>> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false*
>> > >> >> *INFO  [2014-04-14 18:11:01.964]
>> > >> >> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating
new
>> http
>> > >> >> client,
>> > >> >>
>> > >>
>> >
>> config:connTimeout=5000&socketTimeout=20000&allowCompression=false&maxConnections=10000&maxConnectionsPerHost=10000*
>> > >> >> *INFO  [2014-04-14 18:11:01.969]
>> [org.apache.solr.handler.SnapPuller]
>> > >>  No
>> > >> >> value set for 'pollInterval'. Timer Task not started.*
>> > >> >> *INFO  [2014-04-14 18:11:01.973]
>> [org.apache.solr.handler.SnapPuller]
>> > >> >> Master's generation: 1108645*
>> > >> >> *INFO  [2014-04-14 18:11:01.973]
>> [org.apache.solr.handler.SnapPuller]
>> > >> >> Slave's generation: 1108627*
>> > >> >> *INFO  [2014-04-14 18:11:01.973]
>> [org.apache.solr.handler.SnapPuller]
>> > >> >> Starting replication process*
>> > >> >> *INFO  [2014-04-14 18:11:02.007]
>> [org.apache.solr.handler.SnapPuller]
>> > >> >> Number of files in latest index in master: 814*
>> > >> >> *INFO  [2014-04-14 18:11:02.007]
>> > >> >> [org.apache.solr.core.CachingDirectoryFactory] return new
directory
>> > for
>> > >> >> /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007*
>> > >> >> *INFO  [2014-04-14 18:11:02.008]
>> [org.apache.solr.handler.SnapPuller]
>> > >> >> Starting download to
>> > >> >> NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
>> > >> /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007
>> > >> >> lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f6570fe;
>> > >> >> maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true*
>> > >> >>
>> > >> >> ****************
>> > >> >>
>> > >> >>
>> > >> >> So, it looks like the number of updates is too huge for the
regular
>> > >> >> replication and then it goes into full copy of index. And
since our
>> > >> index
>> > >> >> size is very huge (350G), this is causing the cluster to go
into
>> > >> recovery
>> > >> >> mode forever - trying to copy that huge index.
>> > >> >>
>> > >> >> I also read in some thread
>> > >> >>
>> > >>
>> >
>> http://lucene.472066.n3.nabble.com/Recovery-too-many-updates-received-since-start-td3935281.htmlthatthereisa
limit of 100 documents.
>> > >> >>
>> > >> >> I wonder if this has been updated to make that configurable
since
>> > that
>> > >> >> thread. If not, the only option I see is to do a "trickle"
delete
>> of
>> > >> 100
>> > >> >> documents per second or something.
>> > >> >>
>> > >> >> Also - the other suggestion of using "distributed=false" might
not
>> > help
>> > >> >> because the issue currently is that the replication is going
to
>> "full
>> > >> copy".
>> > >> >>
>> > >> >> Any thoughts?
>> > >> >>
>> > >> >> Thanks
>> > >> >> Vinay
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >> On 14 April 2014 07:54, Vinay Pothnis <pothnis@gmail.com>
wrote:
>> > >> >>
>> > >> >>> Yes, that is our approach. We did try deleting a day's
worth of
>> data
>> > >> at a
>> > >> >>> time, and that resulted in OOM as well.
>> > >> >>>
>> > >> >>> Thanks
>> > >> >>> Vinay
>> > >> >>>
>> > >> >>>
>> > >> >>> On 14 April 2014 00:27, Furkan KAMACI <furkankamaci@gmail.com>
>> > wrote:
>> > >> >>>
>> > >> >>>> Hi;
>> > >> >>>>
>> > >> >>>> I mean you can divide the range (i.e. one week at
each delete
>> > >> instead of
>> > >> >>>> one month) and try to check whether you still get
an OOM or not.
>> > >> >>>>
>> > >> >>>> Thanks;
>> > >> >>>> Furkan KAMACI
>> > >> >>>>
>> > >> >>>>
>> > >> >>>> 2014-04-14 7:09 GMT+03:00 Vinay Pothnis <pothnis@gmail.com>:
>> > >> >>>>
>> > >> >>>> > Aman,
>> > >> >>>> > Yes - Will do!
>> > >> >>>> >
>> > >> >>>> > Furkan,
>> > >> >>>> > How do you mean by 'bulk delete'?
>> > >> >>>> >
>> > >> >>>> > -Thanks
>> > >> >>>> > Vinay
>> > >> >>>> >
>> > >> >>>> >
>> > >> >>>> > On 12 April 2014 14:49, Furkan KAMACI <furkankamaci@gmail.com>
>> > >> wrote:
>> > >> >>>> >
>> > >> >>>> > > Hi;
>> > >> >>>> > >
>> > >> >>>> > > Do you get any problems when you index your
data? On the
>> other
>> > >> hand
>> > >> >>>> > > deleting as bulks and reducing the size
of documents may help
>> > you
>> > >> >>>> not to
>> > >> >>>> > > hit OOM.
>> > >> >>>> > >
>> > >> >>>> > > Thanks;
>> > >> >>>> > > Furkan KAMACI
>> > >> >>>> > >
>> > >> >>>> > >
>> > >> >>>> > > 2014-04-12 8:22 GMT+03:00 Aman Tandon <
>> amantandon.10@gmail.com
>> > >:
>> > >> >>>> > >
>> > >> >>>> > > > Vinay please share your experience
after trying this
>> > solution.
>> > >> >>>> > > >
>> > >> >>>> > > >
>> > >> >>>> > > > On Sat, Apr 12, 2014 at 4:12 AM, Vinay
Pothnis <
>> > >> pothnis@gmail.com>
>> > >> >>>> > > wrote:
>> > >> >>>> > > >
>> > >> >>>> > > > > The query is something like this:
>> > >> >>>> > > > >
>> > >> >>>> > > > >
>> > >> >>>> > > > > *curl -H 'Content-Type: text/xml'
--data
>> > >> >>>> '<delete><query>param1:(val1
>> > >> >>>> > > OR
>> > >> >>>> > > > > val2) AND -param2:(val3 OR val4)
AND
>> > >> date_param:[1383955200000 TO
>> > >> >>>> > > > > 1385164800000]</query></delete>'
>> > >> >>>> > > > > 'http://host:port/solr/coll-name1/update?commit=true'*
>> > >> >>>> > > > >
>> > >> >>>> > > > > Trying to restrict the number
of documents deleted via
>> the
>> > >> date
>> > >> >>>> > > > parameter.
>> > >> >>>> > > > >
>> > >> >>>> > > > > Had not tried the "distrib=false"
option. I could give
>> > that a
>> > >> >>>> try.
>> > >> >>>> > > Thanks
>> > >> >>>> > > > > for the link! I will check on
the cache sizes and
>> autowarm
>> > >> >>>> values.
>> > >> >>>> > Will
>> > >> >>>> > > > try
>> > >> >>>> > > > > and disable the caches when I
am deleting and give that a
>> > >> try.
>> > >> >>>> > > > >
>> > >> >>>> > > > > Thanks Erick and Shawn for your
inputs!
>> > >> >>>> > > > >
>> > >> >>>> > > > > -Vinay
>> > >> >>>> > > > >
>> > >> >>>> > > > >
>> > >> >>>> > > > >
>> > >> >>>> > > > > On 11 April 2014 15:28, Shawn
Heisey <solr@elyograg.org>
>> > >> wrote:
>> > >> >>>> > > > >
>> > >> >>>> > > > > > On 4/10/2014 7:25 PM, Vinay
Pothnis wrote:
>> > >> >>>> > > > > >
>> > >> >>>> > > > > >> When we tried to delete
the data through a query -
>> say 1
>> > >> >>>> > day/month's
>> > >> >>>> > > > > worth
>> > >> >>>> > > > > >> of data. But after deleting
just 1 month's worth of
>> > data,
>> > >> the
>> > >> >>>> > master
>> > >> >>>> > > > > node
>> > >> >>>> > > > > >> is going out of memory
- heap space.
>> > >> >>>> > > > > >>
>> > >> >>>> > > > > >> Wondering is there any
way to incrementally delete the
>> > >> data
>> > >> >>>> > without
>> > >> >>>> > > > > >> affecting the cluster
adversely.
>> > >> >>>> > > > > >>
>> > >> >>>> > > > > >
>> > >> >>>> > > > > > I'm curious about the actual
query being used here.
>>  Can
>> > >> you
>> > >> >>>> share
>> > >> >>>> > > it,
>> > >> >>>> > > > or
>> > >> >>>> > > > > > a redacted version of it?
 Perhaps there might be a
>> clue
>> > >> there?
>> > >> >>>> > > > > >
>> > >> >>>> > > > > > Is this a fully distributed
delete request?  One thing
>> > you
>> > >> >>>> might
>> > >> >>>> > try,
>> > >> >>>> > > > > > assuming Solr even supports
it, is sending the same
>> > delete
>> > >> >>>> request
>> > >> >>>> > > > > directly
>> > >> >>>> > > > > > to each shard core with distrib=false.
>> > >> >>>> > > > > >
>> > >> >>>> > > > > > Here's a very incomplete
list about how you can reduce
>> > Solr
>> > >> >>>> heap
>> > >> >>>> > > > > > requirements:
>> > >> >>>> > > > > >
>> > >> >>>> > > > > > http://wiki.apache.org/solr/SolrPerformanceProblems#
>> > >> >>>> > > > > > Reducing_heap_requirements
>> > >> >>>> > > > > >
>> > >> >>>> > > > > > Thanks,
>> > >> >>>> > > > > > Shawn
>> > >> >>>> > > > > >
>> > >> >>>> > > > > >
>> > >> >>>> > > > >
>> > >> >>>> > > >
>> > >> >>>> > > >
>> > >> >>>> > > >
>> > >> >>>> > > > --
>> > >> >>>> > > > With Regards
>> > >> >>>> > > > Aman Tandon
>> > >> >>>> > > >
>> > >> >>>> > >
>> > >> >>>> >
>> > >> >>>>
>> > >> >>>
>> > >> >>>
>> > >> >>
>> > >>
>> > >
>> > >
>> >
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>

Mime
View raw message