lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject Re: Can I rebuild an index and remove some fields?
Date Thu, 16 Feb 2012 03:03:19 GMT
great. I think you could make it a public tool. maybe others also need such
functionality.

On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart <bstewart.ny@gmail.com>wrote:

> I implemented an index shrinker and it works.  I reduced my test index
> from 6.6 GB to 3.6 GB by removing a single shingled field I did not
> need anymore.  I'm actually using Lucene.Net for this project so code
> is C# using Lucene.Net 2.9.2 API.  But basic idea is:
>
> Create an IndexReader wrapper that only enumerates the terms you want
> to keep, and that removes terms from documents when returning
> documents.
>
> Use the SegmentMerger to re-write each segment (where each segment is
> wrapped by the wrapper class), writing new segment to a new directory.
> Collect the SegmentInfos and do a commit in order to create a new
> segments file in new index directory
>
> Done - you now have a shrunk index with specified terms removed.
>
> Implementation uses separate thread for each segment, so it re-writes
> them in parallel.  Took about 15 minutes to do 770,000 doc index on my
> macbook.
>
>
> On Tue, Feb 14, 2012 at 10:12 PM, Li Li <fancyerii@gmail.com> wrote:
> > I have roughly read the codes of 4.0 trunk. maybe it's feasible.
> >    SegmentMerger.add(IndexReader) will add to be merged Readers
> >    merge() will call
> >      mergeTerms(segmentWriteState);
> >      mergePerDoc(segmentWriteState);
> >
> >   mergeTerms() will construct fields from IndexReaders
> >    for(int
> > readerIndex=0;readerIndex<mergeState.readers.size();readerIndex++) {
> >      final MergeState.IndexReaderAndLiveDocs r =
> > mergeState.readers.get(readerIndex);
> >      final Fields f = r.reader.fields();
> >      final int maxDoc = r.reader.maxDoc();
> >      if (f != null) {
> >        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
> >        fields.add(f);
> >      }
> >      docBase += maxDoc;
> >    }
> >    So If you wrapper your IndexReader and override its fields() method,
> > maybe it will work for merge terms.
> >
> >    for DocValues, it can also override AtomicReader.docValues(). just
> > return null for fields you want to remove. maybe it should
> > traverse CompositeReader's getSequentialSubReaders() and wrapper each
> > AtomicReader
> >
> >    other things like term vectors norms are similar.
> > On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart <bstewart.ny@gmail.com
> >wrote:
> >
> >> I was thinking if I make a wrapper class that aggregates another
> >> IndexReader and filter out terms I don't want anymore it might work.
> And
> >> then pass that wrapper into SegmentMerger.  I think if I filter out
> terms
> >> on GetFieldNames(...) and Terms(...) it might work.
> >>
> >> Something like:
> >>
> >> HashSet<string> ignoredTerms=...;
> >>
> >> FilteringIndexReader wrapper=new FilterIndexReader(reader);
> >>
> >> SegmentMerger merger=new SegmentMerger(writer);
> >>
> >> merger.add(wrapper);
> >>
> >> merger.Merge();
> >>
> >>
> >>
> >>
> >>
> >> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
> >>
> >> > for method 2, delete is wrong. we can't delete terms.
> >> >   you also should hack with the tii and tis file.
> >> >
> >> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fancyerii@gmail.com> wrote:
> >> >
> >> >> method1, dumping data
> >> >> for stored fields, you can traverse the whole index and save it to
> >> >> somewhere else.
> >> >> for indexed but not stored fields, it may be more difficult.
> >> >>    if the indexed and not stored field is not analyzed(fields such
as
> >> >> id), it's easy to get from FieldCache.StringIndex.
> >> >>    But for analyzed fields, though theoretically it can be restored
> from
> >> >> term vector and term position, it's hard to recover from index.
> >> >>
> >> >> method 2, hack with metadata
> >> >> 1. indexed fields
> >> >>      delete by query, e.g. field:*
> >> >> 2. stored fields
> >> >>       because all fields are stored sequentially. it's not easy to
> >> delete
> >> >> some fields. this will not affect search speed. but if you want to
> get
> >> >> stored fields,  and the useless fields are very long, then it will
> slow
> >> >> down.
> >> >>       also it's possible to hack with it. but need more effort to
> >> >> understand the index file format  and traverse the fdt/fdx file.
> >> >>
> >>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
> >> >>
> >> >> this will give you some insight.
> >> >>
> >> >>
> >> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <
> bstewart.ny@gmail.com
> >> >wrote:
> >> >>
> >> >>> Lets say I have a large index (100M docs, 1TB, split up between
10
> >> >>> indexes).  And a bunch of the "stored" and "indexed" fields are
not
> >> used in
> >> >>> search at all.  In order to save memory and disk, I'd like to
> rebuild
> >> that
> >> >>> index *without* those fields, but I don't have original documents
to
> >> >>> rebuild entire index with (don't have the full-text anymore, etc.).
>  Is
> >> >>> there some way to rebuild or optimize an existing index with only
a
> >> sub-set
> >> >>> of the existing indexed fields?  Or alternatively is there a way
to
> >> avoid
> >> >>> loading some indexed fields at all ( to avoid loading term infos
and
> >> terms
> >> >>> index ) ?
> >> >>>
> >> >>> Thanks
> >> >>> Bob
> >> >>
> >> >>
> >> >>
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message