lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: Index without tf, anyone?
Date Fri, 18 Jul 2008 22:45:18 GMT
I have created "https://issues.apache.org/jira/browse/LUCENE-1340" for this, with a patch,
not properly tested, missing asserts and unit tests, but basic ant test-core passed ... released
early for feedback



----- Original Message ----
> From: eks dev <eksdev@yahoo.co.uk>
> To: java-dev@lucene.apache.org
> Sent: Friday, 18 July, 2008 10:40:41 PM
> Subject: Re: Index without tf, anyone?
> 
> for now I will ignore Payloads, it is simpler to get some working code this way 
> and is not worse nor better than the other option (anyhow this mambo jumbo with 
> options will have to be cleaned up for flexible Ixing, or we will have problem 
> to keep it under control)
> 
> 
> 
> ----- Original Message ----
> > From: Michael McCandless 
> > To: java-dev@lucene.apache.org
> > Sent: Friday, 18 July, 2008 10:17:29 PM
> > Subject: Re: Index without tf, anyone?
> > 
> > 
> > Hmm -- maybe ignore payloads?
> > 
> > I was going to say "maybe throw an exception", but, I can imagine  
> > you'd want to index a TokenStream once with a field that's storing tf,  
> > positions & payloads, and then again as an field that doesn't.
> > 
> > Mike
> > 
> > eks dev wrote:
> > 
> > > also, another one:
> > >
> > > what should happen with payloads and omitTf options in case
> > > op
> > > storePayloads==true && omitTf==true
> > > shold we say:
> > > 1. ignore omitTf and go on with payloads
> > > or
> > > 2. disable payloads  and omit tf
> > >
> > > other combination are clear
> > >
> > >
> > >
> > > ----- Original Message ----
> > >> From: eks dev 
> > >> To: java-dev@lucene.apache.org
> > >> Sent: Friday, 18 July, 2008 9:20:09 PM
> > >> Subject: Re: Index without tf, anyone?
> > >>
> > >> Mike,
> > >> I have started playing with this, holly cow.... it is a lot of code
> > >>
> > >> Question
> > >>
> > >> SegmentMerger. mergeFields()... there is a big block
> > >>
> > >> else {
> > >>        addIndexed(reader, fieldInfos,
> > >> reader 
> > >> .getFieldNames 
> > >> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET),
> > >> true, true, true, false);
> > >>        addIndexed(reader, fieldInfos,
> > >> reader 
> > >> .getFieldNames(IndexReader.FieldOption.TERMVECTOR_WITH_POSITION),  
> > >> true,
> > >> true, false, false);
> > >>        addIndexed(reader, fieldInfos,
> > >> reader 
> > >> .getFieldNames(IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET), true,
> > >> false, true, false);
> > >>        addIndexed(reader, fieldInfos,
> > >> reader.getFieldNames(IndexReader.FieldOption.TERMVECTOR), true,  
> > >> false, false,
> > >> false);
> > >>        addIndexed(reader, fieldInfos,
> > >> reader.getFieldNames(IndexReader.FieldOption.STORES_PAYLOADS),  
> > >> false, false,
> > >> false, true);
> > >>        addIndexed(reader, fieldInfos,
> > >> reader.getFieldNames(IndexReader.FieldOption.INDEXED), false,  
> > >> false, false,
> > >> false);
> > >>        
> > >> fieldInfos 
> > >> .add(reader.getFieldNames(IndexReader.FieldOption.UNINDEXED),
> > >> false);
> > >>      }
> > >>
> > >>
> > >> I simply do not understand it, have changed addIndexed(...)  
> > >> signature to include
> > >> omitTf, but I am sure what needs to be done here?
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> ----- Original Message ----
> > >>> From: Michael McCandless
> > >>> To: java-dev@lucene.apache.org
> > >>> Sent: Friday, 18 July, 2008 11:48:20 AM
> > >>> Subject: Re: Index without tf, anyone?
> > >>>
> > >>> I just committed LUCENE-1301, which is a first step (top down)  
> > >>> towards
> > >>> flexible indexing.  I hope I didn't break anything....
> > >>>
> > >>> While flexible indexing should make this simpler, it's not too bad
 
> > >>> to
> > >>> modify Lucene to do this today, if you want.  I think this is what
> > >>> you'll need to do (but I haven't tested!):
> > >>>
> > >>>   * Add something to Fieldable/AbstractField/Field that "knows"
> > >>>     whether a field should store the tf.  Also add this to
> > >>>     FieldInfo.java, and make sure that bit is saved to the fnm file.
> > >>>
> > >>>   * In the new oal.index.DocFieldProcessorPerThread, in the
> > >>>     processDocument method, fix the FieldInfos.add call to also pass
> > >>>     in your new "storeTermFreq" bit.  Probably, assert that this
> > >>>     cannot change -- ie a field must be created with
> > >>>     storeTermFreq=true or false and must never change.
> > >>>
> > >>>   * The new oal.index.FreqProxTermsWriter, in appendPostings, has 

> > >>> the
> > >>>     code that creates a new segment.  Change that to skip writing tf
> > >>>     if the FieldInfo says so.
> > >>>
> > >>>   * Fix SegmentTermDocs to not read tf if FieldInfo says so.
> > >>>
> > >>>   * Fix SegmentMerger.appendPostings to not merge/write tf if
> > >>>     FieldInfo says so.  Likewise assert here that the  
> > >>> "storeTermFreq"
> > >>>     does not change in the merged segments.
> > >>>
> > >>> It's also possible to fix FreqProxTermsWriterPerField to not even
> > >>> compute & store the tf in its RawPostingList, per term.  This is
an
> > >>> optimization (saves RAM & CPU) that you can do after first getting
 
> > >>> the
> > >>> above working...
> > >>>
> > >>> On the search side, you'll need to fix scoring to be OK with tf=0.
> > >>>
> > >>> I think this would be a useful addition to Lucene (it comes up every
> > >>> so often), even before we fully work out flexible indexing.
> > >>>
> > >>> Mike
> > >>>
> > >>> eks dev wrote:
> > >>>
> > >>>> hi all,
> > >>>> is there any solution to have pure postings lists without
> > >>>> interleaved tf ... this eats a lot of CPU for VInt decoding on
 
> > >>>> dense
> > >>>> terms (also doubles IO...)  in our case. Can be a untested patch,
> > >>>> tips how to do it or whatever... I know about flexible indexing,
 
> > >>>> but
> > >>>> cannot wait (I guess it will take some time?).
> > >>>>
> > >>>> Does it make sense to start working on it? Can be this somehow
 
> > >>>> later
> > >>>> incorporated into Flexible Indexing... I hate to do it and than
> > >>>> throw it away whem Mike doe his magic with Flexible Indexing.
> > >>>>
> > >>>> Simply we are sure this could help performance a lot (some dense
> > >>>> fields have always constant tf, no need to read them from index).
> > >>>> Simply asking for help if somebody accidently happens to have some
> > >>>> Quick 'n Dirty solution/idea.
> > >>>>
> > >>>> thanks, eks
> > >>>>
> > >>>>
> > >>>>
> > >>>>     __________________________________________________________
> > >>>> Not happy with your email address?.
> > >>>> Get the one you really want - millions of new email addresses
> > >>>> available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
> > >>>>
> > >>>> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > >>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> > >>>>
> > >>>
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> > >>
> > >>
> > >>
> > >>      __________________________________________________________
> > >> Not happy with your email address?.
> > >> Get the one you really want - millions of new email addresses  
> > >> available now at
> > >> Yahoo! http://uk.docs.yahoo.com/ymail/new.html
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> > >
> > >
> > >
> > >      __________________________________________________________
> > > Not happy with your email address?.
> > > Get the one you really want - millions of new email addresses  
> > > available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-dev-help@lucene.apache.org
> > >
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 
>       __________________________________________________________
> Not happy with your email address?.
> Get the one you really want - millions of new email addresses available now at 
> Yahoo! http://uk.docs.yahoo.com/ymail/new.html
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



      __________________________________________________________
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message