lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Kulkarni <kulk...@hawk.iit.edu>
Subject Re: How does Lucene decides which fields have termvectors stored and which not?
Date Wed, 20 Aug 2014 02:53:55 GMT
Hi Kumaran,

See below some part of the code and the .alg file.
Here is the function from DocMaker.java from the package "package
org.apache.lucene.benchmark.byTask.feeds"

/** Set the configuration parameters of this doc maker. */
  public void setConfig(Config config, ContentSource source) {
    this.config = config;
    this.source = source;

    boolean stored = config.get("doc.stored", false);
    boolean bodyStored = config.get("doc.body.stored", stored);
    boolean tokenized = config.get("doc.tokenized", true);
    boolean bodyTokenized = config.get("doc.body.tokenized", tokenized);
    boolean norms = config.get("doc.tokenized.norms", false);
    boolean bodyNorms = config.get("doc.body.tokenized.norms", true);
    boolean termVec = config.get("doc.term.vector", false);
    boolean termVecPositions = config.get("doc.term.vector.positions",
false);
    boolean termVecOffsets = config.get("doc.term.vector.offsets", false);

    valType = new FieldType(TextField.TYPE_NOT_STORED);
    valType.setStored(stored);
    valType.setTokenized(tokenized);
    valType.setOmitNorms(!norms);
    valType.setStoreTermVectors(termVec);
    valType.setStoreTermVectorPositions(termVecPositions);
    valType.setStoreTermVectorOffsets(termVecOffsets);

    valType.freeze();

    bodyValType = new FieldType(TextField.TYPE_NOT_STORED);
    bodyValType.setStored(bodyStored);
    bodyValType.setTokenized(bodyTokenized);
    bodyValType.setOmitNorms(!bodyNorms);
    bodyValType.setStoreTermVectors(termVec);
    bodyValType.setStoreTermVectorPositions(termVecPositions);
    bodyValType.setStoreTermVectorOffsets(termVecOffsets);
    bodyValType.freeze();

    storeBytes = config.get("doc.store.body.bytes", false);

    reuseFields = config.get("doc.reuse.fields", true);

    // In a multi-rounds run, it is important to reset DocState since
settings
    // of fields may change between rounds, and this is the only way to
reset
    // the cache of all threads.
    docState = new ThreadLocal<DocState>();

    indexProperties = config.get("doc.index.props", false);

    updateDocIDLimit = config.get("doc.random.id.limit", -1);
    if (updateDocIDLimit != -1) {
      r = new Random(179);
    }
  }



And the following is the .alg file that I set:

### START OF FILE: just an example
content.source=org.apache.lucene.benchmark.byTask.feeds.TrecContentSource
content.source.verbose=false
content.source.excludeIteration=true
doc.maker.forever=false
doc.index.props=true
content.source.log.step=2500
docs.dir=PATH_TO_MY_DATASET
doc.term.vector=true
work.dir=work
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
trec.doc.parser=org.apache.lucene.benchmark.byTask.feeds.TrecParserByPath
content.source.forever=false
content.source.encoding=UTF-8
directory=FSDirectory
doc.stored=true
doc.tokenized=true
doc.tokenized.norms=true
doc.body.tokenized.norms=true
content.source.excludeIteration=true
ResetSystemErase
CreateIndex
{ AddDoc } : *
CloseIndex
### END OF FILE


Regards,
Sachin Kulkarni

On Tue, Aug 19, 2014 at 1:59 PM, Sachin Kulkarni <kulksac@hawk.iit.edu>
wrote:

> Hi Kumaran,
>
> I am using the benchmark utility from Lucene and doing the indexing via an
> .alg file.
> Would you like to see the alg file instead?
>
> Thank you.
>
> Regards,
> Sachin
>
>
> On Tue, Aug 19, 2014 at 9:42 AM, Kumaran Ramasubramanian <
> kums.134@gmail.com> wrote:
>
>> Hi Sachin
>>
>>         i want to look into ur indexing code. please share it
>>
>> -
>> Kumaran R
>>
>>
>>
>>
>>
>> On Tue, Aug 19, 2014 at 7:18 PM, Sachin Kulkarni <kulksac@hawk.iit.edu>
>> wrote:
>>
>> > Hi,
>> >
>> > Sorry for all the code, It got sent out accidentally.
>> >
>> > The following code is part of the Benchmark utility in Lucene,
>> specifically
>> > SubmissionReport.java
>> >
>> >
>> > // Here reader is the IndexReader.
>> >
>> >
>> >               Iterator itr = docMap.entrySet().iterator();
>> >  int totalNumDocuments = reader.numDocs();
>> > ScoreDoc sd[] = td.scoreDocs;
>> > String sep = " \t ";
>> > DocNameExtractor docext = new DocNameExtractor(docNameField);
>> >  for (int i=0; i<sd.length; i++)
>> > {
>> >    String docName = docext.docName(searcher,sd[i].doc);
>> >  // ***** The Map of documents will help us get the docid
>> > int indexedDocID = docMap.get(docName);
>> >  Fields fields = reader.getTermVectors(indexedDocID);
>> >  Iterator<String> strItr=fields.iterator();
>> >
>> > /// ********** The following while is printing the fieldNames which only
>> > show 2 fields out of the 5 that I am looking for.
>> > while(strItr.hasNext())
>> > {
>> > String fieldName = strItr.next();
>> > System.out.println("next field " + fieldName);
>> > }
>> > Document DocList= reader.document(indexedDocID);
>> > List<IndexableField> field_list = DocList.getFields();
>> >
>> >         /// ****** The following for loop prints the five fields and
>> it's
>> > related information.
>> > for(int j=0; j < field_list.size(); j++)
>> > {
>> > System.out.println ( "list field is : " + field_list.get(j).name() );
>> > IndexableFieldType IFT = field_list.get(j).fieldType();
>> > System.out.println(" Field storeTermVectorOffsets : " +
>> > IFT.storeTermVectorOffsets());
>> > System.out.println(" Field stored :" + IFT.stored());
>> >  }
>> > // ***************************** //
>> >                   }
>> >
>> >
>> >  /**** THE OUTPUT for this section of code is
>> > fields size : 2
>> > next field body
>> > next field docname
>> >
>> > list field is : docid
>> >  Field storeTermVectorOffsets : false
>> > list field is : docname
>> >  Field storeTermVectorOffsets : false
>> > list field is : docdate
>> >  Field storeTermVectorOffsets : false
>> > list field is : doctitle
>> >  Field storeTermVectorOffsets : false
>> > list field is : body
>> >  Field storeTermVectorOffsets : false
>> >
>> > *******/
>> >
>> > Hope this code comes out legible in the email.
>> >
>> > Thank you.
>> >
>> > Regards,
>> > Sachin Kulkarni
>> >
>> >
>> > On Tue, Aug 19, 2014 at 8:39 AM, Sachin Kulkarni <kulksac@hawk.iit.edu>
>> > wrote:
>> >
>> > > Hi Kumaran,
>> > >
>> > >
>> > >
>> > > The following code is part of the Benchmark utility in Lucene,
>> > > specifically SubmissionReport.java
>> > >
>> > >
>> > > Iterator itr = docMap.entrySet().iterator();
>> > >  int totalNumDocuments = reader.numDocs();
>> > > ScoreDoc sd[] = td.scoreDocs;
>> > >  String sep = " \t ";
>> > > DocNameExtractor docext = new DocNameExtractor(docNameField);
>> > >  for (int i=0; i<sd.length; i++)
>> > > {
>> > > System.out.println("i = " + i);
>> > >   String docName = docext.docName(searcher,sd[i].doc);
>> > >   System.out.println("docName : " + docName + "\t map size " +
>> > > docMap.size());
>> > >  // ***** The Map will help us get the docid and
>> > > int indexedDocID = docMap.get(docName);
>> > >  System.out.println("indexed doc id : " + indexedDocID + "\t docname
>> : "
>> > > + docName);
>> > >  // ******** GET THE tf-idf data now ************ //
>> > > Fields fields = reader.getTermVectors(indexedDocID);
>> > >  System.out.println("fields size : " + fields.size());
>> > >  // **** Print log output for testing **** //
>> > >  Iterator<String> strItr=fields.iterator();
>> > > while(strItr.hasNext())
>> > > {
>> > >  String fieldName = strItr.next();
>> > > System.out.println("next field " + fieldName);
>> > > }
>> > >  Document DocList= reader.document(indexedDocID);
>> > > List<IndexableField> field_list = DocList.getFields();
>> > >  for(int j=0; j < field_list.size(); j++)
>> > > {
>> > > System.out.println ( "list field is : " + field_list.get(j).name() );
>> > >  IndexableFieldType IFT = field_list.get(j).fieldType();
>> > > System.out.println(" Field storeTermVectorOffsets : " +
>> > > IFT.storeTermVectorOffsets());
>> > >  //System.out.println(" Field stored :" + IFT.stored());
>> > > //for (FieldInfo.IndexOptions c : IFT.indexOptions().values())
>> > >  // System.out.println(c);
>> > > }
>> > > // *****************************88 //
>> > >
>> > >
>> > > On Tue, Aug 19, 2014 at 2:04 AM, Kumaran Ramasubramanian <
>> > > kums.134@gmail.com> wrote:
>> > >
>> > >> Hi Sachin Kulkarni,
>> > >>
>> > >>     If possible, Please share your code.
>> > >>
>> > >>
>> > >> -
>> > >> Kumaran R
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Aug 19, 2014 at 9:07 AM, Sachin Kulkarni <
>> kulksac@hawk.iit.edu>
>> > >> wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> > I am using Lucene 4.6.0.
>> > >> >
>> > >> > I have been storing 5 fields for my documents in the index, namely
>> > body,
>> > >> > title, docname, docdate and docid.
>> > >> >
>> > >> > But when I get the fields using
>> > >> IndexReader.getTermVectors(indexedDocID) I
>> > >> > only get
>> > >> > the docname and body fields and can retrieve the term vectors
for
>> > those
>> > >> > fields, but not others.
>> > >> >
>> > >> > I check to see if all the five fields are stored using
>> > >> > IndexedFieldType.stored()
>> > >> > and all return true. I also check to see that all the fields are
>> > indexed
>> > >> > and they are, but
>> > >> > still when I try to getTermVectors I only receive two fields back.
>> > >> >
>> > >> > Is there any other config setting that I am missing while indexing
>> > that
>> > >> is
>> > >> > causing this behavior?
>> > >> >
>> > >> > Thanks to Kumaran and Ian for their answers to my previous
>> questions
>> > >> but I
>> > >> > have not been able to figure out the above one yet.
>> > >> >
>> > >> > Thank you very much.
>> > >> >
>> > >> > Regards,
>> > >> > Sachin
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message