lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Kulkarni <kulk...@hawk.iit.edu>
Subject Re: How does Lucene decides which fields have termvectors stored and which not?
Date Fri, 22 Aug 2014 13:54:37 GMT
Hi,

I was able to finally figure this out.
Lucene's Benchmark utility has some default parsers for TREC datasets.
I noticed while parsing it was not parsing the title correctly for my
dataset, eventually setting it to null.
Therefore it was not getting indexed even though I was asking it to.

I works well once I fixed the parser.

Regards,
Sachin Kulkarni


On Tue, Aug 19, 2014 at 9:53 PM, Sachin Kulkarni <kulksac@hawk.iit.edu>
wrote:

> Hi Kumaran,
>
> See below some part of the code and the .alg file.
> Here is the function from DocMaker.java from the package "package
> org.apache.lucene.benchmark.byTask.feeds"
>
> /** Set the configuration parameters of this doc maker. */
>   public void setConfig(Config config, ContentSource source) {
>     this.config = config;
>     this.source = source;
>
>     boolean stored = config.get("doc.stored", false);
>     boolean bodyStored = config.get("doc.body.stored", stored);
>     boolean tokenized = config.get("doc.tokenized", true);
>     boolean bodyTokenized = config.get("doc.body.tokenized", tokenized);
>     boolean norms = config.get("doc.tokenized.norms", false);
>     boolean bodyNorms = config.get("doc.body.tokenized.norms", true);
>     boolean termVec = config.get("doc.term.vector", false);
>     boolean termVecPositions = config.get("doc.term.vector.positions",
> false);
>     boolean termVecOffsets = config.get("doc.term.vector.offsets", false);
>
>     valType = new FieldType(TextField.TYPE_NOT_STORED);
>     valType.setStored(stored);
>     valType.setTokenized(tokenized);
>     valType.setOmitNorms(!norms);
>     valType.setStoreTermVectors(termVec);
>     valType.setStoreTermVectorPositions(termVecPositions);
>     valType.setStoreTermVectorOffsets(termVecOffsets);
>
>     valType.freeze();
>
>     bodyValType = new FieldType(TextField.TYPE_NOT_STORED);
>     bodyValType.setStored(bodyStored);
>     bodyValType.setTokenized(bodyTokenized);
>     bodyValType.setOmitNorms(!bodyNorms);
>     bodyValType.setStoreTermVectors(termVec);
>     bodyValType.setStoreTermVectorPositions(termVecPositions);
>     bodyValType.setStoreTermVectorOffsets(termVecOffsets);
>     bodyValType.freeze();
>
>     storeBytes = config.get("doc.store.body.bytes", false);
>
>     reuseFields = config.get("doc.reuse.fields", true);
>
>     // In a multi-rounds run, it is important to reset DocState since
> settings
>     // of fields may change between rounds, and this is the only way to
> reset
>     // the cache of all threads.
>     docState = new ThreadLocal<DocState>();
>
>     indexProperties = config.get("doc.index.props", false);
>
>     updateDocIDLimit = config.get("doc.random.id.limit", -1);
>     if (updateDocIDLimit != -1) {
>       r = new Random(179);
>     }
>   }
>
>
>
> And the following is the .alg file that I set:
>
> ### START OF FILE: just an example
> content.source=org.apache.lucene.benchmark.byTask.feeds.TrecContentSource
> content.source.verbose=false
> content.source.excludeIteration=true
> doc.maker.forever=false
> doc.index.props=true
> content.source.log.step=2500
> docs.dir=PATH_TO_MY_DATASET
> doc.term.vector=true
> work.dir=work
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> trec.doc.parser=org.apache.lucene.benchmark.byTask.feeds.TrecParserByPath
> content.source.forever=false
> content.source.encoding=UTF-8
> directory=FSDirectory
> doc.stored=true
> doc.tokenized=true
> doc.tokenized.norms=true
> doc.body.tokenized.norms=true
> content.source.excludeIteration=true
> ResetSystemErase
> CreateIndex
> { AddDoc } : *
> CloseIndex
> ### END OF FILE
>
>
> Regards,
> Sachin Kulkarni
>
> On Tue, Aug 19, 2014 at 1:59 PM, Sachin Kulkarni <kulksac@hawk.iit.edu>
> wrote:
>
>> Hi Kumaran,
>>
>> I am using the benchmark utility from Lucene and doing the indexing via
>> an .alg file.
>> Would you like to see the alg file instead?
>>
>> Thank you.
>>
>> Regards,
>> Sachin
>>
>>
>> On Tue, Aug 19, 2014 at 9:42 AM, Kumaran Ramasubramanian <
>> kums.134@gmail.com> wrote:
>>
>>> Hi Sachin
>>>
>>>         i want to look into ur indexing code. please share it
>>>
>>> -
>>> Kumaran R
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Aug 19, 2014 at 7:18 PM, Sachin Kulkarni <kulksac@hawk.iit.edu>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > Sorry for all the code, It got sent out accidentally.
>>> >
>>> > The following code is part of the Benchmark utility in Lucene,
>>> specifically
>>> > SubmissionReport.java
>>> >
>>> >
>>> > // Here reader is the IndexReader.
>>> >
>>> >
>>> >               Iterator itr = docMap.entrySet().iterator();
>>> >  int totalNumDocuments = reader.numDocs();
>>> > ScoreDoc sd[] = td.scoreDocs;
>>> > String sep = " \t ";
>>> > DocNameExtractor docext = new DocNameExtractor(docNameField);
>>> >  for (int i=0; i<sd.length; i++)
>>> > {
>>> >    String docName = docext.docName(searcher,sd[i].doc);
>>> >  // ***** The Map of documents will help us get the docid
>>> > int indexedDocID = docMap.get(docName);
>>> >  Fields fields = reader.getTermVectors(indexedDocID);
>>> >  Iterator<String> strItr=fields.iterator();
>>> >
>>> > /// ********** The following while is printing the fieldNames which
>>> only
>>> > show 2 fields out of the 5 that I am looking for.
>>> > while(strItr.hasNext())
>>> > {
>>> > String fieldName = strItr.next();
>>> > System.out.println("next field " + fieldName);
>>> > }
>>> > Document DocList= reader.document(indexedDocID);
>>> > List<IndexableField> field_list = DocList.getFields();
>>> >
>>> >         /// ****** The following for loop prints the five fields and
>>> it's
>>> > related information.
>>> > for(int j=0; j < field_list.size(); j++)
>>> > {
>>> > System.out.println ( "list field is : " + field_list.get(j).name() );
>>> > IndexableFieldType IFT = field_list.get(j).fieldType();
>>> > System.out.println(" Field storeTermVectorOffsets : " +
>>> > IFT.storeTermVectorOffsets());
>>> > System.out.println(" Field stored :" + IFT.stored());
>>> >  }
>>> > // ***************************** //
>>> >                   }
>>> >
>>> >
>>> >  /**** THE OUTPUT for this section of code is
>>> > fields size : 2
>>> > next field body
>>> > next field docname
>>> >
>>> > list field is : docid
>>> >  Field storeTermVectorOffsets : false
>>> > list field is : docname
>>> >  Field storeTermVectorOffsets : false
>>> > list field is : docdate
>>> >  Field storeTermVectorOffsets : false
>>> > list field is : doctitle
>>> >  Field storeTermVectorOffsets : false
>>> > list field is : body
>>> >  Field storeTermVectorOffsets : false
>>> >
>>> > *******/
>>> >
>>> > Hope this code comes out legible in the email.
>>> >
>>> > Thank you.
>>> >
>>> > Regards,
>>> > Sachin Kulkarni
>>> >
>>> >
>>> > On Tue, Aug 19, 2014 at 8:39 AM, Sachin Kulkarni <kulksac@hawk.iit.edu
>>> >
>>> > wrote:
>>> >
>>> > > Hi Kumaran,
>>> > >
>>> > >
>>> > >
>>> > > The following code is part of the Benchmark utility in Lucene,
>>> > > specifically SubmissionReport.java
>>> > >
>>> > >
>>> > > Iterator itr = docMap.entrySet().iterator();
>>> > >  int totalNumDocuments = reader.numDocs();
>>> > > ScoreDoc sd[] = td.scoreDocs;
>>> > >  String sep = " \t ";
>>> > > DocNameExtractor docext = new DocNameExtractor(docNameField);
>>> > >  for (int i=0; i<sd.length; i++)
>>> > > {
>>> > > System.out.println("i = " + i);
>>> > >   String docName = docext.docName(searcher,sd[i].doc);
>>> > >   System.out.println("docName : " + docName + "\t map size " +
>>> > > docMap.size());
>>> > >  // ***** The Map will help us get the docid and
>>> > > int indexedDocID = docMap.get(docName);
>>> > >  System.out.println("indexed doc id : " + indexedDocID + "\t docname
>>> : "
>>> > > + docName);
>>> > >  // ******** GET THE tf-idf data now ************ //
>>> > > Fields fields = reader.getTermVectors(indexedDocID);
>>> > >  System.out.println("fields size : " + fields.size());
>>> > >  // **** Print log output for testing **** //
>>> > >  Iterator<String> strItr=fields.iterator();
>>> > > while(strItr.hasNext())
>>> > > {
>>> > >  String fieldName = strItr.next();
>>> > > System.out.println("next field " + fieldName);
>>> > > }
>>> > >  Document DocList= reader.document(indexedDocID);
>>> > > List<IndexableField> field_list = DocList.getFields();
>>> > >  for(int j=0; j < field_list.size(); j++)
>>> > > {
>>> > > System.out.println ( "list field is : " + field_list.get(j).name()
);
>>> > >  IndexableFieldType IFT = field_list.get(j).fieldType();
>>> > > System.out.println(" Field storeTermVectorOffsets : " +
>>> > > IFT.storeTermVectorOffsets());
>>> > >  //System.out.println(" Field stored :" + IFT.stored());
>>> > > //for (FieldInfo.IndexOptions c : IFT.indexOptions().values())
>>> > >  // System.out.println(c);
>>> > > }
>>> > > // *****************************88 //
>>> > >
>>> > >
>>> > > On Tue, Aug 19, 2014 at 2:04 AM, Kumaran Ramasubramanian <
>>> > > kums.134@gmail.com> wrote:
>>> > >
>>> > >> Hi Sachin Kulkarni,
>>> > >>
>>> > >>     If possible, Please share your code.
>>> > >>
>>> > >>
>>> > >> -
>>> > >> Kumaran R
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >> On Tue, Aug 19, 2014 at 9:07 AM, Sachin Kulkarni <
>>> kulksac@hawk.iit.edu>
>>> > >> wrote:
>>> > >>
>>> > >> > Hi,
>>> > >> >
>>> > >> > I am using Lucene 4.6.0.
>>> > >> >
>>> > >> > I have been storing 5 fields for my documents in the index,
namely
>>> > body,
>>> > >> > title, docname, docdate and docid.
>>> > >> >
>>> > >> > But when I get the fields using
>>> > >> IndexReader.getTermVectors(indexedDocID) I
>>> > >> > only get
>>> > >> > the docname and body fields and can retrieve the term vectors
for
>>> > those
>>> > >> > fields, but not others.
>>> > >> >
>>> > >> > I check to see if all the five fields are stored using
>>> > >> > IndexedFieldType.stored()
>>> > >> > and all return true. I also check to see that all the fields
are
>>> > indexed
>>> > >> > and they are, but
>>> > >> > still when I try to getTermVectors I only receive two fields
back.
>>> > >> >
>>> > >> > Is there any other config setting that I am missing while
indexing
>>> > that
>>> > >> is
>>> > >> > causing this behavior?
>>> > >> >
>>> > >> > Thanks to Kumaran and Ian for their answers to my previous
>>> questions
>>> > >> but I
>>> > >> > have not been able to figure out the above one yet.
>>> > >> >
>>> > >> > Thank you very much.
>>> > >> >
>>> > >> > Regards,
>>> > >> > Sachin
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message