lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manjula Wijewickrema <manjul...@gmail.com>
Subject ShingleAnalyzerWrapper question
Date Wed, 11 Jun 2014 06:23:43 GMT
Hi,

In my programme, I can index and search a document based on unigrams. I
modified the code as follows to obtain the results based on bigrams.
However, it did not give me the desired output.

*****************

*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException,



IOException {





            *final* String[] NEW_STOP_WORDS = {"a", "able", "about",
"actually", "after", "allow", "almost", "already", "also", "although",
"always", "am",   "an", "and", "any", "anybody"};  //only a portion



            SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
NEW_STOP_WORDS );

            Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*
);



            ShingleAnalyzerWrapper sw=*new*
ShingleAnalyzerWrapper(analyzer,2);

            sw.setOutputUnigrams(*false*);



            IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
*true*,IndexWriter.MaxFieldLength.*UNLIMITED*);

            File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);

            File[] files = dir.listFiles();





            *for* (File file : files) {



                  Document doc = *new* Document();

                  String text="";

                  doc.add(*new* Field("contents",text,Field.Store.*YES*,
Field.Index.UN_TOKENIZED,Field.TermVector.*YES*));





                  Reader reader = *new* FileReader(file);

                  doc.add(*new* Field(*FIELD_CONTENTS*, reader));

                  w.addDocument(doc);

            }

            w.optimize();

            w.close();



      }


****************

Still the output is;


{contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1, manjula/3,
name/1, sabaragamuwa/1, univers/1}

*******************


If anybody can, please help me to obtain the correct output.


Thanks,


Manjula.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message