lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Optimal way to index
Date Tue, 12 Feb 2013 13:29:56 GMT
Hello Ian,
*
*
     I started as directed by you and created the index. Here is a small
piece of code which I have written. Please have a look over it :
*
*
*public static void main(String[] args) throws IOException, ParseException {
*
*  *
*    //Specify the analyzer for tokenizing text. The same analyzer should
be used for indexing and searching*
*    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);*
*
*
*    // 1. create the index*
*    Directory index = new RAMDirectory();*
*    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
analyzer);*
*    IndexWriter w = new IndexWriter(index, config);*
*    Configuration conf = HBaseConfiguration.create();*
* HTable table = new HTable(conf, "mappings");*
* Scan s = new Scan();*
* ResultScanner rs = table.getScanner(s);*
* int count = 0;*
* String[] localnames;*
* for (Result r : rs) {*
* count++;*
* localnames = Bytes.toString(r.getValue(Bytes.toBytes("cf"),
Bytes.toBytes("LOC"))).trim().split(",");*
* for(String str : localnames){*
* addDoc(w, "." + str + ".", Bytes.toString(r.getValue(Bytes.toBytes("cf"),
Bytes.toBytes("CON"))), Bytes.toString(r.getRow()));*
* }*
* }*
* System.out.println("COUNT : " + count);*
* table.close();*
*    w.close();*
*    *
*    // 2. query*
*
*
*    String term = "";*
*//    BufferedReader br = new BufferedReader(new
InputStreamReader(System.in));    *
*//    System.out.println("Enter the term you want to search...");*
*//    term = br.readLine();*
*    term = "Vacuolated Lymphocytes";*
*    TermQuery tq = new TermQuery(new Term("localname", "." + term + "."));*
*
*
*    // 3. search*
*    int hitsPerPage = 10;*
*    IndexReader reader = DirectoryReader.open(index);*
*    IndexSearcher searcher = new IndexSearcher(reader);*
*    TopScoreDocCollector collector =
TopScoreDocCollector.create(hitsPerPage, true);*
*    searcher.search(tq, collector);*
*    ScoreDoc[] hits = collector.topDocs().scoreDocs;*
*    *
*    // 4. display results*
*    System.out.println("Found " + hits.length + " hits.");*
*    for(int i=0;i<hits.length;++i) {*
*      int docId = hits[i].doc;*
*      Document d = searcher.doc(docId);*
*      System.out.println("ControlID -> "  + d.get("controlid") + "\t" +
"Localnames -> " + d.get("localname") + "\t" + "Controname -> " +
d.get("controlname"));*
*    }*
*    // reader can only be closed when there*
*    // is no need to access the documents any more.*
*    reader.close();*
*  }*
*
*
* private static void addDoc(IndexWriter w, String local, String control,
String rowkey) throws IOException {*
*
*
* Document doc = new Document();*
* doc.add(new StringField("localname", local, Field.Store.YES));*
* doc.add(new StringField("controlname", control, Field.Store.YES));*
* doc.add(new StringField("controlid", rowkey, Field.Store.YES)); *
* w.addDocument(doc);*
* }*
*
*
Does it look fine to you? Or can I make it better by adding or removing
something?Although it shows just a primitive usage of Lucene, it is always
better to have some able guidance with us.

One more question. Does the index remain alive only till the lifetime of
the application if we are using *RAMDirectory*? I have to run the entire
process everytime I want to search something.

Also, I have added a dot(.) before and after after each word before adding
it to the document so that I can do *exact match search*. Is my approach
correct or is there any other OOTB feature available in Lucene which I can
use for this?

I am sorry to be a pest of questions and thank you so much for your time.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Mon, Feb 11, 2013 at 10:09 PM, Mohammad Tariq <dontariq@gmail.com> wrote:

> Hey Ian. Thank you so much for the quick reply. I'll definitely give
> Lucene a shot. I'll start off with it and get back to you in case of any
> problem.
>
> Many thanks.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Mon, Feb 11, 2013 at 10:03 PM, Ian Lea <ian.lea@gmail.com> wrote:
>
>> You can certainly use lucene for this, and it will be blindingly fast
>> even if you use a disk based index.
>>
>> Just index documents as you've laid it out, with the field you want to
>> search on added as indexable and the others stored.
>>
>> I've never used Guava Table so can't comment on that, but with only a
>> few thousand words it would certainly be feasible to use something
>> like that.  Better?  I don't know.
>>
>> Personally I'd probably go with lucene as I'd be positive it would a)
>> work and b) be fast even if the thousands ending being tens of
>> thousands, or more.
>>
>>
>>
>>
>> --
>> Ian.
>>
>> On Mon, Feb 11, 2013 at 3:14 PM, Mohammad Tariq <dontariq@gmail.com>
>> wrote:
>> > Hello list,
>> >
>> >          I have a scenario wherein I need an in-memory index as I need
>> > faster search. The problem goes like this :
>> >
>> > I have a list which contains a couple of thousands words. Each word has
>> a
>> > corresponding ID and a list of synonyms. The actual word is a column in
>> my
>> > Hbase table. I get files which contain values for this column and I
>> have to
>> > extract values from these files and put them into the appropriate
>> column.
>> > But sometimes files may contain the synonym instead of the actual word.
>> > Now, this is the place where index come into picture. I should have an
>> > index that contains all the words along with its ID and all the synonyms
>> > and it should be in-memory always so that inserts into Hbase are quick.
>> > Something like this :
>> >
>> >  ID          WORD           SYNONYMS
>> >  13991     A                  a, A, Aa, aa, AA
>> >
>> > Then the index should be something like this :
>> > a    A   13991
>> > A    A   13991
>> > Aa  A   13991
>> > aa   A   13991
>> > AA  A   13991
>> >
>> > So that if I get 'a' in the file, I should be able to do a lookup and
>> index
>> > should give me 'A' along with '13991'. I need both the base name and the
>> > ID. The names could even be strings of 4 to 5 words.
>> >
>> > I have all this information stored in a Hbase table having two columns
>> > where the first column contains the actual word and the second column
>> > contains the entire list of synonyms. And the rowkey is the ID.
>> >
>> > Now. I am not getting whether it is feasible to use Lucene to get this
>> or
>> >  should I go with something like 'Guava Table' or something else. Need
>> some
>> > guidance as being new to Lucene I am not able to think in the right
>> > direction. If it is feasible to use Lucene to achieve this how to do it
>> > efficiently?
>> >
>> > I am using Hbase filters right now to do the fetch which is slowing down
>> > the process.
>> >
>> > I am sorry if my questions sound too childish or senseless as I am not
>> very
>> > good at Lucene. Thank you so much for your valuable time.
>> >
>> > Warm Regards,
>> > Tariq
>> > https://mtariq.jux.com/
>> > cloudfront.blogspot.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message