lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Optimal way to index
Date Tue, 12 Feb 2013 20:14:19 GMT
"AA-" indexed as a StringField was matched by a TermQuery for "AA"?
Sounds surprising.


--
Ian.


On Tue, Feb 12, 2013 at 6:32 PM, Mohammad Tariq <dontariq@gmail.com> wrote:
> Thanks again Ian. I'll make the changes suggested by you. And I am using
> dots because if I search for 'AA' it was giving me 'AA-' as well.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Tue, Feb 12, 2013 at 9:50 PM, Ian Lea <ian.lea@gmail.com> wrote:
>
>> From a glance it looks fine.  I don't see what you gain by adding dots
>> - you are using a TermQuery which will only do exact matches.  Since
>> you're using StringField your text won't be tokenized but stored as
>> is.  I see you're searching on a mixed case term - that's fine as long
>> as you don't expect "aaa" to match "AAA".  I tend to just downcase
>> everything because I've wasted so much time over the years on silly
>> case sensitive bugs.
>>
>> RAMDirectory instances will disappear when the application ends so
>> yes, you'll need to reload on startup.  You don't have to recreate for
>> each search though - create and populate the RAMDirectory on startup
>> and create an IndexSearcher and use that for all searches.
>>
>> Depending on your app it might be easier to use a normal disk based
>> index.  It will probably be fast enough.
>>
>>
>> --
>> Ian.
>>
>>
>> On Tue, Feb 12, 2013 at 1:29 PM, Mohammad Tariq <dontariq@gmail.com>
>> wrote:
>> > Hello Ian,
>> > *
>> > *
>> >      I started as directed by you and created the index. Here is a small
>> > piece of code which I have written. Please have a look over it :
>> > *
>> > *
>> > *public static void main(String[] args) throws IOException,
>> ParseException {
>> > *
>> > *  *
>> > *    //Specify the analyzer for tokenizing text. The same analyzer should
>> > be used for indexing and searching*
>> > *    StandardAnalyzer analyzer = new
>> StandardAnalyzer(Version.LUCENE_40);*
>> > *
>> > *
>> > *    // 1. create the index*
>> > *    Directory index = new RAMDirectory();*
>> > *    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
>> > analyzer);*
>> > *    IndexWriter w = new IndexWriter(index, config);*
>> > *    Configuration conf = HBaseConfiguration.create();*
>> > * HTable table = new HTable(conf, "mappings");*
>> > * Scan s = new Scan();*
>> > * ResultScanner rs = table.getScanner(s);*
>> > * int count = 0;*
>> > * String[] localnames;*
>> > * for (Result r : rs) {*
>> > * count++;*
>> > * localnames = Bytes.toString(r.getValue(Bytes.toBytes("cf"),
>> > Bytes.toBytes("LOC"))).trim().split(",");*
>> > * for(String str : localnames){*
>> > * addDoc(w, "." + str + ".",
>> Bytes.toString(r.getValue(Bytes.toBytes("cf"),
>> > Bytes.toBytes("CON"))), Bytes.toString(r.getRow()));*
>> > * }*
>> > * }*
>> > * System.out.println("COUNT : " + count);*
>> > * table.close();*
>> > *    w.close();*
>> > *    *
>> > *    // 2. query*
>> > *
>> > *
>> > *    String term = "";*
>> > *//    BufferedReader br = new BufferedReader(new
>> > InputStreamReader(System.in));    *
>> > *//    System.out.println("Enter the term you want to search...");*
>> > *//    term = br.readLine();*
>> > *    term = "Vacuolated Lymphocytes";*
>> > *    TermQuery tq = new TermQuery(new Term("localname", "." + term +
>> "."));*
>> > *
>> > *
>> > *    // 3. search*
>> > *    int hitsPerPage = 10;*
>> > *    IndexReader reader = DirectoryReader.open(index);*
>> > *    IndexSearcher searcher = new IndexSearcher(reader);*
>> > *    TopScoreDocCollector collector =
>> > TopScoreDocCollector.create(hitsPerPage, true);*
>> > *    searcher.search(tq, collector);*
>> > *    ScoreDoc[] hits = collector.topDocs().scoreDocs;*
>> > *    *
>> > *    // 4. display results*
>> > *    System.out.println("Found " + hits.length + " hits.");*
>> > *    for(int i=0;i<hits.length;++i) {*
>> > *      int docId = hits[i].doc;*
>> > *      Document d = searcher.doc(docId);*
>> > *      System.out.println("ControlID -> "  + d.get("controlid") + "\t" +
>> > "Localnames -> " + d.get("localname") + "\t" + "Controname -> " +
>> > d.get("controlname"));*
>> > *    }*
>> > *    // reader can only be closed when there*
>> > *    // is no need to access the documents any more.*
>> > *    reader.close();*
>> > *  }*
>> > *
>> > *
>> > * private static void addDoc(IndexWriter w, String local, String control,
>> > String rowkey) throws IOException {*
>> > *
>> > *
>> > * Document doc = new Document();*
>> > * doc.add(new StringField("localname", local, Field.Store.YES));*
>> > * doc.add(new StringField("controlname", control, Field.Store.YES));*
>> > * doc.add(new StringField("controlid", rowkey, Field.Store.YES)); *
>> > * w.addDocument(doc);*
>> > * }*
>> > *
>> > *
>> > Does it look fine to you? Or can I make it better by adding or removing
>> > something?Although it shows just a primitive usage of Lucene, it is
>> always
>> > better to have some able guidance with us.
>> >
>> > One more question. Does the index remain alive only till the lifetime of
>> > the application if we are using *RAMDirectory*? I have to run the entire
>> > process everytime I want to search something.
>> >
>> > Also, I have added a dot(.) before and after after each word before
>> adding
>> > it to the document so that I can do *exact match search*. Is my approach
>> > correct or is there any other OOTB feature available in Lucene which I
>> can
>> > use for this?
>> >
>> > I am sorry to be a pest of questions and thank you so much for your time.
>> >
>> > Warm Regards,
>> > Tariq
>> > https://mtariq.jux.com/
>> > cloudfront.blogspot.com
>> >
>> >
>> > On Mon, Feb 11, 2013 at 10:09 PM, Mohammad Tariq <dontariq@gmail.com>
>> wrote:
>> >
>> >> Hey Ian. Thank you so much for the quick reply. I'll definitely give
>> >> Lucene a shot. I'll start off with it and get back to you in case of any
>> >> problem.
>> >>
>> >> Many thanks.
>> >>
>> >> Warm Regards,
>> >> Tariq
>> >> https://mtariq.jux.com/
>> >> cloudfront.blogspot.com
>> >>
>> >>
>> >> On Mon, Feb 11, 2013 at 10:03 PM, Ian Lea <ian.lea@gmail.com> wrote:
>> >>
>> >>> You can certainly use lucene for this, and it will be blindingly fast
>> >>> even if you use a disk based index.
>> >>>
>> >>> Just index documents as you've laid it out, with the field you want
to
>> >>> search on added as indexable and the others stored.
>> >>>
>> >>> I've never used Guava Table so can't comment on that, but with only
a
>> >>> few thousand words it would certainly be feasible to use something
>> >>> like that.  Better?  I don't know.
>> >>>
>> >>> Personally I'd probably go with lucene as I'd be positive it would a)
>> >>> work and b) be fast even if the thousands ending being tens of
>> >>> thousands, or more.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Ian.
>> >>>
>> >>> On Mon, Feb 11, 2013 at 3:14 PM, Mohammad Tariq <dontariq@gmail.com>
>> >>> wrote:
>> >>> > Hello list,
>> >>> >
>> >>> >          I have a scenario wherein I need an in-memory index as
I
>> need
>> >>> > faster search. The problem goes like this :
>> >>> >
>> >>> > I have a list which contains a couple of thousands words. Each
word
>> has
>> >>> a
>> >>> > corresponding ID and a list of synonyms. The actual word is a column
>> in
>> >>> my
>> >>> > Hbase table. I get files which contain values for this column and
I
>> >>> have to
>> >>> > extract values from these files and put them into the appropriate
>> >>> column.
>> >>> > But sometimes files may contain the synonym instead of the actual
>> word.
>> >>> > Now, this is the place where index come into picture. I should
have
>> an
>> >>> > index that contains all the words along with its ID and all the
>> synonyms
>> >>> > and it should be in-memory always so that inserts into Hbase are
>> quick.
>> >>> > Something like this :
>> >>> >
>> >>> >  ID          WORD           SYNONYMS
>> >>> >  13991     A                  a, A, Aa, aa, AA
>> >>> >
>> >>> > Then the index should be something like this :
>> >>> > a    A   13991
>> >>> > A    A   13991
>> >>> > Aa  A   13991
>> >>> > aa   A   13991
>> >>> > AA  A   13991
>> >>> >
>> >>> > So that if I get 'a' in the file, I should be able to do a lookup
and
>> >>> index
>> >>> > should give me 'A' along with '13991'. I need both the base name
and
>> the
>> >>> > ID. The names could even be strings of 4 to 5 words.
>> >>> >
>> >>> > I have all this information stored in a Hbase table having two
>> columns
>> >>> > where the first column contains the actual word and the second
column
>> >>> > contains the entire list of synonyms. And the rowkey is the ID.
>> >>> >
>> >>> > Now. I am not getting whether it is feasible to use Lucene to get
>> this
>> >>> or
>> >>> >  should I go with something like 'Guava Table' or something else.
>> Need
>> >>> some
>> >>> > guidance as being new to Lucene I am not able to think in the right
>> >>> > direction. If it is feasible to use Lucene to achieve this how
to do
>> it
>> >>> > efficiently?
>> >>> >
>> >>> > I am using Hbase filters right now to do the fetch which is slowing
>> down
>> >>> > the process.
>> >>> >
>> >>> > I am sorry if my questions sound too childish or senseless as I
am
>> not
>> >>> very
>> >>> > good at Lucene. Thank you so much for your valuable time.
>> >>> >
>> >>> > Warm Regards,
>> >>> > Tariq
>> >>> > https://mtariq.jux.com/
>> >>> > cloudfront.blogspot.com
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>
>> >>>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message