Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8C2F1D59B for ; Sun, 26 Aug 2012 06:40:30 +0000 (UTC) Received: (qmail 25334 invoked by uid 500); 26 Aug 2012 06:40:28 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 24917 invoked by uid 500); 26 Aug 2012 06:40:27 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 24885 invoked by uid 99); 26 Aug 2012 06:40:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Aug 2012 06:40:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of noopur.julka@gmail.com designates 209.85.214.176 as permitted sender) Received: from [209.85.214.176] (HELO mail-ob0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Aug 2012 06:40:20 +0000 Received: by obbtb18 with SMTP id tb18so4756114obb.35 for ; Sat, 25 Aug 2012 23:39:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=NTFzQxFFYOLY6tVWzoPU9nDV8RllHNnq3uiRQaOKcu0=; b=xu8VMx1NNOA0qpr1Q9mttVJ5bgNMu/PPWU7OL07ARZeu8PtBEPJiF+CVhkIgzn9n6z VlQBGsfifgj9VbsaQNjLdenL6aQJEIDs28gKRpjmX0LN7gddjbmKsaPVqWBUSNvOHZLr 1X4hR1JLsvfmzU8kh3Pv7lbKHPYUVQBDgLN5Sabc0YxoiVF3gITTcfJVc+x/9tr/DkOK ZZA5oAU0KJmjRu+UONacePeij3L6+m/8kLDSHtx/R+8HtbRe0HPRLR/UsYksABe4W4V8 xAQLXqSC/NzdbwosZTRE+mJwHn67WCIDtmn35FmWe08a8LSmjcfH0HQiayztG5UX9X6W AXzg== Received: by 10.182.111.74 with SMTP id ig10mr7769472obb.14.1345963198893; Sat, 25 Aug 2012 23:39:58 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.65.2 with HTTP; Sat, 25 Aug 2012 23:39:37 -0700 (PDT) In-Reply-To: References: From: Noopur Julka Date: Sun, 26 Aug 2012 12:09:37 +0530 Message-ID: Subject: Re: Efficient string lookup using Lucene To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=14dae9399cef59d03304c8257ae1 --14dae9399cef59d03304c8257ae1 Content-Type: text/plain; charset=ISO-8859-1 I haven't yet found answer to my original question which was how to work with search for japanese characters. Regards, Noopur Julka On Sun, Aug 26, 2012 at 9:17 AM, Devon H. O'Dell wrote: > Seems worth mentioning in partial response to this thread's topics that > (almost) regardless of index strategy, lucene performance hinges on number > of matched documents per query, not total docs in index. There are other > mitigating factors (disk type, ram size, etc), but worst case performance > analysis can generally be modeled in terms of matched documents as opposed > to index size. > > Apologies for any spelling / grammatical errors; this is sent from my > phone. > > --dho > On Aug 25, 2012 11:02 PM, "Noopur Julka" wrote: > > > Index being very large can be ruled out as Luke returned few results and > > the app is capable of returning approx 200 results. > > > > Regards, > > Noopur Julka > > > > > > > > On Sun, Aug 26, 2012 at 6:40 AM, Ilya Zavorin wrote: > > > > > Does Lucene support this type of structure, or do I need to somehow > > > implement it outside Lucene? > > > > > > By the way, I need this to run on an Android phone so size of memory > > might > > > be an issue... > > > > > > Thanks, > > > > > > > > > Ilya Zavorin > > > > > > > > > -----Original Message----- > > > From: Dawid Weiss [mailto:dawid.weiss@gmail.com] > > > Sent: Friday, August 24, 2012 4:50 PM > > > To: java-user@lucene.apache.org > > > Subject: Re: Efficient string lookup using Lucene > > > > > > What you need is a suffix tree or a suffix array. Both data structures > > > will allow you to perform constant-time searches for existence/ > > occurrence > > > of any input pattern. Depending on how much text you have on the input > it > > > may either be a simple task -- see here: > > > > > > http://labs.carrotsearch.com/jsuffixarrays.html > > > > > > or a complicated task if your input size is larger (larger than > memory). > > > Google search for suffix trees/ suffix arrays though, it's the data > > > structure to use here. > > > > > > Dawid > > > > > > On Fri, Aug 24, 2012 at 9:48 PM, Ilya Zavorin > wrote: > > > > Hi Everyone, > > > > > > > > I have the following task. I have a set of documents in multiple > > > languages. I don't know what these languages are. Any given doc may > > contain > > > text in several languages mixed up. So to me these are just a bunch of > > > Unicode text files. > > > > > > > > What I need is to implement an efficient EXACT string lookup. That > is, > > I > > > need to be able to find ANY Unicode string exactly as it appears. I do > > not > > > care about language-specific modifications of the string. That is, if I > > > search for a string "run", I do not need to find "ran" but I do want to > > > find it in all of these strings below: > > > > > > > > Fox is running fast > > > > !%#^&$run!$!%@&$# > > > > run,run > > > > > > > > Is there a way of using StandardAnalyzer or any other analyzer and > the > > > corresponding query parser to find these? Again, my queries might be > more > > > or less random Unicode sequences and I need to find all their > accurrences > > > in the text. > > > > > > > > Essentially, what I am trying to do is implement substring matching > > more > > > efficiently that using Java's standard substring matching methods. > > > > > > > > Thanks! > > > > > > > > Ilya Zavorin > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > --14dae9399cef59d03304c8257ae1--