Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 73388 invoked from network); 26 Apr 2005 13:00:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 26 Apr 2005 13:00:57 -0000 Received: (qmail 24391 invoked by uid 500); 26 Apr 2005 13:01:21 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 24294 invoked by uid 500); 26 Apr 2005 13:01:20 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 24248 invoked by uid 99); 26 Apr 2005 13:01:19 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from Unknown (HELO ehatchersolutions.com) (69.55.225.129) by apache.org (qpsmtpd/0.28) with ESMTP; Tue, 26 Apr 2005 06:01:19 -0700 Received: by ehatchersolutions.com (Postfix, from userid 504) id EE84313E2006; Tue, 26 Apr 2005 09:00:36 -0400 (EDT) Received: from [128.143.167.108] (d-128-167-108.bootp.Virginia.EDU [128.143.167.108]) by ehatchersolutions.com (Postfix) with ESMTP id B865013E2124; Tue, 26 Apr 2005 09:00:08 -0400 (EDT) In-Reply-To: <426DEBE1.1060309@gmx.net> References: <426DEBE1.1060309@gmx.net> Mime-Version: 1.0 (Apple Message framework v619.2) Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: <816ce3ffeb6daf26ceddfadb287fba6c@ehatchersolutions.com> Content-Transfer-Encoding: 7bit Cc: java-user@lucene.apache.org From: Erik Hatcher Subject: Re: Indexing of virtual "made up" documents Date: Tue, 26 Apr 2005 09:00:05 -0400 To: general@lucene.apache.org X-Mailer: Apple Mail (2.619.2) X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on javelina X-Spam-Status: No, score=-3.1 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.1 X-Spam-Level: X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On Apr 26, 2005, at 3:21 AM, Daniel Stephan wrote: > lets see if somebody listens on this list :-D I doubt many are on this list, yet. But your question is probably best asked on the java-user@lucene list rather than here. I'll CC java-user this time to loop those folks in. > I wonder if the following is possible with Lucene. Yes, it is! > I would like to add documents to the index, which aren't real > documents. That's pretty much how my use of Lucene works - there aren't real filesystem documents to index per se. > :-) Meaning: there is no text to parse and tokenize. What I have is a > number of features, some are simple words, some are combinations of > words. Those features classify an entity in my database. > > I have also an own parser/analyzer/tokenizer which is able to take a > text and extract those features from it. (Possibly a query.) > > So, I wanna do sth like (pseudo code): > > Lucene.index(myEntity.getId, myEntity.getDescriptors) > > and then when a query was issued: > > List entityIds = Lucene.query(myQuery.convertToLuceneQueryLanguage) > > I was looking at the source and couldn't find a possibility to get rid > of the analyzing stage to hand the features to Lucene myself. You have two good options here.... you can add each token individually as a Field.Keyword() with the same field name - these will not get analyzed. > One possibility would be to use an analyzer which only considers > whitespace as delimiter and set all descriptors as one string. This > feels suboptimal, because I have them as single tokens already and > concatenating them first, to let lucene tokenize them again, should > not > be necessary. You're right, it's not necessary. The second option is to create a custom Analyzer that returns the tokens you've already established. > Also, the neighbour information isn't applicable in my scenario. It > seems you use placement of terms somehow. I don't have placement > information. Would that hurt Lucene? Placement of terms is used in phrase queries, but certainly isn't a necessity that you concern yourself with it. You can simply emit tokens in whatever position you like (leaving the default position increment to 1 is what I'd recommend). > I am not sure how Lucenes uses the placement information, but in the > described case where I concatenate all my features to a > whitespace-delimited text, I fear that Lucene uses the placement of > features in this made-up text and comes to some wrong conclusions > (after > all, the placement is arbitrary in the "made-up" text). What wrong conclusions do you fear here? Again, the position information is used for phrase queries, but in your situation you wouldn't be using phrase queries so no need to concern yourself with the position stuff at all. > Also 2, I am not sure yet, how the converter would have to look like. > After all the terms in the query have to be of the same form as those > in > the index, otherwise they wouldn't match. Can I inject my own analyzer > only for the query part, so that lucene hands it phrases and lets it > build features from those phrases? Sure - you can use any analyzer you like for query parsing. It sounds like you aren't going to use QueryParser, though, so you may not need an analyzer at all. You definitely have to ensure that the terms in the query match the terms you indexed in order to find documents. How you do this is really up to you. > Any info is appreciated. I could maybe build my own simple index, the > analyzer is already there, but I would prefer to use a professional > solution with a good query language and some additional > nice-to-have-features. If you could give us something more concrete, we could help in more detail. But from the scenario you've described, Lucene fits fine and in fact describes the way I use it in some cases. > May I use Lucene? :-) Yes, you may! Erik