Return-Path: X-Original-To: apmail-lucenenet-user-archive@www.apache.org Delivered-To: apmail-lucenenet-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B3A86D6FD for ; Wed, 22 Aug 2012 15:36:35 +0000 (UTC) Received: (qmail 42306 invoked by uid 500); 22 Aug 2012 15:36:35 -0000 Delivered-To: apmail-lucenenet-user-archive@lucenenet.apache.org Received: (qmail 42243 invoked by uid 500); 22 Aug 2012 15:36:34 -0000 Mailing-List: contact user-help@lucenenet.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@lucenenet.apache.org Delivered-To: mailing list user@lucenenet.apache.org Received: (qmail 42234 invoked by uid 500); 22 Aug 2012 15:36:34 -0000 Delivered-To: apmail-lucene-lucene-net-user@lucene.apache.org Received: (qmail 42231 invoked by uid 99); 22 Aug 2012 15:36:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Aug 2012 15:36:34 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.212.48] (HELO mail-vb0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Aug 2012 15:36:30 +0000 Received: by vbme21 with SMTP id e21so1451780vbm.35 for ; Wed, 22 Aug 2012 08:36:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type :x-gm-message-state; bh=b2c3un7IQgygVKbaeht/59N82pFzgdeGcNqQfjrKiOE=; b=EQEnrSCeOgBY+QmY3z5aYKv1LeNDeLCoknGxCAqvVt3fzSgRU7rQEgj3zO95BhAYSu cyvDp4AUbBVVxYjySaUnaISs5UhKLkad8QM9xvP4fi8nKSK7mmdMYCrA6QRx1jemGO5c nHzGmXaKvP2fN37eUaj2vqziFycpvwbsOSC8ucT4hfLaE5uI0B7kNXjJMVzqAte8rBmY uM2Hd9dfGeCgXBUkqAG9/GBfyLa37l551qdBStrHnV8vDorAbC0vMLZvGYEN4ynoRmFh NbMKR2Bigf/1ZYtiBPGvexRY6jgZ1Wbyz/wf3A5VdI0Zl56/j5llIBg+VEmsUUhEtMzT nV9A== Received: by 10.52.35.116 with SMTP id g20mr14345668vdj.97.1345649768773; Wed, 22 Aug 2012 08:36:08 -0700 (PDT) MIME-Version: 1.0 Sender: omri@diffdoof.com Received: by 10.220.99.79 with HTTP; Wed, 22 Aug 2012 08:35:48 -0700 (PDT) In-Reply-To: <50349CB0.40800@devhost.se> References: <5034856C.3090204@devhost.se> <50348642.8080807@devhost.se> <50349CB0.40800@devhost.se> From: Omri Suissa Date: Wed, 22 Aug 2012 18:35:48 +0300 X-Google-Sender-Auth: td81XdBudA8hbZZRaltdzAqtrBY Message-ID: Subject: Re: Lucene.net Nested Documents support (lucene version 3.4) To: lucene-net-user@lucene.apache.org Content-Type: multipart/alternative; boundary=20cf3079ba4875b94604c7dc8072 X-Gm-Message-State: ALoCoQnD9PhFLRuum2SGQlN+D1+eV8MCIo035I6Jr6BDFM/gTJ2XaSFF0WrKY8zFCDhJu/PNiRyN X-Virus-Checked: Checked by ClamAV on apache.org --20cf3079ba4875b94604c7dc8072 Content-Type: text/plain; charset=ISO-8859-1 Hi Simon, I think i will try your solution and see if it works find on my data (performance mainly). Thanks a lot! Omri On Wed, Aug 22, 2012 at 11:47 AM, Simon Svensson wrote: > The size of the index will grow, but not to any extreme values. If all > values can be represented as normal 4-byte integers, then a type and two > users would be 12 bytes [ typeId, firstUserId, secondUserId ]. You could go > for other means of values based on your internal knowledge of most common > types, size (and generation) of user ids, etc. Perhaps the same VInt > (variable-length) integers that Lucene uses internally. > > Assuming 12 bytes per document would be an increase of index size with > about 12 megabytes per million documents. Lucene can handle far larger > indexes than that. This is perhaps a workaround to proceed until true > nested documents are introduced? > > You could use the PositiveScoresOnlyCollector (which wraps another > collector) to ignore hits scored with zero value. > > Perhaps index every tag twice, the second time with type-information > embedded. That would make it possible to search for tag+type if that's a > common search, but you would still need to verify permissions. > > // Simon > > > On 2012-08-22 10:23, Omri Suissa wrote: > >> Thank you, >> I can understand how this can work on a small number of documents but what >> if i have millions of documents? >> then there could be a situation when a lot of documents will be returned >> by >> the query and only then we will set the score to 0. >> I would like to find a way that the documents will not be return from the >> query in the first place... (as far as i understand this way it will be >> much more efficient). >> >> Omri >> >> On Wed, Aug 22, 2012 at 10:12 AM, Simon Svensson >> wrote: >> >> Hi, >>> >>> First of, storing this data into the index would mean that you would >>> store the permissions at index-time, not query-time. Any changed >>> permissions would require an reindexing of the documents affected. >>> >>> You can accomplish this using payloads. I'm not sure on the technical >>> details regarding how they are read into memory, caching and such. I'm >>> using payloads for a small index (few thousand documents) to have a >>> timestamp on indexed values (a valid until-date) so documents no longer >>> matches a specific token after a set date. You could do something >>> similar where type- and permission information is encoded as a payload, >>> a byte-array, and verified at query time. >>> >>> The score is calculated using a custom similarity, specified with >>> indexSearcher.SetSimilarity(****new ValiditySimilarity()); >>> >>> >>> public class ValiditySimilarity : DefaultSimilarity { >>> public override Single ScorePayload(Int32 docId, String >>> fieldName, Int32 start, Int32 end, Byte[] payload, Int32 offset, Int32 >>> length) { >>> var validTo = BitConverter.ToInt64(payload, offset); >>> if (DateTime.Now.Ticks < validTo) >>> return 1; >>> >>> return 0; >>> } >>> } >>> >>> The actual payload is generated by a custom token stream when indexing >>> the document. >>> >>> document.Add(new Field("FieldName", GetTokenStream("value1 value2", >>> DateTime.Now.AddDays(1)))); >>> >>> private static TokenStream GetTokenStream(String value, DateTime >>> validTo) { >>> var valueReader = new StringReader(value); >>> var stream = new StandardTokenizer(V.LUCENE_29, valueReader); >>> stream = new LowerCaseFilter(stream); >>> stream = new ValidityPayloadFilter(stream, validTo); >>> return stream; >>> } >>> >>> public class ValidityPayloadFilter : TokenFilter { >>> private readonly DateTime _validTo; >>> private readonly PayloadAttribute _payloadAttribute; >>> >>> public ValidityPayloadFilter(****TokenStream stream, DateTime >>> >>> validTo) >>> : base(stream) { >>> _validTo = validTo; >>> _payloadAttribute = >>> (PayloadAttribute)****AddAttribute(typeof(****PayloadAttribute)); >>> >>> } >>> >>> public override Boolean IncrementToken() { >>> if (!input.IncrementToken()) >>> return false; >>> >>> var bytes = BitConverter.GetBytes(_****validTo.Ticks); >>> >>> >>> var payload = new Payload(bytes); >>> _payloadAttribute.SetPayload(****payload); >>> >>> return true; >>> } >>> } >>> >>> // Simon >>> >>> >>> >>> On 2012-08-22 08:13, Omri Suissa wrote: >>> >>> Hi Simon, >>>> Thanks for the help. >>>> This is my scenario: >>>> My search application allow users to add manual tags to each document, >>>> each >>>> tag have a name, type and permissions. >>>> When searching I would like to have the following options: >>>> 1) get all the document that contains specific tag (with any type) that >>>> I >>>> have permission to view >>>> 2) get all the document that contains specific tag with specific type >>>> that >>>> I have permission to view >>>> >>>> For example if I have 2 documents: >>>> Doc A with tags: >>>> X (type 1, permissions: everyone) >>>> Y (type 1, permissions: User1, User2) >>>> Z (type 2, permissions: User1) >>>> >>>> Doc B with tags: >>>> X (type 2, permissions: everyone) >>>> Y (type 4, permissions: everyone) >>>> Z (type 2, permissions: User1) >>>> >>>> I'll be able to find A and B when searching for all documents with tag >>>> X, >>>> only A if X with type 1 and non of the if tag Z and i'm User2 (and so >>>> on...). >>>> >>>> So nested documents could really help me where each tag is a sub >>>> document >>>> (like sql JOIN operation). >>>> >>>> What can I do using the current capabilities? >>>> >>>> Thank you for the help, >>>> Omri >>>> >>>> On Tue, Aug 21, 2012 at 8:02 PM, Simon Svensson >>>> wrote: >>>> >>>> Hi, >>>> >>>>> I do not have an answer to your explicit question, but this mail group >>>>> could perhaps help you with workarounds using the current >>>>> functionality. >>>>> Are you after the search functionality (field1:a and field2:b) with >>>>> child >>>>> documents? Or grouping of the results (the sql equivalent of group by)? >>>>> Return the first 5 entries of every group (like a Google search does >>>>> per >>>>> site)? >>>>> >>>>> // Simon >>>>> >>>>> >>>>> On 2012-08-21 16:00, Omri Suissa wrote: >>>>> >>>>> Hi everyone, >>>>> >>>>>> We are currently implementing Lucene .net in our solution and we need >>>>>> to >>>>>> use the Lucene Nested Documents support that was introduce in Lucene >>>>>> version 3.4 >>>>>> If I understand correctly the current version of Lucene .net does not >>>>>> support this feature (and other 3.4 features), there is a timeline for >>>>>> the >>>>>> 3.4 porting to .net? >>>>>> >>>>>> Thank you, >>>>>> Omri >>>>>> >>>>>> >>>>>> >>>>>> >>> >>> >>> > --20cf3079ba4875b94604c7dc8072--