Mailing-List: contact user-help@lucenenet.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@lucenenet.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
Sender: omri@diffdoof.com
In-Reply-To: <50349CB0.40800@devhost.se>
References: <5034856C.3090204@devhost.se> <50348642.8080807@devhost.se>
 <CAAd_LB+0uDmw4Y5rpMd1N-Lo189=Tw8PKiJc=9dhNxgBApH7aw@mail.gmail.com>
 <50349CB0.40800@devhost.se>
From: Omri Suissa <omri.suissa@diffdoof.com>
Date: Wed, 22 Aug 2012 18:35:48 +0300
Message-ID: 
 <CAAd_LBKO04_hbUCJ=B+tcpY7hMbEbHRcdhm=yM=-0HV5-TSWRw@mail.gmail.com>
Subject: Re: Lucene.net Nested Documents support (lucene version 3.4)
To: lucene-net-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=20cf3079ba4875b94604c7dc8072

--20cf3079ba4875b94604c7dc8072
Content-Type: text/plain; charset=ISO-8859-1

Hi Simon,
I think i will try your solution and see if it works find on my data
(performance mainly). Thanks a lot!

Omri

On Wed, Aug 22, 2012 at 11:47 AM, Simon Svensson <sisve@devhost.se> wrote:

> The size of the index will grow, but not to any extreme values. If all
> values can be represented as normal 4-byte integers, then a type and two
> users would be 12 bytes [ typeId, firstUserId, secondUserId ]. You could go
> for other means of values based on your internal knowledge of most common
> types, size (and generation) of user ids, etc. Perhaps the same VInt
> (variable-length) integers that Lucene uses internally.
>
> Assuming 12 bytes per document would be an increase of index size with
> about 12 megabytes per million documents. Lucene can handle far larger
> indexes than that. This is perhaps a workaround to proceed until true
> nested documents are introduced?
>
> You could use the PositiveScoresOnlyCollector (which wraps another
> collector) to ignore hits scored with zero value.
>
> Perhaps index every tag twice, the second time with type-information
> embedded. That would make it possible to search for tag+type if that's a
> common search, but you would still need to verify permissions.
>
> // Simon
>
>
> On 2012-08-22 10:23, Omri Suissa wrote:
>
>> Thank you,
>> I can understand how this can work on a small number of documents but what
>> if i have millions of documents?
>> then there could be a situation when a lot of documents will be returned
>> by
>> the query and only then we will set the score to 0.
>> I would like to find a way that the documents will not be return from the
>> query in the first place... (as far as i understand this way it will be
>> much more efficient).
>>
>> Omri
>>
>> On Wed, Aug 22, 2012 at 10:12 AM, Simon Svensson <sisve@devhost.se>
>> wrote:
>>
>>  Hi,
>>>
>>> First of, storing this data into the index would mean that you would
>>> store the permissions at index-time, not query-time. Any changed
>>> permissions would require an reindexing of the documents affected.
>>>
>>> You can accomplish this using payloads. I'm not sure on the technical
>>> details regarding how they are read into memory, caching and such. I'm
>>> using payloads for a small index (few thousand documents) to have a
>>> timestamp on indexed values (a valid until-date) so documents no longer
>>> matches a specific token after a set date. You could do something
>>> similar where type- and permission information is encoded as a payload,
>>> a byte-array, and verified at query time.
>>>
>>> The score is calculated using a custom similarity, specified with
>>> indexSearcher.SetSimilarity(****new ValiditySimilarity());
>>>
>>>
>>>      public class ValiditySimilarity : DefaultSimilarity {
>>>          public override Single ScorePayload(Int32 docId, String
>>> fieldName, Int32 start, Int32 end, Byte[] payload, Int32 offset, Int32
>>> length) {
>>>              var validTo = BitConverter.ToInt64(payload, offset);
>>>              if (DateTime.Now.Ticks < validTo)
>>>                  return 1;
>>>
>>>              return 0;
>>>          }
>>>      }
>>>
>>> The actual payload is generated by a custom token stream when indexing
>>> the document.
>>>
>>>      document.Add(new Field("FieldName", GetTokenStream("value1 value2",
>>> DateTime.Now.AddDays(1))));
>>>
>>>      private static TokenStream GetTokenStream(String value, DateTime
>>> validTo) {
>>>          var valueReader = new StringReader(value);
>>>          var stream = new StandardTokenizer(V.LUCENE_29, valueReader);
>>>          stream = new LowerCaseFilter(stream);
>>>          stream = new ValidityPayloadFilter(stream, validTo);
>>>          return stream;
>>>      }
>>>
>>>      public class ValidityPayloadFilter : TokenFilter {
>>>          private readonly DateTime _validTo;
>>>          private readonly PayloadAttribute _payloadAttribute;
>>>
>>>          public ValidityPayloadFilter(****TokenStream stream, DateTime
>>>
>>> validTo)
>>>              : base(stream) {
>>>              _validTo = validTo;
>>>              _payloadAttribute =
>>> (PayloadAttribute)****AddAttribute(typeof(****PayloadAttribute));
>>>
>>>          }
>>>
>>>          public override Boolean IncrementToken() {
>>>              if (!input.IncrementToken())
>>>                  return false;
>>>
>>>              var bytes = BitConverter.GetBytes(_****validTo.Ticks);
>>>
>>>
>>>              var payload = new Payload(bytes);
>>>              _payloadAttribute.SetPayload(****payload);
>>>
>>>              return true;
>>>          }
>>>      }
>>>
>>> // Simon
>>>
>>>
>>>
>>> On 2012-08-22 08:13, Omri Suissa wrote:
>>>
>>>  Hi Simon,
>>>> Thanks for the help.
>>>> This is my scenario:
>>>> My search application allow users to add manual tags to each document,
>>>> each
>>>> tag have a name, type and permissions.
>>>> When searching I would like to have the following options:
>>>> 1) get all the document that contains specific tag (with any type) that
>>>> I
>>>> have permission to view
>>>> 2) get all the document that contains specific tag with specific type
>>>> that
>>>> I have permission to view
>>>>
>>>> For example if I have 2 documents:
>>>> Doc A with tags:
>>>>            X (type 1, permissions: everyone)
>>>>            Y (type 1, permissions: User1, User2)
>>>>            Z (type 2, permissions: User1)
>>>>
>>>> Doc B with tags:
>>>>            X (type 2, permissions: everyone)
>>>>            Y (type 4, permissions: everyone)
>>>>            Z (type 2, permissions: User1)
>>>>
>>>> I'll be able to find A and B when searching for all documents with tag
>>>> X,
>>>> only A if X with type 1 and non of the if tag Z and i'm User2 (and so
>>>> on...).
>>>>
>>>> So nested documents could really help me where each tag is a sub
>>>> document
>>>> (like sql JOIN operation).
>>>>
>>>> What can I do using the current capabilities?
>>>>
>>>> Thank you for the help,
>>>> Omri
>>>>
>>>> On Tue, Aug 21, 2012 at 8:02 PM, Simon Svensson <sisve@devhost.se>
>>>> wrote:
>>>>
>>>>   Hi,
>>>>
>>>>> I do not have an answer to your explicit question, but this mail group
>>>>> could perhaps help you with workarounds using the current
>>>>> functionality.
>>>>> Are you after the search functionality (field1:a and field2:b) with
>>>>> child
>>>>> documents? Or grouping of the results (the sql equivalent of group by)?
>>>>> Return the first 5 entries of every group (like a Google search does
>>>>> per
>>>>> site)?
>>>>>
>>>>> // Simon
>>>>>
>>>>>
>>>>> On 2012-08-21 16:00, Omri Suissa wrote:
>>>>>
>>>>>   Hi everyone,
>>>>>
>>>>>> We are currently implementing Lucene .net in our solution and we need
>>>>>> to
>>>>>> use the Lucene Nested Documents support that was introduce in Lucene
>>>>>> version 3.4
>>>>>> If I understand correctly the current version of Lucene .net does not
>>>>>> support this feature (and other 3.4 features), there is a timeline for
>>>>>> the
>>>>>> 3.4 porting to .net?
>>>>>>
>>>>>> Thank you,
>>>>>> Omri
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>
>>>
>>>
>

--20cf3079ba4875b94604c7dc8072--