river-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wade Chandler <hwadechandler-apa...@yahoo.com>
Subject Re: datastructure classes
Date Thu, 16 Dec 2010 21:02:21 GMT
Yes, unless an in-memory and local DB, then you have another layer of network 
abstraction plus the conversion factor on lookup unless talking about some kind 
of an OO DB. But, when needing to scale to a number of instances, then it may 
very well be unavoidable to go to disk. Also, without such a scheme (some form 
of persistence) there is no way to have long term fail safe persistence which I 
believe is a general requirement for a distributed system, and one which it 
seems JavaSpaces perhaps implies since it uses transactions. What happens to the 
distributed system when the shared memory goes up and down even after a 
transaction has completed successfully and not take operation has occurred?

part thinking out loud, and part solution...perhaps anyways

I'm thinking if some things are done a bit differently, then perhaps certain 
patterns do lend to a DB perhaps or better to a hashmap system. Though I think 
the issue with the DB, extra network, etc, are still an issue which a local file 
or memory system works around.

Thinking about the conversion a bit, and considering JavaSpaces implementations 
do not run or access logic in the instances they are housing, it really seems a 
waste of time to serialize to the Java serialized format and then instantiate 
those same instances to place them into some list, hash, other, etc. The type of 
entry and its fields are what are used to find something and per equality.

Some in memory implementation, in-memory from the client perspective where the 
whole of JavaSpaces is never transmitted over the air or wire, could just skip 
serialization all the way around perhaps ...except for the issue with it being a 
copy... Serialization is probably always a good idea in that case; of course 
depending on the requirements, but for some default implementation for sure.

Now, the one place that doesn't seem a waste of time is using those unmarshalled 
Entries and fields for lookup. Seems having those realized in memory already is 
much faster. A way to speed up the overall effort would be to store the 
serialized form along with the realized/unmarshalled form in the list or what 
ever data structure since these held instances are immutable per the 
write/read/take operations; no reserialization. Then the serialized form can be 
returned immediately. More memory of course, but that could be limited with a 
logical archiving mechanism for memory which hasn't been accessed for a period 
of time (paging essentially of the serialized form). A hash, depending on size, 
would of course work for that whether on disk or in memory.

Dwelling further, it doesn't seem probable to just take the Entry fields and 
serialize them associate them with their hash (hashCode) and then use something 
to compare those things because there is a possibility that hashCode hasn't been 
overridden and thus equals and hashCode won't necessarily both represent 
equality with the same meaning. Were there some guarantee that is the case, then 
that would be possible to do. 

However, and contrary to what I just wrote, since here I'm not talking about 
Object.hashCode and equals, it may be enough to just serialize each field of an 
Entry and take the whole (serialized field that is) and store it as a hash for 
comparison while the serialized form of the overall Entry can be stored for 
reads/takes; no reserialization needed this way when sending an object back to a 
caller. Perhaps those serialized fields can be taken from the Java serialization 
format directly. The work flow of a distributed spaces implementation may have 
the following work flow:

1) A JavaSpaces service is looked up.
2) An Entry is written to the space.
3) The client impl serializes the object with regular Java serialization and 
sends it over the wire/air
4) The server impl takes the serialized instance and scans it; while scanning 
the fields of the Entry (those matching spec requirements of course) and while 
streaming them out is generating a hash or checksum for them. In other words, 
checks the format while at the same time generating hashes which can be used to 
check for equality since the specification states only equality will be used for 
5) The serialized form is stored in memory or to disk along with information 
about the types this class extends or implements
6) The field names and hashes are stored and reference some key such as the hash 
for the overall serialized form which is used in a hash map for memory or a 
filename if on disk. Equal hashes should be OK if a reference counter is used to 
save memory and keep from storing the exact same Entries in more than one file 
if write is used more than once.
7) The serialized form is stored along with information about the types this 
class extends or implements

read/take (will leave out removal part of take here...simplest part):
1) A JavaSpaces service is looked up.
2) An Entry is requested with a template
3) The client impl serializes the template and sends it
4) The server impl performs roughly the same operations as 4 of course ignoring 
the null fields of the Entry (wild cards)
5) The generated hashes are used to look at only the required types and fields
6) The first match is returned

That seems like a good base of something which can start to incorporate 
clustering for the contained data as well as support disk or memory based 
contents. A single data structure would not support everything needed to be 
accomplished perhaps, but locking could be reduced to a couple locations I 
believe if memory is segmented into multiple containers. For disk, a structure 
versus a single file could work well.

For instance, the above lends itself to hashmaps. HashMaps can hold their 
contents in memory or in regular random access files since we are just talking 
buckets. The files can hold the buckets where the keys and base file names can 
be held. The key is to having segments and those being of a good size to not by 
default consume large amounts of memory yet provide as few steps as possible for 
a system which is some where in between. By using a hash, versus just an array 
or list, we can move around information per its key or hash, and this same key 
can be placed into any given segment and still function just the same if the 
size of the hashmaps are controlled and set to always be the same size. Too, 
locking can then be done away with on the hashmap segments all together since we 
won't actually resize and the hash and buckets don't need to be recalculated.

Now, there may be times when memory is needed to be regained from unused maps, 
so some locking would be needed for that, but the time to lookup a value would 
be drastically reduced, and for the most part less the field comparisons a 
lookup would take 1n where n is the number of hashmaps which would be 
considerably smaller than where n is the number of Entries as we have now. Too, 
where we're generating hashes while we're scanning the serialized Java class 
that should help in the cost savings even though we have added a new process. 
Luckily hashes are additive and can be made part of the process overall. If 
during the scanning and hash generation there are any errors with the serialized 
format, an error can be thrown as before where unmarshalling and object would 
have taken place.

Clustering, using something like Shoal, could then be possible with only the 
need to send the hashes, field names, types, and serialized format or the pieces 
required to do the job. This way each node in the cluster doesn't have to 
perform the heavier operations and instead just store them using the hash 
mechanisms. At least in a replicating cluster that is.

Anyways, this email is way too long as it is, but that seems like a good start. 

The other parts to figure out are the exact places locking would need to take 
place, how exactly the field references would be structured and stored, how much 
memory that then entails, but something like class_field_hashofvalue (this 
becomes a hashmap key) and the value pointed to is the hash used in the other 
hashmap for actual object lookup. That final hash can be used to pull out the 
serialized form from disk or directly from an in memory hashmap; too that hash 
is a hash of the overall serialized format and can be used to determine if the 
final value really needs to be stored again or not. With that there are some 
counters which need to be incorporated into some formal design.

So that's part of my idea. I'm still dwelling on the overall data structure and 
is why I haven't really commented more yet.


Wade Chandler
Software Engineer and Developer
NetBeans Dream Team Member and Contributor


----- Original Message ----
> From: Mike McGrady <mmcgrady@topiatechnology.com>
> To: "river-dev@incubator.apache.org" <river-dev@incubator.apache.org>
> Cc: "river-dev@incubator.apache.org" <river-dev@incubator.apache.org>
> Sent: Thu, December 16, 2010 10:04:55 AM
> Subject: Re: datastructure classes
> Intense network systems like our clients cannot work with database calls.   
>They are too slow and do not scale.  They either reach a cost or a  performance 
> Sent from my iPhone
> Michael  McGrady
> Principal investigator AF081_028 SBIR
> Chief Architect
> Topia  Technology, Inc
> Work 1.253.572.9712
> Cel 1.253.720.3365
> On Dec 16,  2010, at 5:55 AM, Patricia Shanahan <pats@acm.org> wrote:
> >> Patricia,
> >> If you don't mind, I am going  to argue for sticking to the machines,
> >> but answer your question  roughly at the end.
> > 
> > I don't mind machines, as long as they get  somewhat quantified. It was
> > "machines a minute" that really gave me  trouble. However, there are
> > machines and machines, so I would rather  think in terms of transaction
> > rates.
> > 
> >> The questions  about either memory use or connections or transactions
> >> are stressors  leading to the question and the urgency whether or not
> >> something  must scale, but they have nothing or very little to do with
> >> whether  it will scale.
> >> If we try to involve multiple machines, then we will  discover
> >> frequently that, even if we do not need to scale, we cannot  because,
> >> for example, of Amdahl's Theorem.  If we cannot go to  multiple
> >> machines as a practical matter, we cannot scale no matter  what
> >> (ceteris paribus).  If we can, then we will find, it is  our
> >> experience, that we can scale no matter what (ceteris  paribus).
> > 
> > Scaling is exactly the reason why I think the current  FastList-based
> > design is seriously limited, and may need to be replaced  by something
> > that indexes its data.
> > 
> > Outrigger turns a  JavaSpace read into a linked
> > list scan of all the entries in the space  for the required type. Not too
> > bad if the list rarely changes and is  always small enough to fit in
> > cache. If it is big or changing, that is  going to cause a lot of memory
> > traffic.
> > 
> > As implemented,  it cannot be spread across machines very
> > effectively because the  additional scanning due to one machine not knowing 
>that another has already  found the entry would use up the added scan power.
> > 
> > I think to  scale it is going to need some sort of indexing, and as I
> > think about  indexing for the sort of ad-hoc query that JavaSpaces face,
> > combined  with frequent change, it starts looking very like maintaining
> > an indexed  database table.
> > 
> >> However, we should be able to do, say,  hundreds of millions of
> >> transactions in a day in real-time critical  systems such as the FAA
> >> or the stock market with data affinity and  integrity and all the
> >> other "ilities".  If Outrigger cannot do  this, it is of no interest
> >> to us.
> > 
> > The current  record for a relational database doing simple transactions
> > is 30 million  transactions per minute (Oracle/SPARC TPC-C). Your mileage
> > may vary, but  there is no inherent limit on relational database scaling
> > that puts a  few hundred thousand transactions per minute out of reach.
> > 
> > 
> >> MG
> >> On Dec 14, 2010, at 9:48 PM, Patricia Shanahan  wrote:
> >>> Please clarify "machines a minute". Can you express it  e.g. in
> >>> transactions per minute?
> >>>  Patricia
> >>> Mike McGrady wrote:
> >>>> Linear - 7000  machines a minute, or more. Sent from my iPhone Michael

>McGrady Principal  investigator AF081_028 SBIR Chief
> >>>> Architect Topia  Technology, Inc Work 1.253.572.9712 Cel
> >>>> 1.253.720.3365 On  Dec 14, 2010, at 2:04 PM, Patricia Shanahan
> >>>> <pats@acm.org>  wrote:
> >>>>> On 12/14/2010 1:49 PM, MICHAEL MCGRADY  wrote:
> >>>>>> On Dec 14, 2010, at 1:40 PM, Patricia  Shanahan wrote:
> >>>>> ...
> >>>>>>> If  we made a persistent version use a relational  database
> >>>>>>> to represent the  space,
> >>>>>> This would not be usable for us.  This  is too slow and does
> >>>>>> not have the correct QCC  features, especially scalability.
> >>>>>  ...
> >>>>> Could you put some numbers on what sort of  scalability you
> >>>>> require?
> >>>>>  Patricia
> >> Michael McGrady Chief Architect Topia Technology, Inc.  Cel
> >> 1.253.720.3365 Work 1.253.572.9712 extension 2037 
> > 
> > 
> > 

View raw message