Network Working Group
A Commentary on Procedure Calling as a Network Protocol
Richard Schantz BBN-TENEX
This RFC is being issued as a first step in an attempt to stimulate a dialog on some issues in designing a distributed computing system. In particular, it considers the approach taken in a design set forth in RFC #674, commonly known as the "Procedure Call Protocol" (PCP). In the present document, the concentration is on what we believe to be the shortcomings of such a design approach. Note at the outset that this is not the first time we are providing a critical commentary on PCP. During the earlier PCP design stages, we met with the PCP designers for a brief period, and suggested several changes, many of which became part of PCP Version 2. We hasten to add, however, that the nature of those suggestions stem from an entirely different point of view than those presented here. Our original suggestions, and also some subsequent ones, were mainly addressing details of implementation. In this note the concern is more with the concepts underlying the PCP design than with the PCP implementation. This note is being distributed because we feel that it raises certain issues which have not been adequately addressed yet. The PCP designers are to be congratulated for providing a detailed written description of their ideas, thereby creating a natural starting point for a discussion of distributed system design concepts. It is the intent of this note to stimulate an interaction among individuals involved with distributed computing, which could perhaps result in systems whose designs don't preclude their use in projects other than the one for which they were originally conceived.
While the Procedure Call Protocol (PCP) and its use within the National Software Works (NSW) context attacks many of the problems associated with integrating independent computing systems to handle a distributed computation, it is our feeling that its design contains flaws which should prevent its widespread use, and in our view, limit its overall utility. We are not voicing our objection to the use of PCP, in its current definition, as the base level implementation vehicle for the NSW project. It is already too late for any such objection, and PCP may, in fact, be very effective for the NSW implementation, since they are proceeding in parallel and have probably influenced each other. Rather, we are voicing an objection to the "PCP philosophy", in the hope of preventing this type of protocol from becoming the de-facto network standard for distributed computation, and in the hope of influencing the future direction of this and similar efforts. Some of the objectionable aspects of PCP, it can be argued, are differences of individual preference, and philosophers have often indicated that you cannot argue about tastes. We have tried to avoid such arguments in this document. Rather, we consider PCP in light of our experience in developing distributed systems. Considered in this way, we feel that PCP and its underlying philosophy have flaws which make it inappropriate as a general purpose protocol and virtual programming system for the construction of distributed software systems. It is our opinion that PCP is probably complete in the sense that one can probably do anything that is required using its primitives. A key issue then, is not whether this function or that function can be supported. Rather, to us an important question is how easy it is to do the things which experience has indicated are important to distributed computing. In addition, a programming discipline dedicated to network applications should pay particular attention to coercing its users away from actions which systems programming in general and network programming in particular have shown to be pitfalls in system implementation.
A Point of View_ _____ __ ____
Rather, we take exception to PCP's underlying premise: that the procedure calling discipline is the starting point for building multi-computer systems. This premise leads to a model which has a central point for the entire algorithm control, rather than a more natural (in network situations) distributed control accomplished by cooperating independent entities interacting through common communication paths. While the procedure call may be an appropriate basis for certain applications, we believe that it can neither directly nor accurately model the interactions and control structures that occur in many distributed multi-computer systems. Much of what follows may seem to be a pedagogic argument, and PCP supporters may take the position of "who cares what you call it, its doing the same thing". Our reply is that it is very important to achieve a clear and concise model of distributed computation, and while the PCP model does not require "poor implementation" of distributed systems, neither does it make "good implementation" any easier, nor does it prohibit ill-advised programming practices. A model stressing the dynamic interconnection of somewhat independent computing entities, we feel, adheres more to the notions of defensive programming, which we have found to be fundamental to building usable multi-machine implementations.
The rest of this RFC discusses what we feel to be some of the
shortcomings of a procedure call protocol.
Limitations of Procedure Calling Across Machines___________ __ _________ _______ ______ ________
First and foremost, it is our contention that procedure calling should not be the basis for multi-machine interactions. We feel that a request and reply protocol along with suitably manipulated communication paths between processes forms a model better suited to the situation in which the network places us. In a network environment one has autonomous computing entities which have agreed on their cooperation, rather than a master process forcing execution of a certain body of code to fulfill its computing needs. In such a configuration, actions required of a process are best accommodated indirectly (by request) rather than directly (by procedure call), in order to maintain the integrity of the constituent processes. distinction between a local (inexpensive in time and effort) and a remote procedure call, which obviously requires a great deal of effort by the "PCP system", if not by the PCP user. It also seems to be the case that the cost of blurring the local/remote distinction at the procedure call level will be found in the more frequent use of a less efficient local procedure calling mechanism. Interprocess communication, on the other hand, (at least with regard to stream or message oriented channels and not just interrupt signals) is generally regarded as having a significant cost associated with it. Message sending is always an interprocess action, and requires system intervention always. There is not as substantial a difference between the IPC of local processes and the IPC of remote processes, as between local and remote procedure calling. PCP is suggestive of a model in which processes exist that span machine boundaries to provide inter-machine subroutine calling. Yet the PCP documentation has not advocated the notion of a process that spans machine boundaries, and rightfully so since such a creation would cause innumerable problems. Since procedure calling is more suitable as an intra-process notion, it seems to be a better idea to take the interprocess communication framework and extend it to have a uniform interpretation locally and remotely, rather than to extend the procedure calling model. It is also our contention that a model which relies on procedure calling for its basis does not take into account the special nature of the network environment, and that such an environment can be more suitably handled in a message passing model. Furthermore, we feel that programming as a whole, even purely local computing, will benefit from paying more attention to such areas as reliability and robustness, which have been brought to the forefront through experience with an oftentimes unreliable network and collection of hosts. An IPC model, by emphasizing the connections between disjoint processes, seems to reinforce the idea that distributed computing is accomplished by joining separate entities, and that defensive programming and error handling techniques are appropriate. Since PCP is, we think, for distributed system builders, and not for the end user (e.g. an RSEXEC user), avoiding the network, interconnection issues, and relative costs, may be counter-productive if the goal is to achieve usable network systems. The unmistakable trend in systems programming is toward inviolable (protected) process structures with external synchronization as a means of coping with complex debugging tasks and the difficulty of making system changes. This trend is better supported, we feel, by a message passing rather than a procedural model of computation. Furthermore, we feel that network programming techniques should be applied to local computation, not the other way around.
Some Particulars____ ___________
In the following list, we try to be more specific with respect to particular situations where we think the PCP concept may be weak as the basis for a network programming system. For some of these examples to be meaningful, the reader should be fairly familiar with the PCP documents issued as RFC 674. 1. Recovery from component malfunction may be very difficult to handle by a process that is not the central control (i.e. a process which is being manipulated by having its procedures executed). Is the situation where there is network trouble, for example, to be modeled by a forced procedure call to some error recovery routine? It is precisely such situations where distributed control serves as a better model. Consider the act of introducing an inferior to another acquaintance and then supplying the new handle as a parameter of a subsequent procedure call in the inferior. The inferior's blind use of the parameter to interact with the other process illustrates the manipulative aspects of a superior. The inferior never really is aware of a new communication path to a new process. The inferior environment (as maintained by the PCP "system") has been changed by the superior, with no active notification of the inferior. Certainly this makes user coded error recovery somewhat awkward. 2. Such process manipulation may at times violate the principles of modular programming. In this vein, it seems beneficial to be able to debug separately the pieces of a computation and then worry only about their synchronization to achieve a totally debugged system. With PCP in its fullest sense, the danger of error propagation seems greater because of the power of a process to cause execution of an arbitrary procedure and to read/write remote data stores without the active participation of the remote process.
conveniently apply protective standards to the code and data
4. PCP doesn't model long term parallel activity in a convenient fashion, as is required to handle various asynchronous producer/consumer process relationships. The synchronization is geared more to a one-to-one call and return, rather than to the asynchronous nature and multiple returns for a single request, as exhibited by many network services. In addition, low priority, preemptable background tasks are hard (impossible?) to model in a procedure call environment. 5. Communication paths are not treated as abstract objects which are independent from the actual entities they connect, and hence they cannot be utilized in some useful ways (e.g. to carry non PCP messages). Also with respect to treating communication paths as objects, there is no concept of passing a communication path to an inferior (or an acquaintance), without having to create a new "connection" (whether or not this turns out to be a physical channel). The ability to pass communication paths is often useful in subcontracting requests to inferior processes. To do this within PCP requires the cooperation of the calling process (i.e. to use the new connection handle), which again seems to violate the concepts of modular programming. The alternative approach in PCP is to have the superior relay the subsequent communications to its created inferior, but the effort involved would probably prohibit the use of this technique for subcontracting. 6. PCP seems too complicated to be used for the type of processing which requires periodic but short (i.e. a few words exchanged) interactions. An example of such interactions is the way the TIP uses the TENEX accounting servers (see RFC #672). Furthermore, PCP is probably much too complex for implementation on a small host. In that regard, there does not seem to be a definition of what might constitute a minimum implementation for a host/process which did/could not handle all of what has been developed. 7. In the PCP model, it may become awkward or resource consuming for a service program to do such things as queue operations for execution at a later time (persistence) or at a more opportune time (priority servicing mechanism). Such implementations may require dummy returns and modification of the controlling fork concept, or maintenance of processing forks over long periods of inactivity. to it and then introducing it to an arbitrary number of processes, thereby overflowing the table space in that process. All of that could have been done without ever executing a single instruction of user written code. This difficulty is a consequence of the PCP notion of having one process manipulate the environment of another without its active participation in such actions. 9. Doesn't the fact that the network PCP process implementation is so much neater than the TENEX PCP process implementation (since TENEX doesn't have a general IPC facility) suggest that message passing and communication facilities supported by the "system" provides a sound basis for multi-process implementations, and that perhaps such facilities should be primitively available to the distributed system builders who will use PCP? 10. There is a question of whether PCP is an implementation virtual machine (language), or an application virtual machine (language). That is, is PCP intended to be used to implement systems which manage distributed resources, or as an end product which makes the network resources themselves easier to use for the every day, ordinary programmer (e.g. makes the network itself transparent to users). One gets the feeling that the designers had both goals, and that neither one is completely satisfied. If the former goal is taken, we believe that most of the complexities (e.g. network trouble, broken connections, etc.) and possibilities (e.g. redundant implementation, broadcast request, etc.) of network implementations are not provided for adequately. In this view, the NSW framework (Works manager, FE) is the distributed system that utilizes the PCP implementation language. We do not see how the use of PCP in this context provides for either an extra-reliable system through component redundancy, or a persistent system which can tolerate temporary malfunctions. If one subscribes to this view, then it doesn't seem right that the objects that run under the created system (i.e. the tools that run under the PCP implemented Front End, Works Manager, and TBH monitor) should also be aware of or use PCP. If one considers the latter goal, that PCP implements a virtual machine to be presented to all programmers for making distributed resources easy to use, then it is clear that PCP with its manifest concern for object location does not provide for the desireable properties of network transparency.