IEN: 168 VAX-UNIX Networking Support Project Implementation Description Robert F. Gurwitz Computer Systems Division Bolt Beranek and Newman, Inc. Cambridge, MA 02138 January, 1981 VAX-UNIX Networking January, 1981 Support Project IEN 168 1 Introduction The purpose of this report is to describe the implementation of network software for the VAX-11/780 * running UNIX. ** This is being done as part of the VAX-UNIX Networking Support Project. The overall purpose of this effort is to provide the capability for the VAX to communicate with other computers via packet- switching networks, such as the ARPANET. Specifically, the project centers around an implementation of the DoD standard host-host protocol, the Transmission Control Protocol (TCP) [4]. TCP allows communication with ARPANET hosts, as well as hosts on networks outside the ARPANET, by its use of the DoD standard Internet Protocol (IP) [3]. The implementation is designed for the VAX, running VM/UNIX, the modified version of UNIX 32/V developed at the University of California, Berkeley [1]. This version of UNIX includes virtual paging capabilities. In the following paragraphs, we will discuss some features and design goals of the implementation, and its organization. 2 Features of the Implementation 2.1 Protocol Dependent Features 2.1.1 Separation of Protocol Layers The TCP software that we are developing for the VAX incorporates several important features. First, the implementation provides for separation of the various protocol layers so that they can be accessed independently by various applications. (1) Thus, there is a capability for access to the TCP level, which will provide complete, reliable, multiplexed, host-host communications connections. In addition, the IP level is also accessible for applications other than TCP, which require its internet addressing and data fragmentation/reassembly services. Finally, the implementation also allows independent access to the local network interface (in this case, to the ARPANET, whose host interface is defined in BBN Report No. 1822 _______________ * VAX is a trademark of Digital Equipment Corporation. ** UNIX is a trademark of Bell Laboratories. (1) In this context, the terms application and user refer to any software that is a user of lower level networking services. Thus, programs such as FTP and TELNET can be considered applications when viewed from the TCP level, and TCP itself may be viewed as an application from the IP level. -1- VAX-UNIX Networking January, 1981 Support Project IEN 168 [2]) in a "raw" fashion, for software which wishes to communicate with hosts on the local network and do its own higher level protocol processing. 2.1.2 Protocol Functions Another feature of the implementation is to provide the full functionality of each level of protocol (TCP and IP), as described in their specifications [3,4]. Thus, on the TCP level, features such as the flow control mechanism (windows), precedence, and security levels will be supported. On the IP level, datagram fragmentation and reassembly will be supported, as well as IP option processing, gateway-host flow control (source-quenching) and routing updates. However, it is anticipated that some of these features (such as handling IP gateway-host routing updates, and IP option processing) will be implemented in later stages of development, after more basic features (such as TCP flow control and IP fragmentation/reassembly) are debugged. 2.2 Operation System Dependent Features 2.2.1 Kernel Resident Networking Software There are several features of the implementation which are operating system dependent. The most important of these is the fact that the networking software is being implemented in the UNIX kernel as a permanently resident system process, rather than a swappable user level process. This organization has several implications which bear on performance. The most obvious effect is that since the networking software is always resident, it can more efficiently respond to network and user initiated events, as it is always available to service such events and need not be swapped in. In addition, residence in the kernel removes the burden of the use of potentially inefficient interprocess communication mechanisms, such as pipes and ports, since simpler data structures, such as globally available queues, can be used to transmit data between the network and user processes. Kernel provided services, (e.g., timers and memory allocation) also become much easier and more efficient to use. -2- VAX-UNIX Networking January, 1981 Support Project IEN 168 The large address space of the VAX makes this organization practical and allows the avoidance of expedients like the NCP split kernel/user process implementation, that have been necessary in previous UNIX networking software on machines with limited address space, like the PDP 11/70. It is hoped that the kernel resident approach will contribute to the speed and efficiency of this TCP. 2.2.2 User Interface Use of the "traditional" UNIX file oriented user interface is another operating system dependent feature of this implementation. The user will access the network software by means of standard system file I/O calls: open, close, read, and write. This entails modification of certain of these calls to accommodate the extra information needed to open and maintain a connection. In addition, the communication of exceptional conditions to the user (such as the foreign host going down) must also be accommodated by extension of the standard system calls. In the case of open, for example, use of the call's mode field will be extended to accommodate a pointer to a parameter structure. In the case of exceptional conditions, the return code for reads and writes will be used to signal the presence of exceptional conditions, much like an error. An additional status call (ioctl) will be provided for the user to determine detailed information about the nature of the condition, and the general status of the connection. In this way, the necessary additional information needed to maintain network communications will be supported, while still allowing the use of the functionality that the UNIX file interface provides, such as the pipe mechanism. In the initial versions, this interface will be the standard UNIX blocking I/O mechanism. Thus, outstanding reads for data which has not been accepted from the foreign host, and writes which exceed the buffering resources of a connection will block. It is expected that the await/capacity mechanism, currently available for Version 6 systems, will be added to the VM/UNIX kernel in the near future. These non-blocking I/O modifications will be supported by the network software, relieving the blocking restriction. -3- VAX-UNIX Networking January, 1981 Support Project IEN 168 3 Design Goals Several design goals have been formulated for this implementation. Among these goals are efficiency and low operating system overhead, promoted by a kernel resident network process, which allows for reduced process and interprocess communication overhead. Another goal of the implementation is to reduce the amount of extraneous data copying in handling network traffic. To achieve this, a buffer data structure has been adopted which has the following characteristics: intermediate size (128 bytes); low overhead (10 bytes of control information per buffer); and flexibility in data handling through the use of data offset and length fields, which reduce the amount of data copying required for operations like IP fragment reassembly and TCP sequence space manipulations. The use of queueing between the various software levels has been limited in the implementation by processing incoming network data to the highest level possible as soon as possible. Thus, an unfragmented message coming from the network is passed to the IP and TCP levels, with queueing taking place at the device driver only until the message has been fully read from the network. Similarly, on the output side, data transmission is only attempted when the software is reasonably certain that the data will be accepted by the network. Finally, it is planned that the inclusion of the network software will entail relatively little modification of the basic kernel code beyond that provided by Berkeley. The only modifications to kernel code outside the network software will be slight changes to the file I/O system calls to support the user interface described above. In addition, an extension to the virtual page map data structure in low core will be necessary to support the memory allocation scheme, which makes use of the kernel's page frame allocation mechanisms. 4 Organization 4.1 Control Flow -4- VAX-UNIX Networking January, 1981 Support Project IEN 168 4.1.1 Local Network Interface The network software can be viewed as a kernel resident system process, much like the scheduler and page daemon of Berkeley VM/UNIX. This process is initiated as part of network initialization. A diagram of its control and data flow is shown in Figure 1. | |-----| |-----| |-----| |-----| | | |LOCAL| |-----| |LOCAL| | | | | | | | NET | |input| | NET | | IP | | TCP | | |->|INPUT|->|queue|->|INPUT|->|INPUT|->|INPUT| | | | I/F | |-----| | | | | | | | N | |-----|==========>|-----| |-----| |-----| | | ^ (wakeup) ^ \ (timer) | E | | | \ / | U | (input) V \ / | T | ( int ) |-----| \ / | S | |frag | |-----| |-----| | W | |queue| | |=>| |->| E | |-----| | TCP | |USER | | O | |-----||-----| |MACH | | I/F | | R | |unack|| snd |<---->| |<=| |<-| R | (outpt) |queue||queue| |-----| |-----| | | ( int ) |-----||-----| / \ / | K | | / \ / | | V / \ / | | |-----| |-----| |-----| |-----| \ / | | |LOCAL| |-----| |LOCAL| | | | | |-----| | | | NET | |outpt| | NET | | IP | | TCP | | rcv | | |<-|OUTPT|<-|queue|<-|OUTPT|<-|OUTPT|<-|OUTPT| |queue| | | | I/F | |-----| | | | | | | |-----| | | |-----|<----------|-----| |-----| |-----| | | | | | |<----------TCP PROCESS------------>| | | Figure 1 . Network Software Organization Its main flow of control is an input loop which is activated (via wakeup) by the network interface device driver when an incoming message has been completely read from the network. (It can also be awakened by TCP user or timer events, described below.) The message is then taken from an input queue and dispatched on the basis of local network format (e.g., 1822 leader link number). ARPANET imp-host messages (RFNMs, incompletes, imp/host status) -5- VAX-UNIX Networking January, 1981 Support Project IEN 168 are handled at this level. For other types of messages, the local network level input handler calls higher level "message handlers." The "standard message handler" is the IP input routine. Handlers for other protocols at this level (such as UNIX NCP) may be accommodated in either of two ways. First, a "raw message" service is available which simply queues data on specified links to/from the local network. By reading or writing on a connection opened for this service, a user process may handle its own higher level protocol communication. Alternatively, for frequently used protocols, a new handler may be defined in the kernel and called directly. 4.1.2 Internet Protocol At the IP level, the fragment reassembly algorithm is executed. Unfragmented messages with valid IP leaders are passed to the higher level protocol handler in a manner similar to the lower level dispatch, but on the basis of IP protocol number. The "standard handler" is TCP. Another protocol handler interprets IP gateway-host flow control and routing update messages. Fragmented messages are placed on a fragment reassembly queue, where incoming fragments are separated by source and destination address, protocol number, and IP identification field. For each "connection" (as defined by these fields), a linked list of fragments is maintained, tagged by fragment offset start and end byte numbers. As fragments are received, the proper list is found (or a new one is created), and the new fragment is merged in by comparing start and end byte numbers with those of fragments already on the list. Duplicate data is thrown away. A timer is associated with this queue, and incomplete messages which remain after timeout are dropped and their storage is freed. Completed messages are passed to the next level. 4.1.3 TCP Level At the TCP level, incoming datagrams are processed via calls to a "TCP machine." This is the TCP itself, which is organized as a finite state machine whose states are roughly the various states of the protocol as defined in [4], and whose inputs include incoming data from the network, user open/close/read/write requests, and timer events. Input from the network is handled directly, passing through the above described -6- VAX-UNIX Networking January, 1981 Support Project IEN 168 levels. User requests and timer events are handled through a work queue. When a user process executes a network request via system call, the relevant data (on a read or write) is copied from user to kernel space (or vice versa), a work entry is enqueued, and the network process is awakened. Similarly, when timers associated with TCP (such as the retransmission timer) go off, timer work requests are enqueued and the network input process is awakened. Once awakened, it checks for the presence of completed messages from the network interface and processes them. After these inputs are processed, the TCP machine is called to handle any outstanding requests on the work queue. The network process then sleeps, waiting for more network input or work requests. Thus, the TCP machine may be called directly with network input, or awakened indirectly to check its work queue for user and timer requests. After reset processing and sequence and acknowledgement number validation, acceptable received data is sequenced and placed on the receive queue. This sequencing process is similar to the IP fragment reassembly algorithm described above. Data placed on this queue is acknowledged to the foreign host. Received data whose sequence numbers lie outside the current receive window are not processed, but are placed on an unacknowledged message queue. The advertised receive window is determined on the basis of the remaining amount of buffering allocated to the connection (see below). When buffering becomes available, data on the unacknowledged message queue is then processed and placed on the receive data queue. On the output side, TCP requests for data transmission result in calls to the IP level output routine. This routine does fragmentation, if necessary, and makes calls on the local network output routine. Outgoing messages are then placed on a buffering queue, for transmission to the network interface by the device driver. In data transmission, an attempt is made to ensure that data moving from the highest level (TCP), will not be sent unless there is reasonable certainty that the lower levels will have the necessary resources to accept the message for transmission to the network. All data to be sent is maintained on a single send queue, where data is added on user writes, and removed when proper acknowledgement is received. Whenever the TCP machine sends data, a retransmission timer is set, and the sequence number of the first data byte on the queue is saved. After initial transmission the sequence number of the next data to send is advanced beyond what was first sent. If the retransmission timer -7- VAX-UNIX Networking January, 1981 Support Project IEN 168 goes off before that data is acknowledged, the sequence number of the next data to send is backed up, and the contents of the send buffer (for the length determined by the current send window) is retransmitted, with the ACK and window fields set appropriately. The retransmission timer is set with increasingly higher values from 3 to 30 seconds, if the saved sequence number does not advance. A persistence timer is also set when data is sent. This allows communication to be maintained if the foreign process advertises a zero length window. When the persistence timer goes off, one byte of data is forced out of the TCP. 4.2 Buffering Strategy As mentioned earlier, all data is passed from the network to the various protocol software layers in intermediate sized (128 byte) buffers. The buffers have two chain pointers, a data offset, and a data length field (see Figure 2). As data is read from the network or copied from the user, multiple buffers are chained together. Protocol headers are also held in these buffers. As messages are passed between the various software levels, the offset is modified to point at the appropriate header. The length field gives the end of data in a particular buffer. This offset/length pair facilitates merging of messages in IP fragment reassembly and TCP sequencing. The allocation of these buffers is handled by the network software. Buffers are obtained by "stealing" page frames from the kernel's free memory map (CMAP). In VM/UNIX, these page frames are 1024 bytes long, and thus have room for eight 128 byte buffers. The advantage of using kernel paging memory as a source of network buffers is that their allocation can be done totally dynamically, with little effect on the operation of the overall system. Buffers are allocated from a cache of free page frames, maintained on a circular free list by the network memory allocator. As the demand for buffers increases, new page frames are stolen from the paging freelist and added to the network buffer cache. Similarly, as the need for pages decrease, free pages are returned to the system. To minimize fragmentation in buffer allocation within the page frames, the free list is sorted. When no more pages are available for allocation, data on the IP reassembly and TCP unacknowledged data queues are dropped, and their buffers are recycled. -8- VAX-UNIX Networking January, 1981 Support Project IEN 168 ^ |------------------------| ^ | | -> NEXT BUFFER | | 10 |------------------------| | BYTES | QUEUE LINK | | | |-----------|------------| | V | OFFSET | LENGTH | | |-----------|------------| | | | 128 | | BYTES | | | | D A T A | | | | | | | | | | | | | | |------------------------| V Figure 2 . Layout of a Network Buffer The number of pages that can be stolen from the system is limited to a moderate number (in practice 64-256, depending on network utilization in a particular system). To enforce fairness of network resource utilization between connections, the number of buffers that can be dedicated to a particular connection at any time is limited. This limit can be varied to some small degree by the user when a connection is opened. Thus, a TELNET user may open a connection with the minimum 1K bytes of send and receive buffering; while an FTP user, anticipating larger transfers, might desire up to 4K of buffering. The effect of this connection buffering allocation is to place a limit on the amount of data that the TCP may accept from the user for sending before blocking, and the amount of input from the network that the TCP may acknowledge. Note that in receiving, the network software may allocate available buffers beyond the user's connection limit for incoming data. However, this data is considered volatile, and may be dropped when buffer demands go higher. Incoming data is acknowledged by TCP only until the user's connection buffer limit is exhausted. The advertised TCP flow control window for a connection is set on the basis of the remaining amount of this buffering. Thus, the network software must insure that it has enough buffering for 1) its own internal use in processing data on the IP and local network levels; 2) retaining acknowledged TCP data -9- VAX-UNIX Networking January, 1981 Support Project IEN 168 that have not been copied to user space; and 3) retaining data accepted by the TCP for transmission which have not yet been acknowledged by the foreign host TCP. Other data, such as unacknowledged TCP input from the network and fragments on the IP reassembly queue are vulnerable to being dropped when demand for more buffers makes necessary the recycling of buffers on these queues. Since there is an absolute limit on the number of page frames that may be stolen from the paging system, and hence the total number of buffers available, there is a resultant limit on the total number of simultaneous connections. Several data structures are required for stealing page frames from the kernel and maintaining the buffer free list. These include enough page table entries for mapping the maximum number of page frames which can be stolen from the system, an allocation map for allocating these page table entries, and the free page list itself. For a 256 page maximum, this requires 2K bytes of page tables, 1K bytes for page frame allocation mapping, and another 1K bytes for the network freelist. The maximum page parameter and others, including the minimum and maximum amount of buffering that the user may specify are modifiable constants of the implementation. 4.3 Data Structures Along with the data structures needed to support the buffer management system, there are several others used in the network software (see Figure 3). The focus of activity is the user connection block (UCB), and the TCP control block (TCB). The UCB is allocated from a table on a per connection basis. It holds non-protocol specific information to maintain a connection. This includes a pointer the UNIX process structure of the opener of a connection, (2) a pointer to the foreign host entry for the peer process's host, a pointer to the protocol-specific connection control block (for TCP, the TCB), pointers to the user's send and receive data buffer chain, and miscellaneous flags and status information. When a network connection is opened, an entry in the user's open file table is allocated, which holds a pointer to the UCB. For TCP connections, a TCB is allocated. All TCBs are chained together to facilitate buffer recycling. The TCB contains a pointer to the corresponding UCB, a block of sequence number variables and state flags used by the TCP finite state _______________ (2) For details on data structures specific to UNIX, see [5]. -10- VAX-UNIX Networking January, 1981 Support Project IEN 168 Foreign Host Table |--------| Network |------>|Host Adr| Conn Table | |--------| |--------| | | #RFNM | |--->|->Proc |<--+--| |--------| | |--------| | | | Status | | |->Host |---| | |--------| Per User | |--------| | File Table | | ->TCB |---| | TCB |--------| | |--------| | | |--------| | Flags | | |->S Buf | |--+--->| ->next | |--------| | |--------| | |--------| | ->UCB |---| |->R Buf | |----| ->UCB | |--------| |--------| |--------| | Flags | | FSM | | and | |Sequence| | Status | | Vars | |--------| |--------| |->Snd Q | |--------| |->Rcv Q | |--------| |->UnackQ| |--------| | Flags | | and | | Status | |--------| Figure 3 . Network Data Structures machine, pointers to the various TCP data queues, and flags and state variables. Protocols other than TCP would have their own control blocks instead of the TCB. For the "raw" local network and IP handlers, all necessary information is kept in the UCB. Finally, there is a foreign host table, where entries are allocated for each host that is part of a connection. The entry contains the foreign host's internet address, the number of outstanding RFNM's for 1822 level host-imp communication, and the status of the foreign host. Entries in this table are hashed on the foreign host address. -11- VAX-UNIX Networking January, 1981 Support Project IEN 168 5 References [1] Babaoglu, O., W. Joy, and J. Porcar, "Design and Implementation of the Berkeley Virtual Memory Extensions to the UNIX Operating System," Computer Science Division, Dept. of Electrical Engineering and Computer Science, University of California, Berkeley, December, 1979. [2] Bolt Beranek and Newman, "Specification for the Interconnection of a Host and an IMP," Bolt Beranek and Newman Inc., Report No. 1822, May 1978 (Revised). [3] Postel, J. (ed.), "DoD Standard Internet Protocol," Defense Advanced Research Projects Agency, Information Processing Techniques Office, RFC 760, IEN 128, January, 1980. [4] Postel, J. (ed.), "DoD Standard Transmission Control Protocol," Defense Advanced Research Projects Agency, Information Processing Techniques Office, RFC 761, IEN 129, January, 1980. [5] Thompson, K., "UNIX Implementation," The Bell System Technical Journal, 57 (6), July-August, 1978, pp. 1931-1946. -12- VAX-UNIX Networking January, 1981 Support Project IEN 168 Table of Contents 1 Introduction.......................................... 1 2 Features of the Implementation........................ 1 2.1 Protocol Dependent Features......................... 1 2.1.1 Separation of Protocol Layers..................... 1 2.1.2 Protocol Functions................................ 2 2.2 Operation System Dependent Features................. 2 2.2.1 Kernel Resident Networking Software............... 2 2.2.2 User Interface.................................... 3 3 Design Goals.......................................... 4 4 Organization.......................................... 4 4.1 Control Flow........................................ 4 4.1.1 Local Network Interface........................... 5 4.1.2 Internet Protocol................................. 6 4.1.3 TCP Level......................................... 6 4.2 Buffering Strategy.................................. 8 4.3 Data Structures.................................... 10 5 References........................................... 12 -i-