SMP_Paper
-< TOPS-10/20 discussions held here >-
================================================================================
Note 286.2 History of DEC's 36-bit Machines 2 of 3
VINO::FLEMMING "Have XDELTA, will travel" 13 lines 27-DEC-1991 06:58
-< TOPS-10 SMP >-
--------------------------------------------------------------------------------
The next note is the chapter that Allan Wilson - an old time 36 bitter now
with Intergraph - and I wrote for the "book" Dan mentioned. There are 8 or
10 figures referenced in the text. The folks producing the "book" said they
would provide a graphic artist to do the pictures so they never existed in
soft copy form. I of course couldn't conveniently post them here even if they
did. For the most part, they just depicted various MP configurations so it
doesn't take much imagination to figure out what they were about. The only
possible exceptions to this are the figures depicting the flow of input and
output but unless someone were insane enough to want to try to do this again,
the general idea of how it worked should be more or less obvious from the
text.
JMF
================================================================================
Note 286.3 History of DEC's 36-bit Machines 3 of 3
VINO::FLEMMING "Have XDELTA, will travel" 1736 lines 27-DEC-1991 06:59
--------------------------------------------------------------------------------
TOPS-10 Symmetric Multiprocessing
Allan B. Wilson and James M. Flemming
Copyright (c) 1989 - A. Wilson and J. Flemming
Copyright (c) 1991 - A. Wilson and J. Flemming
Permission for electronic forwarding and reposting hereby granted.
Rights to commercial and/or hardcopy publication reserved. Contact
the author at VINO::FLEMMING (flemming@vino.dec.com).
[It is with sadness I note the passing away of JMF February 1995 -smw]
- 1 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
INTRODUCTION
In the latter half of the 1970s, Digital Equipment Corporation
began marketing two operating systems for its 36-bit family of
computers. TOPS-10 had evolved as the original internal product,
whose initial development began during the mid-1960s. TOPS-20
had its roots in an external product, TENEX, rights to which were
acquired by DEC to provide the software foundation for a new
system series.
Although TENEX had run on KA-10 and KI-10 processors at several
research and university sites, TOPS-20 was released as a product
only on the KL-10 processor. Therefore TOPS-10 and TOPS-20 share
the same underlying hardware architecture of the KL-10, including
the same basic instruction set. However, because the software
architectures of the two systems differ, the mechanisms and
details of operating-system calls are different. Furthermore,
each system has unique requirements for paging and virtual
memory. Both common and system-specific functions are
implemented in cpu microcode.
DEC introduced TOPS-20 to increase the size of its 36-bit
customer base by offering a powerful but easy-to-use product
which represented the latest in operating system technology.
TOPS-20 was first released as part of a low-end system, the 2040,
which was sold with only 64K words of memory. In contrast, most
TOPS-10 users already had very large systems (many of them
multiprocessors) and, though TOPS-20 had some very attractive
features, there was seldom justification for a TOPS-10 customer
to change to TOPS-20. Regardless of personal preference for
TOPS-10 or TOPS-20, and there were many people strongly in one
camp or the other, the TOPS-10 user-base was large, which meant
significant pressure on DEC for an on-going commitment to TOPS-
10. Better multiprocessing support was viewed favorably by both
internal and external TOPS-10 supporters: it was interesting
technically but, most important, would help sites with heavy
system loads until the planned successor to the KL cpu was
available.
The remainder of this chapter goes into historical and
implementation details of TOPS-10 multiprocessing, focusing on
the latest and most general version, Symmetric Multiprocessing,
or SMP. SMP was developed as the next logical extension of
TOPS-10, but at a time when TOPS-20 was being heavily emphasized
by DEC's Large Computer Group ("LCG"). It's probably accurate to
state that if Jim Flemming and Tony Wachs, long-time TOPS-10
developers and key designers and implementors of SMP, had not
been persistent with John Leng (LCG Vice President) and other
senior LCG managers, SMP development would never have been
allowed to proceed.
- 2 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
MULTIPROCESSING
Multiprocessing is a computer organization in which multiple
central processing units ("cpus") are interconnected.
Multiprocessing can be used to address several different system
objectives, for example, to provide higher system availability:
in case of a cpu failure, the other cpu(s) in the system continue
operating.
Multiprocessing is also used to achieve increased computing
power. A uniprocessor site could exchange its cpu for a newer,
faster model in the same family. If none is available, however,
the site could add another cpu of the same type as the existing
cpu, and still ensure hardware and software compatibility.
Certainly the uniprocessor site could acquire another complete
system, that is, one with its own cpu, memory, and peripherals.
This would offer the same advantages as a multiprocessing
configuration for commonality of hardware, spare parts, software,
and training. However, there would be economic inefficiencies
resulting from multiple copies of the operating system and
supporting software, and additional disk storage would be
required. Furthermore, multiple independent systems are more
difficult to administer. The user community must be divided
among systems, so load balancing becomes a problem, and all
systems must be kept current with software updates; a bug may be
fixed on one system, but missed on another. File system backup
also becomes a bigger problem. Additional support staff may be
required to deal with multiple independent systems.
There are two basic multiprocessing organizations: loosely
coupled and tightly coupled. In a loosely coupled system, two or
more cpus are connected by means of communications links or
high-speed buses, or share a common file system through special
hardware, as represented by Digital's VAXclusters (Figure 1).
In a tightly coupled multiprocessing system (Figure 2), there is
a single shared memory, only one copy of the operating system and
supporting software, and a single file system.
Multiprocessing software determines the degree to which cpus
share the workload. Newer architectures strive for high
parallelism, so that a single job or program can be worked on by
multiple processors to reduce its overall execution time. Since
multiuser (timesharing) systems have multiple jobs (programs or
processes) competing for computing resources, the traditional
multiprocessing approach does not deal with parallelism within an
individual job, but instead simply has each cpu work on a
different job. While this doesn't help a single job run faster
even though there are multiple cpus available (and in fact if
there is only one job to run, all cpus except one stay idle),
overall system throughput and availability are improved.
- 3 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
MASTER/SLAVE MULTIPROCESSING
All TOPS-10 multiprocessing is tightly coupled. The first
multiprocessing implementation for TOPS-10 used an approach known
as master/slave, and was offered for the first time in the early
1970s on KA-10 and KI-10 processors, the 1055 and 1077 systems,
respectively. When the KL-10 was introduced, master/slave
multiprocessing was also available in the 1088.
TOPS-10 multiprocessing was developed to meet customer demand.
The KA-10 (the first PDP-10 processor, which followed the earlier
PDP-6) simply couldn't satisfy the computational needs of some
TOPS-10 customers. First, DEC's Peter Hurley worked with MIT to
create a master/slave system with a PDP-6 and a KA-10; MIT had
just added the KA to their existing PDP-6 system, and wanted to
use the KA to provide additional computing power. Once that
system was running, Peter modified the code to allow KAs as both
master and slave, then set about turning it into an official
product to satisfy commitments to the University of Pittsburgh.
The resulting product was very successful, because the
master/slave implementation matched the university's workload
well -- a heavy interactive load, but with sufficient compute-
bound jobs to keep the slave busy; the University of Pittsburgh
and DEC worked closely together improving multiprocessing
performance (and TOPS-10 in general) through KI and KL cpu
releases.
In TOPS-10 master/slave multiprocessing (Figure 3), there are two
cpus. The master is the general-purpose cpu: it does both
computation and all system input/output ("i/o"). The slave has
no i/o devices except a console terminal and is therefore
dedicated to computation.
In the TOPS-10 master/slave organization, the slave does not take
orders from the master. Both cpus execute the TOPS-10 scheduling
routine to find a job to run; there is code to prevent both cpus
from selecting the same job. Jobs are processed by the master
just as in a uniprocessor configuration. The slave is the same
for user-mode computations and some non-i/o monitor calls. When
the job the slave is running requires i/o, however, the slave
cannot proceed. It marks the job as needing the master's
attention, then enters the scheduler and selects another job to
run. The master, in the meantime, is working on other jobs, and
when it schedules again, will find and run jobs marked as "run-
on-master" by the slave. Thus the slave is a slave by virtue of
the fact that it cannot perform all system duties and therefore
relies on the master for many services.
If there are compute-bound jobs to work on, performance of a
master/slave system can be good. As long as jobs do few master-
only monitor calls, the slave can stay busy running them.
Availability of a master/slave system should also be good. If
the slave fails, only some cpu power is lost. When the master
- 4 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
fails, devices can be switched to the former slave, which is then
restarted as a conventional uniprocessor system; therefore system
services are interrupted only briefly. In either case, work
continues to be processed, and corrective maintenance can be
scheduled for a convenient time.
THE PROBLEM WITH MASTER/SLAVE MULTIPROCESSING
Performance of i/o devices is limited by mechanical constraints
-- the time it takes to move a disk head across the recording
surface ("seeking"), or waiting for the desired data to come
under the head once it's on cylinder ("latency"), for example.
Caching of disk data is a common technique to avoid some disk i/o
altogether; read-ahead for sequential input is often done to
reduce disk i/o transfer setup as well as seek and latency times.
Nevertheless, even with advances in peripheral performance and
software optimizations, i/o speeds remain relatively constant
while cpu speeds increase. Therefore newer computer systems tend
to be more i/o bound. For example, if a program does i/o every
10,000 instructions, a faster cpu will execute the 10,000
instructions more quickly and reach the i/o requests sooner than
a slower cpu. Thus the faster cpu spends a larger percentage of
its time waiting for i/o, and therefore the definitions of
compute-bound and i/o-bound change with the introduction of
faster cpus. This means that job mixes well-suited to
master/slave processing on older cpus are not necessarily well
balanced on later cpu models, since master/slave efficiency
depends on having adequate compute-bound jobs in the mix to work
on. Therefore, the underlying problem with master/slave
multiprocessing is its asymmetry with respect to i/o -- the mix
has to be just right, or the activities of the slave may actually
hurt system performance, reducing throughput by the additional
overhead of memory contention, waiting for code interlocks within
the monitor, and extra context switching. The performance factor
which characterized a successful master/slave system was
typically about 1.8 times a uniprocessor configuration, though as
described, the job mix, which is usually very dynamic, determined
actual performance.
SYMMETRIC MULTIPROCESSING (SMP)
The 7.01 release of TOPS-10 eliminated master/slave restrictions
by reorganization to Symmetric Multiprocessing, extending full
functional capabilities to all cpus. (SMP officially supported
only two processors at initial release, but the SMP software was
designed and written to support up to six.) Notice (Figures 3,
4) that SMP hardware configurations are quite similar to
master/slave configurations except that with SMP i/o devices can
be connected to any cpu, and, with availability in mind, dual-
ported disks connected to two cpus may be desirable. Memory is
still shared between processors, and there is still a single copy
of the operating system. In SMP, however, the entire monitor is
- 5 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
reentrant and all monitor calls can be executed on all cpus.
SMP -- AN EXAMPLE: ORNL
Oak Ridge National Laboratory ("ORNL") started using TOPS-10 on a
KA-10 in 1971. The system was acquired for general timesharing
and linear accelerator data reduction (but not real-time data
acquisition). To meet their growing computational workload, ORNL
acquired a KL-10 in 1977. (The KA-10 was kept for the data
reduction work, and when it was finally retired in 1985 the
service meter showed 128,000 hours -- 14.6 years!)
In late 1977 ORNL started using the (now CompuServe Data
Technologies) System 1022 database system. Because of user
enthusiasm for the package, system demand increased dramatically,
and ORNL acquired a second KL-10 processor in 1979. They began
running SMP in 1980, the same year they acquired a third KL-10.
The ORNL system then became the largest TOPS-10 SMP configuration
documented, with 5 cpus; the fourth and fifth KL-10s were
acquired in 1982 and 1984, respectively. The ORNL system
configuration diagram is shown in Figure 5.
At peak, there would be about 200 jobs logged in, of which about
25 were system jobs; system demand was such that local software
was written to queue users if all available login slots were in
use. A typical day would see around 5000 logins and logouts.
Fortran was the dominant language from the beginning (including
many 1022 database applications written in Fortran), followed by
Cobol.
The ORNL SMP installation continues to be very successful and
productive.
INTERLOCKS
When multiple cpus are executing the same code, interlocks are
necessary to guarantee exclusive access to critical sections of
the code; a critical section is any sequence of instructions
which must be executed atomically, that is, as an uninterrupted
unit. A section of code is critical only if global data are
being changed, and could result in an inconsistency if more than
one processor try to change the same data simultaneously. As an
example, assume that the system maintains a linked-list of free
memory pages. If two cpus each need to allocate a page of memory
both will consult the free-list to find an available page.
Without an interlock, both cpus could determine that the first
page in the list is free and both allocate the page for
themselves. An interlock ensures that such accesses are
correctly coordinated.
The lowest level where atomic execution is required is at the
machine instruction-set level. Certain instructions that read
- 6 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
data from memory, modify the data, and write the modified data
back to memory must be uninterruptible, that is, the
read/modify/write cycle must be atomic. Given such instructions,
they can be used to "lock" and "unlock" critical sections of
code. The SMP lock/unlock mechanism is an implementation of
semaphores, with appropriate queueing of requests when a resource
is busy.
Granularity of locking is important -- locking larger sections of
code means that other cpus will generally have a better chance to
collide for access to a resource. Therefore the smaller a
section of code that is locked, the more likely that access
contention for the section is reduced and thus the better overall
system performance. In SMP, the code was instrumented for
performance measurement to ensure proper locking granularity.
SMP LOCKS: A CLOSER LOOK
In a uniprocessor configuration, the processor's interrupt system
can be used as a synchronization mechanism: interrupts can be
turned off -- as a group or selectively -- to prevent routines at
other interrupt levels from modifying shared data.
Many of the synchronization mechanisms employed in uniprocessor
systems simply require generalization to provide synchronization
in a multiprocessor system. However, caution must be exercised
since locks which are not a bottleneck in uniprocessor systems
may cause serious performance problems in a multiprocessing
environment (perhaps due to lock granularity). A few examples of
locks that normally exist in a uniprocessor operating system and
how they can be generalized for multiprocessing will serve to
illustrate SMP synchronization techniques.
Before describing these mechanisms, it is necessary to introduce
a type of lock not usually encountered in a uniprocessor system.
This lock is known as a "spin lock". It is normally obtained and
held only for a short time relative to a context switch (or when
a context switch is not possible). A spin lock is implemented as
follows:
1. Using a read/modify/write instruction, attempt to test-and-
set a memory location (the lock) associated with the
data/critical section to be accessed.
2. If the test indicates that the set succeeded, proceed to
access/modify the data associated with the lock.
3. If the test indicates that the set failed, that is, the lock
is already owned by another processor, then loop back and execute
the test-and-set until the set succeeds and the lock is obtained.
This lock is known as a spin lock because the processor "spins"
in a tight loop waiting to acquire the lock.
- 7 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
Since the processor owning a lock should spend little time in the
critical section, a spin lock should not be too expensive in lost
time for a cpu that has to wait for the lock. Unfortunately,
however, a test-and-set instruction may consume significant
memory bandwidth since the read/modify/write operation locks out
all other processors (including i/o channels) until it completes.
A useful variation of the "spin lock" mechanism is therefore to
first simply test the lock; if it is potentially available (not
currently set), then try to obtain it using the read/modify/write
test-and-set instruction. Thus the loop becomes:
1. Test the lock for availability
2. If potentially available, try to obtain it using the test-
and-set instruction
3. If obtained, proceed.
4. If not obtained, loop back to the potential-availability
test.
This modification eliminates the read/modify/write instruction
from the main loop and thus substantially reduces the demand on
memory bandwidth.
The most common, and usually most easily recognized, form of
synchronization that must occur in uniprocessor systems is
between process level and interrupt level when dealing with a
device and its associated data structures. This is normally
achieved by disabling interrupts from a device when dealing with
device-specific data at process level which could be modified at
interrupt level should a device interrupt occur. This usually
takes the form of the following sequence of instructions:
disable interrupts from the device
...
test and perhaps modify device data/state
...
start i/o (if appropriate)
...
ensure device data is in a consistent state
...
enable interrupts from the device
In a multiprocessing environment, the above sequence is
insufficient because the device must be prevented from
- 8 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
interrupting any processor. This can be achieved within the
above instruction sequence by introducing a spin lock. The lock
must be obtained after interrupts are disabled and released
before interrupts are reenabled; furthermore, the interrupt-level
processing code must also use the same spin lock before executing
interrupt-level code associated with the device. In summary,
disabling interrupts will keep the interrupt-level code from
being executed on the current processor while the device data is
being modified, and the spin lock will keep the interrupt-level
code from being executed on any other processor while the device
data is inconsistent.
The introduction of spin-lock synchronization in this situation
is simply a generalization of the synchronization that is already
necessary in a single processor system. Thus, a first step in
the implementation of a multiprocessor system is to identify
those places where the interrupt system is used to disallow
conflicting execution of code paths, and to augment these with
spin locks to prevent concurrent execution of the code paths by
multiple cpus.
Cpu scheduling is another area where spin locks can be applied.
Locking could be implemented implicitly rather than explicitly by
not allowing a processor to be rescheduled while in privileged
mode, except as the direct result of a call to the cpu scheduler.
If it happens to be true that the operating system can reschedule
at any time, the locks necessary to guarantee data consistency
will already exist and all that is required is that the locks be
generalized for multiprocessing. However, as is often the case,
and was the case in TOPS-10, data consistency on a single cpu
system was maintained by not allowing the cpu to be rescheduled
at any arbitrary point when it was executing in privileged mode,
but instead requiring an explicit call to the scheduler when data
structures were consistent and rescheduling was proper.
It is this implicit interlock on data structures and the
difficulty in maintaining consistency during concurrent execution
by multiple processors which has led to the master/slave
approach; the implicit interlock is maintained by simply
rescheduling the slave cpu whenever a job enters privileged mode.
This maintains the single-threaded execution of jobs in
privileged mode.
Allowing multiple cpus concurrent execution in privileged mode
requires that those areas where data is inconsistent be
identified and locked to bracket those critical sections.
Fortunately, this turns out to be simpler than it might initially
appear. Since a job cannot be run simultaneously on multiple
processors, internal process data need only be consistent when
rescheduling, and thus individual process data need not be
interlocked. Also, much data is local to an individual processor
and so such per-processor data also require no interlock. The
only data which must be interlocked are true system-global data.
An excellent example of such data is the scheduler database:
- 9 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
scanning, adding jobs to, and removing jobs from the run-queues
must be single-threaded and is therefore protected in SMP by the
scheduler spin lock.
Many other data structures fall into this category and usually
include system-wide linked-list manipulation, changing process
state with respect to the scheduler when the process isn't
running, allocation of system-wide resources, etc. Whether these
data structures are protected by spin locks or a scheduler
semaphore is usually determined by the length of time required to
access the particular data structure and, if modifying it, to
leave it in a consistent state. If the time is short compared to
the time required to do a context switch, a spin lock is probably
the best mechanism. If the time is longer but the likelihood of
collision on an interlock is low, a spin lock may still be the
best choice. Independent of the duration, if nothing can be done
without access to the critical section (again, the scheduler
database and the execution of an interrupt routine are good
examples) a spin lock is required. The best choice can usually
only be determined by performance analysis, so good lock
instrumentation is critical to achieving the best possible system
performance.
Mechanisms for managing other system resources which usually
require the long-term ownership of a lock, e.g., across i/o,
already exist in uniprocessor operating systems and usually don't
require modification for an SMP environment. Such mechanisms are
implemented by trying to obtain a lock and, if it's not
available, calling the scheduler rather than spinning. Once the
lock is obtained data access proceeds normally. When done, in
addition to relinquishing the lock, a check is made to see if any
jobs are waiting for it. If so, the scheduler is notified that
it should run a particular job, or allow all jobs waiting for the
lock to run to compete for it. As mentioned, these types of
locks don't usually require modification for a multiprocessing
environment (although the scheduler interlock may come into play
when making these resources visible to the scheduler when such a
lock is relinquished), but it may be advantageous to introduce
additional locks of this type when implementing symmetric
multiprocessing.
This type of lock would be introduced into the system where a
spin lock would be acceptable except for the fact that the
average waiting time to obtain the lock if it is not available is
potentially long compared to the time required to do a context
switch. Examples of these types of locks in TOPS-10 include the
CB resource (owned to search the file system's in-memory file
cache -- no i/o is done while owning this lock), and the DC
resource which is used to search the physical disk-block cache.
Neither of these resource locks is necessary in a uniprocessor
TOPS-10 system.
Finally, there are two additional locks in TOPS-10 that should be
mentioned. The first is the memory management lock, which is
- 10 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
essentially a hybrid of the locks described above. Its purpose
is to interlock the data structures used for memory management.
The memory management data structures are accessed in process
context for initial memory allocation, for memory expansion, and
for paging. They are accessed in system-context by the swapper
(which runs in system-context rather than in process-context or
as a user-mode process), and, perhaps even at interrupt level,
for physical memory fault recovery.
Since all of these accessors must see and leave the data in a
consistent state, and since the method of access depends on the
context of the access, the memory management lock is actually
dealt with in three different ways. If an attempt to obtain the
lock occurs in process context and the lock isn't available, the
scheduler is called and the job is put into a memory-management
resource wait state. If the swapper fails to obtain the lock, it
just tries again later. Finally, if the error recovery code must
obtain the lock and it's not available, it spins waiting for the
lock to be released. This means the memory management (MM) lock
is unlike sharable-resource locks (which are only owned by
processes) and spin locks (which are only owned by processors) in
that the MM may at various times be owned either by a process or
a processor.
The other unusual lock is a spin lock which, unlike the other
spin locks described earlier, is held for however long it is
needed. This is the lock associated with panic situations such
as bugchecks, memory parity errors, etc. This lock is
appropriately called the DIE lock even though the end result may
not be a reload if recovery is possible. Its main purpose is to
guarantee that the hardware and software are frozen in a state as
close as possible to the state at the time of the fault so that a
dump will yield the maximum possible information.
As already mentioned, proper lock instrumentation is necessary to
understand and perhaps improve system performance. In TOPS-10,
all spin locks have associated with them the current owner of the
lock. This is useful for debugging both software and hardware
failures and can be useful in studying performance. In addition,
many locks also maintain counts on how many times they were
obtained and how many times they "spun" waiting to be freed.
To study lock granularity, a user mode monitoring program was
developed to determine the propriety of lock collisions.
Basically the program noted where a spinning cpu tried to obtain
a lock and what the cpu which owned the lock was actually doing.
This made it possible to determine whether the spinning cpu
really needed the lock in question before proceeding. For
example, if the cpu owning the scheduler interlock was running
the scheduler, and the cpu spinning on the lock wanted to run the
scheduler, this was a proper collision. If on the other hand,
the cpu owning the scheduler interlock was allocating disk space
by updating a bit-table, and the cpu spinning on the lock wanted
to run the scheduler, this was an improper collision which could
- 11 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
be avoided by creating another lock for allocating from the bit-
table.
I'M OKAY, YOU'RE OKAY
In SMP, though all cpus are equal in that functionally they are
able to handle both computation and i/o, certain activities are
allocated to only a single cpu, known as the policy cpu. For
example, the policy cpu maintains the system date and time-of-
day, and system up-time. The cpu on which TOPS-10 is originally
booted automatically becomes the policy cpu, and thus is also
referred to as the boot cpu.
All cpus within SMP are checked periodically, so if a cpu goes
down, clean-up action is taken. In TOPS-10, many events take
place at "clock ticks", which occur at the main system power
frequency, either 50 or 60 times per second. One SMP clock-tick
event is cpu checking. Each cpu has a "keep-alive" counter,
which it reinitializes at each clock tick. A non-boot cpu sets
its keep-alive counter to -N, while the boot cpu initializes its
keep-alive word to -CPUN*N ("CPUN" is the number of cpus in the
configuration).
After reinitializing its own counter, a non-boot cpu increments
the boot cpu's keep-alive counter; the boot cpu increments the
keep-alive counters of all other cpus. If the resulting value of
a keep-alive counter is still negative, the corresponding cpu is
considered okay; if the count reaches zero, however, it means the
corresponding cpu wasn't able to reinitialize its counter
recently, and the cpu is declared dead. After a cpu is declared
dead, its keep-alive word continues to be incremented and is
therefore positive. A positive keep-alive word indicates that
devices connected only to that cpu are no longer accessible, and
programs set to run only on that cpu (by monitor call or console
command) can no longer be run; these problems are reported to the
system operator and user as appropriate.
If the keep-alive counter incremented to zero belongs to the
current boot cpu, the cpu which performed the increment to zero
assumes the role of boot cpu. This is called a "role switch",
and is accomplished by changing a few variables which determine
which cpu is currently the policy cpu. These include the SKPCPU
instruction, which skips when executing on the policy cpu, the
location of the system console terminal ("CTY"), and the variable
which contains the cpu number of the current policy cpu
("BOOTCP"). The fact that a role switch has occurred is reported
to the operator.
- 12 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
SCHEDULING
The TOPS-10 scheduler runs both periodically (at each clock tick)
and when the currently executing job is finished or temporarily
unable to continue (waiting for i/o completion, for example). In
selecting the next job to run, single-cpu TOPS-10 gives
designated special jobs and interactive jobs higher priority to
use the cpu than normal and compute-bound work. Basically, the
attempt is made to assign the cpu to high priority and
interactive jobs when they want to run; other jobs are run
round-robin in the background, that is, when there are no higher
priority jobs needing the cpu. Optionally, the system
administrator can partition background jobs into classes, and
allocate percentages of cpu time to individual classes.
If the scheduler selects the same job to run that was running
during the previous interval, the job can be resumed immediately.
Otherwise a context switch is necessary. Context switching
entails saving the status ("context") of the former job, then
setting up and starting the selected job. Context switching can
take from several microseconds to several milliseconds, depending
on hardware characteristics and the amount of context to save and
restore.
In SMP, multiple cpus execute the scheduling routine to find jobs
to run, which typically results in the same job being run at
different times by different cpus throughout the course of its
processing. It is possible, however, to assign an individual job
to execute exclusively on a particular cpu. This is known as
"cpu affinity", and is useful for running user-mode diagnostics
on a suspect cpu, or servicing a real-time device connected to a
particular cpu.
In general, the scheduling technique of multiple cpus working on
a single queue of jobs is efficient. Queueing theory shows that
multiple servers working from a single queue give better response
than multiple servers and multiple queues, which would be the
case in a loosely coupled multiprocessing organization or with
multiple independent systems, where each cpu has its own copy of
the operating system and thus its own scheduler and scheduler
queue. Due to its architecture, SMP offers automatic and dynamic
load balancing.
Scheduling in a multiprocessor environment offers additional
opportunities for trying to optimize overall system performance
and throughput. In SMP, the policy cpu processes the workload
using normal single-cpu scheduling priorities. Other cpus can do
the same, or schedule "asymmetrically", looking for other work
first. This way, interactive jobs are serviced by a non-policy
cpu only when there is no designated high-priority or background
work to do. This happens to be the default scheduling policy in
SMP as released and has the basic effect that one cpu works on
interactive jobs, while the others run compute-bound jobs. If
- 13 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
there are sufficient compute-bound jobs in the mix, the non-
policy cpus process them with little context switching overhead,
even if there is also a heavy interactive load. The disadvantage
is that if the mix is predominantly interactive, the non-policy
cpus waste time looking for compute-bound work before they get to
the interactive jobs.
It is possible for the system administrator to dynamically
specify symmetric or asymmetric scheduling to reflect current
operating demands. Alternatively, it was considered to have the
system alter scheduling itself, based on mix characteristics.
An important aspect of scheduling in a multiprocessor environment
is inter-cpu interference. For example, if both cpus enter the
scheduling routine simultaneously they will compete for accesses
to instructions and data, and, potentially even more harmful,
cause each other to wait for various interlocks (such as the one
to prevent cpus from selecting the same job to run). While
studies show that most scheduling is done as a result of jobs
blocking for i/o or other events, and not because of clock tick
interrupts, SMP skews the clocks on all cpus to ensure that their
clock interrupts occur at different times. Each cpu has the same
frequency of timer interrupts, but none occur at the same time as
interrupts on other cpus. Therefore SMP prevents periodic
simultaneous scheduling and attendant overhead.
QUEUED I/O
In TOPS-10 SMP the cpu running a job is called the executing cpu;
the cpu connected to devices used by the job is called the owning
cpu. Because monitor calls can be executed by any cpu, unless a
job can't continue until an i/o request is satisfied the
executing cpu can continue to run the job even if the job
requests i/o on devices connected to another cpu. This is vital
to reduce context switching overhead; otherwise, the executing
cpu would need to reschedule and context switch away from the
job, and the owning cpu would have to context switch to the job
to process the monitor call and i/o, suffering the same overhead
as in a master/slave implementation when the slave encounters an
i/o request. A queued i/o mechanism was created to eliminate
this problem.
If a job requests i/o to devices on the executing cpu, the
request is entered into that cpu's device i/o queue immediately.
If a job requires i/o on a different cpu, then the executing cpu
makes a request of the owning cpu for action by making an entry
in a global i/o work queue. At its next clock tick the owning
cpu will scan the global i/o queue and move i/o requests for
itself to its private queues for normal processing. Once the
global i/o queue entry is made by the executing cpu, however, if
the job is still runnable, i.e., there are still buffers
available for the job to fill or empty, the executing cpu will
complete the monitor call and resume the job. Thus no context
- 14 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
switch is necessary and maximum advantage is provided by
symmetrical i/o support.
Also implemented in SMP is i/o load balancing using dual ported
disks (Figure 6). Either a single cpu can be connected to both
ports of a disk, or two cpus can each access a port. Load
balancing is therefore dynamic and automatic on dual-ported
disks, and yields higher availability and throughput.
Multiporting disks was quite attractive during initial SMP
implementation because of available technology; however, later
hardware offered some alternatives. The KLIPA, officially known
as the CI20, is a physical port driver interface to the CI -- the
70-Megabit per second Computer Interconnect bus which, with the
HSC disk controller, is the hardware basis for VAXclusters.
Having a KLIPA on each cpu is like having a disk port per cpu.
Therefore, queued protocol is unnecessary for disk accesses in
this configuration, and there is no 8-ms average latency waiting
for a clock tick to start disk i/o on another cpu.
Magnetic tape i/o is queued under SMP, and dual-ported magtape
support across cpus is provided. Jobs accessing unit-record
equipment (e.g., card readers and line printers) have "device
affinity", that is, any unit-record i/o must be done while the
job is running on the owning cpu; therefore either such a job
must be assigned to the owning cpu, or context switching to the
owning cpu is necessary for unit-record i/o.
CACHE
Cache memory is commonly used in computer systems to provide
better overall system performance by improving effective memory
access times. To understand some of the complexities of SMP
implementation, a look at the KL-10 cache implementation is
necessary.
The memory controller portion of the KL-10 cpu (Figure 7)
coordinates memory access requests from the cpu and from
peripheral devices such as disks and magnetic tapes connected via
internal channels/controllers (RH20s). The standard KL-10 cache
is a 2048 (36-bit) word semiconductor memory that serves as a
buffer for primary memory. Read references to primary memory by
the cpu result in the memory controller checking to see if the
referenced words are in the cache. If so, the memory controller
supplies the cpu with the data directly from the cache; this
results in higher performance because the much slower primary
memory need not be accessed.
If the requested data are not in the cache, the memory controller
gets the data from primary memory, supplies them to the cpu, and
places them in the cache, where they will be found on subsequent
references. (Actually, accesses to primary memory and cache-
fills are normally done in "quad-words", so that the referenced
- 15 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
word and three adjacent words are fetched; thus speed of
sequential accesses to instructions and data is improved by pre-
loading cache locations.)
KL-10 memory write operations are done only in the cache, not in
both cache and primary memory as in a write-through cache
organization. The cpu has instructions to explicitly validate
memory with updated cache contents ("sweep the cache"); memory
validation is automatic if the memory controller needs in-use
cache locations for new data. The memory controller also deals
with memory accesses by the internal channels/controllers and
will supply updated data from the cache on "write-from-memory"
(output) transfers, and invalidate the data found in the cache on
input operations so that a subsequent read will fetch the data
from primary memory.
Typical program characteristics and system operation result in
using data in the cache 90%-95% of the time, thus improving
effective memory access times and thereby increasing system
throughput. Of equal importance in a multiprocessing
configuration, 90%-95% of cpu primary memory references are
avoided, reducing contention for primary memory and eliminating
many memory interference problems.
While the KL-10 cache organization is efficient with respect to
primary memory accesses and improved system performance, it does
cause two problems. The first and most significant is that when
a cpu runs a job or does anything which causes job data to be
modified in its cache, the cpu must sweep its cache to validate
primary memory before another cpu can run the same job.
Otherwise, the new cpu could use "stale" data in primary memory
because updated values would be in the other cpu's cache. The
new cpu cannot recognize the data as stale, and thus would not be
working with the job's proper context. Such operation is
incorrect and can result in subtle bugs that are difficult to
track down.
The second problem is important in terms of availability. It may
be the case that a cpu has modified data in its cache for several
different jobs before a sweep completes. If the cpu suffers a
failure before the sweep completes, other system cpus cannot
select any of these jobs since they are unrunnable with respect
to the cache. Therefore these jobs are effectively lost, and
must be manually restarted from the beginning or the last
checkpoint.
Thus, because of the KL-10 cache implementation, an SMP cpu
failure may cause from zero to several jobs to be lost.
Nevertheless, losing a few jobs is still preferable to losing 80
to 100 jobs, as would be the case if a loosely coupled or
completely independent system cpu failed.
A cache sweep serial number scheme is used to keep track of
primary memory currency with respect to cache. Every time a cpu
- 16 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
completes a sweep of its cache, it increments its cache sweep
serial number. When a job is stopped or requests i/o, the
current cpu and cache sweep serial numbers are saved. The system
can tell if a cache sweep has completed for a job by comparing
the current cache sweep serial number for the cpu with the saved
value. If the current cache sweep serial number for the cpu is
greater than the saved value, at least one sweep (and a single
one is sufficient) has completed. Thus the system can safely
manipulate the job, knowing that its image in primary memory is
up-to-date.
If during a cpu's scheduling cycle, jobs are available to run but
cannot be run because cache has not been swept for them yet, the
scheduling cpu keeps track of this "cache lost time" as a cpu
operational statistic. High cache lost time is bad, because it
means that a cpu is available to run jobs but has to remain idle
since jobs cannot be run with respect to the cache. To minimize
cache lost time, a cpu can request that another cpu sweep its
cache, thereby updating primary memory with cache data from jobs
the other cpu has processed. Every scheduling cycle each cpu
honors any sweep requests from other cpus (one sweep suffices for
all requests). To further reduce cache lost time, if a cpu
selects a job to run and is context-switching from another job,
it starts a cache sweep so that the "old" job will become
runnable with respect to cache on other cpus when the sweep
completes (in about 250 microseconds).
The KL-10 cache implementation does permit cache to be enabled or
disabled for each page in memory. This facility allows the
monitor to "uncache", i.e., selectively disable cache for pages
containing certain monitor data for which sweeps would be
impractical. Terminal i/o buffers, per-cpu data (cpu cache sweep
serial number, cpu uptime, etc.), and lock databases are examples
of data stored in uncached pages. Accesses to such data are
relatively infrequent, so no large cpu speed or primary memory
access penalties are incurred.
CACHE MANAGEMENT FOR I/O
To overlap computation and input/output, TOPS-10 implements
producer/consumer buffering for i/o; Figure 8 shows the basic
buffer structure. If a device is inactive when a user program
makes an input monitor call, the monitor starts reading into the
buffer ring at the buffer specified by the buffer control block.
As soon as at least one buffer is full, the user program is
allowed to proceed.
The monitor continues reading into subsequent buffers, following
the ring structure, until it advances to a buffer which the user
program has yet to empty; at this point, the monitor marks the
device as inactive. Each time the user program does another
input, the buffer control block is simply advanced to the next
buffer; if the buffer is full, control returns to the user
- 17 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
program for processing.
Similarly, on an output monitor call, if the device is inactive,
the monitor starts writing from the buffer specified by the
buffer control block and control returns to the user program.
When an output interrupt occurs, the monitor advances to the next
buffer in the output ring and, if the user has filled that
buffer, the monitor continues writing. Each time the user
program does another output, the buffer control block is simply
advanced to the next buffer; if it is empty, control returns to
the user program to fill it.
Unless non-blocking i/o was specified, a user program is blocked
on input only when the next buffer is empty, i.e., has not been
filled by the monitor yet. On output, the user program is
blocked only if the next buffer that the user would fill still
contains data, i.e., the monitor has not yet emptied it. For an
input ring, the monitor only makes the device inactive whenever
the entire ring has been filled. For an output ring, the monitor
makes the device inactive when it advances to a buffer that the
user program has yet to fill. The coordination between the
monitor and the user program is accomplished by a bit in each
buffer itself. This bit is called the "use-bit" and is set by
the monitor whenever a buffer is available to the emptier. That
is, if the use-bit is on in an input buffer, it contains data
which the user program may process. If the use-bit is on in an
output buffer, the buffer contains data which the monitor must
write to the output device.
This basic scheme is simple and optimizes i/o throughput as seen
by both the device and the user program. However, it becomes
more complicated in an SMP environment where the caches in each
cpu must be taken into account. Basically, there are two
problems which must be addressed: data integrity and
coordination of the use-bits among cpus. For example, if a user
program fills a buffer and does an output call on one cpu, the
data in that buffer will be in the cache of the cpu that the
program was running on. The data will not necessarily be current
in main memory, and if the device which the data is destined to
is owned by another cpu, the write operation cannot be initiated
until the cpu which the program is running on has swept its cache
(making the data visible in main memory to all cpus and their
attached i/o devices). Likewise, if the monitor filled an input
buffer or emptied an output buffer on a cpu and set the use-bit
appropriately, the use-bit would be in the cache of the cpu which
the operation was completed on and might not be current in main
memory.
Actually, the problem is even more serious. If the cpu the
program is running on examines the word containing the use-bit,
then there will be a valid copy of that word in its cache.
Therefore, even if the cpu which turns on the use-bit causes the
word containing the use-bit to be written into primary memory,
the cpu that the program was running on would use the copy from
- 18 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
its cache rather than refetching the word from memory.
Basically, the implication of these two aspects of the i/o
subsystem is that for proper data integrity, the data must be in
primary memory if the read or write operation will actually occur
on some cpu other than the one that the user program was running
on when a buffer was filled. Furthermore, the use-bits must be
visible from all cpus, i.e., they must be read/written from/to
primary memory and must not be valid only in an individual cpu's
cache.
As stated earlier, the KL-10's standard cache is organized in
four-word chunks (called "cache lines"). If the cache refill
algorithm (which is programmable) is specified correctly, when a
word is referenced, the other three words in the same cache line
are read from primary memory and placed in the cache. For
example, if a word were read from an address which ended in a 2,
adjacent word addresses ending in 0, 1, 2, and 3 would all be
read from primary memory and placed in the cache line. Since
there are 2048 words in the standard cache (512 lines), a
reference to a virtual address which is different from a virtual
address whose contents are already in the cache, but which has
the same low-order nine bits, will cause the data in the cache to
be displaced; in addition, the cache-line contents are written
back to primary memory before the cache-line is reused for the
new data if the cache-slot data is fresher than the corresponding
data in primary memory. This can be viewed as a one-line cache
sweep.
There is a routine which implements this called OUCHE. Whenever
a use-bit is read from or written into a buffer, the monitor
calls OUCHE with the address of the word containing the use-bit.
OUCHE then uses the low-order nine bits of the virtual address of
the buffer to form a different cached virtual address which it
reads, thus forcing the word containing the use-bit out of its
cache and, if necessary, updating primary memory. Thus, the word
containing the use-bit will only remain valid, and perhaps
modified, in the cache of the cpu making the reference until
OUCHE is called. OUCHE is the mechanism used in SMP to solve the
use-bit problem.
Solving the data integrity problem requires another approach.
Consider a user program filling an output buffer and then doing
an output monitor call to cause the monitor to write the buffer
to the device. If the device is directly accessible by the cpu
where the user program is currently running, the output operation
can begin immediately since the i/o channel on that cpu will
fetch its data from that cpu's cache. However, if the device is
on a different cpu from where the program is running, the i/o
request must be queued for processing on the cpu that owns the
device. The data in the user program's buffer is not visible to
the channel connected to the owning cpu unless the executing cpu
first sweeps its cache.
The problem is even more complicated in the case of an input
- 19 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
operation. Once again, if the monitor call to begin the input
operation and the actual i/o both occur on the same cpu, there is
no problem since the i/o channel will write data read from the
device into primary memory and invalidate the cpu's cache if the
cache references an affected address. If, however, the i/o must
be done on another cpu, when the input operation is complete, the
executing cpu must sweep its cache to ensure that previous
references to the input buffer are invalidated so that data read
from the device will be fetched from primary memory when the user
program accesses the buffer.
The cache-sweep serial number scheme described earlier is used to
guarantee that data is visible to the device on whatever cpu owns
the device on output, and is visible to the user program on input
regardless of the executing cpu. Whenever a user program does an
output monitor call indicating it has filled an output buffer,
that buffer is said to have been swept when one complete cache
sweep has occurred on the executing cpu. When an output buffer
has been swept, it may be written to a device connected to any
cpu in the SMP system. Likewise, when an input buffer has been
filled by some cpu in the SMP system, it is said to have been
swept when the executing cpu has done a cache sweep after
completion of the input operation. The basic algorithms which
maintain the number of buffers which are correct with respect to
the cache, that is, have been swept, for both input and output
are shown in Figure 9.
CACHE AND DEBUGGING
Debugging can be very challenging in a multiprocessor
environment, especially when cache-related bugs are involved.
For example, one particular SMP bug which was very difficult to
track down would variously result in minor confusion, or complete
chaos. A variable ("USRHCU") was being interrogated at interrupt
level to determine how virtual-to-physical address translation
was to be done. During some monitor source-code edits, several
new variables were added to the data structure containing USRHCU.
As a result of these changes, USRHCU happened to wind up in the
same cache-line as the process program-counter ("USRPC", the
address of the next instruction to execute, which must be saved
when a process is suspended during a context switch).
Because of the standard KL-10 cache-fill mechanism wherein the
referenced word plus the words at the three adjacent addresses
are all loaded into the appropriate four-word cache-line, the
reference to USRHCU also caused an apparently valid copy of USRPC
to be loaded into the cache of the cpu making the interrupt level
access to USRHCU. Now, if the process was running on some other
cpu at the time of the reference, when that cpu context switched
away from the process, it would properly store the current
process program-counter ("PC") in USRPC. However, if the cpu
which made the interrupt-level reference to USRHCU then selected
the process to run, the PC at which the process would be resumed
- 20 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
was the stale copy of USRPC which was loaded into the cache by
the reference to USRHCU. Thus, the process would be resumed with
all of its proper context except the PC, which was set to an old
value. The resulting behavior came to be described as the PC
running backward.
If the correct PC was actually a user-mode PC, the result was
usually improper results from the user process, or the user
process just crashed. However, if the PC turned out to be an
exec-mode PC, nearly anything was possible. The fix for the
problem was simply redefining the data structures so that USRPC
and USRHCU could never fall in the same cache-line.
CPU DOORBELLS
Cpus in SMP communicate continually since a single copy of the
operating system is shared among all cpus. Accessing and
modifying global values such as job status information is a
common form of communication. Reading another cpu's cache sweep
serial number is a typical example of one cpu needing specific
information about another cpu. However, there is no cpu hardware
such as a "doorbell" for one cpu to interrupt another cpu or
otherwise get its immediate attention. The design and
implementation of SMP revealed that for scheduling or cache
management no such doorbell was necessary. In fact, a hardware
doorbell would only be useful during emergencies ("I'm dying" or
"get out of my way").
Rather than implement a hardware doorbell, a software doorbell
was chosen for SMP. The basic mechanism allows a cpu to ring
another cpu's doorbell, or the doorbells of all other cpus, on a
"significant event", such as cache sweep done (jobs can be run
with respect to cache), context switch from a runnable job (which
is eligible to be run by another cpu), i/o done (job can be run
because i/o request has been satisfied), and queued-i/o request
made (i/o to do for job run on another cpu).
A cpu has to "listen" for a doorbell; a doorbell will not
interrupt or otherwise disturb a cpu. The only time a cpu pays
attention to doorbells is when it is idle, that is, when it has
scheduled, found nothing to do, and is running the "null job"
until something happens to make work available.
This software doorbell implementation is good in that a cpu is
not taken away from useful work with interruptions. A negative
aspect is that a cpu may look for work to do as the result of a
doorbell yet find nothing to do. While looking for work to do,
it holds interlocks and increases memory contention, and thereby
possibly interferes with other cpus that do have work to do.
- 21 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
HARDWARE FAILURES AND SYSTEM AVAILABILITY
Redundancy, or duplication of critical components, increases
availability by providing additional units to handle a particular
function should one unit fail: the failing unit can be repaired
or replaced while the backup units assume the workload.
Therefore there may be some performance degradation, but no
service outage. The master/slave system provides cpu redundancy
but requires additional operator action for switching peripherals
and reloading the monitor after certain failures.
SMP provides better inherent availability than master/slave. In
SMP, all devices can be duplicated and placed on multiple cpus.
Thus any device in an SMP system can fail ("single point
failure") and the system can still provide all critical functions
and services. With dual-ported disks, failure of a cpu, channel,
controller, or disk port will not prevent the system from
accessing data through the other path. Such operation is
automatic and the operator is notified of the failure so
corrective maintenance can be scheduled.
Memory parity errors are rare, but are automatically tried up to
three times per word. A hard error, that is, an access
unsuccessful on all retries, causes the associated job to be
stopped and an error message issued to the user if the access is
to a private page. A hard error in a shared page causes the
system to get a new copy from the disk area used for shared pages
and continue automatically. Parity errors within the monitor
itself are also handled. If the situation warrants it, the
operating system moves itself into better memory.
There is also a TOPS-10 Ethernet interface, known as the KLNI.
This option may be included with each SMP cpu for higher
availability. Ideally there would be one such interface active
per cpu, but because of the Ethernet specifications, only one
active interface per SMP system is allowed; the others are only
present for hot-standby failover. In SMP, if the active one
fails, or the cpu that the active one is connected to goes down,
the system automatically switches over to an alternate, and
notifies the operator.
Of course the operating system logs all system errors and
failures so that the system administrator and maintenance
personnel have a history of system operation and can detect
trends. Components configured into and out of the system are
also logged.
SMP DEVELOPMENT TIME
The original estimate for implementing SMP was six months, and
the basic "SMP part" of the project was actually working on the
KI-10 in somewhat less than six months. However, other issues
- 22 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
stretched the overall time to release to about two years. This
section summarizes some of those items and events.
Software development within DEC's Large Systems Group was always
heavily influenced by the needs of its existing customers.
Because the user base was made up of both commercial and
university sites (those that regularly had funds available for
computer equipment, as well as those that had to stay many years
with their original configurations), software compatibility
between old and new versions of the operating system and the
support of new features and functions across the entire family of
processors were always given high priority. Because of the
special requirements of SMP, several important decisions were
necessary.
The first was getting management agreement that SMP did not have
to run on the KA-10. The development team wanted SMP to be a
virtual memory system and to use memory mapping to take care of
addressing matters, such as referring to data in CDBs (Cpu Data
Blocks); the KA-10 addressed a CDB through an index register.
Once the decision was made to support KIs and KLs only, a major
reorganization of the entire cpu database and system address
space was undertaken.
A limitation in pre-SMP TOPS-10 was a maximum of 128 jobs. It
didn't seem proper to have massive cpu power available in SMP and
not be able to exploit it fully. Unfortunately, supporting more
jobs would require more job-related data structures and therefore
would consume more of the monitor's address space, which was
starting to get very crowded. To solve these problems, changes
were made to the way TOPS-10 managed its address space, and to
allow many data structures which previously were always memory
resident to be swapped to disk.
For the system to be truly symmetric, all devices had to be
supported on every cpu, which meant redesigning the device
databases and rewriting all device service routines to make them
reentrant; for completeness, real-time support also had to be
provided for all SMP cpus. As described earlier, queued protocol
was implemented for disks and tapes. It was also implemented for
terminals and networks; this involved a total rewrite of SCNSER,
the monitor module responsible for terminal support.
As described, cache support and debugging took a major effort,
and, though not technically difficult, it turned out to be hard
to make dual-ported disks work properly across cpus. Getting
proper locking granularity was not difficult, but was just one
more thing to work on.
Because SMP had to support many cpus and data channels, an
initial worry was that memory bandwidth might not be sufficient
to satisfy demand. The developers were prepared to implement
memory bandwidth scheduling, but fortunately that turned out to
be unnecessary. However, memory overruns and related problems
- 23 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
were troublesome enough to require the availability-test
described in the section on locks. With this change, no further
memory bandwidth problems were observed.
SMP developers believed that all these tasks were necessary to
provide basic SMP functionality and meet performance
requirements. The tasks were completed and the developers began
preparing to release the product, but then LCG management decided
that before shipment SMP had to be very maintainable and provide
very high availability. To meet these new goals, additional
development was undertaken.
All of the hardware error recovery algorithms were reexamined and
most were rewritten. Full and completely dynamic system
reconfiguration under software control was implemented. It was
then possible to take essentially any system component down for
preventive or corrective maintenance without interfering with
timesharing other than to degrade system performance. System
sleep was also implemented. This allows the operator to have the
system suspend itself and save complete system status to disk.
Then the operator can perform any special reconfiguration (such
as memory interleaving), and subsequently bring the system back
up where it left off.
DECnet was in its infancy during this time, but TOPS-10 already
had a comprehensive network providing task-to-task
communications, terminal concentration, and remote stations. To
meet the new availability requirements, network support was
redone to provide for complex topologies, dynamic
reconfiguration, and down-line loading. As an example of overall
system availability, there was no interruption in service for a
terminal connected to the front-end PDP-11 of a cpu that failed
as long as a network link existed to a PDP-11 front-end on
another cpu that was still running.
Finally, delays resulted from the normal process of releasing any
complex hardware/software combination, such as designing and
manufacturing ROMs for booting front-ends through various
connections, diagnostics, documentation, testing, and performance
analysis. Interestingly, after initial dual-cpu SMP was
released, it took another year to formalize the version
supporting additional cpus, but the process didn't involve
writing a single line of code -- just logistics.
To simplify SMP development with respect to cpu coordination, cpu
byte manipulation instructions were made read/modify/write. This
was accommodated on the KL-10 by a microcode change; an official
hardware engineering change was implemented for the KI-10.
Initial SMP development was actually carried out on the KI: the
equipment was readily available and the developers preferred
debugging on the KI. Before SMP release, however, the decision
was made to restrict SMP support to the KL. The KI was already a
relatively old product by that time, and extra resources would be
- 24 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
required to continue to test and maintain SMP for both
configurations. Because this decision came late in SMP
development, only two items were missing for comprehensive SMP
support on the KI. The DL-10, a memory sharing interface for
front-end PDP-11 processors, worked only on KI CPU0. More
important, the KI never acquired Phase IV DECnet support.
Initially TOPS-10 developers believed DECnet support would
require KL-only extended addressing. When it became clear that
DECnet could be implemented on the KS (a compact and inexpensive
single-cpu-only TOPS-10/TOPS-20 system), retrofitting DECnet to
the KI would have been possible, but by then the TOPS-10 group no
longer had a KI for development and testing.
Contributing to the rapid pace of SMP development, and definitely
part of TOPS-10 folklore, were the working hours of Jim Flemming
and Tony Wachs. Though most of the staff tended to keep more
standard business hours, ordinarily Jim and Tony preferred to
start work around 2 a.m. This allowed them dedicated access to
hardware for development and debugging, and largely kept them
from being bothered by typical office distractions. Because
their hours did overlap with the beginning of the conventional
workday, they were not completely isolated from other personnel,
so communication was not impacted.
ON-GOING DEVELOPMENT
Useful KL-10 hardware development did not cease after SMP was
released, partly because development of the planned KL follow-on
processor was cancelled in 1983. Existing customers needed
greater KL performance while they evaluated eventual replacement
options for their TOPS-10 systems, so the MCA25 was introduced in
1984 to double the size of the KL-10 cache and paging memory. It
also increases the size of a cache-line from four to eight words,
and implements a "keep me" bit in the page tables. The keep-me
bit is a suggestion to the microcode that when the paging memory
is flushed, certain data which are likely to be needed soon
should be kept around. (The most interesting problem regarding
SMP support for this option is what to do when some cpus have it
and some don't; in TOPS-10, the solution was to make the keep-me
bit part of the per-cpu database and simply OR it in for
appropriate pages.)
Certainly the introduction of a new product such as the MCA25,
with SMP support, shows that SMP development didn't stop in 1980.
Furthermore, it demonstrates that if the basic SMP software
architecture is correct, incremental hardware additions can
enhance SMP system performance with only relatively minor
software changes.
- 25 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
SMP'S RELATION TO OTHER PRODUCTS
SMP was not developed in response to multiprocessing competitors.
It was simply the best way to extend the utility of TOPS-10 while
waiting for a successor to the KL cpu (though no such product
ever materialized). Neither was SMP implemented according to any
externally published specification or design; it just grew
normally out of the existing TOPS-10 design and philosophy.
SMP did have an important influence on later DEC hardware
designs, however: in new products hardware, not software, is
responsible for cache coherency. The design, implementation, and
debugging problems associated with cache in TOPS-10 SMP clearly
demonstrated the need for hardware to maintain cache coherency.
This is particularly important with the increasing interest in
parallel processing, where sharable, writable user data is
fundamental.
In spite of TOPS-10's success with SMP, TOPS-20 never gained such
capability. It should not be technically difficult because
TOPS-20 already has the basic mutual exclusion support needed for
SMP. Also, because of TOPS-20's "mapped" i/o, cache problems
should be nearly non-existent. Finally, from the standpoint of
cpu scheduling, TOPS-20 SMP should present relatively few
problems.
TOPS-20 SMP was never seriously considered simply due to lack of
suitable shared memory. The DECSYSTEM-20 had only private
internal memory. A shared internal memory product, the MX20, was
planned but cancelled (had it appeared, however, it would have
supported only two cpus). Furthermore, in the final
specification for the "Jupiter" (planned follow-on to the KL
cpu), memory-sharing capability was eliminated for higher
performance and lower cost. Therefore the anticipated future
hardware platform for TOPS-20 precluded tightly-coupled
multiprocessing support.
As far as VAX multiprocessing is concerned, there has been
significant contact between TOPS-10 SMP and VMS developers.
Principally Jim Flemming has been conveying TOPS-10 SMP ideas and
experience to other groups. Jim has also been involved in
implementing an SMP version of UNIX (a trademark of AT&T)
System V for VAX computers.
SUMMARY
TOPS-10 is a very successful operating system, which benefited
greatly from user input. DECUS (Digital Equipment Computer Users
Society) meetings were always a valuable source of suggestions
for new features and enhancements, and from the early days TOPS-
10 developers were very accessible to users, thereby contributing
to a constant exchange of ideas. Both internal and external
- 26 -
TOPS-10 SYMMETRIC MULTIPROCESSING -- SMP
users always looked forward to new releases of TOPS-10, because
each release brought more power and capability. Initial support
for multiprocessing was provided in the early 1970s.
Tightly-coupled multiprocessing can offer configuration and
processing flexibility over a conventional uniprocessing
architecture. TOPS-10 multiprocessing moved from an early
master/slave approach to Symmetric Multiprocessing, which is more
general and provides especially attractive benefits in terms of
system performance and availability, and also has economic and
administrative advantages. In an interesting parallel, initial
multiprocessing support in VAX/VMS was also a master/slave
implementation. VAX/VMS support for symmetric multiprocessing
was not provided until version 5.
The first TOPS-10 SMP release was in 1980, and represented a
major milestone in Digital's 36-bit system history. TOPS-10
continues to be a workhorse in sites around the world, which is
quite amazing considering it is already in its third decade.
ACKNOWLEDGEMENTS
Some of the material presented herein appeared in the June, 1980,
Datamation article "More Power to You", by Allan B. Wilson. Tony
Wachs (one of the key designers and implementors of TOPS-10 and
SMP) and Barbara Huizenga (a member of the TOPS-10/SMP team)
reviewed several drafts of this chapter. Duane Winkler of Oak
Ridge National Laboratory (operated by Martin Marietta Energy
Systems, Inc.), Oak Ridge, TN, furnished the information on the
ORNL SMP system. TOPS-10 and SMP represent the work of many more
contributors than can be given proper credit individually.
THE AUTHORS
Allan Wilson worked for Digital Equipment Corporation for eight
years, and held software development, technical consulting, and
marketing positions there; he participated in the development of
SMP, and also independently developed TOPS-10 support for the
original DECSYSTEM-20-only hardware, thus creating the product
later released as the DECsystem 1091. He currently is an
Executive Vice President of Intergraph Corporation.
Jim Flemming is a Consulting Software Engineer for Digital
Equipment Corporation. He was one of the principal architects
and implementors of TOPS-10 SMP. A recent project involved
implementation of an SMP version of System V UNIX for VAX
computers.
- 27 -
----------------
The End.