WO1992003784A1

WO1992003784A1 - Scheduling method for a multiprocessing operating system

Info

Publication number: WO1992003784A1
Application number: PCT/US1991/004068
Authority: WO
Inventors: Gregory G. Gaertner; George A. Spix; Diane M. Wengelski; Keith J. Thompson
Original assignee: Supercomputer Systems Limited Partnership
Priority date: 1990-08-23
Filing date: 1991-06-10
Publication date: 1992-03-05

Abstract

An integrated dispatcher (1112) schedules non homogeneous processes in a tightly-coupled microprocessor system. Self-scheduling processes (20, 30, 40, 60) may be scheduled on any available processor; the only criterion for scheduling is priority (0-99). The integrated dispatcher (1112) has a mechanism (10) to efficiently multithread scheduling.

Description

SCHEDULINC ETHOD FOR A MULTIPROCESSING OPERATING SYSTEM

RELATED APPLICATIONS

This application is a continuation-in-part of an applica¬ tion filed in the United States Patent and Trademark Office on June 11, 1990, entitled INTEGRATED SOFTWARE ARCHITECTURE FOR A HIGHLY PARALLEL MULTIPROCESSOR SYSTEM, Serial No. 07/537,466, and assigned to the assignee of the present invention, the disclosure of which is hereby incorporated by reference in the ; resent application. The application is also related to copending applications entitled, DIS¬ TRIBUTED ARCHITECTURE FOR INPUT/OUTPUT FOR A MULTIPROCESSOR SYSTEM, Serial No. 07/536,182, METHOD AND APPARATUS FOR A LOAD AND FLAG INSTRUCTION, Serial No. 07/536,217 and SIGNALING MECHANISM FOR A MULTIPROCESSOR SYSTEM, Serial No. 07/536,192. The application is also related to copending application filed concurrently herewith, entitled DUAL LEVEL SCHEDULING OF PROCESSES TO MULTIPLE PARALLEL REGIONS OF A MULTITHREADED PROGRAM ON A TIGHTLY COUPLED MULTIPRO¬ CESSOR COMPUTER SYSTEM, METHOD OF EFFICIENT COMMUNICATION BETWEEN COPROCESSORS OF UNEQUAL SPEEDS and METHOD OF IMPLEMENTING KERNEL FUNCTIONS USING MINIMAL CONTEXT PRO¬ CESSES all of which are assigned to the assignee of the present invention, the disclosures of which are hereby incorporated by reference in the present application.

TECHNICAL FI« LD

The present invention relates generally to multiprocessor computer systems and specifically to allocating processors in a tightly-coupled configuration to execute the threads of one or more multithreaded programs that are running on the system simultaneously. BACKGROUND ART

Processes are entities that are scheduled by the operating system to run on processors. In a multithreaded program, different threads may execute simultaneously on different processors. If the processes executing the different threads of a program are scheduled to execute simultaneously on different processors, then multiprocessing of the multithreaded program is achieved. In addition, if multiple system processes are scheduled to run simultaneously in multiple processors, the operating system has achieved multiprocessing.

Generally, in all process scheduling at least four types of contenders compete for processor access:

1) Processes waking up after waiting for an event.

2) Work needing to be done after an interrupt.

3) Multiple threads in the operating system.

4) Multiple threads in user processes.

One problem with existing implementations of multithreaded systems is that a bottleneck occurs when multiple threads must wait at a final, central point to be dispatched to a processor. This scheme may use a lock manager to schedule processes. The result requires a process to wait in line for a processor. Inefficient scheduling may occur if a lower priority process is in the queue ahead of a higher priority process. Thus, the result of the lock manager is to reduce a multithreaded process into a single thread of execution at the point where processes are dispatched.

Another problem related to existing implementations is tha efficiency is reduced because of overhead associated with

SHEET processes. In a classical Unix implementation, only one kind of process entity can be created or processed. A process consists of system side context and user side context. Because a classical Unix implementation has only one creating entity, the fork, the system processes contain more context information than is actually required. The user context (i.e., user block and various process table fields) is ignored by the system and not used. However, this context has overhead associated with memory and switching which is not ignored and thus consumes unnecessary resources.

Another problem with existing implementations is that when an interrupt occurs, the processor which receives the interrupt stops processing to handle the interrupt, regard¬ less of what the processor was doing. This can result in delaying a high priority task by making the processor service a lower priority interrupt.

Another problem can occur when an implementation has multiple schedulers in a tightly coupled multiprocessing environment. Each of the schedulers controls a type of process and as such all schedulers are in contention for access to processors. Decentralizing the run queue function has overhead penalties for the complexity in managing locally scheduled processes.

SUMMARY OF THE INVENTION

This invention relates to the scheduling of multiple computing resources between multiple categories of contenders and the efficient use of these computing

Unix is a trademark of AT&T Bell Laboratories. resources. In particular, it relates to the scheduling of multiple, closely-coupled computing processors.

The scheduling method, hereinafter referred to as an integrated dispatcher, improves the efficiency in scheduling multiple processors by providing a focal point for scheduling. That is, there are not independent schedules for each category of processes to be scheduled. This ensures that the highest priority process will be scheduled regardless of its type. For example, if schedulers were independent, a processor running the interrupt process scheduler would choose the highest priority process from the interrupt process running, even though there may be higher priority processes waiting on a different scheduler's run queue. The integrated dispatcher provides a mechanism which allows processors to be self scheduling. The integrated dispatching process runs in a processor to choose which process will run next in this processor. That is, there is not a supreme, supervisory dispatcher which allocates processes to specific processors.

The entities scheduled by the integrated dispatcher need not be homogeneous. That is, the integrated dispatcher chooses what entity will run next in this processor based on the common criterion of priority, regardless of type. In the preferred embodiment, the types of entities which can be scheduled are the iprocs, mprocs, and procs as described hereinafter.

The integrated dispatcher provides the method to efficiently multithread scheduling. That is, mechanisms have been added which will allow all processors to be simultaneously running the integrated dispatcher with limited chance of conflict. If each processor has selected a process of a different priority, each passes through the scheduler unaware that other processors are also dispatching. If two processes have the same priority, they are processed in a pipeline fashion as described hereinafter.

No new hardware is needed to support this invention. The mechanism referred to above is preferably the software synchronization mechanisms employed to allow simultaneous execution of specific pieces of code without endangering the integrity of the system.

The present invention is best suited for scheduling tightly coupled processors and works best if the hardware provides convenient locking mechanisms on memory access as described in the copending patent application, DISTRIBUTED ARCHITEC¬ TURE FOR INPUT/OUTPUT FOR A MULTIPROCESSOR SYSTEM, Serial No. 07/536,182. However, the application of this scheduling method is not restricted to this situation and can be used all or in part in a facility requiring efficient processor scheduling.

These and other objects of the present invention will become apparent with referen e to the drawings, the detailed descriptions of the preferred embodiment and the appended claims.

Those having normal skill in the art will recognize the foregoing and other objects, features, advantages and applications of the present invention from the following more detailed description of the preferred embodiments as illustrated in the accompanying drawings.

DESCRIPTION OF THE DRAWINGS

Figure 1 is a schematic diagram showing the relationship of the integrated dispatcher to other parts of the operating system kernel.

SUBSTITUTE SHEET Figure 2 is the integrated dispatcher, run queue, wake queue, and process entity scheme.

Figure 3 is a diagram of the run queue with pipelined priority.

Figure 4 is a comparative diagram between a full process and several microprocesses.

DESCRIPTION OF THE PREFERRED EMBODIMENT

The process control subsystem 1110 of the preferred embodi¬ ment is shown within the context of the Unix kernel in Figure 1. The integrated dispatcher 1112 integrates the scheduling of microprocesses, interrupts, processes, and standard Unix processes.

In the preferred embodiment as illustrated in Figure 2, the integrated dispatcher 1112 runs on a system of tightly coupled multiprocessors with hardware support for locking memory on a per word basis. This hardware mechanism is described in copending application METHOD AND APPARATUS FOR A LOAD AND FLAG INSTRUCTION. The integrated dispatcher is multithreaded. This means that all processors can be executing the integrated dispatcher simultaneously.

The integrated dispatcher is multithreaded to allow proces¬ sors to simultaneously execute the dispatching code with minimal blocking. That is, all other processors should not block until one processor completes dispatching. This is a bottleneck and extremely inefficient. Rather, the present invention uses software synchronization methods to protect common data structures while allowing all processors to continue. Until processors contest for a specific common data structure, they will continue unimpeded. If processor does attempt to access a protected structure, it will block until the processor that locked it is finished

SHEET with it. The data structures are set up to minimize these instances of blockage. For instance, if every processor simultaneously attempts to dispatch at a different priority, each will access a unique protectable structure allowing each to continue without blocking.

If a processor has nothing to do, the processor runs the integrated dispatcher, which selects the highest priority process from the run queue 10 and starts the process running in that processor. Contention for run queue entries is reduced because the dispatcher locks only the entries it currently is looking at. Entries on the run queue are also locked when new processes are put on the run queue 10 by mproc create 20, by consume entry 30, by Zero-level interrupt 50, by inewproc 40, and by the usual Unix scheduling means (waking, switching, and creating routines) 60.

An mproc is a kernel representation of a microprocess. Microprocesses are minimal context user processes designed to run very specific tasks very efficiently. That is, they are expected to be short lived. Microprocesses share a process image and user area making them very lean and efficient to create, but doing so also prevents them from switching because they do not have the unique context save areas that full processes do.

Each full process has a unique area and shared image process entry. The shared image process table entry contains information relevant to maintaining the segments which comprise the image. When a full process switches out (yields the processor to another process), it must save the state of the processor so that the same state can be restored before this process is allowed to execute again. Microprocesses do not have unique context save areas (full process fields and user area structure) in which to store the processor state, so they are expected to run to

HEET completion rather than yield the processor on the assumption they will resume at a later time. Figure 4 illustrates an organizational comparison between a full process and microprocesses.

Refer to the copending application DUAL LEVEL SCHEDULING OF PROCESSES TO MULTIPLE PARALLEL REGIONS OF A MULTITHREADED PROGRAM ON A TIGHTLY COUPLED MULTIPROCESSOR COMPUTER SYSTEM for a description of the mproc method. The application METHOD OF EFFICIENT COMMUNICATION BETWEEN COPROCESSOR OF UNEQUAL SPEEDS describes the consume entry method.

An iproc is a minimal context process designed to effi¬ ciently execute kernel functions. The kernel procedure that creates these minimal context processes is called inewproc. It is a very simplified version of newproc which creates full processes.

A full process has both system context and user context. Kernel functions do not need their user context so allocating a full process context process to execute these functions is inefficient. A subset of a full process is allocated which allows the kernel function to execute. It does not have a user area or shared image process entry since it will not be executing user code. These functions are then very efficiently switched in and out since they depend on very little context. What little context they do depend on is saved in the iproc table entry. The application METHOD OF IMPLEMENTING KERNEL FUNCTIONS USING MINIMAL CONTEXT PROCESSES describes the inewproc method.

The run queue 10 resides in main memory and is equally accessible by all processors. Each processor runs the integrated dispatcher when the current process exits, when the current process waits for an event, or when the curren process allows itself to be preempted. The integrated

SUBSTITUTE SHEET dispatcher can then c use a new process to be scheduled into the processor.

Figure 3 shows the run queue data structure 10. The integrated dispatcher schedules all entities through the run queue. Entities with the highest priority on the run queue 11 are dispatched first. A semaphore counter on each entry 12 prevents multiple processes from accessing the same run queue entry simultaneously. If more than one entity has the same priority 13, the remaining entity remains in the run queue until its turn to be processed. The result if a first-in-first-out queue for each priority. The semaphore is released after the processor pulls off the process, allowing another processor to pull an entry off that queue. This pipeline for each priority keeps processes flowing quickly to processors.

Counting semaphores are software mechanisms for synchroni¬ zation. The semaphore consists of a count of the available resources to be managed and a list associated with entities waiting for that resource. To implement a lock, this count is set to one so that only one resource, the lock, exists. If he semaphore is going to govern multiple resources, it is set to the number of resources available. As a resource is taken, this count is decremented. When the semaphore count goes to zero, no more resources are available so the requester is put to sleep to wait for one to become avail¬ able. As a process frees a resource, it increments the semaphore counter and wakes up a waiting process.

In the case of th run queue, a process acquires the 'lock' on the entry associated with a specific priority. It is then free to access the processes queued at this priority. All other processes will block at the lock and are prevented from accessing the q vied processes until the first process is finished with them and frees the lock. Once the lock is acquired, the dispatching process can pop a process off of the run queue for execution. If multiple processes are queued at this priority, the rest remain on queue for an ensuing dispatching process to dequeue once it obtains the lock. Note that as long as dispatching processes are accessing th run queue at different priorities, they will not block each other. They can run concurrently and share the dispatcher.

There is only one 'priority vector'. It is the run queue itself. That is, the priority is actually the index into the run queue array. Each entry on the run queue array consists of a semaphore and the head of the list of pro¬ cesses waiting execution at this priority. The list of processes is actually a linked list of the actual [im]proc table entries. That is, the fun threads through fields in the [im]proc table entries.

In the preferred embodiment, the integrated dispatcher schedules all activities which require a processor. These activities include iprocs, mprocs, processes, and events from external coprocessors via the wake queue 70. In the preferred embodiment, the integrated dispatcher provides a single focal point for scheduling multiple process types to multiple processors. The integrated dispatcher provides a means for all processors to have access to all processes. As each processor accesses the run queue 10, the dispatcher schedules the highest priority item on the queue independent of its category.

The integrated dispatcher uses a common run queue 10 designed as a vector of first-in-first-out-queues. This scheme facilitates implementing the integrated dispatcher as a pipeline to eliminate the bottleneck for dispatching processes and is thus designed for efficient multiprocessor scheduling. The integrated dispatcher is completely symmetric with no preferential processors or processes and is therefore completely multithreaded.

SUBSTITUTE SHEET Completely symmetric refers to the equalness of all proces¬ sors and schedulable entities. That is, no processors are given preference nor is any process type ([im]proc) given preference. Each processor has an equal chance of obtaining the entry lock and therefore being allowed to pop a process to schedule. Processes are queued onto a run queue entry based solely upon their priority without biasing on account of their type ([im]proc) and are ordered on the queue in a first-in, first-out order.

This makes the dispatcher completely multithreaded because each processor is 'self-scheduling' . That is, there is not a processor dedicated to supervising the dispatching of processes to processors. Each processor flows through the dispatching code when it needs to dispatch a process to run on this processor.

This organization also maintains symmetry by having a single focal point for dispatching rather than having multiple dispatchers per schedulable entity type ([im]proc). If there were multiple dispatchers per type, this symmetry could not be maintained since one dispatcher would have to be checked before another thereby giving that process type preference. That is, given an iproc dispatcher, an mproc dispatcher and a proc dispatcher and given entries of equal priority on each, the dispatcher that is checked first actually has an inflated priority value since its entity is always chosen to run prior to the others also at that same priority but on the queue of another dispatcher.

As the dispatching mechanism for the run queue, the inte¬ grated dispatcher handles the various system process entities (iproc, mproc, proc, and so on) that are imple¬ mented in the preferred embodiment using standard multi¬ threading methods. The run queue is protected by means well known in the art of creating multithreaded code and

E SHEET described in the application INTEGRATED SOFTWARE ARCHITEC¬ TURE FOR A HIGHLY PARALLEL MULTIPROCESSOR SYSTEM.

In the preferred embodiment, the conditional switch capa¬ bility of the invention enables lower priority processes to handle interrupts rather than delaying high priority processes.

Switching means 'yielding the processor' . The current state of context of the processor must be saved somewhere so that it can be restored before it is allowed to execute again. After this processor context is tucked away (usually in the process user area) the context of the process that was chosen to run next in this processor is restored. When the switched out process is chosen to run again, its processor context is restored so that the processor looks just as it did before it switched out. Note that processes do not have to resume on the processor from which it was switched. Any processor running the dispatching code and seeing this process as the highest priority process will restore this process context and allow it to run.

Switching causes dispatcher code execution to choose the next process which should run on this processor. There¬ fore, the normal flow is: save outgoing processes context; run dispatcher to choose next process to run in this processor; restore context of chosen process; allow chosen process to resume execution.

Conditional switch is a more efficient version of this scheme that takes into account that this process may be experiencing preemption and not voluntarily giving up the processor because it knows it is going to be waiting for a certain event to happen (sleep). Conditional switch delays saving the current process's context until it is sure that it is going to be switched out. That is, if the process

UTE SHEET which was just switched out is the highest priority process on the run queue, it is chcjen as the 'next process' to run on this processor. This results in an unnecessary saving/restoring of context. Therefore, the flow of conditional switch is: determine highest priority process on run queue; compare this priority with the priority of current process; if current process is at an equal or higher priority, do not continue with this switch but allow current process to continue to execute; if current process is at a lower priority, continue with context switch, save current process's context, restore higher priority process's context and allow that process to resume execute.

A 'daemon* is a process that does not terminate. Once created, it performs a specific task and then goes to sleep waiting for that task to be needed again. These tasks are frequently needed making it much more efficient to leave the process around rather than creating it each time its function is needed. An example of a daemon in UNIX is the buffer cache daemon which is created at system boot time and woken whenever the buffer cache needs to be flushed out to disk.

There may exist a minimal-context system daemon which any processor can wake to handle the interrupt, or any processor may spawn a new minimal context system process to handle it. Either way, the interrupted processor running need not be delayed by handling the interrupt itself. When the integrated dispatcher runs it will schedule the highest priority process. Therefore, if the newly created or awakened interrupt process is the highest priority process, it is chosen to run. Note that the concept of a light¬ weight system process is essential to this scheme because interrupt handling cannot be delayed while a full context process is created to handle the interrupt. This allows a processor doing work at a high priority to cause a processor running at a lower priority to handle the

T interrupt. A lightweight system process typically is an iproc in contrast to a microprocess which is an mproc both of which were discussed hereinabove.

If it is imperative that this interrupt be handled immedi¬ ately, the current processor can cause another processor to be preempted by sending that processor a signal (see application SIGNALING MECHANISM FOR A MULTIPROCESSOR SYSTEM) . The signalled processor preempts when it receives the signal. The process running on the signalled processor can then decide whether or not it will handle the interrupt; if the iproc assigned to handle the interrupt is at a higher priority, the current process will be switched out and the iproc run.

A system variable holds the processor number that is currently executing at the lowest priority. Given this identification, any processor can signal any other processor as provided by the hardware outlined in the aforementioned copending patent application. Assuming this machine hardware exists, a processor receiving an interrupt can spawn an iproc to handle that interrupt and then signal the lowest priority processor. The interrupted processor then resumes execution of its work without servicing the interrupt. Receipt of the signal by the lowest priority processor forces it into the interrupt code and a conditional switch. Note that this 'lowest priority process* may not actually be at a lower priority than the newly spawned iproc which is servicing the interrupt. That is, if all the processors were executing very important tasks at the time of the interrupt, the 'lowest priority' process currently executing may still be higher than the iproc. In this case, the conditional switch allows this processor to continue to execute the current process rather than servicing the interrupt. The interrupt is serviced when a processor drops to a priority below that of the iproc.

SUBSTITUTE SHEET The problem of overhead on system processes has been eliminated or at least substantially minimized by the method provided in th preferred embodiment to create these special process entities which are nearly context free.

The dispatcher has a mechanism for generating minimal context system processes as described in the copending application METHOD OF IMPLEMENTING KERNEL FUNCTIONS USING

MINIMAL CONTEXT PROCESSES.

A 'user block' or 'user area' is a data structure used by the kernel to save information related to a process that is not needed until the process is ready to run. That is, the kernel keeps data related to a process in process table entries and in user areas. All information needed regard¬ less of the state of the process is in the process table. The user area can actually be swapped and must therefore not contain any information the system may need while it is not in memory. In the preferred embodiment, the user block is eliminated completely, and the information contained in the process table is minimized. The overall result is that overhead is minimized.

The integrated dispatcher addresses yet another limitation of prior implementations. The dispatcher runs in conjunc¬ tion with a mechanism which allows slower coprocessors to schedule work in the multiprocessor as described below as well as in copending application METHOD OF EFFICIENT COMMUNICATION BETWEEN COPROCESSORS WITH UNEQUAL SPEEDS.

The integrated dispatcher contains the intelligence for managing the wake queue 70. The wake queue is a method of communication between slow and fast processes which prevents bottlenecking the fast processes. Access to the wake queue occurs in such a way as to permit slower coprocessors to put entries on the wake queue without holding up the faster processors. Thus, coprocessors of unequal speed can efficiently schedule tasks for a fast processor to run without locking the run queue from which the fast processors are scheduled. For example, when a peripheral device which is slow requests action from a processor (which is fast), the request is held in the wake queue until a processor is available to handle the request.

Thus, faster processors are not held up by slower coprocessors.

The intelligence for servicing the wake queue has been added to the integrated dispatcher. Rather than having a slow coprocessor interrupt a fast processor to schedule a task or indicate to it that some tasks are complete, the wake queue concept provides an intermediate queue upon which the coprocessor can queue information. When the integrated dispatcher is run, it checks to see if any entries are queued upon the wake queues and if so, processes them. This processing varies widely dependent upon the data queued. Any information can be queued and processed in this way as long as the coprocessor knows to queue it, and the intelligence is added in the integrated dispatcher to process it. For example, if the information queued is simply the address of a process waiting for some coprocessor task to complete, the integrated dispatcher can now wake that process. The results in that process being queued on the run queue. Specific details of the wake queue are described in the copending application METHOD OF EFFICIENT COMMUNICATION BETWEEN COPROCESSORS WITH UNEQUAL SPEEDS.

Although the description of the preferred embodiment has been presented, it is contemplated that various changes could be made without deviating from the spirit of the present invention.

While the exemplary preferred embodiments of the present invention are described herein with particularity, those having normal skill in the art will recognize various changes, modifications, additions and application other than those specifically mentioned herein without departing from the spirit of this invention.

What is claimed is:

Claims

1. In a multiprocessor environment wherein each processor is capable of accessing one or more storage areas, a method for self scheduling of service requests from a plurality of sources having potentially different priority levels for handling comprising the steps of:

establishing a run queue in a first storage area having storage positions therein arrayed in sequential order corresponding to priority levels;

monitoring service requests from the plurality of sources;

assigning a priority level to each service request for entry thereof into the corresponding position in said run queue;

creating an integrated dispatcher including a multiplicity of instructions for retention in a second storage area;

causing each processor upon becoming available to handle a service request to retrieve said integrated dispatcher;

responding to execution of said integrated dispatcher in said processor by inspecting said run queue to identify the service request having the highest priority; and

configuring said processor to handle the service request thus assigned to it.

2. The method in accordance with claim 1 wherein the service requests can include a demand for microprocessor scheduling, a demand for interrupt process scheduling or a process scheduling and wherein the method includes the step of preventing any other processor from accessing said first storage area at the same priority level as that which said process is handling pursuant to said configuring step until the processor having the preceding run queue has completed its loading step.

3. The method in accordance with claim 1 which includes the steps of associating all service requests of given priority level with a particular position in the run queue, and maintaining a given priority level in the run queue active until all service requests of that priority level are handled.

4. The method in accordance with claim 3 wherein said associating step includes the step of establishing a process identifying area and a semaphore counter function area corresponding to each said process identifying area.