Chip Multithreading System Need a New Operating System Scheduler

Authors: Alexandra Fedorova, Christopher Small, Daniel Nussbaum, and Margo Seltzer (Harvard University, Sun Microsystems)

Published in: USENIX 2005


Question–Answer Form

1. What is your take-away message from this paper?

At the time the paper was written, schedulers did not take advantage of multithreading in hardware, which led to lost potential performance especially for OLTP workloads where threads have contention for CPU resources. The new metric called instruction delay latency gives useful insight for how to benchmark CPU utilization, but it does not take into account other CPU resources such as caches.


2. What is the motivation for this work?


3. What is the proposed solution (hypothesis, idea, design)?


4. What is the author's evaluation of the solution?


5. What is your analysis of the identified problem, idea, and evaluation?


6. What are the paper's contributions?


7. What are future directions for this research?


8. What questions are you left with?


NOTES

p.1

frequent branches and control transfers, can result in processor pipeline utilization as low as 19%

common workload for web services result in poor CPU utilization

p.1

MT-savvy operating system scheduler could improve application performance by a factor of two

scheduler that utilize multithreading can improve performance

p.1

application servers, web services, and on-line transaction processing systems, are notorious for poor utilization of CPU pipeline

example for common workloads

p.1

short stretches of integer operations, with frequent dynamic branches. This negatively affects cache locality and branch prediction and causes frequent processor stalls

explanation for poor CPU performance

p.1

do little for transaction processing-like workloads.

CPU not designed for common workload (works best for scientific applications)

p.1

improve processor utilization for transaction-processing-like workloads by offering better support for thread-level parallelism (TLP

How MT improves performance

p.1

A CMP processor includes multiple processor cores on a single chip, which allows more than one thread to be active at a time and improves utilization of chip resource

CMP processor

p.1

An MT processor has multiple sets of registers and other thread state and interleaves execution of instructions from different threads

MT processor

p.1

ardware vendors are proposing architectures that combine CMP and MT. We will refer to such systems as chip multithreading (CMT) systems

combine both MT and CMP called CMT

p.1

Our experiments have shown that the potential for performance gain from a specialized scheduler on CMT systems is even greater, and can be as large as a factor of two.

a new scheduler is needed to utilize new software

p.2

Scheduling on single-processor MT systems has been studied before [10-12]. The scheduling algorithms for single-processor MT systems discussed in the literature worked as follows: they ran all combinations of threads that could be co-scheduled, determined which combination(s) yielded the best performance,

related works solution

p.2

An OLTP workload may involve a hundred threads; on a CMT system with 16 hardware contexts, there are 1027 combinations to evaluate.

not a trivial problem

p.1

minimizes resource contention and maximizes system throughput,

ideas for new CMT scheduler

p.1

he scheduler must understand how its scheduling decisions will affect resource contention, because resource contention ultimately determines performance

design choices for the new scheduler, what the scheduler must be aware of

p.2

Our proposal involves building a scheduler that would model relative resource contention resulting from different potential schedules

design proposal for the new scheduler

p.2

Our toolkit models systems with multiple multithreaded CPU cores

experiments are designed around new hardware architecture where each CPU cores have multiple threads

p.2

An MT core has multiple hardware contexts (usually one, two, four or eight), where each context consists of a set of registers and other thread state.

p.2

switching between contexts on each cycle. A thread may become blocked when it encounters a long latency operation, such as servicing a cache miss

Hide cache latency by switching to other threads, while the blocked threads are waiting for data from memory

p.2

his latency-hiding property is at the heart of hardware multithreading

p.2

SMT systems are more complex and require more chip real estate. We have instead taken the approach of leveraging a simple, classical RISC core in order to allow space for more cores on each chip. This allows for a higher degree of multithreading in the system, resulting in higher throughput for multithreaded and multiprogrammed workloads.

Alternative to RISC

p.2

When assigning threads to hardware contexts, the scheduler has to decide which threads should be run on the same processor, and which threads should be run separately

OS scheduler policy for CPU optimization. Optimal thread assignment results in high utilization of the CPU

p.2

r. If we are to design a scheduler that can find good thread assignments, we must understand the causes and effects of contention among the threads that share a processor.

Contention seems to be the heart of the problem

p.2

The key to understanding why this is the case is the concept of instruction delay latency

p.2

When a thread performs a long-latency operation, it is blocked; subsequent instructions to be issued by that thread are delayed until the operation completes. We term the duration of this delay the instruction delay latency. ALU instructions have 0 delay latency.1 A load that hits in the L1 cache has a latency of four cycles. A branch delays the subsequent instruction by two cycles

instruction delay latency is an important unit to understand the impact of CPU instructions

p.2

Processor pipeline contention depends on the latencies of the instructions that the workload executes

memory loads -> high latency. ALU operations -> no latency

p.3

A, a single-threaded processor; B, a multithreaded processor with four hardware contexts; and C, a four-way multiprocessor

comparison profiles

p.3

A and B will perform comparably; C will have a throughput four times greater than the other two systems, because it has four times as many functional units.

B has four hardware contexts but has only a single functional unit which leads to resource contention for CPU bound workload -> A and B performance is the same

p.3

When running the memory-bound workload, System B and System C will perform comparably, outperforming System A by a factor of four2.

B and C performance is comparable because of multi hardware contexts which hide IO latency

p.4

This simple experiment demonstrates that the instruction mix, and, more precisely, the average instruction delay latency, can be used as a heuristic for approximating the processor pipeline requirements for a workload.

the proposed metric provides useful information for the scheduler to make scheduling decisions.

p.4

A thread with an instruction mix dominated by long-latency instructions can leave functional units underutilized. Therefore, it is logical to co-schedule it with a thread that is running a lot of short-latency instructions and has high demand for functional units

scheduling strategy that optimizes instruction delay latency

p.4

this technique does not take into account effects of cache contention that surface when threads with large working sets are running on the same processor.

limitations of the proposed model

p.5

envision some limitations of this approach. CPI does not give precise information on the types of instructions that the workload is executing.

limitations

Architecture-Aware OS Kernel Paper, p.1

CMT systems need new schedulers: a naïve scheduler may squander up to half of available application performance, and existing SMT scheduling algorithms do not scale to dozens of threads.