Modern CPUs are more and more powerful. When there is no job to do,
it
enters into idle state. During its ilde period, we certainly can
cut
off its power and have it enter into low-power state only if we
know
when there is new assignment and we can re-activate CPU and have it
do
its jobs again. The process is like this:
no
job
cut off power
CPU in active ----------> CPU in
idle --------------> low-power state
^
|
|
|
| re-power
up v
<-----------------------------------------------------
To achieve the above goal, we need to answer the following
questions:
1) How to
know CPU is idle so that we can cut off power;
2) How to
cut off power;
3) When and
how to re-power up CPU;
1. When CPU is idle
-----------------
The answer to the first question is very simple as a matter of
fact: When
it is idle, CPU runs the swapper process (process ID is 0. Pobably,
it
should be called idle thread, anyway, it is a legacy name, and all
text-
books call it that way). So, CPU must be idle when it runs into
swapper.
Traditionally, the swapper process does nothing. In a forever loop,
it just
checks if there is other task to do, if not, delays for a while and
then
checks again, otherwise, it tells process scheduler to schedule
other task.
The code is like like this:
while (1)
{
while (no_job_to_do)
{
delay
for a while; <------- halt instruction, in
fact;
}
schedule_other_process;
}
So, To cut CPU power, we change the above code to,
while (1)
{
while (no_job_to_do)
{
cut_off_cpu_power; <-----done in pm_idle() for
Linux
...
}
schedule_other_process;
}
2. How to Cut Off Power
-----------------------
Note that CPU consists of many units, besides core logic, it has
cache, BIU
(Bus Interface Unit), Local APIC. when a CPU is in idle state, we
can cut
clock signal and power from some units. The more units are stopped,
the more
power saved.
We need to consider another side effect of cuting CPU
power: Each unit spends
some time to power up. So, the more units are stopped, the more
time it takes
for CPU to be re-activated (wake up). We call the time as
entry/exit latency.
2.1 C-State
-------------
To find a balance between power-saving and entry/exit latnecy,
Intel CPUs
provide many low-power states called C-State, or sleeping state.
Deponding
on CPU models, Intel CPUs support C-States: C1, C2,
C3, C4 C5, C6, ...
(C0 is active state). While in sleeping state(C1 or above),
CPU doesn't
execute any instruction, but consumes less power.
C0 - CPU is full-powered, and executes
instruction;
C1 - stop main internal core clocks;
C2 - C2 has two sub-mode:
Stop-Grant & Stop-Clock;
While in C1/C2,
CPU still processes bus snoop & snoop from
other
cores. That means
CPU automatically exits C1/C2, handle snoop and
then returns C1/C2
again.
C3 - Flush cache. So, it won't exit C3 to handle
snoop.
C4 - for multi-core processors. For example, for
Duo 2, if both cores
are in C4, the
package will enter a deeper sleep state.
C5 - I don't know :)
C6 - For Intel Core i7, the package enters more
deeper sleep if all
cores in C6, and
some additional power-saving from QPI link.
Cn - ... Sigh~,
Besides Cx, some Intel CPUs have enhanced CxE states. For example,
Intel
Core 2 Duo instroduced enhanced C-States:
C1E, C2E, C3E, C4E. The enhanced
states have an additional feature than Cx-State:
they reduce CPU voltage
before entering Cx-state (In fact, voltage-reducing is implemented
based
on EIST/T-States).
2.2 HLT, P_LVLx and MWait
---------------------------
Then, how to enter into some certain C-State ? Intel provides three
methods.
2.2.1 HLT instruction
----------------------
As we know, Intel x86 has a HLT (halt) instruction. From 486DX4,
this
instruction will cause CPUs to enter into C1 or C1E state. If
BIOSes
enable C1E feature, CPU enters C1E, otherwise CPU enters C1.
BIOSes
enables C1E via some MSR register. For example, for Intel Xeon
7000,
BIOS can set bit 25 of IA32_MISC_ENABLE_MSR (MSR 1A0).
Note that HLT can be used for C1 entry only. That means, you
cannot
enable CPU to enter C2 or above by HLT.
2.2.2 P_LVLx I/O registers
----------------------------
And Intel defines P_LVLx I/O registers (x is 2 ~ 5). I/O reading
P_LVLx
register will cause CPU to enter into C-state. Generally, P_LVL2
for C2,
but P_LVL3 of Core i7 for C6 while P_LVL3 of Duo 2 for C3. It
depends on
CPU model.
2.2.3 Monitor/MWait instruction
--------------------------------
Except HLT instruction and P_LVLx registers, Intel provides another
way
to enable CPU to enter into C-State: MWait. This
instruction should be
used together with Monitor. Normally, we use monitor instruction
to
watch a range of memory, and then use mwait with some hints to
enable CPU
to enter into Cx-state.
Without this instruction, when a CPU is in sleeping state, if other
CPUs
want to wake it up, the only way is to send an IPI. However, IPI is
an
expensive operation, it takes much time (compared to
Monitor/MWait). With
Monitor/MWait pair, other CPUs can wakup sleeping CPU by modify the
memory
watched (monitored) by the sleeping CPU.
3. Re-activate CPU
-----------------------------
When a CPU runs into swapper process, there might be some processes
in
various wait queues of this CPU. Once the condition changes,
those
processes could become runnable again. Because they have been
already
assigned to this CPU, before sleeping, the CPU must prepare to run
the
processes in wait state in the near future.
Then, what's the conditions which a process can wait for ? Yes,
time and/
or interrupt. A process can wait on a timer or interrupt or some
events
that will be triggered in interrupt handling.
Intel CPU returns to C0 from sleeping state once receiving
interrupt, and
timer is implemented via hardware timer interrupt. So those
processes in
waitqueues would be executed once they becomes runnable (we skip
tickless
kernel and C3-stop LAPIC timer for the time being).
Besides, other CPUs can assign some jobs to an idle CPU and wake it
up via
interrupt or the method provided by monitor/mwait.
4. ACPI & C-State
-------------------
ACPI defines two methods (control interfaces) to control CPU
C-states. And
ACPI specification defines 3 C-states. Note that ACPI C-states is
not the
same as Intel CPU C-States. For example, we can map Intel CPU
C1/C1E to
ACPI C1, Intel C2/C2E to ACPI C2, Intel C3, C4, C5, C6 to ACPI
C3.
4.1. P_LVLx registers in P_BLK
-------------------------------
In DSDT table, each processor optionaly can have a P_BLK register
block,
For example,
Processor (
\_PR.CPU0, // Namespace name
1,
0x120, //
P_BLK system I/O address
6
// size of P_BLK
)
{...}
P_LVL2: P_BLK + 4, 1
byte, system I/O space;
P_LVL3: P_BLK + 5, 1
byte, system I/O space;
Reading P_LVL2 causes CPU to enter C2 state; reading P_LVL3 causes
CPU to
enter C3 state.
In FADT table, there are two fields to give C2 and C3 entry/exit
latency
respectivly,
FADT.P_LVL2_LAT,
The worst-case hardware latency to enter/exit a
C2 state. A value > 100 indicates the system
does
not support a C2 state.
FADT.P_LVL3_LAT, The worst-case hardware
latency to enter/exit a
C3 state. A value > 1000 indicates the system
does
not support a C3 state.
Based on entry/exit latency, OS can select which C-state should be
entered
into when CPU is idle. OS should select as deeper sleeping state as
possible,
so as to save more power. In fact, the hardware entry/exit latency
is used
as a reference point, and OS will adjust the entry/exit latency for
each
C-state during runtime.
When CPU is idle, OS checks the most recent impending timer, and
compares
the interval with C-State latency, and select one of C-state to
enter.
4.2. _CST & _CSD ACPI objects
-----------------------------
4.2.1 _PDC
----------
_PDC, OS uses it to inform the platform of the level cpu power
managemet
support provided by OS;
Note that OS must use _PDC/_OSC method to inform the platform of
the level of
power management which OS can handle. Based on this information,
ACPI firmware
can return different values(package) for_CST and _CSD.
4.2.2 _CST
----------
_CST, the platform declares the supported C-States. ACPI can define
a _CST
object for a processor like,
Name (_CST,
Package()) {Count, CState,…,
CState}, where,
CState: Package (Register, Type, Latency,
Power)
For example,
Processor (\_PR.CPU0,1, 0x120, 6)
{
...
Name (_CST, Package()
{
4, //the number of supported
C-States
Package(){ResourceTemplate(){Register(FFixedHW,
0, 0, 0)}, 1, 20, 1000},
Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x161)}, 2, 40, 750},
Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x162)}, 3, 60, 500},
Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x163)}, 3, 100, 250}
})
...
}
In this example, CPU0 has 4 C-states, C1, C2 and
two C3 with different
latency and average power consumption.
C1:
FFixedHW, it means using "halt" or "mwait" instruction to enter
C1;
C2:
SystemIO, 8-bit size, so a byte-read to I/O addr 0x161 to enter
C2;
If Cx state uses FFixedHW, we check if the CPU
supports mwait instruction. Calling
cpuid.ax = 0x05, the returned value in edx
register tells us which C-state is
supported by mwait instruction (including the
number of sub-state of each C-State).
4.2.3 _CSD
------------
_CSD, the platform provides C-State control cross logical
processor
dependency information to OS;
CSDPackage: Package
(CStateDep,…, CStateDep), where,
CStateDep:
Package (NumberOfEntries, Revision, Domain, CoordType,
NumProcessors, Index)
For example,
Processor (\_SB.CPU0, 1, 0x120, 6)
{
Name (_CST, Package()
{
3,
Package(){ResourceTemplate(){Register(FFixedHW,
0, 0, 0)}, 1, 20, 1000},
Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x161)}, 2, 40, 750},
Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x162)}, 3, 60,
500}
})
Name(_CSD, Package()
{
Package(){6, 0, 0, 0xFD, 2, 1},
// 6 entries, Revision 0, Domain 0, OSPM Coordinate
// Initiate on Any Proc, 2 Procs, Index 1 (C2-type)
Package(){6, 0, 0, 0xFD, 2, 2} //
6 entries, Revision 0, Domain 0, OSPM Coordinate
// Initiate on Any Proc, 2 Procs, Index 2 (C3-type)
})
}
Processor (\_SB.CPU1, 2, 0x130, 6)
{
Name(_CST, Package()
{
3,
Package(){ResourceTemplate(){Register(FFixedHW,
0, 0, 0)}, 1, 20, 1000},
Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x161)}, 2, 40, 750},
Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x162)}, 3, 60, 500}
})
Name(_CSD, Package()
{
Package(){6, 0, 0, 0xFD, 2, 1},
// 6 entries (fields in this package), Revision 0,
// Domain 0, OSPM Coordinate
// Initiate on any Proc, 2 Procs, Index 1 (C2-type)
Package(){6, 0, 0, 0xFD, 2, 2} //
6 entries, Revision 0, Domain 0, OSPM Coordinate
// Initiate on any Proc, 2 Procs, Index 2 (C3-type)
})
}
I am copying the following words from ACPI sepc,
OSPM can coordinate the transitions between logical processors,
choosing to initiate
the transition when doing so does not lead to incorrect or
non-optimal system behavior.
This OSPM coordination is referred to as Software Coordination.
Alternately, it might
be possible for the underlying hardware to coordinate the state
transition requests
on multiple logical processors, causing the processors to
transition to the target
state when the transition is guaranteed to not lead to incorrect or
non-optimal
system behavior. This scenario is referred to as Hardware (HW)
coordination
5. Linux C-State Related Code
--------------------------
Linux has a global function pointer pm_idle, if nobody changes it,
it is set
to default_idle(). The routine default_idle() just calls HLT
instruct to put
CPU into halt state. If CPU supports C-state, this will cause CPU
to enter C1
or into C1E if BIOS enabled C1E feature.
In fact, there are many module trying to have pm_idle point to a
specific
routine. For example,
APM
apm_cpu_idle() //legacy APM power management
cpuidle
cpuidle_idle_call()
AMD-CPU
c1e_idle() //AMD C1E acts like Intel
C3
CPU supporting
MWait mwait_idle()
//C1 only
idle=poll by
kernel-param poll_idle()
//noop, no power reducing
idle=halt by
kernel-param default_idle()
...
The priotrity of swapper process is very low, it executes only when
there is
no other runable process. Any runnable process can preempt CPU from
swapper
process. In a forever loop, swapper process executes cpu_idle()
like this,
void
cpu_idle(void)
{
...
while (1) {
while (!need_resched()) { <----If
hasn't runnable process
local_irq_disable();
pm_idle();
}
...
schedule(); <------- select a new process to
be executed
...
}
5.1 Architecture Overview
--------------------------
Linux CPU C-State related modules/drivers are orgnized as
follows,
----------------
| sysfs |
----------------
|
-------- ------ |
| ladder | |menu| |
--------- ----- |
|
| |
------------------------
|cpuidle infrastructure |
------------------------
|
|
----------------------
|acpi-cpuidle driver |
----------------------
|
|
----------------------------
|ACPI
processor bus driver |
----------------------------
5.1.1 Driver Register
-----------------------
In acpi_processor_init(), which is a module initialization routine
and
called by do_initcalls(), two related drivers, acpi processor bus
driver
and acpi_idle_driver, are registered. If you really want to look
into it,
take a look at the following path:
kernel_init()
==> do_basic_setup()
==>
do_initcalls()
==> ... acpi_processor_init();
==>
cpuidle_register_driver(&acpi_idle_driver);
acpi_bus_register_driver(&acpi_processor_driver);
Among, the registering of drivers is in
driver/acpi/processor_core.c;
notes:
a) cpuidle insfrastructure is NOT a driver, and
it is initialized by
core_initcall(). It
provides:
I) In userland
apps/users can check/switch cpuilde governor by
sysfs interface:
/sys/devices/system/cpu/(cpuX)/cpuidle/
II) interfaces for
governor registering;
III) interfaces
for cpuilde devices, cpuilde driver;
IV) Set global
pm_idle pointer to cpuilde_idle_call();
b) acpi_idle_driver is registered into cpuidle
infrastruct, while
acpi_processor_driver is
registered acpi subsystem as an acpi bus
driver;
c) cpuilde infrastructure allows only one driver
to register, it uses
a global pointer to the
registered acpi_idle_driver. Refer to
cpuidle_register_driver()
provided by cpuidle infrastructure in
driver/cpuidle/driver.c
d) ACPI process driver registers a hotplug
callback for cpu hotplug,
so it will get notification
when a CPU is online/offline.
5.1.2 Device Discovery & Register
---------------------------------
ACPI subsystem parses ACPI tables, and for each ACPI processor
object,
it calls acpi processor bus driver's add entrypoint,
acpi_processor_add(),
to add an acpi processor device.
After adding an acpi processor device, acpi subsystem will call
processor
driver's start entrypoint function, acpi_processor_start().
In acpi_processor_start(), the routine acpi_processor_power_init()
is
called to evaluate _PDC, and read & parse _CST,
_CSD or use FADT/MADT
info to initialize processors' power state information, and then
calls
cpuidle_register_device() to register a cpuidle device into
cpuidle
infrastructure.
For hotplug CPUs, during acpi_processor_init() execution, the
routine
acpi_processor_install_hotplug_notify() is called to register a
CPU
hotplug callback. when a CPU is online, acpi_processor_start()
gets
execution.
Please note that both the processors operate the same physical
CPUs,
besides cpuidle driver, there are some other processor-related
drivers,
such as T-State driver, P-state driver, CPU-hotplug
infrastructure,
etc. The ACPI processor driver acts as a bridge/coordinator
among
those drivers.
5.1.3 Driver/Device attach
-----------------------
acpi subsystem registered processors into acpi_process_driver,
if/when
the registered CPU is online, the start entrypoint,
acpi_processor_start()
is called. This entry function takes many initialization jobs for
T-state,
P-state and C-state. Now we just look at c-state, it calls
acpi_processor_power_init();
==> acpi_processor_get_power_info();
==>
acpi_processor_setup_cpuidle();
The first called routine will evaluate _CST or read FADT if _CST
failed,
to get C-state description from ACPI tables. Refer to section
4.1/4.2,
and see how to handle c-state information.
The second one will setup some information for each valid c-state,
note
for most cases (without kernel parameter, bus master,
etc)
C1, state->enter =
acpi_idle_enter_c1;
C2, state->enter =
acpi_idle_enter_simple;
C3, state->enter =
acpi_idle_enter_bm;
This enter routine is used to enter corresponding C-state.
5.1.4 Governor
-----------------
The governors of cpuilde are simple to read/understand. It provides
3
main callbacks for cpuidle infrastructure.
rating <--
menu is 20, ladder is 10;
enable()
select()
reflect()
Each governor has a rating in its structure. When governors are
registered
into cpuidle insfrastructure by the routine
cpuidle_register_governor(),
cpuidle will select the one with max rating unless users specified
one
via sysfs interface. The cpuilde_curr_governor pointers point to
the
selected one.
Only one governor can be used at the same time. When, OS decides to
put a
CPU into C-state, it calls select entrypoint of current governor,
governor
will by its policy choose one C-state,
cpuilde_idle_call()
{
next_state =
cpuilde_curr_governor->select();
target_state
= &dev->states[next_state];
dev->last_state = target_state;
dev->last_residency =
target_state->enter(dev, target_state);
cpuilde_curr_governor->reflect();
}
6. Linux Files related to C-States
----------------------------------
driver/acpi/processor_core.c
driver/acpi/processor_idle.c
driver/cpuidle/cpuidle.c
driver/cpuidle/driver.c
driver/cpuidle/governor.c
driver/cpuidle/sysfs.c
driver/cpuidle/governor/ladder.c
driver/cpuidle/governor/menu.c
7. Some Kernel Parameters
-------------------------------
idle=poll,
polling, always in C0, most no power-saving;
idle=halt,
use HLT instruction only, only enter C1;
idle=nomwait
don't use mwait, P_LVLx method is used;
idle=mwait
force OS to use mwait for C-state;
max_cstate=n
specifiy available max C-state, n is a number
Others (which may help locate issue when C-State doesn't
work),
nohz=off
don't use dynamic tick/tickless mode
nolapic_timer don't use
local APIC timer
lapic_timer_c2_ok Local APIC timer is ok in
C2
clocksource=tsc (or hpet, pit, acpi_pm,
jiffies), override clock source
8. Sysfs & Proc
-----------------
Check C-State stastics & state,
/proc/acpi/processor/CPUX/
Check governor & driver,
/sys/devices/system/cpu/cpuidle/
(for
system0-wide)
/sys/devices/system/cpu/cpuX/cpuidle/ (for
CPU)
9. TBD
-----------
9.1 Broadcast Timer
------------------
When some CPU enters deep C (C3 or above), their Local APIC timer
will
stop as well (Linux uses LAPIC timer as tick device in most cases).
This
issue is handled by "broadcast timer scheme.
9.2 Dynamic Tick /Tickless
--------------------------
Linux supports tickless which causes the C-State code more
complex.
9.3 Idle Load balancing
-----------------------
When CPUs enter into idle state, one of idle CPU will be nominated
as ILB
(Idle Load Balancer). It is responsible for pulling task from busy
CPUs and
re-assigne the tasks to idle CPUs and have idle CPUs to
start-up.