build/patch/kernel/archive/sun50iw9-4.9/patch-4.9.175-176.patch

3348 lines
123 KiB
Diff

diff --git a/Documentation/hw-vuln/index.rst b/Documentation/hw-vuln/index.rst
new file mode 100644
index 000000000000..ffc064c1ec68
--- /dev/null
+++ b/Documentation/hw-vuln/index.rst
@@ -0,0 +1,13 @@
+========================
+Hardware vulnerabilities
+========================
+
+This section describes CPU vulnerabilities and provides an overview of the
+possible mitigations along with guidance for selecting mitigations if they
+are configurable at compile, boot or run time.
+
+.. toctree::
+ :maxdepth: 1
+
+ l1tf
+ mds
diff --git a/Documentation/hw-vuln/l1tf.rst b/Documentation/hw-vuln/l1tf.rst
new file mode 100644
index 000000000000..31653a9f0e1b
--- /dev/null
+++ b/Documentation/hw-vuln/l1tf.rst
@@ -0,0 +1,615 @@
+L1TF - L1 Terminal Fault
+========================
+
+L1 Terminal Fault is a hardware vulnerability which allows unprivileged
+speculative access to data which is available in the Level 1 Data Cache
+when the page table entry controlling the virtual address, which is used
+for the access, has the Present bit cleared or other reserved bits set.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+ - Processors from AMD, Centaur and other non Intel vendors
+
+ - Older processor models, where the CPU family is < 6
+
+ - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
+ Penwell, Pineview, Silvermont, Airmont, Merrifield)
+
+ - The Intel XEON PHI family
+
+ - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
+ IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
+ by the Meltdown vulnerability either. These CPUs should become
+ available by end of 2018.
+
+Whether a processor is affected or not can be read out from the L1TF
+vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
+
+Related CVEs
+------------
+
+The following CVE entries are related to the L1TF vulnerability:
+
+ ============= ================= ==============================
+ CVE-2018-3615 L1 Terminal Fault SGX related aspects
+ CVE-2018-3620 L1 Terminal Fault OS, SMM related aspects
+ CVE-2018-3646 L1 Terminal Fault Virtualization related aspects
+ ============= ================= ==============================
+
+Problem
+-------
+
+If an instruction accesses a virtual address for which the relevant page
+table entry (PTE) has the Present bit cleared or other reserved bits set,
+then speculative execution ignores the invalid PTE and loads the referenced
+data if it is present in the Level 1 Data Cache, as if the page referenced
+by the address bits in the PTE was still present and accessible.
+
+While this is a purely speculative mechanism and the instruction will raise
+a page fault when it is retired eventually, the pure act of loading the
+data and making it available to other speculative instructions opens up the
+opportunity for side channel attacks to unprivileged malicious code,
+similar to the Meltdown attack.
+
+While Meltdown breaks the user space to kernel space protection, L1TF
+allows to attack any physical memory address in the system and the attack
+works across all protection domains. It allows an attack of SGX and also
+works from inside virtual machines because the speculation bypasses the
+extended page table (EPT) protection mechanism.
+
+
+Attack scenarios
+----------------
+
+1. Malicious user space
+^^^^^^^^^^^^^^^^^^^^^^^
+
+ Operating Systems store arbitrary information in the address bits of a
+ PTE which is marked non present. This allows a malicious user space
+ application to attack the physical memory to which these PTEs resolve.
+ In some cases user-space can maliciously influence the information
+ encoded in the address bits of the PTE, thus making attacks more
+ deterministic and more practical.
+
+ The Linux kernel contains a mitigation for this attack vector, PTE
+ inversion, which is permanently enabled and has no performance
+ impact. The kernel ensures that the address bits of PTEs, which are not
+ marked present, never point to cacheable physical memory space.
+
+ A system with an up to date kernel is protected against attacks from
+ malicious user space applications.
+
+2. Malicious guest in a virtual machine
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The fact that L1TF breaks all domain protections allows malicious guest
+ OSes, which can control the PTEs directly, and malicious guest user
+ space applications, which run on an unprotected guest kernel lacking the
+ PTE inversion mitigation for L1TF, to attack physical host memory.
+
+ A special aspect of L1TF in the context of virtualization is symmetric
+ multi threading (SMT). The Intel implementation of SMT is called
+ HyperThreading. The fact that Hyperthreads on the affected processors
+ share the L1 Data Cache (L1D) is important for this. As the flaw allows
+ only to attack data which is present in L1D, a malicious guest running
+ on one Hyperthread can attack the data which is brought into the L1D by
+ the context which runs on the sibling Hyperthread of the same physical
+ core. This context can be host OS, host user space or a different guest.
+
+ If the processor does not support Extended Page Tables, the attack is
+ only possible, when the hypervisor does not sanitize the content of the
+ effective (shadow) page tables.
+
+ While solutions exist to mitigate these attack vectors fully, these
+ mitigations are not enabled by default in the Linux kernel because they
+ can affect performance significantly. The kernel provides several
+ mechanisms which can be utilized to address the problem depending on the
+ deployment scenario. The mitigations, their protection scope and impact
+ are described in the next sections.
+
+ The default mitigations and the rationale for choosing them are explained
+ at the end of this document. See :ref:`default_mitigations`.
+
+.. _l1tf_sys_info:
+
+L1TF system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current L1TF
+status of the system: whether the system is vulnerable, and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/l1tf
+
+The possible values in this file are:
+
+ =========================== ===============================
+ 'Not affected' The processor is not vulnerable
+ 'Mitigation: PTE Inversion' The host protection is active
+ =========================== ===============================
+
+If KVM/VMX is enabled and the processor is vulnerable then the following
+information is appended to the 'Mitigation: PTE Inversion' part:
+
+ - SMT status:
+
+ ===================== ================
+ 'VMX: SMT vulnerable' SMT is enabled
+ 'VMX: SMT disabled' SMT is disabled
+ ===================== ================
+
+ - L1D Flush mode:
+
+ ================================ ====================================
+ 'L1D vulnerable' L1D flushing is disabled
+
+ 'L1D conditional cache flushes' L1D flush is conditionally enabled
+
+ 'L1D cache flushes' L1D flush is unconditionally enabled
+ ================================ ====================================
+
+The resulting grade of protection is discussed in the following sections.
+
+
+Host mitigation mechanism
+-------------------------
+
+The kernel is unconditionally protected against L1TF attacks from malicious
+user space running on the host.
+
+
+Guest mitigation mechanisms
+---------------------------
+
+.. _l1d_flush:
+
+1. L1D flush on VMENTER
+^^^^^^^^^^^^^^^^^^^^^^^
+
+ To make sure that a guest cannot attack data which is present in the L1D
+ the hypervisor flushes the L1D before entering the guest.
+
+ Flushing the L1D evicts not only the data which should not be accessed
+ by a potentially malicious guest, it also flushes the guest
+ data. Flushing the L1D has a performance impact as the processor has to
+ bring the flushed guest data back into the L1D. Depending on the
+ frequency of VMEXIT/VMENTER and the type of computations in the guest
+ performance degradation in the range of 1% to 50% has been observed. For
+ scenarios where guest VMEXIT/VMENTER are rare the performance impact is
+ minimal. Virtio and mechanisms like posted interrupts are designed to
+ confine the VMEXITs to a bare minimum, but specific configurations and
+ application scenarios might still suffer from a high VMEXIT rate.
+
+ The kernel provides two L1D flush modes:
+ - conditional ('cond')
+ - unconditional ('always')
+
+ The conditional mode avoids L1D flushing after VMEXITs which execute
+ only audited code paths before the corresponding VMENTER. These code
+ paths have been verified that they cannot expose secrets or other
+ interesting data to an attacker, but they can leak information about the
+ address space layout of the hypervisor.
+
+ Unconditional mode flushes L1D on all VMENTER invocations and provides
+ maximum protection. It has a higher overhead than the conditional
+ mode. The overhead cannot be quantified correctly as it depends on the
+ workload scenario and the resulting number of VMEXITs.
+
+ The general recommendation is to enable L1D flush on VMENTER. The kernel
+ defaults to conditional mode on affected processors.
+
+ **Note**, that L1D flush does not prevent the SMT problem because the
+ sibling thread will also bring back its data into the L1D which makes it
+ attackable again.
+
+ L1D flush can be controlled by the administrator via the kernel command
+ line and sysfs control files. See :ref:`mitigation_control_command_line`
+ and :ref:`mitigation_control_kvm`.
+
+.. _guest_confinement:
+
+2. Guest VCPU confinement to dedicated physical cores
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ To address the SMT problem, it is possible to make a guest or a group of
+ guests affine to one or more physical cores. The proper mechanism for
+ that is to utilize exclusive cpusets to ensure that no other guest or
+ host tasks can run on these cores.
+
+ If only a single guest or related guests run on sibling SMT threads on
+ the same physical core then they can only attack their own memory and
+ restricted parts of the host memory.
+
+ Host memory is attackable, when one of the sibling SMT threads runs in
+ host OS (hypervisor) context and the other in guest context. The amount
+ of valuable information from the host OS context depends on the context
+ which the host OS executes, i.e. interrupts, soft interrupts and kernel
+ threads. The amount of valuable data from these contexts cannot be
+ declared as non-interesting for an attacker without deep inspection of
+ the code.
+
+ **Note**, that assigning guests to a fixed set of physical cores affects
+ the ability of the scheduler to do load balancing and might have
+ negative effects on CPU utilization depending on the hosting
+ scenario. Disabling SMT might be a viable alternative for particular
+ scenarios.
+
+ For further information about confining guests to a single or to a group
+ of cores consult the cpusets documentation:
+
+ https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
+
+.. _interrupt_isolation:
+
+3. Interrupt affinity
+^^^^^^^^^^^^^^^^^^^^^
+
+ Interrupts can be made affine to logical CPUs. This is not universally
+ true because there are types of interrupts which are truly per CPU
+ interrupts, e.g. the local timer interrupt. Aside of that multi queue
+ devices affine their interrupts to single CPUs or groups of CPUs per
+ queue without allowing the administrator to control the affinities.
+
+ Moving the interrupts, which can be affinity controlled, away from CPUs
+ which run untrusted guests, reduces the attack vector space.
+
+ Whether the interrupts with are affine to CPUs, which run untrusted
+ guests, provide interesting data for an attacker depends on the system
+ configuration and the scenarios which run on the system. While for some
+ of the interrupts it can be assumed that they won't expose interesting
+ information beyond exposing hints about the host OS memory layout, there
+ is no way to make general assumptions.
+
+ Interrupt affinity can be controlled by the administrator via the
+ /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
+ available at:
+
+ https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
+
+.. _smt_control:
+
+4. SMT control
+^^^^^^^^^^^^^^
+
+ To prevent the SMT issues of L1TF it might be necessary to disable SMT
+ completely. Disabling SMT can have a significant performance impact, but
+ the impact depends on the hosting scenario and the type of workloads.
+ The impact of disabling SMT needs also to be weighted against the impact
+ of other mitigation solutions like confining guests to dedicated cores.
+
+ The kernel provides a sysfs interface to retrieve the status of SMT and
+ to control it. It also provides a kernel command line interface to
+ control SMT.
+
+ The kernel command line interface consists of the following options:
+
+ =========== ==========================================================
+ nosmt Affects the bring up of the secondary CPUs during boot. The
+ kernel tries to bring all present CPUs online during the
+ boot process. "nosmt" makes sure that from each physical
+ core only one - the so called primary (hyper) thread is
+ activated. Due to a design flaw of Intel processors related
+ to Machine Check Exceptions the non primary siblings have
+ to be brought up at least partially and are then shut down
+ again. "nosmt" can be undone via the sysfs interface.
+
+ nosmt=force Has the same effect as "nosmt" but it does not allow to
+ undo the SMT disable via the sysfs interface.
+ =========== ==========================================================
+
+ The sysfs interface provides two files:
+
+ - /sys/devices/system/cpu/smt/control
+ - /sys/devices/system/cpu/smt/active
+
+ /sys/devices/system/cpu/smt/control:
+
+ This file allows to read out the SMT control state and provides the
+ ability to disable or (re)enable SMT. The possible states are:
+
+ ============== ===================================================
+ on SMT is supported by the CPU and enabled. All
+ logical CPUs can be onlined and offlined without
+ restrictions.
+
+ off SMT is supported by the CPU and disabled. Only
+ the so called primary SMT threads can be onlined
+ and offlined without restrictions. An attempt to
+ online a non-primary sibling is rejected
+
+ forceoff Same as 'off' but the state cannot be controlled.
+ Attempts to write to the control file are rejected.
+
+ notsupported The processor does not support SMT. It's therefore
+ not affected by the SMT implications of L1TF.
+ Attempts to write to the control file are rejected.
+ ============== ===================================================
+
+ The possible states which can be written into this file to control SMT
+ state are:
+
+ - on
+ - off
+ - forceoff
+
+ /sys/devices/system/cpu/smt/active:
+
+ This file reports whether SMT is enabled and active, i.e. if on any
+ physical core two or more sibling threads are online.
+
+ SMT control is also possible at boot time via the l1tf kernel command
+ line parameter in combination with L1D flush control. See
+ :ref:`mitigation_control_command_line`.
+
+5. Disabling EPT
+^^^^^^^^^^^^^^^^
+
+ Disabling EPT for virtual machines provides full mitigation for L1TF even
+ with SMT enabled, because the effective page tables for guests are
+ managed and sanitized by the hypervisor. Though disabling EPT has a
+ significant performance impact especially when the Meltdown mitigation
+ KPTI is enabled.
+
+ EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
+
+There is ongoing research and development for new mitigation mechanisms to
+address the performance impact of disabling SMT or EPT.
+
+.. _mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the L1TF mitigations at boot
+time with the option "l1tf=". The valid arguments for this option are:
+
+ ============ =============================================================
+ full Provides all available mitigations for the L1TF
+ vulnerability. Disables SMT and enables all mitigations in
+ the hypervisors, i.e. unconditional L1D flushing
+
+ SMT control and L1D flush control via the sysfs interface
+ is still possible after boot. Hypervisors will issue a
+ warning when the first VM is started in a potentially
+ insecure configuration, i.e. SMT enabled or L1D flush
+ disabled.
+
+ full,force Same as 'full', but disables SMT and L1D flush runtime
+ control. Implies the 'nosmt=force' command line option.
+ (i.e. sysfs control of SMT is disabled.)
+
+ flush Leaves SMT enabled and enables the default hypervisor
+ mitigation, i.e. conditional L1D flushing
+
+ SMT control and L1D flush control via the sysfs interface
+ is still possible after boot. Hypervisors will issue a
+ warning when the first VM is started in a potentially
+ insecure configuration, i.e. SMT enabled or L1D flush
+ disabled.
+
+ flush,nosmt Disables SMT and enables the default hypervisor mitigation,
+ i.e. conditional L1D flushing.
+
+ SMT control and L1D flush control via the sysfs interface
+ is still possible after boot. Hypervisors will issue a
+ warning when the first VM is started in a potentially
+ insecure configuration, i.e. SMT enabled or L1D flush
+ disabled.
+
+ flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is
+ started in a potentially insecure configuration.
+
+ off Disables hypervisor mitigations and doesn't emit any
+ warnings.
+ It also drops the swap size and available RAM limit restrictions
+ on both hypervisor and bare metal.
+
+ ============ =============================================================
+
+The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
+
+
+.. _mitigation_control_kvm:
+
+Mitigation control for KVM - module parameter
+-------------------------------------------------------------
+
+The KVM hypervisor mitigation mechanism, flushing the L1D cache when
+entering a guest, can be controlled with a module parameter.
+
+The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
+following arguments:
+
+ ============ ==============================================================
+ always L1D cache flush on every VMENTER.
+
+ cond Flush L1D on VMENTER only when the code between VMEXIT and
+ VMENTER can leak host memory which is considered
+ interesting for an attacker. This still can leak host memory
+ which allows e.g. to determine the hosts address space layout.
+
+ never Disables the mitigation
+ ============ ==============================================================
+
+The parameter can be provided on the kernel command line, as a module
+parameter when loading the modules and at runtime modified via the sysfs
+file:
+
+/sys/module/kvm_intel/parameters/vmentry_l1d_flush
+
+The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
+line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
+module parameter is ignored and writes to the sysfs file are rejected.
+
+.. _mitigation_selection:
+
+Mitigation selection guide
+--------------------------
+
+1. No virtualization in use
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The system is protected by the kernel unconditionally and no further
+ action is required.
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ If the guest comes from a trusted source and the guest OS kernel is
+ guaranteed to have the L1TF mitigations in place the system is fully
+ protected against L1TF and no further action is required.
+
+ To avoid the overhead of the default L1D flushing on VMENTER the
+ administrator can disable the flushing via the kernel command line and
+ sysfs control files. See :ref:`mitigation_control_command_line` and
+ :ref:`mitigation_control_kvm`.
+
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+3.1. SMT not supported or disabled
+""""""""""""""""""""""""""""""""""
+
+ If SMT is not supported by the processor or disabled in the BIOS or by
+ the kernel, it's only required to enforce L1D flushing on VMENTER.
+
+ Conditional L1D flushing is the default behaviour and can be tuned. See
+ :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
+
+3.2. EPT not supported or disabled
+""""""""""""""""""""""""""""""""""
+
+ If EPT is not supported by the processor or disabled in the hypervisor,
+ the system is fully protected. SMT can stay enabled and L1D flushing on
+ VMENTER is not required.
+
+ EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
+
+3.3. SMT and EPT supported and active
+"""""""""""""""""""""""""""""""""""""
+
+ If SMT and EPT are supported and active then various degrees of
+ mitigations can be employed:
+
+ - L1D flushing on VMENTER:
+
+ L1D flushing on VMENTER is the minimal protection requirement, but it
+ is only potent in combination with other mitigation methods.
+
+ Conditional L1D flushing is the default behaviour and can be tuned. See
+ :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
+
+ - Guest confinement:
+
+ Confinement of guests to a single or a group of physical cores which
+ are not running any other processes, can reduce the attack surface
+ significantly, but interrupts, soft interrupts and kernel threads can
+ still expose valuable data to a potential attacker. See
+ :ref:`guest_confinement`.
+
+ - Interrupt isolation:
+
+ Isolating the guest CPUs from interrupts can reduce the attack surface
+ further, but still allows a malicious guest to explore a limited amount
+ of host physical memory. This can at least be used to gain knowledge
+ about the host address space layout. The interrupts which have a fixed
+ affinity to the CPUs which run the untrusted guests can depending on
+ the scenario still trigger soft interrupts and schedule kernel threads
+ which might expose valuable information. See
+ :ref:`interrupt_isolation`.
+
+The above three mitigation methods combined can provide protection to a
+certain degree, but the risk of the remaining attack surface has to be
+carefully analyzed. For full protection the following methods are
+available:
+
+ - Disabling SMT:
+
+ Disabling SMT and enforcing the L1D flushing provides the maximum
+ amount of protection. This mitigation is not depending on any of the
+ above mitigation methods.
+
+ SMT control and L1D flushing can be tuned by the command line
+ parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
+ time with the matching sysfs control files. See :ref:`smt_control`,
+ :ref:`mitigation_control_command_line` and
+ :ref:`mitigation_control_kvm`.
+
+ - Disabling EPT:
+
+ Disabling EPT provides the maximum amount of protection as well. It is
+ not depending on any of the above mitigation methods. SMT can stay
+ enabled and L1D flushing is not required, but the performance impact is
+ significant.
+
+ EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
+ parameter.
+
+3.4. Nested virtual machines
+""""""""""""""""""""""""""""
+
+When nested virtualization is in use, three operating systems are involved:
+the bare metal hypervisor, the nested hypervisor and the nested virtual
+machine. VMENTER operations from the nested hypervisor into the nested
+guest will always be processed by the bare metal hypervisor. If KVM is the
+bare metal hypervisor it will:
+
+ - Flush the L1D cache on every switch from the nested hypervisor to the
+ nested virtual machine, so that the nested hypervisor's secrets are not
+ exposed to the nested virtual machine;
+
+ - Flush the L1D cache on every switch from the nested virtual machine to
+ the nested hypervisor; this is a complex operation, and flushing the L1D
+ cache avoids that the bare metal hypervisor's secrets are exposed to the
+ nested virtual machine;
+
+ - Instruct the nested hypervisor to not perform any L1D cache flush. This
+ is an optimization to avoid double L1D flushing.
+
+
+.. _default_mitigations:
+
+Default mitigations
+-------------------
+
+ The kernel default mitigations for vulnerable processors are:
+
+ - PTE inversion to protect against malicious user space. This is done
+ unconditionally and cannot be controlled. The swap storage is limited
+ to ~16TB.
+
+ - L1D conditional flushing on VMENTER when EPT is enabled for
+ a guest.
+
+ The kernel does not by default enforce the disabling of SMT, which leaves
+ SMT systems vulnerable when running untrusted guests with EPT enabled.
+
+ The rationale for this choice is:
+
+ - Force disabling SMT can break existing setups, especially with
+ unattended updates.
+
+ - If regular users run untrusted guests on their machine, then L1TF is
+ just an add on to other malware which might be embedded in an untrusted
+ guest, e.g. spam-bots or attacks on the local network.
+
+ There is no technical way to prevent a user from running untrusted code
+ on their machines blindly.
+
+ - It's technically extremely unlikely and from today's knowledge even
+ impossible that L1TF can be exploited via the most popular attack
+ mechanisms like JavaScript because these mechanisms have no way to
+ control PTEs. If this would be possible and not other mitigation would
+ be possible, then the default might be different.
+
+ - The administrators of cloud and hosting setups have to carefully
+ analyze the risk for their scenarios and make the appropriate
+ mitigation choices, which might even vary across their deployed
+ machines and also result in other changes of their overall setup.
+ There is no way for the kernel to provide a sensible default for this
+ kind of scenarios.
diff --git a/Documentation/hw-vuln/mds.rst b/Documentation/hw-vuln/mds.rst
new file mode 100644
index 000000000000..daf6fdac49a3
--- /dev/null
+++ b/Documentation/hw-vuln/mds.rst
@@ -0,0 +1,308 @@
+MDS - Microarchitectural Data Sampling
+======================================
+
+Microarchitectural Data Sampling is a hardware vulnerability which allows
+unprivileged speculative access to data which is available in various CPU
+internal buffers.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+ - Processors from AMD, Centaur and other non Intel vendors
+
+ - Older processor models, where the CPU family is < 6
+
+ - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus)
+
+ - Intel processors which have the ARCH_CAP_MDS_NO bit set in the
+ IA32_ARCH_CAPABILITIES MSR.
+
+Whether a processor is affected or not can be read out from the MDS
+vulnerability file in sysfs. See :ref:`mds_sys_info`.
+
+Not all processors are affected by all variants of MDS, but the mitigation
+is identical for all of them so the kernel treats them as a single
+vulnerability.
+
+Related CVEs
+------------
+
+The following CVE entries are related to the MDS vulnerability:
+
+ ============== ===== ===================================================
+ CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling
+ CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling
+ CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling
+ CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory
+ ============== ===== ===================================================
+
+Problem
+-------
+
+When performing store, load, L1 refill operations, processors write data
+into temporary microarchitectural structures (buffers). The data in the
+buffer can be forwarded to load operations as an optimization.
+
+Under certain conditions, usually a fault/assist caused by a load
+operation, data unrelated to the load memory address can be speculatively
+forwarded from the buffers. Because the load operation causes a fault or
+assist and its result will be discarded, the forwarded data will not cause
+incorrect program execution or state changes. But a malicious operation
+may be able to forward this speculative data to a disclosure gadget which
+allows in turn to infer the value via a cache side channel attack.
+
+Because the buffers are potentially shared between Hyper-Threads cross
+Hyper-Thread attacks are possible.
+
+Deeper technical information is available in the MDS specific x86
+architecture section: :ref:`Documentation/x86/mds.rst <mds>`.
+
+
+Attack scenarios
+----------------
+
+Attacks against the MDS vulnerabilities can be mounted from malicious non
+priviledged user space applications running on hosts or guest. Malicious
+guest OSes can obviously mount attacks as well.
+
+Contrary to other speculation based vulnerabilities the MDS vulnerability
+does not allow the attacker to control the memory target address. As a
+consequence the attacks are purely sampling based, but as demonstrated with
+the TLBleed attack samples can be postprocessed successfully.
+
+Web-Browsers
+^^^^^^^^^^^^
+
+ It's unclear whether attacks through Web-Browsers are possible at
+ all. The exploitation through Java-Script is considered very unlikely,
+ but other widely used web technologies like Webassembly could possibly be
+ abused.
+
+
+.. _mds_sys_info:
+
+MDS system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current MDS
+status of the system: whether the system is vulnerable, and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/mds
+
+The possible values in this file are:
+
+ .. list-table::
+
+ * - 'Not affected'
+ - The processor is not vulnerable
+ * - 'Vulnerable'
+ - The processor is vulnerable, but no mitigation enabled
+ * - 'Vulnerable: Clear CPU buffers attempted, no microcode'
+ - The processor is vulnerable but microcode is not updated.
+
+ The mitigation is enabled on a best effort basis. See :ref:`vmwerv`
+ * - 'Mitigation: Clear CPU buffers'
+ - The processor is vulnerable and the CPU buffer clearing mitigation is
+ enabled.
+
+If the processor is vulnerable then the following information is appended
+to the above information:
+
+ ======================== ============================================
+ 'SMT vulnerable' SMT is enabled
+ 'SMT mitigated' SMT is enabled and mitigated
+ 'SMT disabled' SMT is disabled
+ 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown
+ ======================== ============================================
+
+.. _vmwerv:
+
+Best effort mitigation mode
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ If the processor is vulnerable, but the availability of the microcode based
+ mitigation mechanism is not advertised via CPUID the kernel selects a best
+ effort mitigation mode. This mode invokes the mitigation instructions
+ without a guarantee that they clear the CPU buffers.
+
+ This is done to address virtualization scenarios where the host has the
+ microcode update applied, but the hypervisor is not yet updated to expose
+ the CPUID to the guest. If the host has updated microcode the protection
+ takes effect otherwise a few cpu cycles are wasted pointlessly.
+
+ The state in the mds sysfs file reflects this situation accordingly.
+
+
+Mitigation mechanism
+-------------------------
+
+The kernel detects the affected CPUs and the presence of the microcode
+which is required.
+
+If a CPU is affected and the microcode is available, then the kernel
+enables the mitigation by default. The mitigation can be controlled at boot
+time via a kernel command line option. See
+:ref:`mds_mitigation_control_command_line`.
+
+.. _cpu_buffer_clear:
+
+CPU buffer clearing
+^^^^^^^^^^^^^^^^^^^
+
+ The mitigation for MDS clears the affected CPU buffers on return to user
+ space and when entering a guest.
+
+ If SMT is enabled it also clears the buffers on idle entry when the CPU
+ is only affected by MSBDS and not any other MDS variant, because the
+ other variants cannot be protected against cross Hyper-Thread attacks.
+
+ For CPUs which are only affected by MSBDS the user space, guest and idle
+ transition mitigations are sufficient and SMT is not affected.
+
+.. _virt_mechanism:
+
+Virtualization mitigation
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The protection for host to guest transition depends on the L1TF
+ vulnerability of the CPU:
+
+ - CPU is affected by L1TF:
+
+ If the L1D flush mitigation is enabled and up to date microcode is
+ available, the L1D flush mitigation is automatically protecting the
+ guest transition.
+
+ If the L1D flush mitigation is disabled then the MDS mitigation is
+ invoked explicit when the host MDS mitigation is enabled.
+
+ For details on L1TF and virtualization see:
+ :ref:`Documentation/hw-vuln//l1tf.rst <mitigation_control_kvm>`.
+
+ - CPU is not affected by L1TF:
+
+ CPU buffers are flushed before entering the guest when the host MDS
+ mitigation is enabled.
+
+ The resulting MDS protection matrix for the host to guest transition:
+
+ ============ ===== ============= ============ =================
+ L1TF MDS VMX-L1FLUSH Host MDS MDS-State
+
+ Don't care No Don't care N/A Not affected
+
+ Yes Yes Disabled Off Vulnerable
+
+ Yes Yes Disabled Full Mitigated
+
+ Yes Yes Enabled Don't care Mitigated
+
+ No Yes N/A Off Vulnerable
+
+ No Yes N/A Full Mitigated
+ ============ ===== ============= ============ =================
+
+ This only covers the host to guest transition, i.e. prevents leakage from
+ host to guest, but does not protect the guest internally. Guests need to
+ have their own protections.
+
+.. _xeon_phi:
+
+XEON PHI specific considerations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The XEON PHI processor family is affected by MSBDS which can be exploited
+ cross Hyper-Threads when entering idle states. Some XEON PHI variants allow
+ to use MWAIT in user space (Ring 3) which opens an potential attack vector
+ for malicious user space. The exposure can be disabled on the kernel
+ command line with the 'ring3mwait=disable' command line option.
+
+ XEON PHI is not affected by the other MDS variants and MSBDS is mitigated
+ before the CPU enters a idle state. As XEON PHI is not affected by L1TF
+ either disabling SMT is not required for full protection.
+
+.. _mds_smt_control:
+
+SMT control
+^^^^^^^^^^^
+
+ All MDS variants except MSBDS can be attacked cross Hyper-Threads. That
+ means on CPUs which are affected by MFBDS or MLPDS it is necessary to
+ disable SMT for full protection. These are most of the affected CPUs; the
+ exception is XEON PHI, see :ref:`xeon_phi`.
+
+ Disabling SMT can have a significant performance impact, but the impact
+ depends on the type of workloads.
+
+ See the relevant chapter in the L1TF mitigation documentation for details:
+ :ref:`Documentation/hw-vuln/l1tf.rst <smt_control>`.
+
+
+.. _mds_mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the MDS mitigations at boot
+time with the option "mds=". The valid arguments for this option are:
+
+ ============ =============================================================
+ full If the CPU is vulnerable, enable all available mitigations
+ for the MDS vulnerability, CPU buffer clearing on exit to
+ userspace and when entering a VM. Idle transitions are
+ protected as well if SMT is enabled.
+
+ It does not automatically disable SMT.
+
+ full,nosmt The same as mds=full, with SMT disabled on vulnerable
+ CPUs. This is the complete mitigation.
+
+ off Disables MDS mitigations completely.
+
+ ============ =============================================================
+
+Not specifying this option is equivalent to "mds=full".
+
+
+Mitigation selection guide
+--------------------------
+
+1. Trusted userspace
+^^^^^^^^^^^^^^^^^^^^
+
+ If all userspace applications are from a trusted source and do not
+ execute untrusted code which is supplied externally, then the mitigation
+ can be disabled.
+
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The same considerations as above versus trusted user space apply.
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The protection depends on the state of the L1TF mitigations.
+ See :ref:`virt_mechanism`.
+
+ If the MDS mitigation is enabled and SMT is disabled, guest to host and
+ guest to guest attacks are prevented.
+
+.. _mds_default_mitigations:
+
+Default mitigations
+-------------------
+
+ The kernel default mitigations for vulnerable processors are:
+
+ - Enable CPU buffer clearing
+
+ The kernel does not by default enforce the disabling of SMT, which leaves
+ SMT systems vulnerable when running untrusted code. The same rationale as
+ for L1TF applies.
+ See :ref:`Documentation/hw-vuln//l1tf.rst <default_mitigations>`.
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 213399aac757..f95c58dbbbc3 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -12,7 +12,6 @@ Contents:
:maxdepth: 2
kernel-documentation
- l1tf
development-process/index
dev-tools/tools
driver-api/index
@@ -20,6 +19,24 @@ Contents:
gpu/index
80211/index
+This section describes CPU vulnerabilities and their mitigations.
+
+.. toctree::
+ :maxdepth: 1
+
+ hw-vuln/index
+
+Architecture-specific documentation
+-----------------------------------
+
+These books provide programming details about architecture-specific
+implementation.
+
+.. toctree::
+ :maxdepth: 2
+
+ x86/index
+
Indices and tables
==================
diff --git a/Documentation/l1tf.rst b/Documentation/l1tf.rst
deleted file mode 100644
index bae52b845de0..000000000000
--- a/Documentation/l1tf.rst
+++ /dev/null
@@ -1,610 +0,0 @@
-L1TF - L1 Terminal Fault
-========================
-
-L1 Terminal Fault is a hardware vulnerability which allows unprivileged
-speculative access to data which is available in the Level 1 Data Cache
-when the page table entry controlling the virtual address, which is used
-for the access, has the Present bit cleared or other reserved bits set.
-
-Affected processors
--------------------
-
-This vulnerability affects a wide range of Intel processors. The
-vulnerability is not present on:
-
- - Processors from AMD, Centaur and other non Intel vendors
-
- - Older processor models, where the CPU family is < 6
-
- - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
- Penwell, Pineview, Silvermont, Airmont, Merrifield)
-
- - The Intel XEON PHI family
-
- - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
- IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
- by the Meltdown vulnerability either. These CPUs should become
- available by end of 2018.
-
-Whether a processor is affected or not can be read out from the L1TF
-vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
-
-Related CVEs
-------------
-
-The following CVE entries are related to the L1TF vulnerability:
-
- ============= ================= ==============================
- CVE-2018-3615 L1 Terminal Fault SGX related aspects
- CVE-2018-3620 L1 Terminal Fault OS, SMM related aspects
- CVE-2018-3646 L1 Terminal Fault Virtualization related aspects
- ============= ================= ==============================
-
-Problem
--------
-
-If an instruction accesses a virtual address for which the relevant page
-table entry (PTE) has the Present bit cleared or other reserved bits set,
-then speculative execution ignores the invalid PTE and loads the referenced
-data if it is present in the Level 1 Data Cache, as if the page referenced
-by the address bits in the PTE was still present and accessible.
-
-While this is a purely speculative mechanism and the instruction will raise
-a page fault when it is retired eventually, the pure act of loading the
-data and making it available to other speculative instructions opens up the
-opportunity for side channel attacks to unprivileged malicious code,
-similar to the Meltdown attack.
-
-While Meltdown breaks the user space to kernel space protection, L1TF
-allows to attack any physical memory address in the system and the attack
-works across all protection domains. It allows an attack of SGX and also
-works from inside virtual machines because the speculation bypasses the
-extended page table (EPT) protection mechanism.
-
-
-Attack scenarios
-----------------
-
-1. Malicious user space
-^^^^^^^^^^^^^^^^^^^^^^^
-
- Operating Systems store arbitrary information in the address bits of a
- PTE which is marked non present. This allows a malicious user space
- application to attack the physical memory to which these PTEs resolve.
- In some cases user-space can maliciously influence the information
- encoded in the address bits of the PTE, thus making attacks more
- deterministic and more practical.
-
- The Linux kernel contains a mitigation for this attack vector, PTE
- inversion, which is permanently enabled and has no performance
- impact. The kernel ensures that the address bits of PTEs, which are not
- marked present, never point to cacheable physical memory space.
-
- A system with an up to date kernel is protected against attacks from
- malicious user space applications.
-
-2. Malicious guest in a virtual machine
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- The fact that L1TF breaks all domain protections allows malicious guest
- OSes, which can control the PTEs directly, and malicious guest user
- space applications, which run on an unprotected guest kernel lacking the
- PTE inversion mitigation for L1TF, to attack physical host memory.
-
- A special aspect of L1TF in the context of virtualization is symmetric
- multi threading (SMT). The Intel implementation of SMT is called
- HyperThreading. The fact that Hyperthreads on the affected processors
- share the L1 Data Cache (L1D) is important for this. As the flaw allows
- only to attack data which is present in L1D, a malicious guest running
- on one Hyperthread can attack the data which is brought into the L1D by
- the context which runs on the sibling Hyperthread of the same physical
- core. This context can be host OS, host user space or a different guest.
-
- If the processor does not support Extended Page Tables, the attack is
- only possible, when the hypervisor does not sanitize the content of the
- effective (shadow) page tables.
-
- While solutions exist to mitigate these attack vectors fully, these
- mitigations are not enabled by default in the Linux kernel because they
- can affect performance significantly. The kernel provides several
- mechanisms which can be utilized to address the problem depending on the
- deployment scenario. The mitigations, their protection scope and impact
- are described in the next sections.
-
- The default mitigations and the rationale for choosing them are explained
- at the end of this document. See :ref:`default_mitigations`.
-
-.. _l1tf_sys_info:
-
-L1TF system information
------------------------
-
-The Linux kernel provides a sysfs interface to enumerate the current L1TF
-status of the system: whether the system is vulnerable, and which
-mitigations are active. The relevant sysfs file is:
-
-/sys/devices/system/cpu/vulnerabilities/l1tf
-
-The possible values in this file are:
-
- =========================== ===============================
- 'Not affected' The processor is not vulnerable
- 'Mitigation: PTE Inversion' The host protection is active
- =========================== ===============================
-
-If KVM/VMX is enabled and the processor is vulnerable then the following
-information is appended to the 'Mitigation: PTE Inversion' part:
-
- - SMT status:
-
- ===================== ================
- 'VMX: SMT vulnerable' SMT is enabled
- 'VMX: SMT disabled' SMT is disabled
- ===================== ================
-
- - L1D Flush mode:
-
- ================================ ====================================
- 'L1D vulnerable' L1D flushing is disabled
-
- 'L1D conditional cache flushes' L1D flush is conditionally enabled
-
- 'L1D cache flushes' L1D flush is unconditionally enabled
- ================================ ====================================
-
-The resulting grade of protection is discussed in the following sections.
-
-
-Host mitigation mechanism
--------------------------
-
-The kernel is unconditionally protected against L1TF attacks from malicious
-user space running on the host.
-
-
-Guest mitigation mechanisms
----------------------------
-
-.. _l1d_flush:
-
-1. L1D flush on VMENTER
-^^^^^^^^^^^^^^^^^^^^^^^
-
- To make sure that a guest cannot attack data which is present in the L1D
- the hypervisor flushes the L1D before entering the guest.
-
- Flushing the L1D evicts not only the data which should not be accessed
- by a potentially malicious guest, it also flushes the guest
- data. Flushing the L1D has a performance impact as the processor has to
- bring the flushed guest data back into the L1D. Depending on the
- frequency of VMEXIT/VMENTER and the type of computations in the guest
- performance degradation in the range of 1% to 50% has been observed. For
- scenarios where guest VMEXIT/VMENTER are rare the performance impact is
- minimal. Virtio and mechanisms like posted interrupts are designed to
- confine the VMEXITs to a bare minimum, but specific configurations and
- application scenarios might still suffer from a high VMEXIT rate.
-
- The kernel provides two L1D flush modes:
- - conditional ('cond')
- - unconditional ('always')
-
- The conditional mode avoids L1D flushing after VMEXITs which execute
- only audited code paths before the corresponding VMENTER. These code
- paths have been verified that they cannot expose secrets or other
- interesting data to an attacker, but they can leak information about the
- address space layout of the hypervisor.
-
- Unconditional mode flushes L1D on all VMENTER invocations and provides
- maximum protection. It has a higher overhead than the conditional
- mode. The overhead cannot be quantified correctly as it depends on the
- workload scenario and the resulting number of VMEXITs.
-
- The general recommendation is to enable L1D flush on VMENTER. The kernel
- defaults to conditional mode on affected processors.
-
- **Note**, that L1D flush does not prevent the SMT problem because the
- sibling thread will also bring back its data into the L1D which makes it
- attackable again.
-
- L1D flush can be controlled by the administrator via the kernel command
- line and sysfs control files. See :ref:`mitigation_control_command_line`
- and :ref:`mitigation_control_kvm`.
-
-.. _guest_confinement:
-
-2. Guest VCPU confinement to dedicated physical cores
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- To address the SMT problem, it is possible to make a guest or a group of
- guests affine to one or more physical cores. The proper mechanism for
- that is to utilize exclusive cpusets to ensure that no other guest or
- host tasks can run on these cores.
-
- If only a single guest or related guests run on sibling SMT threads on
- the same physical core then they can only attack their own memory and
- restricted parts of the host memory.
-
- Host memory is attackable, when one of the sibling SMT threads runs in
- host OS (hypervisor) context and the other in guest context. The amount
- of valuable information from the host OS context depends on the context
- which the host OS executes, i.e. interrupts, soft interrupts and kernel
- threads. The amount of valuable data from these contexts cannot be
- declared as non-interesting for an attacker without deep inspection of
- the code.
-
- **Note**, that assigning guests to a fixed set of physical cores affects
- the ability of the scheduler to do load balancing and might have
- negative effects on CPU utilization depending on the hosting
- scenario. Disabling SMT might be a viable alternative for particular
- scenarios.
-
- For further information about confining guests to a single or to a group
- of cores consult the cpusets documentation:
-
- https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
-
-.. _interrupt_isolation:
-
-3. Interrupt affinity
-^^^^^^^^^^^^^^^^^^^^^
-
- Interrupts can be made affine to logical CPUs. This is not universally
- true because there are types of interrupts which are truly per CPU
- interrupts, e.g. the local timer interrupt. Aside of that multi queue
- devices affine their interrupts to single CPUs or groups of CPUs per
- queue without allowing the administrator to control the affinities.
-
- Moving the interrupts, which can be affinity controlled, away from CPUs
- which run untrusted guests, reduces the attack vector space.
-
- Whether the interrupts with are affine to CPUs, which run untrusted
- guests, provide interesting data for an attacker depends on the system
- configuration and the scenarios which run on the system. While for some
- of the interrupts it can be assumed that they won't expose interesting
- information beyond exposing hints about the host OS memory layout, there
- is no way to make general assumptions.
-
- Interrupt affinity can be controlled by the administrator via the
- /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
- available at:
-
- https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
-
-.. _smt_control:
-
-4. SMT control
-^^^^^^^^^^^^^^
-
- To prevent the SMT issues of L1TF it might be necessary to disable SMT
- completely. Disabling SMT can have a significant performance impact, but
- the impact depends on the hosting scenario and the type of workloads.
- The impact of disabling SMT needs also to be weighted against the impact
- of other mitigation solutions like confining guests to dedicated cores.
-
- The kernel provides a sysfs interface to retrieve the status of SMT and
- to control it. It also provides a kernel command line interface to
- control SMT.
-
- The kernel command line interface consists of the following options:
-
- =========== ==========================================================
- nosmt Affects the bring up of the secondary CPUs during boot. The
- kernel tries to bring all present CPUs online during the
- boot process. "nosmt" makes sure that from each physical
- core only one - the so called primary (hyper) thread is
- activated. Due to a design flaw of Intel processors related
- to Machine Check Exceptions the non primary siblings have
- to be brought up at least partially and are then shut down
- again. "nosmt" can be undone via the sysfs interface.
-
- nosmt=force Has the same effect as "nosmt" but it does not allow to
- undo the SMT disable via the sysfs interface.
- =========== ==========================================================
-
- The sysfs interface provides two files:
-
- - /sys/devices/system/cpu/smt/control
- - /sys/devices/system/cpu/smt/active
-
- /sys/devices/system/cpu/smt/control:
-
- This file allows to read out the SMT control state and provides the
- ability to disable or (re)enable SMT. The possible states are:
-
- ============== ===================================================
- on SMT is supported by the CPU and enabled. All
- logical CPUs can be onlined and offlined without
- restrictions.
-
- off SMT is supported by the CPU and disabled. Only
- the so called primary SMT threads can be onlined
- and offlined without restrictions. An attempt to
- online a non-primary sibling is rejected
-
- forceoff Same as 'off' but the state cannot be controlled.
- Attempts to write to the control file are rejected.
-
- notsupported The processor does not support SMT. It's therefore
- not affected by the SMT implications of L1TF.
- Attempts to write to the control file are rejected.
- ============== ===================================================
-
- The possible states which can be written into this file to control SMT
- state are:
-
- - on
- - off
- - forceoff
-
- /sys/devices/system/cpu/smt/active:
-
- This file reports whether SMT is enabled and active, i.e. if on any
- physical core two or more sibling threads are online.
-
- SMT control is also possible at boot time via the l1tf kernel command
- line parameter in combination with L1D flush control. See
- :ref:`mitigation_control_command_line`.
-
-5. Disabling EPT
-^^^^^^^^^^^^^^^^
-
- Disabling EPT for virtual machines provides full mitigation for L1TF even
- with SMT enabled, because the effective page tables for guests are
- managed and sanitized by the hypervisor. Though disabling EPT has a
- significant performance impact especially when the Meltdown mitigation
- KPTI is enabled.
-
- EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
-
-There is ongoing research and development for new mitigation mechanisms to
-address the performance impact of disabling SMT or EPT.
-
-.. _mitigation_control_command_line:
-
-Mitigation control on the kernel command line
----------------------------------------------
-
-The kernel command line allows to control the L1TF mitigations at boot
-time with the option "l1tf=". The valid arguments for this option are:
-
- ============ =============================================================
- full Provides all available mitigations for the L1TF
- vulnerability. Disables SMT and enables all mitigations in
- the hypervisors, i.e. unconditional L1D flushing
-
- SMT control and L1D flush control via the sysfs interface
- is still possible after boot. Hypervisors will issue a
- warning when the first VM is started in a potentially
- insecure configuration, i.e. SMT enabled or L1D flush
- disabled.
-
- full,force Same as 'full', but disables SMT and L1D flush runtime
- control. Implies the 'nosmt=force' command line option.
- (i.e. sysfs control of SMT is disabled.)
-
- flush Leaves SMT enabled and enables the default hypervisor
- mitigation, i.e. conditional L1D flushing
-
- SMT control and L1D flush control via the sysfs interface
- is still possible after boot. Hypervisors will issue a
- warning when the first VM is started in a potentially
- insecure configuration, i.e. SMT enabled or L1D flush
- disabled.
-
- flush,nosmt Disables SMT and enables the default hypervisor mitigation,
- i.e. conditional L1D flushing.
-
- SMT control and L1D flush control via the sysfs interface
- is still possible after boot. Hypervisors will issue a
- warning when the first VM is started in a potentially
- insecure configuration, i.e. SMT enabled or L1D flush
- disabled.
-
- flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is
- started in a potentially insecure configuration.
-
- off Disables hypervisor mitigations and doesn't emit any
- warnings.
- ============ =============================================================
-
-The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
-
-
-.. _mitigation_control_kvm:
-
-Mitigation control for KVM - module parameter
--------------------------------------------------------------
-
-The KVM hypervisor mitigation mechanism, flushing the L1D cache when
-entering a guest, can be controlled with a module parameter.
-
-The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
-following arguments:
-
- ============ ==============================================================
- always L1D cache flush on every VMENTER.
-
- cond Flush L1D on VMENTER only when the code between VMEXIT and
- VMENTER can leak host memory which is considered
- interesting for an attacker. This still can leak host memory
- which allows e.g. to determine the hosts address space layout.
-
- never Disables the mitigation
- ============ ==============================================================
-
-The parameter can be provided on the kernel command line, as a module
-parameter when loading the modules and at runtime modified via the sysfs
-file:
-
-/sys/module/kvm_intel/parameters/vmentry_l1d_flush
-
-The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
-line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
-module parameter is ignored and writes to the sysfs file are rejected.
-
-
-Mitigation selection guide
---------------------------
-
-1. No virtualization in use
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- The system is protected by the kernel unconditionally and no further
- action is required.
-
-2. Virtualization with trusted guests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
- If the guest comes from a trusted source and the guest OS kernel is
- guaranteed to have the L1TF mitigations in place the system is fully
- protected against L1TF and no further action is required.
-
- To avoid the overhead of the default L1D flushing on VMENTER the
- administrator can disable the flushing via the kernel command line and
- sysfs control files. See :ref:`mitigation_control_command_line` and
- :ref:`mitigation_control_kvm`.
-
-
-3. Virtualization with untrusted guests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-3.1. SMT not supported or disabled
-""""""""""""""""""""""""""""""""""
-
- If SMT is not supported by the processor or disabled in the BIOS or by
- the kernel, it's only required to enforce L1D flushing on VMENTER.
-
- Conditional L1D flushing is the default behaviour and can be tuned. See
- :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
-
-3.2. EPT not supported or disabled
-""""""""""""""""""""""""""""""""""
-
- If EPT is not supported by the processor or disabled in the hypervisor,
- the system is fully protected. SMT can stay enabled and L1D flushing on
- VMENTER is not required.
-
- EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
-
-3.3. SMT and EPT supported and active
-"""""""""""""""""""""""""""""""""""""
-
- If SMT and EPT are supported and active then various degrees of
- mitigations can be employed:
-
- - L1D flushing on VMENTER:
-
- L1D flushing on VMENTER is the minimal protection requirement, but it
- is only potent in combination with other mitigation methods.
-
- Conditional L1D flushing is the default behaviour and can be tuned. See
- :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.
-
- - Guest confinement:
-
- Confinement of guests to a single or a group of physical cores which
- are not running any other processes, can reduce the attack surface
- significantly, but interrupts, soft interrupts and kernel threads can
- still expose valuable data to a potential attacker. See
- :ref:`guest_confinement`.
-
- - Interrupt isolation:
-
- Isolating the guest CPUs from interrupts can reduce the attack surface
- further, but still allows a malicious guest to explore a limited amount
- of host physical memory. This can at least be used to gain knowledge
- about the host address space layout. The interrupts which have a fixed
- affinity to the CPUs which run the untrusted guests can depending on
- the scenario still trigger soft interrupts and schedule kernel threads
- which might expose valuable information. See
- :ref:`interrupt_isolation`.
-
-The above three mitigation methods combined can provide protection to a
-certain degree, but the risk of the remaining attack surface has to be
-carefully analyzed. For full protection the following methods are
-available:
-
- - Disabling SMT:
-
- Disabling SMT and enforcing the L1D flushing provides the maximum
- amount of protection. This mitigation is not depending on any of the
- above mitigation methods.
-
- SMT control and L1D flushing can be tuned by the command line
- parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
- time with the matching sysfs control files. See :ref:`smt_control`,
- :ref:`mitigation_control_command_line` and
- :ref:`mitigation_control_kvm`.
-
- - Disabling EPT:
-
- Disabling EPT provides the maximum amount of protection as well. It is
- not depending on any of the above mitigation methods. SMT can stay
- enabled and L1D flushing is not required, but the performance impact is
- significant.
-
- EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
- parameter.
-
-3.4. Nested virtual machines
-""""""""""""""""""""""""""""
-
-When nested virtualization is in use, three operating systems are involved:
-the bare metal hypervisor, the nested hypervisor and the nested virtual
-machine. VMENTER operations from the nested hypervisor into the nested
-guest will always be processed by the bare metal hypervisor. If KVM is the
-bare metal hypervisor it wiil:
-
- - Flush the L1D cache on every switch from the nested hypervisor to the
- nested virtual machine, so that the nested hypervisor's secrets are not
- exposed to the nested virtual machine;
-
- - Flush the L1D cache on every switch from the nested virtual machine to
- the nested hypervisor; this is a complex operation, and flushing the L1D
- cache avoids that the bare metal hypervisor's secrets are exposed to the
- nested virtual machine;
-
- - Instruct the nested hypervisor to not perform any L1D cache flush. This
- is an optimization to avoid double L1D flushing.
-
-
-.. _default_mitigations:
-
-Default mitigations
--------------------
-
- The kernel default mitigations for vulnerable processors are:
-
- - PTE inversion to protect against malicious user space. This is done
- unconditionally and cannot be controlled.
-
- - L1D conditional flushing on VMENTER when EPT is enabled for
- a guest.
-
- The kernel does not by default enforce the disabling of SMT, which leaves
- SMT systems vulnerable when running untrusted guests with EPT enabled.
-
- The rationale for this choice is:
-
- - Force disabling SMT can break existing setups, especially with
- unattended updates.
-
- - If regular users run untrusted guests on their machine, then L1TF is
- just an add on to other malware which might be embedded in an untrusted
- guest, e.g. spam-bots or attacks on the local network.
-
- There is no technical way to prevent a user from running untrusted code
- on their machines blindly.
-
- - It's technically extremely unlikely and from today's knowledge even
- impossible that L1TF can be exploited via the most popular attack
- mechanisms like JavaScript because these mechanisms have no way to
- control PTEs. If this would be possible and not other mitigation would
- be possible, then the default might be different.
-
- - The administrators of cloud and hosting setups have to carefully
- analyze the risk for their scenarios and make the appropriate
- mitigation choices, which might even vary across their deployed
- machines and also result in other changes of their overall setup.
- There is no way for the kernel to provide a sensible default for this
- kind of scenarios.
diff --git a/Documentation/spec_ctrl.txt b/Documentation/spec_ctrl.txt
index 32f3d55c54b7..c4dbe6f7cdae 100644
--- a/Documentation/spec_ctrl.txt
+++ b/Documentation/spec_ctrl.txt
@@ -92,3 +92,12 @@ Speculation misfeature controls
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_ENABLE, 0, 0);
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_DISABLE, 0, 0);
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_FORCE_DISABLE, 0, 0);
+
+- PR_SPEC_INDIR_BRANCH: Indirect Branch Speculation in User Processes
+ (Mitigate Spectre V2 style attacks against user processes)
+
+ Invocations:
+ * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, 0, 0, 0);
+ * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
+ * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
+ * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
diff --git a/Documentation/x86/conf.py b/Documentation/x86/conf.py
new file mode 100644
index 000000000000..33c5c3142e20
--- /dev/null
+++ b/Documentation/x86/conf.py
@@ -0,0 +1,10 @@
+# -*- coding: utf-8; mode: python -*-
+
+project = "X86 architecture specific documentation"
+
+tags.add("subproject")
+
+latex_documents = [
+ ('index', 'x86.tex', project,
+ 'The kernel development community', 'manual'),
+]
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
new file mode 100644
index 000000000000..ef389dcf1b1d
--- /dev/null
+++ b/Documentation/x86/index.rst
@@ -0,0 +1,8 @@
+==========================
+x86 architecture specifics
+==========================
+
+.. toctree::
+ :maxdepth: 1
+
+ mds
diff --git a/Documentation/x86/mds.rst b/Documentation/x86/mds.rst
new file mode 100644
index 000000000000..534e9baa4e1d
--- /dev/null
+++ b/Documentation/x86/mds.rst
@@ -0,0 +1,225 @@
+Microarchitectural Data Sampling (MDS) mitigation
+=================================================
+
+.. _mds:
+
+Overview
+--------
+
+Microarchitectural Data Sampling (MDS) is a family of side channel attacks
+on internal buffers in Intel CPUs. The variants are:
+
+ - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
+ - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
+ - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
+ - Microarchitectural Data Sampling Uncacheable Memory (MDSUM) (CVE-2019-11091)
+
+MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
+dependent load (store-to-load forwarding) as an optimization. The forward
+can also happen to a faulting or assisting load operation for a different
+memory address, which can be exploited under certain conditions. Store
+buffers are partitioned between Hyper-Threads so cross thread forwarding is
+not possible. But if a thread enters or exits a sleep state the store
+buffer is repartitioned which can expose data from one thread to the other.
+
+MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
+L1 miss situations and to hold data which is returned or sent in response
+to a memory or I/O operation. Fill buffers can forward data to a load
+operation and also write data to the cache. When the fill buffer is
+deallocated it can retain the stale data of the preceding operations which
+can then be forwarded to a faulting or assisting load operation, which can
+be exploited under certain conditions. Fill buffers are shared between
+Hyper-Threads so cross thread leakage is possible.
+
+MLPDS leaks Load Port Data. Load ports are used to perform load operations
+from memory or I/O. The received data is then forwarded to the register
+file or a subsequent operation. In some implementations the Load Port can
+contain stale data from a previous operation which can be forwarded to
+faulting or assisting loads under certain conditions, which again can be
+exploited eventually. Load ports are shared between Hyper-Threads so cross
+thread leakage is possible.
+
+MDSUM is a special case of MSBDS, MFBDS and MLPDS. An uncacheable load from
+memory that takes a fault or assist can leave data in a microarchitectural
+structure that may later be observed using one of the same methods used by
+MSBDS, MFBDS or MLPDS.
+
+Exposure assumptions
+--------------------
+
+It is assumed that attack code resides in user space or in a guest with one
+exception. The rationale behind this assumption is that the code construct
+needed for exploiting MDS requires:
+
+ - to control the load to trigger a fault or assist
+
+ - to have a disclosure gadget which exposes the speculatively accessed
+ data for consumption through a side channel.
+
+ - to control the pointer through which the disclosure gadget exposes the
+ data
+
+The existence of such a construct in the kernel cannot be excluded with
+100% certainty, but the complexity involved makes it extremly unlikely.
+
+There is one exception, which is untrusted BPF. The functionality of
+untrusted BPF is limited, but it needs to be thoroughly investigated
+whether it can be used to create such a construct.
+
+
+Mitigation strategy
+-------------------
+
+All variants have the same mitigation strategy at least for the single CPU
+thread case (SMT off): Force the CPU to clear the affected buffers.
+
+This is achieved by using the otherwise unused and obsolete VERW
+instruction in combination with a microcode update. The microcode clears
+the affected CPU buffers when the VERW instruction is executed.
+
+For virtualization there are two ways to achieve CPU buffer
+clearing. Either the modified VERW instruction or via the L1D Flush
+command. The latter is issued when L1TF mitigation is enabled so the extra
+VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
+be issued.
+
+If the VERW instruction with the supplied segment selector argument is
+executed on a CPU without the microcode update there is no side effect
+other than a small number of pointlessly wasted CPU cycles.
+
+This does not protect against cross Hyper-Thread attacks except for MSBDS
+which is only exploitable cross Hyper-thread when one of the Hyper-Threads
+enters a C-state.
+
+The kernel provides a function to invoke the buffer clearing:
+
+ mds_clear_cpu_buffers()
+
+The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
+(idle) transitions.
+
+As a special quirk to address virtualization scenarios where the host has
+the microcode updated, but the hypervisor does not (yet) expose the
+MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
+hope that it might actually clear the buffers. The state is reflected
+accordingly.
+
+According to current knowledge additional mitigations inside the kernel
+itself are not required because the necessary gadgets to expose the leaked
+data cannot be controlled in a way which allows exploitation from malicious
+user space or VM guests.
+
+Kernel internal mitigation modes
+--------------------------------
+
+ ======= ============================================================
+ off Mitigation is disabled. Either the CPU is not affected or
+ mds=off is supplied on the kernel command line
+
+ full Mitigation is enabled. CPU is affected and MD_CLEAR is
+ advertised in CPUID.
+
+ vmwerv Mitigation is enabled. CPU is affected and MD_CLEAR is not
+ advertised in CPUID. That is mainly for virtualization
+ scenarios where the host has the updated microcode but the
+ hypervisor does not expose MD_CLEAR in CPUID. It's a best
+ effort approach without guarantee.
+ ======= ============================================================
+
+If the CPU is affected and mds=off is not supplied on the kernel command
+line then the kernel selects the appropriate mitigation mode depending on
+the availability of the MD_CLEAR CPUID bit.
+
+Mitigation points
+-----------------
+
+1. Return to user space
+^^^^^^^^^^^^^^^^^^^^^^^
+
+ When transitioning from kernel to user space the CPU buffers are flushed
+ on affected CPUs when the mitigation is not disabled on the kernel
+ command line. The migitation is enabled through the static key
+ mds_user_clear.
+
+ The mitigation is invoked in prepare_exit_to_usermode() which covers
+ most of the kernel to user space transitions. There are a few exceptions
+ which are not invoking prepare_exit_to_usermode() on return to user
+ space. These exceptions use the paranoid exit code.
+
+ - Non Maskable Interrupt (NMI):
+
+ Access to sensible data like keys, credentials in the NMI context is
+ mostly theoretical: The CPU can do prefetching or execute a
+ misspeculated code path and thereby fetching data which might end up
+ leaking through a buffer.
+
+ But for mounting other attacks the kernel stack address of the task is
+ already valuable information. So in full mitigation mode, the NMI is
+ mitigated on the return from do_nmi() to provide almost complete
+ coverage.
+
+ - Double fault (#DF):
+
+ A double fault is usually fatal, but the ESPFIX workaround, which can
+ be triggered from user space through modify_ldt(2) is a recoverable
+ double fault. #DF uses the paranoid exit path, so explicit mitigation
+ in the double fault handler is required.
+
+ - Machine Check Exception (#MC):
+
+ Another corner case is a #MC which hits between the CPU buffer clear
+ invocation and the actual return to user. As this still is in kernel
+ space it takes the paranoid exit path which does not clear the CPU
+ buffers. So the #MC handler repopulates the buffers to some
+ extent. Machine checks are not reliably controllable and the window is
+ extremly small so mitigation would just tick a checkbox that this
+ theoretical corner case is covered. To keep the amount of special
+ cases small, ignore #MC.
+
+ - Debug Exception (#DB):
+
+ This takes the paranoid exit path only when the INT1 breakpoint is in
+ kernel space. #DB on a user space address takes the regular exit path,
+ so no extra mitigation required.
+
+
+2. C-State transition
+^^^^^^^^^^^^^^^^^^^^^
+
+ When a CPU goes idle and enters a C-State the CPU buffers need to be
+ cleared on affected CPUs when SMT is active. This addresses the
+ repartitioning of the store buffer when one of the Hyper-Threads enters
+ a C-State.
+
+ When SMT is inactive, i.e. either the CPU does not support it or all
+ sibling threads are offline CPU buffer clearing is not required.
+
+ The idle clearing is enabled on CPUs which are only affected by MSBDS
+ and not by any other MDS variant. The other MDS variants cannot be
+ protected against cross Hyper-Thread attacks because the Fill Buffer and
+ the Load Ports are shared. So on CPUs affected by other variants, the
+ idle clearing would be a window dressing exercise and is therefore not
+ activated.
+
+ The invocation is controlled by the static key mds_idle_clear which is
+ switched depending on the chosen mitigation mode and the SMT state of
+ the system.
+
+ The buffer clear is only invoked before entering the C-State to prevent
+ that stale data from the idling CPU from spilling to the Hyper-Thread
+ sibling after the store buffer got repartitioned and all entries are
+ available to the non idle sibling.
+
+ When coming out of idle the store buffer is partitioned again so each
+ sibling has half of it available. The back from idle CPU could be then
+ speculatively exposed to contents of the sibling. The buffers are
+ flushed either on exit to user space or on VMENTER so malicious code
+ in user space or the guest cannot speculatively access them.
+
+ The mitigation is hooked into all variants of halt()/mwait(), but does
+ not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
+ has been superseded by the intel_idle driver around 2010 and is
+ preferred on all affected CPUs which are expected to gain the MD_CLEAR
+ functionality in microcode. Aside of that the IO-Port mechanism is a
+ legacy interface which is only used on older systems which are either
+ not affected or do not receive microcode updates anymore.
diff --git a/Makefile b/Makefile
index e52b0579e176..92fe701e5582 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,6 @@
VERSION = 4
PATCHLEVEL = 9
-SUBLEVEL = 175
+SUBLEVEL = 176
EXTRAVERSION =
NAME = Roaring Lionus
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5a4591ff8407..e0055b4302d6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -937,13 +937,7 @@ config NR_CPUS
approximately eight kilobytes to the kernel image.
config SCHED_SMT
- bool "SMT (Hyperthreading) scheduler support"
- depends on SMP
- ---help---
- SMT scheduler support improves the CPU scheduler's decision making
- when dealing with Intel Pentium 4 chips with HyperThreading at a
- cost of slightly increased overhead in some places. If unsure say
- N here.
+ def_bool y if SMP
config SCHED_MC
def_bool y
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index b0cd306dc527..8841d016b4a4 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -28,6 +28,7 @@
#include <asm/vdso.h>
#include <asm/uaccess.h>
#include <asm/cpufeature.h>
+#include <asm/nospec-branch.h>
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>
@@ -206,6 +207,8 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
#endif
user_enter_irqoff();
+
+ mds_user_clear_cpu_buffers();
}
#define SYSCALL_EXIT_WORK_FLAGS \
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a30829052a00..cb8178a2783a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3750,11 +3750,11 @@ __init int intel_pmu_init(void)
pr_cont("Nehalem events, ");
break;
- case INTEL_FAM6_ATOM_PINEVIEW:
- case INTEL_FAM6_ATOM_LINCROFT:
- case INTEL_FAM6_ATOM_PENWELL:
- case INTEL_FAM6_ATOM_CLOVERVIEW:
- case INTEL_FAM6_ATOM_CEDARVIEW:
+ case INTEL_FAM6_ATOM_BONNELL:
+ case INTEL_FAM6_ATOM_BONNELL_MID:
+ case INTEL_FAM6_ATOM_SALTWELL:
+ case INTEL_FAM6_ATOM_SALTWELL_MID:
+ case INTEL_FAM6_ATOM_SALTWELL_TABLET:
memcpy(hw_cache_event_ids, atom_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
@@ -3766,9 +3766,11 @@ __init int intel_pmu_init(void)
pr_cont("Atom events, ");
break;
- case INTEL_FAM6_ATOM_SILVERMONT1:
- case INTEL_FAM6_ATOM_SILVERMONT2:
+ case INTEL_FAM6_ATOM_SILVERMONT:
+ case INTEL_FAM6_ATOM_SILVERMONT_X:
+ case INTEL_FAM6_ATOM_SILVERMONT_MID:
case INTEL_FAM6_ATOM_AIRMONT:
+ case INTEL_FAM6_ATOM_AIRMONT_MID:
memcpy(hw_cache_event_ids, slm_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, slm_hw_cache_extra_regs,
@@ -3785,7 +3787,7 @@ __init int intel_pmu_init(void)
break;
case INTEL_FAM6_ATOM_GOLDMONT:
- case INTEL_FAM6_ATOM_DENVERTON:
+ case INTEL_FAM6_ATOM_GOLDMONT_X:
memcpy(hw_cache_event_ids, glm_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, glm_hw_cache_extra_regs,
diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c
index 47d526c700a1..72d09340c24d 100644
--- a/arch/x86/events/intel/cstate.c
+++ b/arch/x86/events/intel/cstate.c
@@ -531,8 +531,8 @@ static const struct x86_cpu_id intel_cstates_match[] __initconst = {
X86_CSTATES_MODEL(INTEL_FAM6_HASWELL_ULT, hswult_cstates),
- X86_CSTATES_MODEL(INTEL_FAM6_ATOM_SILVERMONT1, slm_cstates),
- X86_CSTATES_MODEL(INTEL_FAM6_ATOM_SILVERMONT2, slm_cstates),
+ X86_CSTATES_MODEL(INTEL_FAM6_ATOM_SILVERMONT, slm_cstates),
+ X86_CSTATES_MODEL(INTEL_FAM6_ATOM_SILVERMONT_X, slm_cstates),
X86_CSTATES_MODEL(INTEL_FAM6_ATOM_AIRMONT, slm_cstates),
X86_CSTATES_MODEL(INTEL_FAM6_BROADWELL_CORE, snb_cstates),
diff --git a/arch/x86/events/msr.c b/arch/x86/events/msr.c
index be0b1968d60a..68144a341903 100644
--- a/arch/x86/events/msr.c
+++ b/arch/x86/events/msr.c
@@ -61,8 +61,8 @@ static bool test_intel(int idx)
case INTEL_FAM6_BROADWELL_GT3E:
case INTEL_FAM6_BROADWELL_X:
- case INTEL_FAM6_ATOM_SILVERMONT1:
- case INTEL_FAM6_ATOM_SILVERMONT2:
+ case INTEL_FAM6_ATOM_SILVERMONT:
+ case INTEL_FAM6_ATOM_SILVERMONT_X:
case INTEL_FAM6_ATOM_AIRMONT:
if (idx == PERF_MSR_SMI)
return true;
diff --git a/arch/x86/include/asm/intel-family.h b/arch/x86/include/asm/intel-family.h
index 75b748a1deb8..ba7b6f736414 100644
--- a/arch/x86/include/asm/intel-family.h
+++ b/arch/x86/include/asm/intel-family.h
@@ -50,19 +50,23 @@
/* "Small Core" Processors (Atom) */
-#define INTEL_FAM6_ATOM_PINEVIEW 0x1C
-#define INTEL_FAM6_ATOM_LINCROFT 0x26
-#define INTEL_FAM6_ATOM_PENWELL 0x27
-#define INTEL_FAM6_ATOM_CLOVERVIEW 0x35
-#define INTEL_FAM6_ATOM_CEDARVIEW 0x36
-#define INTEL_FAM6_ATOM_SILVERMONT1 0x37 /* BayTrail/BYT / Valleyview */
-#define INTEL_FAM6_ATOM_SILVERMONT2 0x4D /* Avaton/Rangely */
-#define INTEL_FAM6_ATOM_AIRMONT 0x4C /* CherryTrail / Braswell */
-#define INTEL_FAM6_ATOM_MERRIFIELD 0x4A /* Tangier */
-#define INTEL_FAM6_ATOM_MOOREFIELD 0x5A /* Anniedale */
-#define INTEL_FAM6_ATOM_GOLDMONT 0x5C
-#define INTEL_FAM6_ATOM_DENVERTON 0x5F /* Goldmont Microserver */
-#define INTEL_FAM6_ATOM_GEMINI_LAKE 0x7A
+#define INTEL_FAM6_ATOM_BONNELL 0x1C /* Diamondville, Pineview */
+#define INTEL_FAM6_ATOM_BONNELL_MID 0x26 /* Silverthorne, Lincroft */
+
+#define INTEL_FAM6_ATOM_SALTWELL 0x36 /* Cedarview */
+#define INTEL_FAM6_ATOM_SALTWELL_MID 0x27 /* Penwell */
+#define INTEL_FAM6_ATOM_SALTWELL_TABLET 0x35 /* Cloverview */
+
+#define INTEL_FAM6_ATOM_SILVERMONT 0x37 /* Bay Trail, Valleyview */
+#define INTEL_FAM6_ATOM_SILVERMONT_X 0x4D /* Avaton, Rangely */
+#define INTEL_FAM6_ATOM_SILVERMONT_MID 0x4A /* Merriefield */
+
+#define INTEL_FAM6_ATOM_AIRMONT 0x4C /* Cherry Trail, Braswell */
+#define INTEL_FAM6_ATOM_AIRMONT_MID 0x5A /* Moorefield */
+
+#define INTEL_FAM6_ATOM_GOLDMONT 0x5C /* Apollo Lake */
+#define INTEL_FAM6_ATOM_GOLDMONT_X 0x5F /* Denverton */
+#define INTEL_FAM6_ATOM_GOLDMONT_PLUS 0x7A /* Gemini Lake */
/* Xeon Phi */
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 508a062e6cf1..0c8f4281b151 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -5,6 +5,8 @@
#ifndef __ASSEMBLY__
+#include <asm/nospec-branch.h>
+
/* Provide __cpuidle; we can't safely include <linux/cpu.h> */
#define __cpuidle __attribute__((__section__(".cpuidle.text")))
@@ -53,11 +55,13 @@ static inline void native_irq_enable(void)
static inline __cpuidle void native_safe_halt(void)
{
+ mds_idle_clear_cpu_buffers();
asm volatile("sti; hlt": : :"memory");
}
static inline __cpuidle void native_halt(void)
{
+ mds_idle_clear_cpu_buffers();
asm volatile("hlt": : :"memory");
}
diff --git a/arch/x86/include/asm/microcode_intel.h b/arch/x86/include/asm/microcode_intel.h
index 5e69154c9f07..a61ec81b27db 100644
--- a/arch/x86/include/asm/microcode_intel.h
+++ b/arch/x86/include/asm/microcode_intel.h
@@ -52,6 +52,21 @@ struct extended_sigtable {
#define exttable_size(et) ((et)->count * EXT_SIGNATURE_SIZE + EXT_HEADER_SIZE)
+static inline u32 intel_get_microcode_revision(void)
+{
+ u32 rev, dummy;
+
+ native_wrmsrl(MSR_IA32_UCODE_REV, 0);
+
+ /* As documented in the SDM: Do a CPUID 1 here */
+ sync_core();
+
+ /* get the current revision from MSR 0x8B */
+ native_rdmsr(MSR_IA32_UCODE_REV, dummy, rev);
+
+ return rev;
+}
+
extern int has_newer_microcode(void *mc, unsigned int csig, int cpf, int rev);
extern int microcode_sanity_check(void *mc, int print_err);
extern int find_matching_signature(void *mc, unsigned int csig, int cpf);
diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index f37f2d8a2989..0b40cc442bda 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -4,6 +4,7 @@
#include <linux/sched.h>
#include <asm/cpufeature.h>
+#include <asm/nospec-branch.h>
#define MWAIT_SUBSTATE_MASK 0xf
#define MWAIT_CSTATE_MASK 0xf
@@ -38,6 +39,8 @@ static inline void __monitorx(const void *eax, unsigned long ecx,
static inline void __mwait(unsigned long eax, unsigned long ecx)
{
+ mds_idle_clear_cpu_buffers();
+
/* "mwait %eax, %ecx;" */
asm volatile(".byte 0x0f, 0x01, 0xc9;"
:: "a" (eax), "c" (ecx));
@@ -72,6 +75,8 @@ static inline void __mwait(unsigned long eax, unsigned long ecx)
static inline void __mwaitx(unsigned long eax, unsigned long ebx,
unsigned long ecx)
{
+ /* No MDS buffer clear as this is AMD/HYGON only */
+
/* "mwaitx %eax, %ebx, %ecx;" */
asm volatile(".byte 0x0f, 0x01, 0xfb;"
:: "a" (eax), "b" (ebx), "c" (ecx));
@@ -79,6 +84,8 @@ static inline void __mwaitx(unsigned long eax, unsigned long ebx,
static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
{
+ mds_idle_clear_cpu_buffers();
+
trace_hardirqs_on();
/* "mwait %eax, %ecx;" */
asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 221a32ed1372..f12e61e2a86b 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -44,15 +44,15 @@ struct mm_struct;
void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte);
-static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep)
+static inline void native_set_pte(pte_t *ptep, pte_t pte)
{
- *ptep = native_make_pte(0);
+ WRITE_ONCE(*ptep, pte);
}
-static inline void native_set_pte(pte_t *ptep, pte_t pte)
+static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep)
{
- *ptep = pte;
+ native_set_pte(ptep, native_make_pte(0));
}
static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
@@ -62,7 +62,7 @@ static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
{
- *pmdp = pmd;
+ WRITE_ONCE(*pmdp, pmd);
}
static inline void native_pmd_clear(pmd_t *pmd)
@@ -98,7 +98,7 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
static inline void native_set_pud(pud_t *pudp, pud_t pud)
{
- *pudp = pud;
+ WRITE_ONCE(*pudp, pud);
}
static inline void native_pud_clear(pud_t *pud)
@@ -131,7 +131,7 @@ static inline pgd_t *native_get_shadow_pgd(pgd_t *pgdp)
static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
{
- *pgdp = kaiser_set_shadow_pgd(pgdp, pgd);
+ WRITE_ONCE(*pgdp, kaiser_set_shadow_pgd(pgdp, pgd));
}
static inline void native_pgd_clear(pgd_t *pgd)
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 5cb436acd463..676e84f521ba 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,9 +8,6 @@ struct task_struct *__switch_to_asm(struct task_struct *prev,
__visible struct task_struct *__switch_to(struct task_struct *prev,
struct task_struct *next);
-struct tss_struct;
-void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
- struct tss_struct *tss);
/* This runs runs on the previous thread's stack. */
static inline void prepare_switch_to(struct task_struct *prev,
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 686a58d793e5..f5ca15622dc9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -68,8 +68,12 @@ static inline void invpcid_flush_all_nonglobals(void)
struct tlb_state {
struct mm_struct *active_mm;
int state;
- /* last user mm's ctx id */
- u64 last_ctx_id;
+
+ /* Last user mm for optimizing IBPB */
+ union {
+ struct mm_struct *last_user_mm;
+ unsigned long last_user_mm_ibpb;
+ };
/*
* Access to this CR4 shadow and to H/W CR4 is protected by
diff --git a/arch/x86/include/uapi/asm/Kbuild b/arch/x86/include/uapi/asm/Kbuild
index 3dec769cadf7..1c532b3f18ea 100644
--- a/arch/x86/include/uapi/asm/Kbuild
+++ b/arch/x86/include/uapi/asm/Kbuild
@@ -27,7 +27,6 @@ header-y += ldt.h
header-y += mce.h
header-y += mman.h
header-y += msgbuf.h
-header-y += msr-index.h
header-y += msr.h
header-y += mtrr.h
header-y += param.h
diff --git a/arch/x86/include/uapi/asm/mce.h b/arch/x86/include/uapi/asm/mce.h
index 69a6e07e3149..db7dae58745f 100644
--- a/arch/x86/include/uapi/asm/mce.h
+++ b/arch/x86/include/uapi/asm/mce.h
@@ -28,6 +28,8 @@ struct mce {
__u64 mcgcap; /* MCGCAP MSR: machine check capabilities of CPU */
__u64 synd; /* MCA_SYND MSR: only valid on SMCA systems */
__u64 ipid; /* MCA_IPID MSR: only valid on SMCA systems */
+ __u64 ppin; /* Protected Processor Inventory Number */
+ __u32 microcode;/* Microcode revision */
};
#define MCE_GET_RECORD_LEN _IOR('M', 1, int)
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index cee0fec0d232..860f2fd9f540 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -14,6 +14,7 @@
#include <asm/bugs.h>
#include <asm/cpu.h>
#include <asm/intel-family.h>
+#include <asm/microcode_intel.h>
#ifdef CONFIG_X86_64
#include <linux/topology.h>
@@ -137,14 +138,8 @@ static void early_init_intel(struct cpuinfo_x86 *c)
(c->x86 == 0x6 && c->x86_model >= 0x0e))
set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
- if (c->x86 >= 6 && !cpu_has(c, X86_FEATURE_IA64)) {
- unsigned lower_word;
-
- wrmsr(MSR_IA32_UCODE_REV, 0, 0);
- /* Required by the SDM */
- sync_core();
- rdmsr(MSR_IA32_UCODE_REV, lower_word, c->microcode);
- }
+ if (c->x86 >= 6 && !cpu_has(c, X86_FEATURE_IA64))
+ c->microcode = intel_get_microcode_revision();
/* Now if any of them are set, check the blacklist and clear the lot */
if ((cpu_has(c, X86_FEATURE_SPEC_CTRL) ||
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 25310d2b8609..d9ad49ca3cbe 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -139,6 +139,8 @@ void mce_setup(struct mce *m)
m->socketid = cpu_data(m->extcpu).phys_proc_id;
m->apicid = cpu_data(m->extcpu).initial_apicid;
rdmsrl(MSR_IA32_MCG_CAP, m->mcgcap);
+
+ m->microcode = boot_cpu_data.microcode;
}
DEFINE_PER_CPU(struct mce, injectm);
@@ -309,7 +311,7 @@ static void print_mce(struct mce *m)
*/
pr_emerg(HW_ERR "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x microcode %x\n",
m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid,
- cpu_data(m->extcpu).microcode);
+ m->microcode);
pr_emerg_ratelimited(HW_ERR "Run the above through 'mcelog --ascii'\n");
}
diff --git a/arch/x86/kernel/cpu/microcode/amd.c b/arch/x86/kernel/cpu/microcode/amd.c
index 732bb03fcf91..a19fddfb6bf8 100644
--- a/arch/x86/kernel/cpu/microcode/amd.c
+++ b/arch/x86/kernel/cpu/microcode/amd.c
@@ -707,22 +707,26 @@ int apply_microcode_amd(int cpu)
return -1;
/* need to apply patch? */
- if (rev >= mc_amd->hdr.patch_id) {
- c->microcode = rev;
- uci->cpu_sig.rev = rev;
- return 0;
- }
+ if (rev >= mc_amd->hdr.patch_id)
+ goto out;
if (__apply_microcode_amd(mc_amd)) {
pr_err("CPU%d: update failed for patch_level=0x%08x\n",
cpu, mc_amd->hdr.patch_id);
return -1;
}
- pr_info("CPU%d: new patch_level=0x%08x\n", cpu,
- mc_amd->hdr.patch_id);
- uci->cpu_sig.rev = mc_amd->hdr.patch_id;
- c->microcode = mc_amd->hdr.patch_id;
+ rev = mc_amd->hdr.patch_id;
+
+ pr_info("CPU%d: new patch_level=0x%08x\n", cpu, rev);
+
+out:
+ uci->cpu_sig.rev = rev;
+ c->microcode = rev;
+
+ /* Update boot_cpu_data's revision too, if we're on the BSP: */
+ if (c->cpu_index == boot_cpu_data.cpu_index)
+ boot_cpu_data.microcode = rev;
return 0;
}
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 79291d6fb301..1308abfc4758 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -386,15 +386,8 @@ static int collect_cpu_info_early(struct ucode_cpu_info *uci)
native_rdmsr(MSR_IA32_PLATFORM_ID, val[0], val[1]);
csig.pf = 1 << ((val[1] >> 18) & 7);
}
- native_wrmsrl(MSR_IA32_UCODE_REV, 0);
- /* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
-
- /* get the current revision from MSR 0x8B */
- native_rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
-
- csig.rev = val[1];
+ csig.rev = intel_get_microcode_revision();
uci->cpu_sig = csig;
uci->valid = 1;
@@ -618,29 +611,35 @@ static inline void print_ucode(struct ucode_cpu_info *uci)
static int apply_microcode_early(struct ucode_cpu_info *uci, bool early)
{
struct microcode_intel *mc;
- unsigned int val[2];
+ u32 rev;
mc = uci->mc;
if (!mc)
return 0;
+ /*
+ * Save us the MSR write below - which is a particular expensive
+ * operation - when the other hyperthread has updated the microcode
+ * already.
+ */
+ rev = intel_get_microcode_revision();
+ if (rev >= mc->hdr.rev) {
+ uci->cpu_sig.rev = rev;
+ return 0;
+ }
+
/* write microcode via MSR 0x79 */
native_wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits);
- native_wrmsrl(MSR_IA32_UCODE_REV, 0);
-
- /* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
- /* get the current revision from MSR 0x8B */
- native_rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
- if (val[1] != mc->hdr.rev)
+ rev = intel_get_microcode_revision();
+ if (rev != mc->hdr.rev)
return -1;
#ifdef CONFIG_X86_64
/* Flush global tlb. This is precaution. */
flush_tlb_early();
#endif
- uci->cpu_sig.rev = val[1];
+ uci->cpu_sig.rev = rev;
if (early)
print_ucode(uci);
@@ -903,9 +902,9 @@ static int apply_microcode_intel(int cpu)
{
struct microcode_intel *mc;
struct ucode_cpu_info *uci;
- struct cpuinfo_x86 *c;
- unsigned int val[2];
+ struct cpuinfo_x86 *c = &cpu_data(cpu);
static int prev_rev;
+ u32 rev;
/* We should bind the task to the CPU */
if (WARN_ON(raw_smp_processor_id() != cpu))
@@ -924,35 +923,42 @@ static int apply_microcode_intel(int cpu)
if (!get_matching_mc(mc, cpu))
return 0;
+ /*
+ * Save us the MSR write below - which is a particular expensive
+ * operation - when the other hyperthread has updated the microcode
+ * already.
+ */
+ rev = intel_get_microcode_revision();
+ if (rev >= mc->hdr.rev)
+ goto out;
+
/* write microcode via MSR 0x79 */
wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits);
- wrmsrl(MSR_IA32_UCODE_REV, 0);
- /* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
+ rev = intel_get_microcode_revision();
- /* get the current revision from MSR 0x8B */
- rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
-
- if (val[1] != mc->hdr.rev) {
+ if (rev != mc->hdr.rev) {
pr_err("CPU%d update to revision 0x%x failed\n",
cpu, mc->hdr.rev);
return -1;
}
- if (val[1] != prev_rev) {
+ if (rev != prev_rev) {
pr_info("updated to revision 0x%x, date = %04x-%02x-%02x\n",
- val[1],
+ rev,
mc->hdr.date & 0xffff,
mc->hdr.date >> 24,
(mc->hdr.date >> 16) & 0xff);
- prev_rev = val[1];
+ prev_rev = rev;
}
- c = &cpu_data(cpu);
+out:
+ uci->cpu_sig.rev = rev;
+ c->microcode = rev;
- uci->cpu_sig.rev = val[1];
- c->microcode = val[1];
+ /* Update boot_cpu_data's revision too, if we're on the BSP: */
+ if (c->cpu_index == boot_cpu_data.cpu_index)
+ boot_cpu_data.microcode = rev;
return 0;
}
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index bfe4d6c96fbd..6b7b35d80264 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -32,6 +32,7 @@
#include <asm/x86_init.h>
#include <asm/reboot.h>
#include <asm/cache.h>
+#include <asm/nospec-branch.h>
#define CREATE_TRACE_POINTS
#include <trace/events/nmi.h>
@@ -544,6 +545,9 @@ nmi_restart:
write_cr2(this_cpu_read(nmi_cr2));
if (this_cpu_dec_return(nmi_state))
goto nmi_restart;
+
+ if (user_mode(regs))
+ mds_user_clear_cpu_buffers();
}
NOKPROBE_SYMBOL(do_nmi);
diff --git a/arch/x86/kernel/process.h b/arch/x86/kernel/process.h
new file mode 100644
index 000000000000..898e97cf6629
--- /dev/null
+++ b/arch/x86/kernel/process.h
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Code shared between 32 and 64 bit
+
+#include <asm/spec-ctrl.h>
+
+void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p);
+
+/*
+ * This needs to be inline to optimize for the common case where no extra
+ * work needs to be done.
+ */
+static inline void switch_to_extra(struct task_struct *prev,
+ struct task_struct *next)
+{
+ unsigned long next_tif = task_thread_info(next)->flags;
+ unsigned long prev_tif = task_thread_info(prev)->flags;
+
+ if (IS_ENABLED(CONFIG_SMP)) {
+ /*
+ * Avoid __switch_to_xtra() invocation when conditional
+ * STIPB is disabled and the only different bit is
+ * TIF_SPEC_IB. For CONFIG_SMP=n TIF_SPEC_IB is not
+ * in the TIF_WORK_CTXSW masks.
+ */
+ if (!static_branch_likely(&switch_to_cond_stibp)) {
+ prev_tif &= ~_TIF_SPEC_IB;
+ next_tif &= ~_TIF_SPEC_IB;
+ }
+ }
+
+ /*
+ * __switch_to_xtra() handles debug registers, i/o bitmaps,
+ * speculation mitigations etc.
+ */
+ if (unlikely(next_tif & _TIF_WORK_CTXSW_NEXT ||
+ prev_tif & _TIF_WORK_CTXSW_PREV))
+ __switch_to_xtra(prev, next);
+}
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index bd7be8efdc4c..912246fd6cd9 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -55,6 +55,8 @@
#include <asm/switch_to.h>
#include <asm/vm86.h>
+#include "process.h"
+
void __show_regs(struct pt_regs *regs, int all)
{
unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L;
@@ -264,12 +266,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
if (get_kernel_rpl() && unlikely(prev->iopl != next->iopl))
set_iopl_mask(next->iopl);
- /*
- * Now maybe handle debug registers and/or IO bitmaps
- */
- if (unlikely(task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV ||
- task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT))
- __switch_to_xtra(prev_p, next_p, tss);
+ switch_to_extra(prev_p, next_p);
/*
* Leave lazy mode, flushing any hypercalls made here.
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index a2661814bde0..81eec65fe053 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -51,6 +51,8 @@
#include <asm/xen/hypervisor.h>
#include <asm/vdso.h>
+#include "process.h"
+
__visible DEFINE_PER_CPU(unsigned long, rsp_scratch);
/* Prints also some state that isn't saved in the pt_regs */
@@ -454,12 +456,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Reload esp0 and ss1. This changes current_thread_info(). */
load_sp0(tss, next);
- /*
- * Now maybe reload the debug registers and handle I/O bitmaps
- */
- if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
- task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
- __switch_to_xtra(prev_p, next_p, tss);
+ switch_to_extra(prev_p, next_p);
#ifdef CONFIG_XEN
/*
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 5bbfa2f63b8c..ef225fa8e928 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -62,6 +62,7 @@
#include <asm/alternative.h>
#include <asm/fpu/xstate.h>
#include <asm/trace/mpx.h>
+#include <asm/nospec-branch.h>
#include <asm/mpx.h>
#include <asm/vm86.h>
@@ -340,6 +341,13 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
regs->ip = (unsigned long)general_protection;
regs->sp = (unsigned long)&normal_regs->orig_ax;
+ /*
+ * This situation can be triggered by userspace via
+ * modify_ldt(2) and the return does not take the regular
+ * user space exit, so a CPU buffer clear is required when
+ * MDS mitigation is enabled.
+ */
+ mds_user_clear_cpu_buffers();
return;
}
#endif
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 769c370011d6..cb768417429d 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -713,7 +713,7 @@ unsigned long native_calibrate_tsc(void)
case INTEL_FAM6_KABYLAKE_DESKTOP:
crystal_khz = 24000; /* 24.0 MHz */
break;
- case INTEL_FAM6_ATOM_DENVERTON:
+ case INTEL_FAM6_ATOM_GOLDMONT_X:
crystal_khz = 25000; /* 25.0 MHz */
break;
case INTEL_FAM6_ATOM_GOLDMONT:
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 90801a8f19c9..ce092a62fc5d 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -790,7 +790,7 @@ unsigned long max_swapfile_size(void)
pages = generic_max_swapfile_size();
- if (boot_cpu_has_bug(X86_BUG_L1TF)) {
+ if (boot_cpu_has_bug(X86_BUG_L1TF) && l1tf_mitigation != L1TF_MITIGATION_OFF) {
/* Limit the swap file size to MAX_PA/2 for L1TF workaround */
unsigned long long l1tf_limit = l1tf_pfn_limit();
/*
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 3f729e20f0e3..12522dbae615 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -9,6 +9,7 @@
#include <linux/spinlock.h>
#include <linux/mm.h>
#include <linux/uaccess.h>
+#include <linux/cpu.h>
#undef pr_fmt
#define pr_fmt(fmt) "Kernel/User page tables isolation: " fmt
@@ -297,7 +298,8 @@ void __init kaiser_check_boottime_disable(void)
goto skip;
}
- if (cmdline_find_option_bool(boot_command_line, "nopti"))
+ if (cmdline_find_option_bool(boot_command_line, "nopti") ||
+ cpu_mitigations_off())
goto disable;
skip:
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index e30baa8ad94f..dff8ac2d255c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -251,7 +251,7 @@ static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp)
if (pgd_val(pgd) != 0) {
pmd_t *pmd = (pmd_t *)pgd_page_vaddr(pgd);
- pgdp[i] = native_make_pgd(0);
+ pgd_clear(&pgdp[i]);
paravirt_release_pmd(pgd_val(pgd) >> PAGE_SHIFT);
pmd_free(mm, pmd);
@@ -419,7 +419,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
int changed = !pte_same(*ptep, entry);
if (changed && dirty) {
- *ptep = entry;
+ set_pte(ptep, entry);
pte_update(vma->vm_mm, address, ptep);
}
@@ -436,7 +436,7 @@ int pmdp_set_access_flags(struct vm_area_struct *vma,
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
if (changed && dirty) {
- *pmdp = entry;
+ set_pmd(pmdp, entry);
/*
* We had a write-protection fault here and changed the pmd
* to to more permissive. No need to flush the TLB for that,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index eac92e2d171b..a112bb175dd4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,6 +30,12 @@
* Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
*/
+/*
+ * Use bit 0 to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * stored in cpu_tlb_state.last_user_mm_ibpb.
+ */
+#define LAST_USER_MM_IBPB 0x1UL
+
atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
struct flush_tlb_info {
@@ -101,33 +107,101 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
local_irq_restore(flags);
}
+static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+{
+ unsigned long next_tif = task_thread_info(next)->flags;
+ unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
+
+ return (unsigned long)next->mm | ibpb;
+}
+
+static void cond_ibpb(struct task_struct *next)
+{
+ if (!next || !next->mm)
+ return;
+
+ /*
+ * Both, the conditional and the always IBPB mode use the mm
+ * pointer to avoid the IBPB when switching between tasks of the
+ * same process. Using the mm pointer instead of mm->context.ctx_id
+ * opens a hypothetical hole vs. mm_struct reuse, which is more or
+ * less impossible to control by an attacker. Aside of that it
+ * would only affect the first schedule so the theoretically
+ * exposed data is not really interesting.
+ */
+ if (static_branch_likely(&switch_mm_cond_ibpb)) {
+ unsigned long prev_mm, next_mm;
+
+ /*
+ * This is a bit more complex than the always mode because
+ * it has to handle two cases:
+ *
+ * 1) Switch from a user space task (potential attacker)
+ * which has TIF_SPEC_IB set to a user space task
+ * (potential victim) which has TIF_SPEC_IB not set.
+ *
+ * 2) Switch from a user space task (potential attacker)
+ * which has TIF_SPEC_IB not set to a user space task
+ * (potential victim) which has TIF_SPEC_IB set.
+ *
+ * This could be done by unconditionally issuing IBPB when
+ * a task which has TIF_SPEC_IB set is either scheduled in
+ * or out. Though that results in two flushes when:
+ *
+ * - the same user space task is scheduled out and later
+ * scheduled in again and only a kernel thread ran in
+ * between.
+ *
+ * - a user space task belonging to the same process is
+ * scheduled in after a kernel thread ran in between
+ *
+ * - a user space task belonging to the same process is
+ * scheduled in immediately.
+ *
+ * Optimize this with reasonably small overhead for the
+ * above cases. Mangle the TIF_SPEC_IB bit into the mm
+ * pointer of the incoming task which is stored in
+ * cpu_tlbstate.last_user_mm_ibpb for comparison.
+ */
+ next_mm = mm_mangle_tif_spec_ib(next);
+ prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
+
+ /*
+ * Issue IBPB only if the mm's are different and one or
+ * both have the IBPB bit set.
+ */
+ if (next_mm != prev_mm &&
+ (next_mm | prev_mm) & LAST_USER_MM_IBPB)
+ indirect_branch_prediction_barrier();
+
+ this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
+ }
+
+ if (static_branch_unlikely(&switch_mm_always_ibpb)) {
+ /*
+ * Only flush when switching to a user space task with a
+ * different context than the user space task which ran
+ * last on this CPU.
+ */
+ if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
+ indirect_branch_prediction_barrier();
+ this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
+ }
+ }
+}
+
void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
unsigned cpu = smp_processor_id();
if (likely(prev != next)) {
- u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);
-
/*
* Avoid user/user BTB poisoning by flushing the branch
* predictor when switching between processes. This stops
* one process from doing Spectre-v2 attacks on another.
- *
- * As an optimization, flush indirect branches only when
- * switching into processes that disable dumping. This
- * protects high value processes like gpg, without having
- * too high performance overhead. IBPB is *expensive*!
- *
- * This will not flush branches when switching into kernel
- * threads. It will also not flush if we switch to idle
- * thread and back to the same process. It will flush if we
- * switch to a different non-dumpable process.
*/
- if (tsk && tsk->mm &&
- tsk->mm->context.ctx_id != last_ctx_id &&
- get_dumpable(tsk->mm) != SUID_DUMP_USER)
- indirect_branch_prediction_barrier();
+ cond_ibpb(tsk);
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
@@ -143,14 +217,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
}
- /*
- * Record last user mm's context id, so we can avoid
- * flushing branch buffer with IBPB if we switch back
- * to the same user.
- */
- if (next != &init_mm)
- this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
-
this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
this_cpu_write(cpu_tlbstate.active_mm, next);
diff --git a/arch/x86/platform/atom/punit_atom_debug.c b/arch/x86/platform/atom/punit_atom_debug.c
index d49d3be81953..ecb5866aaf84 100644
--- a/arch/x86/platform/atom/punit_atom_debug.c
+++ b/arch/x86/platform/atom/punit_atom_debug.c
@@ -154,8 +154,8 @@ static void punit_dbgfs_unregister(void)
(kernel_ulong_t)&drv_data }
static const struct x86_cpu_id intel_punit_cpu_ids[] = {
- ICPU(INTEL_FAM6_ATOM_SILVERMONT1, punit_device_byt),
- ICPU(INTEL_FAM6_ATOM_MERRIFIELD, punit_device_tng),
+ ICPU(INTEL_FAM6_ATOM_SILVERMONT, punit_device_byt),
+ ICPU(INTEL_FAM6_ATOM_SILVERMONT_MID, punit_device_tng),
ICPU(INTEL_FAM6_ATOM_AIRMONT, punit_device_cht),
{}
};
diff --git a/drivers/acpi/acpi_lpss.c b/drivers/acpi/acpi_lpss.c
index 957d3fa3b543..8e38249311bd 100644
--- a/drivers/acpi/acpi_lpss.c
+++ b/drivers/acpi/acpi_lpss.c
@@ -243,7 +243,7 @@ static const struct lpss_device_desc bsw_spi_dev_desc = {
#define ICPU(model) { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, }
static const struct x86_cpu_id lpss_cpu_ids[] = {
- ICPU(INTEL_FAM6_ATOM_SILVERMONT1), /* Valleyview, Bay Trail */
+ ICPU(INTEL_FAM6_ATOM_SILVERMONT), /* Valleyview, Bay Trail */
ICPU(INTEL_FAM6_ATOM_AIRMONT), /* Braswell, Cherry Trail */
{}
};
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index f1f4ce7ddb47..3b123735a1c4 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -531,11 +531,18 @@ ssize_t __weak cpu_show_l1tf(struct device *dev,
return sprintf(buf, "Not affected\n");
}
+ssize_t __weak cpu_show_mds(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "Not affected\n");
+}
+
static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
+static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL);
static struct attribute *cpu_root_vulnerabilities_attrs[] = {
&dev_attr_meltdown.attr,
@@ -543,6 +550,7 @@ static struct attribute *cpu_root_vulnerabilities_attrs[] = {
&dev_attr_spectre_v2.attr,
&dev_attr_spec_store_bypass.attr,
&dev_attr_l1tf.attr,
+ &dev_attr_mds.attr,
NULL
};
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index f690085b1ad9..4fe999687415 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1413,7 +1413,7 @@ static void intel_pstate_update_util(struct update_util_data *data, u64 time,
static const struct x86_cpu_id intel_pstate_cpu_ids[] = {
ICPU(INTEL_FAM6_SANDYBRIDGE, core_params),
ICPU(INTEL_FAM6_SANDYBRIDGE_X, core_params),
- ICPU(INTEL_FAM6_ATOM_SILVERMONT1, silvermont_params),
+ ICPU(INTEL_FAM6_ATOM_SILVERMONT, silvermont_params),
ICPU(INTEL_FAM6_IVYBRIDGE, core_params),
ICPU(INTEL_FAM6_HASWELL_CORE, core_params),
ICPU(INTEL_FAM6_BROADWELL_CORE, core_params),
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 5ded9b22b015..a6fa32c7e068 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1107,14 +1107,14 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = {
ICPU(INTEL_FAM6_WESTMERE, idle_cpu_nehalem),
ICPU(INTEL_FAM6_WESTMERE_EP, idle_cpu_nehalem),
ICPU(INTEL_FAM6_NEHALEM_EX, idle_cpu_nehalem),
- ICPU(INTEL_FAM6_ATOM_PINEVIEW, idle_cpu_atom),
- ICPU(INTEL_FAM6_ATOM_LINCROFT, idle_cpu_lincroft),
+ ICPU(INTEL_FAM6_ATOM_BONNELL, idle_cpu_atom),
+ ICPU(INTEL_FAM6_ATOM_BONNELL_MID, idle_cpu_lincroft),
ICPU(INTEL_FAM6_WESTMERE_EX, idle_cpu_nehalem),
ICPU(INTEL_FAM6_SANDYBRIDGE, idle_cpu_snb),
ICPU(INTEL_FAM6_SANDYBRIDGE_X, idle_cpu_snb),
- ICPU(INTEL_FAM6_ATOM_CEDARVIEW, idle_cpu_atom),
- ICPU(INTEL_FAM6_ATOM_SILVERMONT1, idle_cpu_byt),
- ICPU(INTEL_FAM6_ATOM_MERRIFIELD, idle_cpu_tangier),
+ ICPU(INTEL_FAM6_ATOM_SALTWELL, idle_cpu_atom),
+ ICPU(INTEL_FAM6_ATOM_SILVERMONT, idle_cpu_byt),
+ ICPU(INTEL_FAM6_ATOM_SILVERMONT_MID, idle_cpu_tangier),
ICPU(INTEL_FAM6_ATOM_AIRMONT, idle_cpu_cht),
ICPU(INTEL_FAM6_IVYBRIDGE, idle_cpu_ivb),
ICPU(INTEL_FAM6_IVYBRIDGE_X, idle_cpu_ivt),
@@ -1122,7 +1122,7 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = {
ICPU(INTEL_FAM6_HASWELL_X, idle_cpu_hsw),
ICPU(INTEL_FAM6_HASWELL_ULT, idle_cpu_hsw),
ICPU(INTEL_FAM6_HASWELL_GT3E, idle_cpu_hsw),
- ICPU(INTEL_FAM6_ATOM_SILVERMONT2, idle_cpu_avn),
+ ICPU(INTEL_FAM6_ATOM_SILVERMONT_X, idle_cpu_avn),
ICPU(INTEL_FAM6_BROADWELL_CORE, idle_cpu_bdw),
ICPU(INTEL_FAM6_BROADWELL_GT3E, idle_cpu_bdw),
ICPU(INTEL_FAM6_BROADWELL_X, idle_cpu_bdw),
@@ -1134,7 +1134,7 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = {
ICPU(INTEL_FAM6_SKYLAKE_X, idle_cpu_skx),
ICPU(INTEL_FAM6_XEON_PHI_KNL, idle_cpu_knl),
ICPU(INTEL_FAM6_ATOM_GOLDMONT, idle_cpu_bxt),
- ICPU(INTEL_FAM6_ATOM_DENVERTON, idle_cpu_dnv),
+ ICPU(INTEL_FAM6_ATOM_GOLDMONT_X, idle_cpu_dnv),
{}
};
diff --git a/drivers/mmc/host/sdhci-acpi.c b/drivers/mmc/host/sdhci-acpi.c
index 80918abfc468..4398398c0935 100644
--- a/drivers/mmc/host/sdhci-acpi.c
+++ b/drivers/mmc/host/sdhci-acpi.c
@@ -127,7 +127,7 @@ static const struct sdhci_acpi_chip sdhci_acpi_chip_int = {
static bool sdhci_acpi_byt(void)
{
static const struct x86_cpu_id byt[] = {
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT1 },
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT },
{}
};
diff --git a/drivers/pci/pci-mid.c b/drivers/pci/pci-mid.c
index c7f3408e3148..54b3f9bc5ad8 100644
--- a/drivers/pci/pci-mid.c
+++ b/drivers/pci/pci-mid.c
@@ -71,8 +71,8 @@ static struct pci_platform_pm_ops mid_pci_platform_pm = {
* arch/x86/platform/intel-mid/pwr.c.
*/
static const struct x86_cpu_id lpss_cpu_ids[] = {
- ICPU(INTEL_FAM6_ATOM_PENWELL),
- ICPU(INTEL_FAM6_ATOM_MERRIFIELD),
+ ICPU(INTEL_FAM6_ATOM_SALTWELL_MID),
+ ICPU(INTEL_FAM6_ATOM_SILVERMONT_MID),
{}
};
diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c
index 3c71f608b444..8809c1a20bed 100644
--- a/drivers/powercap/intel_rapl.c
+++ b/drivers/powercap/intel_rapl.c
@@ -1175,12 +1175,12 @@ static const struct x86_cpu_id rapl_ids[] __initconst = {
RAPL_CPU(INTEL_FAM6_KABYLAKE_MOBILE, rapl_defaults_core),
RAPL_CPU(INTEL_FAM6_KABYLAKE_DESKTOP, rapl_defaults_core),
- RAPL_CPU(INTEL_FAM6_ATOM_SILVERMONT1, rapl_defaults_byt),
+ RAPL_CPU(INTEL_FAM6_ATOM_SILVERMONT, rapl_defaults_byt),
RAPL_CPU(INTEL_FAM6_ATOM_AIRMONT, rapl_defaults_cht),
- RAPL_CPU(INTEL_FAM6_ATOM_MERRIFIELD, rapl_defaults_tng),
- RAPL_CPU(INTEL_FAM6_ATOM_MOOREFIELD, rapl_defaults_ann),
+ RAPL_CPU(INTEL_FAM6_ATOM_SILVERMONT_MID,rapl_defaults_tng),
+ RAPL_CPU(INTEL_FAM6_ATOM_AIRMONT_MID, rapl_defaults_ann),
RAPL_CPU(INTEL_FAM6_ATOM_GOLDMONT, rapl_defaults_core),
- RAPL_CPU(INTEL_FAM6_ATOM_DENVERTON, rapl_defaults_core),
+ RAPL_CPU(INTEL_FAM6_ATOM_GOLDMONT_X, rapl_defaults_core),
RAPL_CPU(INTEL_FAM6_XEON_PHI_KNL, rapl_defaults_hsw_server),
{}
diff --git a/drivers/thermal/intel_soc_dts_thermal.c b/drivers/thermal/intel_soc_dts_thermal.c
index b2bbaa1c60b0..18788109cae6 100644
--- a/drivers/thermal/intel_soc_dts_thermal.c
+++ b/drivers/thermal/intel_soc_dts_thermal.c
@@ -43,7 +43,7 @@ static irqreturn_t soc_irq_thread_fn(int irq, void *dev_data)
}
static const struct x86_cpu_id soc_thermal_ids[] = {
- { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT1, 0,
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_ATOM_SILVERMONT, 0,
BYT_SOC_DTS_APIC_IRQ},
{}
};
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index a83c822c35c2..d4b167fc9ecb 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -1,28 +1,9 @@
#ifndef _LINUX_BITOPS_H
#define _LINUX_BITOPS_H
#include <asm/types.h>
+#include <linux/bits.h>
-#ifdef __KERNEL__
-#define BIT(nr) (1UL << (nr))
-#define BIT_ULL(nr) (1ULL << (nr))
-#define BIT_MASK(nr) (1UL << ((nr) % BITS_PER_LONG))
-#define BIT_WORD(nr) ((nr) / BITS_PER_LONG)
-#define BIT_ULL_MASK(nr) (1ULL << ((nr) % BITS_PER_LONG_LONG))
-#define BIT_ULL_WORD(nr) ((nr) / BITS_PER_LONG_LONG)
-#define BITS_PER_BYTE 8
#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
-#endif
-
-/*
- * Create a contiguous bitmask starting at bit position @l and ending at
- * position @h. For example
- * GENMASK_ULL(39, 21) gives us the 64bit vector 0x000000ffffe00000.
- */
-#define GENMASK(h, l) \
- (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
-
-#define GENMASK_ULL(h, l) \
- (((~0ULL) << (l)) & (~0ULL >> (BITS_PER_LONG_LONG - 1 - (h))))
extern unsigned int __sw_hweight8(unsigned int w);
extern unsigned int __sw_hweight16(unsigned int w);
diff --git a/include/linux/bits.h b/include/linux/bits.h
new file mode 100644
index 000000000000..2b7b532c1d51
--- /dev/null
+++ b/include/linux/bits.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_BITS_H
+#define __LINUX_BITS_H
+#include <asm/bitsperlong.h>
+
+#define BIT(nr) (1UL << (nr))
+#define BIT_ULL(nr) (1ULL << (nr))
+#define BIT_MASK(nr) (1UL << ((nr) % BITS_PER_LONG))
+#define BIT_WORD(nr) ((nr) / BITS_PER_LONG)
+#define BIT_ULL_MASK(nr) (1ULL << ((nr) % BITS_PER_LONG_LONG))
+#define BIT_ULL_WORD(nr) ((nr) / BITS_PER_LONG_LONG)
+#define BITS_PER_BYTE 8
+
+/*
+ * Create a contiguous bitmask starting at bit position @l and ending at
+ * position @h. For example
+ * GENMASK_ULL(39, 21) gives us the 64bit vector 0x000000ffffe00000.
+ */
+#define GENMASK(h, l) \
+ (((~0UL) - (1UL << (l)) + 1) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
+
+#define GENMASK_ULL(h, l) \
+ (((~0ULL) - (1ULL << (l)) + 1) & \
+ (~0ULL >> (BITS_PER_LONG_LONG - 1 - (h))))
+
+#endif /* __LINUX_BITS_H */
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ae5ac89324df..166686209f2c 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -54,6 +54,8 @@ extern ssize_t cpu_show_spec_store_bypass(struct device *dev,
struct device_attribute *attr, char *buf);
extern ssize_t cpu_show_l1tf(struct device *dev,
struct device_attribute *attr, char *buf);
+extern ssize_t cpu_show_mds(struct device *dev,
+ struct device_attribute *attr, char *buf);
extern __printf(4, 5)
struct device *cpu_device_create(struct device *parent, void *drvdata,
@@ -276,4 +278,28 @@ static inline void cpu_smt_check_topology_early(void) { }
static inline void cpu_smt_check_topology(void) { }
#endif
+/*
+ * These are used for a global "mitigations=" cmdline option for toggling
+ * optional CPU mitigations.
+ */
+enum cpu_mitigations {
+ CPU_MITIGATIONS_OFF,
+ CPU_MITIGATIONS_AUTO,
+ CPU_MITIGATIONS_AUTO_NOSMT,
+};
+
+extern enum cpu_mitigations cpu_mitigations;
+
+/* mitigations=off */
+static inline bool cpu_mitigations_off(void)
+{
+ return cpu_mitigations == CPU_MITIGATIONS_OFF;
+}
+
+/* mitigations=auto,nosmt */
+static inline bool cpu_mitigations_auto_nosmt(void)
+{
+ return cpu_mitigations == CPU_MITIGATIONS_AUTO_NOSMT;
+}
+
#endif /* _LINUX_CPU_H_ */
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index d53a23100401..58ae371556bc 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -60,14 +60,17 @@ extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead);
#define PTRACE_MODE_READ 0x01
#define PTRACE_MODE_ATTACH 0x02
#define PTRACE_MODE_NOAUDIT 0x04
-#define PTRACE_MODE_FSCREDS 0x08
-#define PTRACE_MODE_REALCREDS 0x10
+#define PTRACE_MODE_FSCREDS 0x08
+#define PTRACE_MODE_REALCREDS 0x10
+#define PTRACE_MODE_SCHED 0x20
+#define PTRACE_MODE_IBPB 0x40
/* shorthands for READ/ATTACH and FSCREDS/REALCREDS combinations */
#define PTRACE_MODE_READ_FSCREDS (PTRACE_MODE_READ | PTRACE_MODE_FSCREDS)
#define PTRACE_MODE_READ_REALCREDS (PTRACE_MODE_READ | PTRACE_MODE_REALCREDS)
#define PTRACE_MODE_ATTACH_FSCREDS (PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS)
#define PTRACE_MODE_ATTACH_REALCREDS (PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS)
+#define PTRACE_MODE_SPEC_IBPB (PTRACE_MODE_ATTACH_REALCREDS | PTRACE_MODE_IBPB)
/**
* ptrace_may_access - check whether the caller is permitted to access
@@ -85,6 +88,20 @@ extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead);
*/
extern bool ptrace_may_access(struct task_struct *task, unsigned int mode);
+/**
+ * ptrace_may_access - check whether the caller is permitted to access
+ * a target task.
+ * @task: target task
+ * @mode: selects type of access and caller credentials
+ *
+ * Returns true on success, false on denial.
+ *
+ * Similar to ptrace_may_access(). Only to be called from context switch
+ * code. Does not call into audit and the regular LSM hooks due to locking
+ * constraints.
+ */
+extern bool ptrace_may_access_sched(struct task_struct *task, unsigned int mode);
+
static inline int ptrace_reparented(struct task_struct *child)
{
return !same_thread_group(child->real_parent, child->parent);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ebd0afb35d16..1c487a3abd84 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2357,6 +2357,8 @@ static inline void memalloc_noio_restore(unsigned int flags)
#define PFA_LMK_WAITING 3 /* Lowmemorykiller is waiting */
#define PFA_SPEC_SSB_DISABLE 4 /* Speculative Store Bypass disabled */
#define PFA_SPEC_SSB_FORCE_DISABLE 5 /* Speculative Store Bypass force disabled*/
+#define PFA_SPEC_IB_DISABLE 6 /* Indirect branch speculation restricted */
+#define PFA_SPEC_IB_FORCE_DISABLE 7 /* Indirect branch speculation permanently restricted */
#define TASK_PFA_TEST(name, func) \
@@ -2390,6 +2392,13 @@ TASK_PFA_CLEAR(SPEC_SSB_DISABLE, spec_ssb_disable)
TASK_PFA_TEST(SPEC_SSB_FORCE_DISABLE, spec_ssb_force_disable)
TASK_PFA_SET(SPEC_SSB_FORCE_DISABLE, spec_ssb_force_disable)
+TASK_PFA_TEST(SPEC_IB_DISABLE, spec_ib_disable)
+TASK_PFA_SET(SPEC_IB_DISABLE, spec_ib_disable)
+TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable)
+
+TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
+TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
+
/*
* task->jobctl flags
*/
diff --git a/include/linux/sched/smt.h b/include/linux/sched/smt.h
new file mode 100644
index 000000000000..559ac4590593
--- /dev/null
+++ b/include/linux/sched/smt.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SCHED_SMT_H
+#define _LINUX_SCHED_SMT_H
+
+#include <linux/atomic.h>
+
+#ifdef CONFIG_SCHED_SMT
+extern atomic_t sched_smt_present;
+
+static __always_inline bool sched_smt_active(void)
+{
+ return atomic_read(&sched_smt_present);
+}
+#else
+static inline bool sched_smt_active(void) { return false; }
+#endif
+
+void arch_smt_update(void);
+
+#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 64776b72e1eb..64ec0d62e5f5 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -202,6 +202,7 @@ struct prctl_mm_map {
#define PR_SET_SPECULATION_CTRL 53
/* Speculation control variants */
# define PR_SPEC_STORE_BYPASS 0
+# define PR_SPEC_INDIRECT_BRANCH 1
/* Return and control values for PR_SET/GET_SPECULATION_CTRL */
# define PR_SPEC_NOT_AFFECTED 0
# define PR_SPEC_PRCTL (1UL << 0)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index f39a7be98fc1..efba851ee018 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -258,6 +258,9 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
static int ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
{
+ if (mode & PTRACE_MODE_SCHED)
+ return false;
+
if (mode & PTRACE_MODE_NOAUDIT)
return has_ns_capability_noaudit(current, ns, CAP_SYS_PTRACE);
else
@@ -325,9 +328,16 @@ ok:
!ptrace_has_cap(mm->user_ns, mode)))
return -EPERM;
+ if (mode & PTRACE_MODE_SCHED)
+ return 0;
return security_ptrace_access_check(task, mode);
}
+bool ptrace_may_access_sched(struct task_struct *task, unsigned int mode)
+{
+ return __ptrace_may_access(task, mode | PTRACE_MODE_SCHED);
+}
+
bool ptrace_may_access(struct task_struct *task, unsigned int mode)
{
int err;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6b3fff6a6437..50e80b1be2c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7355,11 +7355,22 @@ static int cpuset_cpu_inactive(unsigned int cpu)
return 0;
}
+#ifdef CONFIG_SCHED_SMT
+atomic_t sched_smt_present = ATOMIC_INIT(0);
+#endif
+
int sched_cpu_activate(unsigned int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
+#ifdef CONFIG_SCHED_SMT
+ /*
+ * When going up, increment the number of cores with SMT present.
+ */
+ if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+ atomic_inc(&sched_smt_present);
+#endif
set_cpu_active(cpu, true);
if (sched_smp_initialized) {
@@ -7408,6 +7419,14 @@ int sched_cpu_deactivate(unsigned int cpu)
else
synchronize_rcu();
+#ifdef CONFIG_SCHED_SMT
+ /*
+ * When going down, decrement the number of cores with SMT present.
+ */
+ if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+ atomic_dec(&sched_smt_present);
+#endif
+
if (!sched_smp_initialized)
return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec6e838e991a..15c08752926b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2,6 +2,7 @@
#include <linux/sched.h>
#include <linux/sched/sysctl.h>
#include <linux/sched/rt.h>
+#include <linux/sched/smt.h>
#include <linux/u64_stats_sync.h>
#include <linux/sched/deadline.h>
#include <linux/kernel_stat.h>
diff --git a/tools/power/x86/turbostat/Makefile b/tools/power/x86/turbostat/Makefile
index 8561e7ddca59..92be948c922d 100644
--- a/tools/power/x86/turbostat/Makefile
+++ b/tools/power/x86/turbostat/Makefile
@@ -8,7 +8,7 @@ ifeq ("$(origin O)", "command line")
endif
turbostat : turbostat.c
-CFLAGS += -Wall
+CFLAGS += -Wall -I../../../include
CFLAGS += -DMSRHEADER='"../../../../arch/x86/include/asm/msr-index.h"'
%: %.c