8804 lines
317 KiB
Diff
8804 lines
317 KiB
Diff
From patchwork Wed Jul 6 22:00:10 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908714
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id B6680CCA480
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 22:02:18 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=suBD8WYL8VDVWKPu/jo/RymhQBRxlVJBLbptTptYpf4=; b=O6ZavB+akjo83pqLYzf5fljiTB
|
|
H4uyzBh+vs93HC6ERvSBUbeLs+IUsakeprr1vy6Ob5oxBIu2smlegYEivkcuYY0s3HupO+cC4DVav
|
|
Y/NIAez9m4mR6kdmfGYSf9Rv5RoE2ac+27dKqFeyQVCttMvsAQqv6A2USLLA24VBJNIkcXwed7Xnf
|
|
9+62R7O+xEpkWEFD/gBD1ggSYRmJTaOnVD06DTd3j28JL2DGhCEXqU+li2nvYddQCgbFTGyhg/HE1
|
|
vdZUW39fWx1KdPC7RuldHwj6mxU8c1yUFgQEA7hu239K5GByFobREQ9qjupZIc0WRrW8Cmnn2ZBtS
|
|
Q7BChe+Q==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D52-00CWHj-Eu; Wed, 06 Jul 2022 22:01:08 +0000
|
|
Received: from mail-il1-x14a.google.com ([2607:f8b0:4864:20::14a])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D4o-00CW8k-0D
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:00:55 +0000
|
|
Received: by mail-il1-x14a.google.com with SMTP id
|
|
h28-20020a056e021d9c00b002dc15a95f9cso5022267ila.2
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:00:52 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=qRI8PXbLcM+5CLpEdu5Szvo90bsJIGjJE2jS009gkGk=;
|
|
b=ZmyxY1Zw8XSvfRWkKAW+f4mUNqqtO18FFYBy2MotiZryXwyz9ItbUh9iu4txbliGWV
|
|
2zSpKFQCiNnOAlQ6EcsvQBLjKhLO02wKW9+/0P3DsfIXA4cNhb908dXECrznSmVA8Pnr
|
|
F13ODZZAGss1dN9dP7/zz2TweJvGgqjzlw8hpy3C9EXhkGdCEVfFUX5sYsFwHF6ph62j
|
|
YFYkt0yEeDGZ6BSKwot0UC5ZcUyd9AqPFg+XD4PWIlU21bbWaLA6eIQAr/1vyvoOUESY
|
|
RP+ZlS9AQ2JVmz3TDo8SyWa829c8OgLjNn28DmB38A4um5Ju0lB8q6j6sdVFGsj5iEvp
|
|
AFww==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=qRI8PXbLcM+5CLpEdu5Szvo90bsJIGjJE2jS009gkGk=;
|
|
b=OcLfz0DVw1h8Wf+YlPtcJpLAgtcZJ1tQOoJ1yMjYvChxB1zyq3PmIRWV5TgbSH1Nvp
|
|
XMSGVNGNgbSds9i3BtAHtyhxsXf6/qZthGyP0usXFbUbPhTk8ZxlUNo4uDUJ+7UvgNS9
|
|
gR/vaa4vFq+IPdHwS4Z4bqFA4u/jUY85rA4F9VETfAHTtK1FNrc5fWfc2g3YaN/Tq334
|
|
Budpkli5BMukFLOWbPFMzMV2Iulj0tgSJ0j0LSQ5t4j+uqvp0yQmNSvYqGFoqm8Fm/VV
|
|
ph/QaWjrQJUcnYlGBkme9/+MLTev5n5Gf7dN64/JUT6xjBq1RDmmAZ4BlpMfG/eodKkb
|
|
eFbQ==
|
|
X-Gm-Message-State: AJIora9oxwOoNzkR/4MPfLTLyqv6cFBz3KFPug0M+UvvjygqtRMZ+Kgu
|
|
nj5FUcoE6wVnGFLsnk20qniYJ1ou4ZM=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1sDSDGcRnDMwsrjVWQBGg0nLWGv8i6pEMJ10oDW9TLUnp/+bD9AQNZ2+emMVfEkLEmXVjNXufDKAmU=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a05:6638:14cd:b0:33c:c00e:898d with SMTP id
|
|
l13-20020a05663814cd00b0033cc00e898dmr26357876jak.143.1657144852078; Wed, 06
|
|
Jul 2022 15:00:52 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:10 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-2-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 01/14] mm: x86, arm64: add arch_has_hw_pte_young()
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Barry Song <baohua@kernel.org>, Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_150054_085031_134610A4
|
|
X-CRM114-Status: GOOD ( 17.93 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
Some architectures automatically set the accessed bit in PTEs, e.g.,
|
|
x86 and arm64 v8.2. On architectures that do not have this capability,
|
|
clearing the accessed bit in a PTE usually triggers a page fault
|
|
following the TLB miss of this PTE (to emulate the accessed bit).
|
|
|
|
Being aware of this capability can help make better decisions, e.g.,
|
|
whether to spread the work out over a period of time to reduce bursty
|
|
page faults when trying to clear the accessed bit in many PTEs.
|
|
|
|
Note that theoretically this capability can be unreliable, e.g.,
|
|
hotplugged CPUs might be different from builtin ones. Therefore it
|
|
should not be used in architecture-independent code that involves
|
|
correctness, e.g., to determine whether TLB flushes are required (in
|
|
combination with the accessed bit).
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Reviewed-by: Barry Song <baohua@kernel.org>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Acked-by: Will Deacon <will@kernel.org>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
arch/arm64/include/asm/pgtable.h | 15 ++-------------
|
|
arch/x86/include/asm/pgtable.h | 6 +++---
|
|
include/linux/pgtable.h | 13 +++++++++++++
|
|
mm/memory.c | 14 +-------------
|
|
4 files changed, 19 insertions(+), 29 deletions(-)
|
|
|
|
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
|
|
index 0b6632f18364..c46399c0500c 100644
|
|
--- a/arch/arm64/include/asm/pgtable.h
|
|
+++ b/arch/arm64/include/asm/pgtable.h
|
|
@@ -1066,24 +1066,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
|
|
* page after fork() + CoW for pfn mappings. We don't always have a
|
|
* hardware-managed access flag on arm64.
|
|
*/
|
|
-static inline bool arch_faults_on_old_pte(void)
|
|
-{
|
|
- /* The register read below requires a stable CPU to make any sense */
|
|
- cant_migrate();
|
|
-
|
|
- return !cpu_has_hw_af();
|
|
-}
|
|
-#define arch_faults_on_old_pte arch_faults_on_old_pte
|
|
+#define arch_has_hw_pte_young cpu_has_hw_af
|
|
|
|
/*
|
|
* Experimentally, it's cheap to set the access flag in hardware and we
|
|
* benefit from prefaulting mappings as 'old' to start with.
|
|
*/
|
|
-static inline bool arch_wants_old_prefaulted_pte(void)
|
|
-{
|
|
- return !arch_faults_on_old_pte();
|
|
-}
|
|
-#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
|
|
+#define arch_wants_old_prefaulted_pte cpu_has_hw_af
|
|
|
|
static inline bool pud_sect_supported(void)
|
|
{
|
|
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
|
|
index 44e2d6f1dbaa..dc5f7d8ef68a 100644
|
|
--- a/arch/x86/include/asm/pgtable.h
|
|
+++ b/arch/x86/include/asm/pgtable.h
|
|
@@ -1431,10 +1431,10 @@ static inline bool arch_has_pfn_modify_check(void)
|
|
return boot_cpu_has_bug(X86_BUG_L1TF);
|
|
}
|
|
|
|
-#define arch_faults_on_old_pte arch_faults_on_old_pte
|
|
-static inline bool arch_faults_on_old_pte(void)
|
|
+#define arch_has_hw_pte_young arch_has_hw_pte_young
|
|
+static inline bool arch_has_hw_pte_young(void)
|
|
{
|
|
- return false;
|
|
+ return true;
|
|
}
|
|
|
|
#ifdef CONFIG_PAGE_TABLE_CHECK
|
|
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
|
|
index 3cdc16cfd867..8eee31bc9bde 100644
|
|
--- a/include/linux/pgtable.h
|
|
+++ b/include/linux/pgtable.h
|
|
@@ -260,6 +260,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
|
#endif
|
|
|
|
+#ifndef arch_has_hw_pte_young
|
|
+/*
|
|
+ * Return whether the accessed bit is supported on the local CPU.
|
|
+ *
|
|
+ * This stub assumes accessing through an old PTE triggers a page fault.
|
|
+ * Architectures that automatically set the access bit should overwrite it.
|
|
+ */
|
|
+static inline bool arch_has_hw_pte_young(void)
|
|
+{
|
|
+ return false;
|
|
+}
|
|
+#endif
|
|
+
|
|
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
|
|
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
|
|
unsigned long address,
|
|
diff --git a/mm/memory.c b/mm/memory.c
|
|
index 7a089145cad4..49500390b91b 100644
|
|
--- a/mm/memory.c
|
|
+++ b/mm/memory.c
|
|
@@ -125,18 +125,6 @@ int randomize_va_space __read_mostly =
|
|
2;
|
|
#endif
|
|
|
|
-#ifndef arch_faults_on_old_pte
|
|
-static inline bool arch_faults_on_old_pte(void)
|
|
-{
|
|
- /*
|
|
- * Those arches which don't have hw access flag feature need to
|
|
- * implement their own helper. By default, "true" means pagefault
|
|
- * will be hit on old pte.
|
|
- */
|
|
- return true;
|
|
-}
|
|
-#endif
|
|
-
|
|
#ifndef arch_wants_old_prefaulted_pte
|
|
static inline bool arch_wants_old_prefaulted_pte(void)
|
|
{
|
|
@@ -2862,7 +2850,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
|
|
* On architectures with software "accessed" bits, we would
|
|
* take a double page fault, so mark it accessed here.
|
|
*/
|
|
- if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
|
|
+ if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
|
|
pte_t entry;
|
|
|
|
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
|
|
|
|
From patchwork Wed Jul 6 22:00:11 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908715
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id 25346C433EF
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 22:02:28 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=D9gPSR1GQwBHsDcHkqn5gQklr3BBLQJaPgDQBKHxC0I=; b=FqBU/FTyzrJ7wa3efnxWF7KyD6
|
|
HgTyp9IKx+Os9fdK8LzH0vwhluHucLzaDRJObbu/sAKmQ4r4Xmox8gO53ZBJF9EPGfJzWD+fOT56k
|
|
zScyqG8qn1NdCvYcR2h4zHspHz8PhCAFMK739ojpMmcdUc/c3U+JDT/yo825BbVRjzEJOTsWK9pxa
|
|
xkKXOKK7I56hMLzdnDgVtZTduYOGR3Xpmjo/agoaIjmo18Ep5ZfoMwayX9MAj6aJxfRawVRkMmBiI
|
|
0N4FpaXKcojHxCkEYrHzi/lK8qgkLHV/xHj0zjD13UbhcI7ug8a6uAr/Jw3SYJJ24nYK2Av+uvXQm
|
|
3aUmJkpA==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D5C-00CWLk-LM; Wed, 06 Jul 2022 22:01:18 +0000
|
|
Received: from mail-yb1-xb4a.google.com ([2607:f8b0:4864:20::b4a])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D4o-00CW97-KL
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:00:57 +0000
|
|
Received: by mail-yb1-xb4a.google.com with SMTP id
|
|
w2-20020a25ef42000000b0066d68be3fbcso12588439ybm.13
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:00:53 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=Yske1o/9q5kvxCT6Do7fK+m0Z7RMKAIlwEA5dqQMV6o=;
|
|
b=ScyLwm63xSUVYY78eVpIKf7E4l6uHPJ8SKqWyYLqNgfcQS9rJpZhYXa+GvIYC8VFxz
|
|
2VFStSncvwevlF5a8SeHX4Xsz1oxV5uuYYiB5ijS1hgFnqmnWUZ92SAkit2dsdOrKkVm
|
|
doRskpr19skWYdTit7iDaFWDHSkEjmp1FnyOwnhb4K1iob0FZUGliEmOjr11tQKlaxMl
|
|
A7gk8PUbqgtBAB5FxJW674j5ErsQXUNEF0mV9mDiI18iHiW2zTe0Jvp4coFt/YGkO03P
|
|
+mGZgU80OTVBNdIcmd9CUSdknj31pHlFfc27NA1Hoqf7YpOu3eL0SW+Jp946t/R7w6FH
|
|
wLdA==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=Yske1o/9q5kvxCT6Do7fK+m0Z7RMKAIlwEA5dqQMV6o=;
|
|
b=kuFDCLp5kpqL5mYaDrf45aNUsF6VaksRz/DcXp0Fv91o409I7JCGX4i9dpSZivfvDq
|
|
zKvazU8Mccm7+2bVubnMa3bd5aRvb+Au2TxQUYv/3+gZZNaIq3SCv+dRaX7qFCZaWo7x
|
|
ob1jAY8dXg3huWDo4XW64879e8nI+WE0I/9lsaSkHcukZBgS8+yY044Mc0CaKwpNRry8
|
|
BrCoVUyPeZkWpEQXqhTrmzvvJERUuItQKks5pOZcNd9Y2QMjaEyaOrBcuKkNbDPqIEKI
|
|
gTYmQhhsoj1OF0qSwN+H01NMhWoLdybCffWjiDVZQnICS29ErdIX2YP2/FU0XgL9e+Ca
|
|
mcSw==
|
|
X-Gm-Message-State: AJIora+m9F0o/sK2KoplihoNmNyo5Dgys/Amz/X1/gurWQTb+S/mTNQU
|
|
/0ua3RFJSkxZBTwF202vrJ30COONY3Y=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1uPiyB7rEJDdAsYAVesh6XcxF7m4/NOwgKHx35NtLh0WZv9A8PqKLV1Gu8X5xOooB/DS/0V1C1QSZU=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a25:390:0:b0:66e:b9c7:b46c with SMTP id
|
|
138-20020a250390000000b0066eb9c7b46cmr172371ybd.505.1657144853349; Wed, 06
|
|
Jul 2022 15:00:53 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:11 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-3-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Barry Song <baohua@kernel.org>, Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_150054_712906_16D518D7
|
|
X-CRM114-Status: GOOD ( 17.74 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
Some architectures support the accessed bit in non-leaf PMD entries,
|
|
e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
|
|
as part of linear address translation [1]. Page table walkers that
|
|
clear the accessed bit may use this capability to reduce their search
|
|
space.
|
|
|
|
Note that:
|
|
1. Although an inline function is preferable, this capability is added
|
|
as a configuration option for consistency with the existing macros.
|
|
2. Due to the little interest in other varieties, this capability was
|
|
only tested on Intel and AMD CPUs.
|
|
|
|
Thanks to the following developers for their efforts [2][3].
|
|
Randy Dunlap <rdunlap@infradead.org>
|
|
Stephen Rothwell <sfr@canb.auug.org.au>
|
|
|
|
[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
|
|
Volume 3 (June 2021), section 4.8
|
|
[2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/
|
|
[3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Reviewed-by: Barry Song <baohua@kernel.org>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
arch/Kconfig | 8 ++++++++
|
|
arch/x86/Kconfig | 1 +
|
|
arch/x86/include/asm/pgtable.h | 3 ++-
|
|
arch/x86/mm/pgtable.c | 5 ++++-
|
|
include/linux/pgtable.h | 4 ++--
|
|
5 files changed, 17 insertions(+), 4 deletions(-)
|
|
|
|
diff --git a/arch/Kconfig b/arch/Kconfig
|
|
index fcf9a41a4ef5..eaeec187bd6a 100644
|
|
--- a/arch/Kconfig
|
|
+++ b/arch/Kconfig
|
|
@@ -1403,6 +1403,14 @@ config DYNAMIC_SIGFRAME
|
|
config HAVE_ARCH_NODE_DEV_GROUP
|
|
bool
|
|
|
|
+config ARCH_HAS_NONLEAF_PMD_YOUNG
|
|
+ bool
|
|
+ help
|
|
+ Architectures that select this option are capable of setting the
|
|
+ accessed bit in non-leaf PMD entries when using them as part of linear
|
|
+ address translations. Page table walkers that clear the accessed bit
|
|
+ may use this capability to reduce their search space.
|
|
+
|
|
source "kernel/gcov/Kconfig"
|
|
|
|
source "scripts/gcc-plugins/Kconfig"
|
|
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
|
|
index be0b95e51df6..5715111abe13 100644
|
|
--- a/arch/x86/Kconfig
|
|
+++ b/arch/x86/Kconfig
|
|
@@ -85,6 +85,7 @@ config X86
|
|
select ARCH_HAS_PMEM_API if X86_64
|
|
select ARCH_HAS_PTE_DEVMAP if X86_64
|
|
select ARCH_HAS_PTE_SPECIAL
|
|
+ select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2
|
|
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
|
|
select ARCH_HAS_COPY_MC if X86_64
|
|
select ARCH_HAS_SET_MEMORY
|
|
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
|
|
index dc5f7d8ef68a..5059799bebe3 100644
|
|
--- a/arch/x86/include/asm/pgtable.h
|
|
+++ b/arch/x86/include/asm/pgtable.h
|
|
@@ -815,7 +815,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
|
|
|
|
static inline int pmd_bad(pmd_t pmd)
|
|
{
|
|
- return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
|
|
+ return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
|
|
+ (_KERNPG_TABLE & ~_PAGE_ACCESSED);
|
|
}
|
|
|
|
static inline unsigned long pages_to_mb(unsigned long npg)
|
|
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
|
|
index a932d7712d85..8525f2876fb4 100644
|
|
--- a/arch/x86/mm/pgtable.c
|
|
+++ b/arch/x86/mm/pgtable.c
|
|
@@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
|
|
return ret;
|
|
}
|
|
|
|
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
|
|
int pmdp_test_and_clear_young(struct vm_area_struct *vma,
|
|
unsigned long addr, pmd_t *pmdp)
|
|
{
|
|
@@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
|
|
|
|
return ret;
|
|
}
|
|
+#endif
|
|
+
|
|
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
int pudp_test_and_clear_young(struct vm_area_struct *vma,
|
|
unsigned long addr, pud_t *pudp)
|
|
{
|
|
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
|
|
index 8eee31bc9bde..9c57c5cc49c2 100644
|
|
--- a/include/linux/pgtable.h
|
|
+++ b/include/linux/pgtable.h
|
|
@@ -213,7 +213,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
|
|
#endif
|
|
|
|
#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
|
|
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
|
|
static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
|
|
unsigned long address,
|
|
pmd_t *pmdp)
|
|
@@ -234,7 +234,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
|
|
BUILD_BUG();
|
|
return 0;
|
|
}
|
|
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
|
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
|
|
#endif
|
|
|
|
#ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
|
|
|
|
From patchwork Wed Jul 6 22:00:12 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908716
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id D2E21C43334
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 22:02:29 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=RH/KJ9WjK8mrO9POiUmCTl14xFHA2Y+su9gixq3uB/U=; b=j1bCJ6+bEUwI8KLgy8JtmGwgJ5
|
|
HnH92FHuesK+eU7olEyuCHZtCLYmX5uZVZ+3jJoxjDOBPKcFb7mP1mlNHjuNXPFR1Uv8QBxkxgzDr
|
|
5DBaB6ZemAgTgb5qxwVolgnpLx4+nH35HHqH9SIkW+7NWf9L3BUh7Xjar/4MXMvncqU12A1Uz/bWR
|
|
R+h1nNXZcv/UaWuz/M7U+WpGBtkfcRBTVyU7CxL0NXCtpEWtpS/q+cc3BA1eQ08VVS4169SHme1Pe
|
|
f+ZQ9zs/ATJ8eDK925Cg4op770VSdVRX32i78gA8dcrfvrEOYvWis9rfbYOChmiXuxrbfjZT1Vds0
|
|
TZFFm5xw==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D5L-00CWOG-1e; Wed, 06 Jul 2022 22:01:27 +0000
|
|
Received: from mail-yw1-x114a.google.com ([2607:f8b0:4864:20::114a])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D4q-00CWA2-7V
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:00 +0000
|
|
Received: by mail-yw1-x114a.google.com with SMTP id
|
|
00721157ae682-31cbcba2f28so48225217b3.19
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:00:55 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=xRTO2a+j5NrVJKtKeScRWRKKBSTrjcMS5t6hiKMzj7E=;
|
|
b=jaYbReZJ8uDDLbii1xwhzvdsu6n9p9fFeOoX3rMWV4HRFwikqu+fxkANqP9J1hGdR2
|
|
NeJtlffRYWnnwdndS5aG1Db183fv4nEfSDNZk5Aw1GhS0DDV+irZrJ4sR+RBQ0mlRL0F
|
|
PCWg0VVitxpZ5yzJzYAkEO4uHOjww0Tjni9prrUmk4iDUdAeuQHZsQYSGRbR+cGm4i8w
|
|
k7/vbxWbkPS/YQ/tq51SCEZjr+bTsFRcUYhsaDMMVhgqvpvMmhh84viZjp9G9W/MZCVp
|
|
lhJy7B/1ym1XZ9aYTn0gi9sgQDfh0ksvuw/1a5ib9CO1DG9/pvF0LoK/EKm8nNJ/pZyy
|
|
kAfA==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=xRTO2a+j5NrVJKtKeScRWRKKBSTrjcMS5t6hiKMzj7E=;
|
|
b=Er3OIc8dci1ciYxY+DS7oSEhAc2INat0DprWnfBh16O7/2LU3d/lq3zKCEq/NQosup
|
|
sJUb8+600ltfjSuc6sPxDDUg+9lktll9O5xz+zJcHE6/rYj2kaNwMZgjprnFT9mKcb2K
|
|
tg8zoJPV1+z/6QYsdPpp9K2i68jPpbKRnLjvPbc6sf9u64sPJOlNcIeVMwg0jHVDP+Nu
|
|
Em99TLOKBOVi79RX/EjvVe7MxeN4YxE8efYaFn1QKC29mZLagRzgV4FNJN8CXHGuXuD+
|
|
QaUmN1omms63sgJsydxX1fMKwDPfsh6MWRAWRDinCNCtL2jkNToY2Jw5aRYye0Dwsztr
|
|
0eNQ==
|
|
X-Gm-Message-State: AJIora8vnH1lVcfgxn+da2YNwmAY50ybNeuE5j6wjFwmXkOgvg/tNteQ
|
|
nudc4yBkM3NrWpyQ+BjFCVARgLy0ZcI=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1slRLC7YoaOAJX582u+iAc9V/TxbhI1Hoxnov8FBUQD9MbHWxCLjXxj9TQ7JPp2TjaJTr9hNbuL1Ok=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a05:6902:1206:b0:66e:6e93:366c with SMTP id
|
|
s6-20020a056902120600b0066e6e93366cmr11968955ybu.59.1657144854840; Wed, 06
|
|
Jul 2022 15:00:54 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:12 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-4-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 03/14] mm/vmscan.c: refactor shrink_node()
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Barry Song <baohua@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>,
|
|
Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_150056_323065_7C6858AC
|
|
X-CRM114-Status: GOOD ( 22.67 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
This patch refactors shrink_node() to improve readability for the
|
|
upcoming changes to mm/vmscan.c.
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Reviewed-by: Barry Song <baohua@kernel.org>
|
|
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
mm/vmscan.c | 198 +++++++++++++++++++++++++++-------------------------
|
|
1 file changed, 104 insertions(+), 94 deletions(-)
|
|
|
|
diff --git a/mm/vmscan.c b/mm/vmscan.c
|
|
index f7d9a683e3a7..fddb9bd3c6c2 100644
|
|
--- a/mm/vmscan.c
|
|
+++ b/mm/vmscan.c
|
|
@@ -2670,6 +2670,109 @@ enum scan_balance {
|
|
SCAN_FILE,
|
|
};
|
|
|
|
+static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
|
|
+{
|
|
+ unsigned long file;
|
|
+ struct lruvec *target_lruvec;
|
|
+
|
|
+ target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
|
|
+
|
|
+ /*
|
|
+ * Flush the memory cgroup stats, so that we read accurate per-memcg
|
|
+ * lruvec stats for heuristics.
|
|
+ */
|
|
+ mem_cgroup_flush_stats();
|
|
+
|
|
+ /*
|
|
+ * Determine the scan balance between anon and file LRUs.
|
|
+ */
|
|
+ spin_lock_irq(&target_lruvec->lru_lock);
|
|
+ sc->anon_cost = target_lruvec->anon_cost;
|
|
+ sc->file_cost = target_lruvec->file_cost;
|
|
+ spin_unlock_irq(&target_lruvec->lru_lock);
|
|
+
|
|
+ /*
|
|
+ * Target desirable inactive:active list ratios for the anon
|
|
+ * and file LRU lists.
|
|
+ */
|
|
+ if (!sc->force_deactivate) {
|
|
+ unsigned long refaults;
|
|
+
|
|
+ refaults = lruvec_page_state(target_lruvec,
|
|
+ WORKINGSET_ACTIVATE_ANON);
|
|
+ if (refaults != target_lruvec->refaults[0] ||
|
|
+ inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
|
|
+ sc->may_deactivate |= DEACTIVATE_ANON;
|
|
+ else
|
|
+ sc->may_deactivate &= ~DEACTIVATE_ANON;
|
|
+
|
|
+ /*
|
|
+ * When refaults are being observed, it means a new
|
|
+ * workingset is being established. Deactivate to get
|
|
+ * rid of any stale active pages quickly.
|
|
+ */
|
|
+ refaults = lruvec_page_state(target_lruvec,
|
|
+ WORKINGSET_ACTIVATE_FILE);
|
|
+ if (refaults != target_lruvec->refaults[1] ||
|
|
+ inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
|
|
+ sc->may_deactivate |= DEACTIVATE_FILE;
|
|
+ else
|
|
+ sc->may_deactivate &= ~DEACTIVATE_FILE;
|
|
+ } else
|
|
+ sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
|
|
+
|
|
+ /*
|
|
+ * If we have plenty of inactive file pages that aren't
|
|
+ * thrashing, try to reclaim those first before touching
|
|
+ * anonymous pages.
|
|
+ */
|
|
+ file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
|
|
+ if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
|
|
+ sc->cache_trim_mode = 1;
|
|
+ else
|
|
+ sc->cache_trim_mode = 0;
|
|
+
|
|
+ /*
|
|
+ * Prevent the reclaimer from falling into the cache trap: as
|
|
+ * cache pages start out inactive, every cache fault will tip
|
|
+ * the scan balance towards the file LRU. And as the file LRU
|
|
+ * shrinks, so does the window for rotation from references.
|
|
+ * This means we have a runaway feedback loop where a tiny
|
|
+ * thrashing file LRU becomes infinitely more attractive than
|
|
+ * anon pages. Try to detect this based on file LRU size.
|
|
+ */
|
|
+ if (!cgroup_reclaim(sc)) {
|
|
+ unsigned long total_high_wmark = 0;
|
|
+ unsigned long free, anon;
|
|
+ int z;
|
|
+
|
|
+ free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
|
|
+ file = node_page_state(pgdat, NR_ACTIVE_FILE) +
|
|
+ node_page_state(pgdat, NR_INACTIVE_FILE);
|
|
+
|
|
+ for (z = 0; z < MAX_NR_ZONES; z++) {
|
|
+ struct zone *zone = &pgdat->node_zones[z];
|
|
+
|
|
+ if (!managed_zone(zone))
|
|
+ continue;
|
|
+
|
|
+ total_high_wmark += high_wmark_pages(zone);
|
|
+ }
|
|
+
|
|
+ /*
|
|
+ * Consider anon: if that's low too, this isn't a
|
|
+ * runaway file reclaim problem, but rather just
|
|
+ * extreme pressure. Reclaim as per usual then.
|
|
+ */
|
|
+ anon = node_page_state(pgdat, NR_INACTIVE_ANON);
|
|
+
|
|
+ sc->file_is_tiny =
|
|
+ file + free <= total_high_wmark &&
|
|
+ !(sc->may_deactivate & DEACTIVATE_ANON) &&
|
|
+ anon >> sc->priority;
|
|
+ }
|
|
+}
|
|
+
|
|
/*
|
|
* Determine how aggressively the anon and file LRU lists should be
|
|
* scanned.
|
|
@@ -3138,109 +3241,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
|
|
unsigned long nr_reclaimed, nr_scanned;
|
|
struct lruvec *target_lruvec;
|
|
bool reclaimable = false;
|
|
- unsigned long file;
|
|
|
|
target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
|
|
|
|
again:
|
|
- /*
|
|
- * Flush the memory cgroup stats, so that we read accurate per-memcg
|
|
- * lruvec stats for heuristics.
|
|
- */
|
|
- mem_cgroup_flush_stats();
|
|
-
|
|
memset(&sc->nr, 0, sizeof(sc->nr));
|
|
|
|
nr_reclaimed = sc->nr_reclaimed;
|
|
nr_scanned = sc->nr_scanned;
|
|
|
|
- /*
|
|
- * Determine the scan balance between anon and file LRUs.
|
|
- */
|
|
- spin_lock_irq(&target_lruvec->lru_lock);
|
|
- sc->anon_cost = target_lruvec->anon_cost;
|
|
- sc->file_cost = target_lruvec->file_cost;
|
|
- spin_unlock_irq(&target_lruvec->lru_lock);
|
|
-
|
|
- /*
|
|
- * Target desirable inactive:active list ratios for the anon
|
|
- * and file LRU lists.
|
|
- */
|
|
- if (!sc->force_deactivate) {
|
|
- unsigned long refaults;
|
|
-
|
|
- refaults = lruvec_page_state(target_lruvec,
|
|
- WORKINGSET_ACTIVATE_ANON);
|
|
- if (refaults != target_lruvec->refaults[0] ||
|
|
- inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
|
|
- sc->may_deactivate |= DEACTIVATE_ANON;
|
|
- else
|
|
- sc->may_deactivate &= ~DEACTIVATE_ANON;
|
|
-
|
|
- /*
|
|
- * When refaults are being observed, it means a new
|
|
- * workingset is being established. Deactivate to get
|
|
- * rid of any stale active pages quickly.
|
|
- */
|
|
- refaults = lruvec_page_state(target_lruvec,
|
|
- WORKINGSET_ACTIVATE_FILE);
|
|
- if (refaults != target_lruvec->refaults[1] ||
|
|
- inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
|
|
- sc->may_deactivate |= DEACTIVATE_FILE;
|
|
- else
|
|
- sc->may_deactivate &= ~DEACTIVATE_FILE;
|
|
- } else
|
|
- sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
|
|
-
|
|
- /*
|
|
- * If we have plenty of inactive file pages that aren't
|
|
- * thrashing, try to reclaim those first before touching
|
|
- * anonymous pages.
|
|
- */
|
|
- file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
|
|
- if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
|
|
- sc->cache_trim_mode = 1;
|
|
- else
|
|
- sc->cache_trim_mode = 0;
|
|
-
|
|
- /*
|
|
- * Prevent the reclaimer from falling into the cache trap: as
|
|
- * cache pages start out inactive, every cache fault will tip
|
|
- * the scan balance towards the file LRU. And as the file LRU
|
|
- * shrinks, so does the window for rotation from references.
|
|
- * This means we have a runaway feedback loop where a tiny
|
|
- * thrashing file LRU becomes infinitely more attractive than
|
|
- * anon pages. Try to detect this based on file LRU size.
|
|
- */
|
|
- if (!cgroup_reclaim(sc)) {
|
|
- unsigned long total_high_wmark = 0;
|
|
- unsigned long free, anon;
|
|
- int z;
|
|
-
|
|
- free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
|
|
- file = node_page_state(pgdat, NR_ACTIVE_FILE) +
|
|
- node_page_state(pgdat, NR_INACTIVE_FILE);
|
|
-
|
|
- for (z = 0; z < MAX_NR_ZONES; z++) {
|
|
- struct zone *zone = &pgdat->node_zones[z];
|
|
- if (!managed_zone(zone))
|
|
- continue;
|
|
-
|
|
- total_high_wmark += high_wmark_pages(zone);
|
|
- }
|
|
-
|
|
- /*
|
|
- * Consider anon: if that's low too, this isn't a
|
|
- * runaway file reclaim problem, but rather just
|
|
- * extreme pressure. Reclaim as per usual then.
|
|
- */
|
|
- anon = node_page_state(pgdat, NR_INACTIVE_ANON);
|
|
-
|
|
- sc->file_is_tiny =
|
|
- file + free <= total_high_wmark &&
|
|
- !(sc->may_deactivate & DEACTIVATE_ANON) &&
|
|
- anon >> sc->priority;
|
|
- }
|
|
+ prepare_scan_count(pgdat, sc);
|
|
|
|
shrink_node_memcgs(pgdat, sc);
|
|
|
|
|
|
From patchwork Wed Jul 6 22:00:13 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908717
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id A2C63C433EF
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 22:02:42 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=JVY6UCq2bTPFxwe8NDyXWAbdu8jDmsFB4tqR2KBvljs=; b=vgKkP3uUCrFwgjw4IJN0DWROq4
|
|
XU+XV8IkQAKyO0xumodGSrHkou4Uh3kN1ZHibE5g/V9Ob0t2v8gbQM1m0mJHM508P6yLprJrhTtT7
|
|
3F1b8YTXzgIoMetYrikcsiTrpeozjsHd4drhO3kMbqOc1M3pc3aHO3O5hOwsZS9Tnk2GdDG/kUF8b
|
|
npTpz2jVteeJFjs+Y0ivSZapGjcCxK0dYhp+BA6AOSentiD0Io4d5iVQa7kCsJTalbawvskTbEj+T
|
|
u8g+UmgmtS3SDBYhIrECZupXEDNDictEkGVWNU2cL7lmUhagZf2gf+35VOoZzfIet7+Nwds3s8klF
|
|
19mtAMMA==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D5U-00CWS0-QK; Wed, 06 Jul 2022 22:01:37 +0000
|
|
Received: from mail-yw1-x114a.google.com ([2607:f8b0:4864:20::114a])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D4s-00CWBe-19
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:01 +0000
|
|
Received: by mail-yw1-x114a.google.com with SMTP id
|
|
00721157ae682-31c9d560435so63747347b3.21
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:00:57 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=GyLn4alIs4ulph3wPn6kG/6c2qN7BZlAW/LU7V/wtB8=;
|
|
b=fqvYzFJd6diatCK9xOi35jB4AbI1jOxd0dc3zbIWXBRd/oZCSL2ChL+LrZ+NDYE03d
|
|
TIPGwoUneWvzbc4OXeOfpb0FtGxmdhwy/nlPnMgq+BH+J79K/39lDuK/WznYk1HI+hzN
|
|
zL7bsRal3Q8YUC5jRMId0XoVcP/vuEU/M54E4rAJ15EBntL/F6yfHEySvrSBBtWZhnt0
|
|
90gyXGuo//w+Jc0ez+vgTHQxHk3TDIFEvyNKpltir9acA6/j0jGHYEfhC/r1UrED+Tt8
|
|
m1PcqYkXSdSfGsO4GbojXKICNGmqT0/82l34NKy0jmCO9o+gJUnrEIDeiTyPT8jYjdXn
|
|
eGJQ==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=GyLn4alIs4ulph3wPn6kG/6c2qN7BZlAW/LU7V/wtB8=;
|
|
b=lMpheu/UkS0Zu17pvmd54o7oCr+Z/Y8qrFmnr3gvZfmPdSvAMvLu3MBbHW0doQwRnd
|
|
Oklxx8B4LaOgTYjLggU15LP1ROAiedoJtKAGJSj+4twvWkTBkbg6QexNe60kedl81Pib
|
|
LI9nuPLg/294vLzsqWm0aMyCX1o/BhRWa/bFfBr2NlfqXtedFnUrOZj5v8DAMXH0UBI2
|
|
4lsdT5zLfhHNClB6ZlabbACU/CFqZVZ8R8i6lelIHd3VaHv3L9ulQddGwmbv9A2kCJqC
|
|
O24B5qkwIPeNf/hnHFaaTnNLooonhxuxvchsUd2OrTawaL3zt+JE/xOZaA74KDYZcXkr
|
|
h98w==
|
|
X-Gm-Message-State: AJIora+U0YkfABjhMOA3anbkbJF7BL3O6czXc4npc4wZPrFw+UBDq9H8
|
|
7MmBOnKq48DkKkaS70VR1FXXdNWV9e8=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1vs1GQqyrE7d9mjEL5MTBaBKoftkODnUWU3nZauu0DiGFKm6nQgOePB+L8kJ6BOwhlufj40Jzp1R4k=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a25:d043:0:b0:66e:31d6:4606 with SMTP id
|
|
h64-20020a25d043000000b0066e31d64606mr25539292ybg.241.1657144856519; Wed, 06
|
|
Jul 2022 15:00:56 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:13 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-5-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 04/14] Revert "include/linux/mm_inline.h: fold
|
|
__update_lru_size() into its sole caller"
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Miaohe Lin <linmiaohe@huawei.com>, Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_150058_115500_9910A3F3
|
|
X-CRM114-Status: GOOD ( 11.01 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
This patch undoes the following refactor:
|
|
commit 289ccba18af4 ("include/linux/mm_inline.h: fold __update_lru_size() into its sole caller")
|
|
|
|
The upcoming changes to include/linux/mm_inline.h will reuse
|
|
__update_lru_size().
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
include/linux/mm_inline.h | 9 ++++++++-
|
|
1 file changed, 8 insertions(+), 1 deletion(-)
|
|
|
|
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
|
|
index 7b25b53c474a..fb8aadb81cd6 100644
|
|
--- a/include/linux/mm_inline.h
|
|
+++ b/include/linux/mm_inline.h
|
|
@@ -34,7 +34,7 @@ static inline int page_is_file_lru(struct page *page)
|
|
return folio_is_file_lru(page_folio(page));
|
|
}
|
|
|
|
-static __always_inline void update_lru_size(struct lruvec *lruvec,
|
|
+static __always_inline void __update_lru_size(struct lruvec *lruvec,
|
|
enum lru_list lru, enum zone_type zid,
|
|
long nr_pages)
|
|
{
|
|
@@ -43,6 +43,13 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
|
|
__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
|
|
__mod_zone_page_state(&pgdat->node_zones[zid],
|
|
NR_ZONE_LRU_BASE + lru, nr_pages);
|
|
+}
|
|
+
|
|
+static __always_inline void update_lru_size(struct lruvec *lruvec,
|
|
+ enum lru_list lru, enum zone_type zid,
|
|
+ long nr_pages)
|
|
+{
|
|
+ __update_lru_size(lruvec, lru, zid, nr_pages);
|
|
#ifdef CONFIG_MEMCG
|
|
mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
|
|
#endif
|
|
|
|
From patchwork Wed Jul 6 22:00:14 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908760
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id 7C2FCCCA47C
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 23:01:34 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=QhQfptQUplzgwhexjbYhxCNeKXxRhFFoHwd8RoTGX9g=; b=cl163jCxmDFOUalIyodnmGpEOy
|
|
8mMKEUy4SUBM6PJdO2AZ2qDK6eXviOtAFv8UwNqyo/KvuwH8eJ070Agw3XJGZGhOGvGmEFlXzAd5l
|
|
W7UY8xpf3cL26fDERfcNhCrrRwdkYw7WdHfHq1OJhtRq+iOqaFy5idCt8Brmhf3lYRk+SBD6wlFka
|
|
mRo8SN66fUSTSx9Cx76gcQn2m3R16KvSLFWTsN000qPgCAST3iIKOJMbyrupwB+zloH501KHw7JDJ
|
|
KBO9Yok1pqJLR9kY1cAPKICh+H9KZVG3d13vcNsVMVstCuds5zWEMsyoK5qoLzI0touOBwD5ugK9W
|
|
YQNJ6IxA==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9Dzt-00CfSK-LR; Wed, 06 Jul 2022 22:59:53 +0000
|
|
Received: from desiato.infradead.org ([2001:8b0:10b:1:d65d:64ff:fe57:4e05])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9Dzj-00CfQO-BX
|
|
for linux-arm-kernel@bombadil.infradead.org; Wed, 06 Jul 2022 22:59:43 +0000
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=infradead.org; s=desiato.20200630; h=Content-Transfer-Encoding:Content-Type
|
|
:Cc:To:From:Subject:References:Mime-Version:Message-Id:In-Reply-To:Date:
|
|
Sender:Reply-To:Content-ID:Content-Description;
|
|
bh=y74IoQ1Un60Xq7yBx41XqudQ9pnNmGNgLv+0SiGV5r4=; b=RKsmvw/oXLQeSDokGxnb8swuMa
|
|
MwGXUM7Tp5wWFX/5CjCLfm4aw/hp6f07yNthMmPxDfZi12up4f5/HEf155K/ZosBKj9ZpZm9JuyMN
|
|
Er3pmd/K+RWFZouI2/nI9kLGmvoLufimFrgHvfqfF3HPQuDZA4Do00HJbz2tW3ibf+x1z7eAwYV7M
|
|
dM1i2xb+w7SPC7y4RWxzhcKQMt+hVPRfCPsLsZrgx03jHQzzVhiUJJRh66F0NrOdBuEgBAyugpwZL
|
|
2i5WYpUstkFHZq8uwjVJ/2IKA2hxnrm2j00dUxK1pGB2lRDMjk/lsBDRGmk0EU6Pe1JnYYY7khhyX
|
|
l0655uMQ==;
|
|
Received: from mail-io1-xd4a.google.com ([2607:f8b0:4864:20::d4a])
|
|
by desiato.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D59-000zVe-Et
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:19 +0000
|
|
Received: by mail-io1-xd4a.google.com with SMTP id
|
|
q75-20020a6b8e4e000000b0067275f1e6c4so8871362iod.14
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:00:59 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=y74IoQ1Un60Xq7yBx41XqudQ9pnNmGNgLv+0SiGV5r4=;
|
|
b=p293qdy+AJ1NK8wVIFYa38QTJD9CsNtfxZWrFxc99swgPytMvFTFgMhkjdcKezzZie
|
|
yrDLuqEO4g2bHuYcfru6gtGl/vlEBzugJSUw9t9SSuHD0KPbwuSBuj6k/Z4E6o/3VSjs
|
|
nmEwp3FaQzQrq+AvQ75NBZLJcjJnu2S/L2SRP5n2jtLL27l7UQfJTw+nlDEN61Y6wnKm
|
|
cTbYVguOwFUEjdFi2ghze0M0n87A9CNsBCyQHS9wRzczRWbW6m+LMwO/fsge9KEjZcyq
|
|
WUlwLSCnJuEi3hDOUrhrpLVnbT1LO6KIzff4/TXK4ud4HZ+BORPfFQeF2zBQpAIt8foH
|
|
VdwQ==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=y74IoQ1Un60Xq7yBx41XqudQ9pnNmGNgLv+0SiGV5r4=;
|
|
b=EYHMYjhSXJ27z03LCpk5h5v9jwwNPJtkyTTgPkiVJ8bzaqW9YxlWmAGnVL3DdOYGxx
|
|
5o8pzQ2ASFHriwH8HR2zTenS4cjbthLONDOgkT4KTfHRcIm6QRikoNOPok+ld+1igcdK
|
|
RpIgPiyO7sqznFD/9v/uNO2vyp+tp7rbiGKWs2ZlJmBTOlRWJaIRDsvHwqRhDbLOSZXd
|
|
WAcQtQNtNcFvMneq51t/zaL2ay5kkQIZC44StrDjTRrNZznFYFb9rlcmQQVmzyMOQ3Hm
|
|
Ng14frHNBBjcc6oV2q+gFRMk5fYQ0cQTila0p6jlK2SQ88ops2aXqQ7Hc8A+ZVL8vzN6
|
|
Zvkw==
|
|
X-Gm-Message-State: AJIora+GWWIbsqlxOHf8Z+W0MYO2ytPz+wWlKbiBK5BlbLsuEcwUUiJb
|
|
oa50KgBx6Qe3s0QGWooqb1gW0QjWsAw=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1vTPigokpkBMxkuw/ymV5qWW3cjnNF2AOB7Hi8viYhEQm+kOAzrEtDgBoJ1BwoaUWa5EKU0D3T6qsI=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a05:6638:2114:b0:33e:8e12:e5ee with SMTP id
|
|
n20-20020a056638211400b0033e8e12e5eemr22734068jaj.281.1657144858015; Wed, 06
|
|
Jul 2022 15:00:58 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:14 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-6-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 05/14] mm: multi-gen LRU: groundwork
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_230116_407114_E634BA49
|
|
X-CRM114-Status: GOOD ( 31.69 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
Evictable pages are divided into multiple generations for each lruvec.
|
|
The youngest generation number is stored in lrugen->max_seq for both
|
|
anon and file types as they are aged on an equal footing. The oldest
|
|
generation numbers are stored in lrugen->min_seq[] separately for anon
|
|
and file types as clean file pages can be evicted regardless of swap
|
|
constraints. These three variables are monotonically increasing.
|
|
|
|
Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
|
|
in order to fit into the gen counter in folio->flags. Each truncated
|
|
generation number is an index to lrugen->lists[]. The sliding window
|
|
technique is used to track at least MIN_NR_GENS and at most
|
|
MAX_NR_GENS generations. The gen counter stores a value within [1,
|
|
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
|
|
stores 0.
|
|
|
|
There are two conceptually independent procedures: "the aging", which
|
|
produces young generations, and "the eviction", which consumes old
|
|
generations. They form a closed-loop system, i.e., "the page reclaim".
|
|
Both procedures can be invoked from userspace for the purposes of
|
|
working set estimation and proactive reclaim. These techniques are
|
|
commonly used to optimize job scheduling (bin packing) in data
|
|
centers [1][2].
|
|
|
|
To avoid confusion, the terms "hot" and "cold" will be applied to the
|
|
multi-gen LRU, as a new convention; the terms "active" and "inactive"
|
|
will be applied to the active/inactive LRU, as usual.
|
|
|
|
The protection of hot pages and the selection of cold pages are based
|
|
on page access channels and patterns. There are two access channels:
|
|
one through page tables and the other through file descriptors. The
|
|
protection of the former channel is by design stronger because:
|
|
1. The uncertainty in determining the access patterns of the former
|
|
channel is higher due to the approximation of the accessed bit.
|
|
2. The cost of evicting the former channel is higher due to the TLB
|
|
flushes required and the likelihood of encountering the dirty bit.
|
|
3. The penalty of underprotecting the former channel is higher because
|
|
applications usually do not prepare themselves for major page
|
|
faults like they do for blocked I/O. E.g., GUI applications
|
|
commonly use dedicated I/O threads to avoid blocking rendering
|
|
threads.
|
|
There are also two access patterns: one with temporal locality and the
|
|
other without. For the reasons listed above, the former channel is
|
|
assumed to follow the former pattern unless VM_SEQ_READ or
|
|
VM_RAND_READ is present; the latter channel is assumed to follow the
|
|
latter pattern unless outlying refaults have been observed [3][4].
|
|
|
|
The next patch will address the "outlying refaults". Three macros,
|
|
i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
|
|
added in this patch to make the entire patchset less diffy.
|
|
|
|
A page is added to the youngest generation on faulting. The aging
|
|
needs to check the accessed bit at least twice before handing this
|
|
page over to the eviction. The first check takes care of the accessed
|
|
bit set on the initial fault; the second check makes sure this page
|
|
has not been used since then. This protocol, AKA second chance,
|
|
requires a minimum of two generations, hence MIN_NR_GENS.
|
|
|
|
[1] https://dl.acm.org/doi/10.1145/3297858.3304053
|
|
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
|
|
[3] https://lwn.net/Articles/495543/
|
|
[4] https://lwn.net/Articles/815342/
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
fs/fuse/dev.c | 3 +-
|
|
include/linux/mm.h | 2 +
|
|
include/linux/mm_inline.h | 175 ++++++++++++++++++++++++++++++
|
|
include/linux/mmzone.h | 100 +++++++++++++++++
|
|
include/linux/page-flags-layout.h | 13 ++-
|
|
include/linux/page-flags.h | 4 +-
|
|
include/linux/sched.h | 4 +
|
|
kernel/bounds.c | 5 +
|
|
mm/Kconfig | 8 ++
|
|
mm/huge_memory.c | 3 +-
|
|
mm/memcontrol.c | 2 +
|
|
mm/memory.c | 25 +++++
|
|
mm/mm_init.c | 6 +-
|
|
mm/mmzone.c | 2 +
|
|
mm/swap.c | 9 +-
|
|
mm/vmscan.c | 75 +++++++++++++
|
|
16 files changed, 423 insertions(+), 13 deletions(-)
|
|
|
|
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
|
|
index 0e537e580dc1..5d36015071d2 100644
|
|
--- a/fs/fuse/dev.c
|
|
+++ b/fs/fuse/dev.c
|
|
@@ -777,7 +777,8 @@ static int fuse_check_page(struct page *page)
|
|
1 << PG_active |
|
|
1 << PG_workingset |
|
|
1 << PG_reclaim |
|
|
- 1 << PG_waiters))) {
|
|
+ 1 << PG_waiters |
|
|
+ LRU_GEN_MASK | LRU_REFS_MASK))) {
|
|
dump_page(page, "fuse: trying to steal weird page");
|
|
return 1;
|
|
}
|
|
diff --git a/include/linux/mm.h b/include/linux/mm.h
|
|
index cf3d0d673f6b..ed5393e5930d 100644
|
|
--- a/include/linux/mm.h
|
|
+++ b/include/linux/mm.h
|
|
@@ -1060,6 +1060,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
|
|
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
|
|
#define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH)
|
|
#define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
|
|
+#define LRU_GEN_PGOFF (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
|
|
+#define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH)
|
|
|
|
/*
|
|
* Define the bit shifts to access each section. For non-existent
|
|
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
|
|
index fb8aadb81cd6..2ff703900fd0 100644
|
|
--- a/include/linux/mm_inline.h
|
|
+++ b/include/linux/mm_inline.h
|
|
@@ -40,6 +40,9 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
|
|
{
|
|
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
|
|
|
|
+ lockdep_assert_held(&lruvec->lru_lock);
|
|
+ WARN_ON_ONCE(nr_pages != (int)nr_pages);
|
|
+
|
|
__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
|
|
__mod_zone_page_state(&pgdat->node_zones[zid],
|
|
NR_ZONE_LRU_BASE + lru, nr_pages);
|
|
@@ -101,11 +104,177 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
|
|
return lru;
|
|
}
|
|
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+
|
|
+static inline bool lru_gen_enabled(void)
|
|
+{
|
|
+ return true;
|
|
+}
|
|
+
|
|
+static inline bool lru_gen_in_fault(void)
|
|
+{
|
|
+ return current->in_lru_fault;
|
|
+}
|
|
+
|
|
+static inline int lru_gen_from_seq(unsigned long seq)
|
|
+{
|
|
+ return seq % MAX_NR_GENS;
|
|
+}
|
|
+
|
|
+static inline int folio_lru_gen(struct folio *folio)
|
|
+{
|
|
+ unsigned long flags = READ_ONCE(folio->flags);
|
|
+
|
|
+ return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
|
|
+}
|
|
+
|
|
+static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
|
|
+{
|
|
+ unsigned long max_seq = lruvec->lrugen.max_seq;
|
|
+
|
|
+ VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
|
|
+
|
|
+ /* see the comment on MIN_NR_GENS */
|
|
+ return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
|
|
+}
|
|
+
|
|
+static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio,
|
|
+ int old_gen, int new_gen)
|
|
+{
|
|
+ int type = folio_is_file_lru(folio);
|
|
+ int zone = folio_zonenum(folio);
|
|
+ int delta = folio_nr_pages(folio);
|
|
+ enum lru_list lru = type * LRU_INACTIVE_FILE;
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+
|
|
+ VM_WARN_ON_ONCE(old_gen != -1 && old_gen >= MAX_NR_GENS);
|
|
+ VM_WARN_ON_ONCE(new_gen != -1 && new_gen >= MAX_NR_GENS);
|
|
+ VM_WARN_ON_ONCE(old_gen == -1 && new_gen == -1);
|
|
+
|
|
+ if (old_gen >= 0)
|
|
+ WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
|
|
+ lrugen->nr_pages[old_gen][type][zone] - delta);
|
|
+ if (new_gen >= 0)
|
|
+ WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
|
|
+ lrugen->nr_pages[new_gen][type][zone] + delta);
|
|
+
|
|
+ /* addition */
|
|
+ if (old_gen < 0) {
|
|
+ if (lru_gen_is_active(lruvec, new_gen))
|
|
+ lru += LRU_ACTIVE;
|
|
+ __update_lru_size(lruvec, lru, zone, delta);
|
|
+ return;
|
|
+ }
|
|
+
|
|
+ /* deletion */
|
|
+ if (new_gen < 0) {
|
|
+ if (lru_gen_is_active(lruvec, old_gen))
|
|
+ lru += LRU_ACTIVE;
|
|
+ __update_lru_size(lruvec, lru, zone, -delta);
|
|
+ return;
|
|
+ }
|
|
+}
|
|
+
|
|
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
|
|
+{
|
|
+ unsigned long seq;
|
|
+ unsigned long flags;
|
|
+ int gen = folio_lru_gen(folio);
|
|
+ int type = folio_is_file_lru(folio);
|
|
+ int zone = folio_zonenum(folio);
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+
|
|
+ VM_WARN_ON_ONCE_FOLIO(gen != -1, folio);
|
|
+
|
|
+ if (folio_test_unevictable(folio))
|
|
+ return false;
|
|
+ /*
|
|
+ * There are three common cases for this page:
|
|
+ * 1. If it's hot, e.g., freshly faulted in or previously hot and
|
|
+ * migrated, add it to the youngest generation.
|
|
+ * 2. If it's cold but can't be evicted immediately, i.e., an anon page
|
|
+ * not in swapcache or a dirty page pending writeback, add it to the
|
|
+ * second oldest generation.
|
|
+ * 3. Everything else (clean, cold) is added to the oldest generation.
|
|
+ */
|
|
+ if (folio_test_active(folio))
|
|
+ seq = lrugen->max_seq;
|
|
+ else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
|
|
+ (folio_test_reclaim(folio) &&
|
|
+ (folio_test_dirty(folio) || folio_test_writeback(folio))))
|
|
+ seq = lrugen->min_seq[type] + 1;
|
|
+ else
|
|
+ seq = lrugen->min_seq[type];
|
|
+
|
|
+ gen = lru_gen_from_seq(seq);
|
|
+ flags = (gen + 1UL) << LRU_GEN_PGOFF;
|
|
+ /* see the comment on MIN_NR_GENS about PG_active */
|
|
+ set_mask_bits(&folio->flags, LRU_GEN_MASK | BIT(PG_active), flags);
|
|
+
|
|
+ lru_gen_update_size(lruvec, folio, -1, gen);
|
|
+ /* for folio_rotate_reclaimable() */
|
|
+ if (reclaiming)
|
|
+ list_add_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
|
|
+ else
|
|
+ list_add(&folio->lru, &lrugen->lists[gen][type][zone]);
|
|
+
|
|
+ return true;
|
|
+}
|
|
+
|
|
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
|
|
+{
|
|
+ unsigned long flags;
|
|
+ int gen = folio_lru_gen(folio);
|
|
+
|
|
+ if (gen < 0)
|
|
+ return false;
|
|
+
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
|
|
+
|
|
+ /* for folio_migrate_flags() */
|
|
+ flags = !reclaiming && lru_gen_is_active(lruvec, gen) ? BIT(PG_active) : 0;
|
|
+ flags = set_mask_bits(&folio->flags, LRU_GEN_MASK, flags);
|
|
+ gen = ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
|
|
+
|
|
+ lru_gen_update_size(lruvec, folio, gen, -1);
|
|
+ list_del(&folio->lru);
|
|
+
|
|
+ return true;
|
|
+}
|
|
+
|
|
+#else /* !CONFIG_LRU_GEN */
|
|
+
|
|
+static inline bool lru_gen_enabled(void)
|
|
+{
|
|
+ return false;
|
|
+}
|
|
+
|
|
+static inline bool lru_gen_in_fault(void)
|
|
+{
|
|
+ return false;
|
|
+}
|
|
+
|
|
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
|
|
+{
|
|
+ return false;
|
|
+}
|
|
+
|
|
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
|
|
+{
|
|
+ return false;
|
|
+}
|
|
+
|
|
+#endif /* CONFIG_LRU_GEN */
|
|
+
|
|
static __always_inline
|
|
void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
|
|
{
|
|
enum lru_list lru = folio_lru_list(folio);
|
|
|
|
+ if (lru_gen_add_folio(lruvec, folio, false))
|
|
+ return;
|
|
+
|
|
update_lru_size(lruvec, lru, folio_zonenum(folio),
|
|
folio_nr_pages(folio));
|
|
if (lru != LRU_UNEVICTABLE)
|
|
@@ -123,6 +292,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
|
|
{
|
|
enum lru_list lru = folio_lru_list(folio);
|
|
|
|
+ if (lru_gen_add_folio(lruvec, folio, true))
|
|
+ return;
|
|
+
|
|
update_lru_size(lruvec, lru, folio_zonenum(folio),
|
|
folio_nr_pages(folio));
|
|
/* This is not expected to be used on LRU_UNEVICTABLE */
|
|
@@ -140,6 +312,9 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
|
|
{
|
|
enum lru_list lru = folio_lru_list(folio);
|
|
|
|
+ if (lru_gen_del_folio(lruvec, folio, false))
|
|
+ return;
|
|
+
|
|
if (lru != LRU_UNEVICTABLE)
|
|
list_del(&folio->lru);
|
|
update_lru_size(lruvec, lru, folio_zonenum(folio),
|
|
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
|
|
index aab70355d64f..c90c2282044e 100644
|
|
--- a/include/linux/mmzone.h
|
|
+++ b/include/linux/mmzone.h
|
|
@@ -314,6 +314,102 @@ enum lruvec_flags {
|
|
*/
|
|
};
|
|
|
|
+#endif /* !__GENERATING_BOUNDS_H */
|
|
+
|
|
+/*
|
|
+ * Evictable pages are divided into multiple generations. The youngest and the
|
|
+ * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
|
|
+ * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
|
|
+ * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the
|
|
+ * corresponding generation. The gen counter in folio->flags stores gen+1 while
|
|
+ * a page is on one of lrugen->lists[]. Otherwise it stores 0.
|
|
+ *
|
|
+ * A page is added to the youngest generation on faulting. The aging needs to
|
|
+ * check the accessed bit at least twice before handing this page over to the
|
|
+ * eviction. The first check takes care of the accessed bit set on the initial
|
|
+ * fault; the second check makes sure this page hasn't been used since then.
|
|
+ * This process, AKA second chance, requires a minimum of two generations,
|
|
+ * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
|
|
+ * LRU, e.g., /proc/vmstat, these two generations are considered active; the
|
|
+ * rest of generations, if they exist, are considered inactive. See
|
|
+ * lru_gen_is_active().
|
|
+ *
|
|
+ * PG_active is always cleared while a page is on one of lrugen->lists[] so that
|
|
+ * the aging needs not to worry about it. And it's set again when a page
|
|
+ * considered active is isolated for non-reclaiming purposes, e.g., migration.
|
|
+ * See lru_gen_add_folio() and lru_gen_del_folio().
|
|
+ *
|
|
+ * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the
|
|
+ * number of categories of the active/inactive LRU when keeping track of
|
|
+ * accesses through page tables. This requires order_base_2(MAX_NR_GENS+1) bits
|
|
+ * in folio->flags.
|
|
+ */
|
|
+#define MIN_NR_GENS 2U
|
|
+#define MAX_NR_GENS 4U
|
|
+
|
|
+#ifndef __GENERATING_BOUNDS_H
|
|
+
|
|
+struct lruvec;
|
|
+
|
|
+#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
|
|
+#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
|
|
+
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+
|
|
+enum {
|
|
+ LRU_GEN_ANON,
|
|
+ LRU_GEN_FILE,
|
|
+};
|
|
+
|
|
+/*
|
|
+ * The youngest generation number is stored in max_seq for both anon and file
|
|
+ * types as they are aged on an equal footing. The oldest generation numbers are
|
|
+ * stored in min_seq[] separately for anon and file types as clean file pages
|
|
+ * can be evicted regardless of swap constraints.
|
|
+ *
|
|
+ * Normally anon and file min_seq are in sync. But if swapping is constrained,
|
|
+ * e.g., out of swap space, file min_seq is allowed to advance and leave anon
|
|
+ * min_seq behind.
|
|
+ *
|
|
+ * The number of pages in each generation is eventually consistent and therefore
|
|
+ * can be transiently negative.
|
|
+ */
|
|
+struct lru_gen_struct {
|
|
+ /* the aging increments the youngest generation number */
|
|
+ unsigned long max_seq;
|
|
+ /* the eviction increments the oldest generation numbers */
|
|
+ unsigned long min_seq[ANON_AND_FILE];
|
|
+ /* the multi-gen LRU lists, lazily sorted on eviction */
|
|
+ struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
|
|
+ /* the multi-gen LRU sizes, eventually consistent */
|
|
+ long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
|
|
+};
|
|
+
|
|
+void lru_gen_init_lruvec(struct lruvec *lruvec);
|
|
+
|
|
+#ifdef CONFIG_MEMCG
|
|
+void lru_gen_init_memcg(struct mem_cgroup *memcg);
|
|
+void lru_gen_exit_memcg(struct mem_cgroup *memcg);
|
|
+#endif
|
|
+
|
|
+#else /* !CONFIG_LRU_GEN */
|
|
+
|
|
+static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
|
|
+{
|
|
+}
|
|
+
|
|
+#ifdef CONFIG_MEMCG
|
|
+static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
|
|
+{
|
|
+}
|
|
+
|
|
+static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
|
|
+{
|
|
+}
|
|
+#endif
|
|
+
|
|
+#endif /* CONFIG_LRU_GEN */
|
|
+
|
|
struct lruvec {
|
|
struct list_head lists[NR_LRU_LISTS];
|
|
/* per lruvec lru_lock for memcg */
|
|
@@ -331,6 +427,10 @@ struct lruvec {
|
|
unsigned long refaults[ANON_AND_FILE];
|
|
/* Various lruvec state flags (enum lruvec_flags) */
|
|
unsigned long flags;
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+ /* evictable pages divided into generations */
|
|
+ struct lru_gen_struct lrugen;
|
|
+#endif
|
|
#ifdef CONFIG_MEMCG
|
|
struct pglist_data *pgdat;
|
|
#endif
|
|
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
|
|
index ef1e3e736e14..240905407a18 100644
|
|
--- a/include/linux/page-flags-layout.h
|
|
+++ b/include/linux/page-flags-layout.h
|
|
@@ -55,7 +55,8 @@
|
|
#define SECTIONS_WIDTH 0
|
|
#endif
|
|
|
|
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
|
|
+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
|
|
+ <= BITS_PER_LONG - NR_PAGEFLAGS
|
|
#define NODES_WIDTH NODES_SHIFT
|
|
#elif defined(CONFIG_SPARSEMEM_VMEMMAP)
|
|
#error "Vmemmap: No space for nodes field in page flags"
|
|
@@ -89,8 +90,8 @@
|
|
#define LAST_CPUPID_SHIFT 0
|
|
#endif
|
|
|
|
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
|
|
- <= BITS_PER_LONG - NR_PAGEFLAGS
|
|
+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
|
|
+ KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
|
|
#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
|
|
#else
|
|
#define LAST_CPUPID_WIDTH 0
|
|
@@ -100,10 +101,12 @@
|
|
#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
|
|
#endif
|
|
|
|
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
|
|
- > BITS_PER_LONG - NR_PAGEFLAGS
|
|
+#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
|
|
+ KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
|
|
#error "Not enough bits in page flags"
|
|
#endif
|
|
|
|
+#define LRU_REFS_WIDTH 0
|
|
+
|
|
#endif
|
|
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
|
|
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
|
|
index e66f7aa3191d..8d466d724852 100644
|
|
--- a/include/linux/page-flags.h
|
|
+++ b/include/linux/page-flags.h
|
|
@@ -1059,7 +1059,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
|
|
1UL << PG_private | 1UL << PG_private_2 | \
|
|
1UL << PG_writeback | 1UL << PG_reserved | \
|
|
1UL << PG_slab | 1UL << PG_active | \
|
|
- 1UL << PG_unevictable | __PG_MLOCKED)
|
|
+ 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK)
|
|
|
|
/*
|
|
* Flags checked when a page is prepped for return by the page allocator.
|
|
@@ -1070,7 +1070,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
|
|
* alloc-free cycle to prevent from reusing the page.
|
|
*/
|
|
#define PAGE_FLAGS_CHECK_AT_PREP \
|
|
- (PAGEFLAGS_MASK & ~__PG_HWPOISON)
|
|
+ ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
|
|
|
|
#define PAGE_FLAGS_PRIVATE \
|
|
(1UL << PG_private | 1UL << PG_private_2)
|
|
diff --git a/include/linux/sched.h b/include/linux/sched.h
|
|
index c46f3a63b758..744340a96ace 100644
|
|
--- a/include/linux/sched.h
|
|
+++ b/include/linux/sched.h
|
|
@@ -912,6 +912,10 @@ struct task_struct {
|
|
#ifdef CONFIG_MEMCG
|
|
unsigned in_user_fault:1;
|
|
#endif
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+ /* whether the LRU algorithm may apply to this access */
|
|
+ unsigned in_lru_fault:1;
|
|
+#endif
|
|
#ifdef CONFIG_COMPAT_BRK
|
|
unsigned brk_randomized:1;
|
|
#endif
|
|
diff --git a/kernel/bounds.c b/kernel/bounds.c
|
|
index 9795d75b09b2..5ee60777d8e4 100644
|
|
--- a/kernel/bounds.c
|
|
+++ b/kernel/bounds.c
|
|
@@ -22,6 +22,11 @@ int main(void)
|
|
DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
|
|
#endif
|
|
DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+ DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
|
|
+#else
|
|
+ DEFINE(LRU_GEN_WIDTH, 0);
|
|
+#endif
|
|
/* End of constants */
|
|
|
|
return 0;
|
|
diff --git a/mm/Kconfig b/mm/Kconfig
|
|
index 169e64192e48..cee109f3128a 100644
|
|
--- a/mm/Kconfig
|
|
+++ b/mm/Kconfig
|
|
@@ -1130,6 +1130,14 @@ config PTE_MARKER_UFFD_WP
|
|
purposes. It is required to enable userfaultfd write protection on
|
|
file-backed memory types like shmem and hugetlbfs.
|
|
|
|
+config LRU_GEN
|
|
+ bool "Multi-Gen LRU"
|
|
+ depends on MMU
|
|
+ # make sure folio->flags has enough spare bits
|
|
+ depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP
|
|
+ help
|
|
+ A high performance LRU implementation to overcommit memory.
|
|
+
|
|
source "mm/damon/Kconfig"
|
|
|
|
endmenu
|
|
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
|
|
index 834f288b3769..5500583e35b8 100644
|
|
--- a/mm/huge_memory.c
|
|
+++ b/mm/huge_memory.c
|
|
@@ -2370,7 +2370,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
|
|
#ifdef CONFIG_64BIT
|
|
(1L << PG_arch_2) |
|
|
#endif
|
|
- (1L << PG_dirty)));
|
|
+ (1L << PG_dirty) |
|
|
+ LRU_GEN_MASK | LRU_REFS_MASK));
|
|
|
|
/* ->mapping in first tail page is compound_mapcount */
|
|
VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
|
|
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
|
|
index 618c366a2f07..7d58e8a73ece 100644
|
|
--- a/mm/memcontrol.c
|
|
+++ b/mm/memcontrol.c
|
|
@@ -5105,6 +5105,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
|
|
|
|
static void mem_cgroup_free(struct mem_cgroup *memcg)
|
|
{
|
|
+ lru_gen_exit_memcg(memcg);
|
|
memcg_wb_domain_exit(memcg);
|
|
__mem_cgroup_free(memcg);
|
|
}
|
|
@@ -5163,6 +5164,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
|
|
memcg->deferred_split_queue.split_queue_len = 0;
|
|
#endif
|
|
idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
|
|
+ lru_gen_init_memcg(memcg);
|
|
return memcg;
|
|
fail:
|
|
mem_cgroup_id_remove(memcg);
|
|
diff --git a/mm/memory.c b/mm/memory.c
|
|
index 49500390b91b..85d3961c2bd5 100644
|
|
--- a/mm/memory.c
|
|
+++ b/mm/memory.c
|
|
@@ -5091,6 +5091,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
|
|
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
|
|
}
|
|
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+static void lru_gen_enter_fault(struct vm_area_struct *vma)
|
|
+{
|
|
+ /* the LRU algorithm doesn't apply to sequential or random reads */
|
|
+ current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
|
|
+}
|
|
+
|
|
+static void lru_gen_exit_fault(void)
|
|
+{
|
|
+ current->in_lru_fault = false;
|
|
+}
|
|
+#else
|
|
+static void lru_gen_enter_fault(struct vm_area_struct *vma)
|
|
+{
|
|
+}
|
|
+
|
|
+static void lru_gen_exit_fault(void)
|
|
+{
|
|
+}
|
|
+#endif /* CONFIG_LRU_GEN */
|
|
+
|
|
/*
|
|
* By the time we get here, we already hold the mm semaphore
|
|
*
|
|
@@ -5122,11 +5143,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
|
|
if (flags & FAULT_FLAG_USER)
|
|
mem_cgroup_enter_user_fault();
|
|
|
|
+ lru_gen_enter_fault(vma);
|
|
+
|
|
if (unlikely(is_vm_hugetlb_page(vma)))
|
|
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
|
|
else
|
|
ret = __handle_mm_fault(vma, address, flags);
|
|
|
|
+ lru_gen_exit_fault();
|
|
+
|
|
if (flags & FAULT_FLAG_USER) {
|
|
mem_cgroup_exit_user_fault();
|
|
/*
|
|
diff --git a/mm/mm_init.c b/mm/mm_init.c
|
|
index 9ddaf0e1b0ab..0d7b2bd2454a 100644
|
|
--- a/mm/mm_init.c
|
|
+++ b/mm/mm_init.c
|
|
@@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)
|
|
|
|
shift = 8 * sizeof(unsigned long);
|
|
width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
|
|
- - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
|
|
+ - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
|
|
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
|
|
- "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
|
|
+ "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
|
|
SECTIONS_WIDTH,
|
|
NODES_WIDTH,
|
|
ZONES_WIDTH,
|
|
LAST_CPUPID_WIDTH,
|
|
KASAN_TAG_WIDTH,
|
|
+ LRU_GEN_WIDTH,
|
|
+ LRU_REFS_WIDTH,
|
|
NR_PAGEFLAGS);
|
|
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
|
|
"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
|
|
diff --git a/mm/mmzone.c b/mm/mmzone.c
|
|
index 0ae7571e35ab..68e1511be12d 100644
|
|
--- a/mm/mmzone.c
|
|
+++ b/mm/mmzone.c
|
|
@@ -88,6 +88,8 @@ void lruvec_init(struct lruvec *lruvec)
|
|
* Poison its list head, so that any operations on it would crash.
|
|
*/
|
|
list_del(&lruvec->lists[LRU_UNEVICTABLE]);
|
|
+
|
|
+ lru_gen_init_lruvec(lruvec);
|
|
}
|
|
|
|
#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
|
|
diff --git a/mm/swap.c b/mm/swap.c
|
|
index 034bb24879a3..b062729b340f 100644
|
|
--- a/mm/swap.c
|
|
+++ b/mm/swap.c
|
|
@@ -460,6 +460,11 @@ void folio_add_lru(struct folio *folio)
|
|
VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
|
|
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
|
|
|
|
+ /* see the comment in lru_gen_add_folio() */
|
|
+ if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
|
|
+ lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
|
|
+ folio_set_active(folio);
|
|
+
|
|
folio_get(folio);
|
|
local_lock(&lru_pvecs.lock);
|
|
pvec = this_cpu_ptr(&lru_pvecs.lru_add);
|
|
@@ -551,7 +556,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
|
|
|
|
static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
|
|
{
|
|
- if (PageActive(page) && !PageUnevictable(page)) {
|
|
+ if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
|
|
int nr_pages = thp_nr_pages(page);
|
|
|
|
del_page_from_lru_list(page, lruvec);
|
|
@@ -666,7 +671,7 @@ void deactivate_file_folio(struct folio *folio)
|
|
*/
|
|
void deactivate_page(struct page *page)
|
|
{
|
|
- if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
|
|
+ if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
|
|
struct pagevec *pvec;
|
|
|
|
local_lock(&lru_pvecs.lock);
|
|
diff --git a/mm/vmscan.c b/mm/vmscan.c
|
|
index fddb9bd3c6c2..1fcc0feed985 100644
|
|
--- a/mm/vmscan.c
|
|
+++ b/mm/vmscan.c
|
|
@@ -2992,6 +2992,81 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
|
|
return can_demote(pgdat->node_id, sc);
|
|
}
|
|
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+
|
|
+/******************************************************************************
|
|
+ * shorthand helpers
|
|
+ ******************************************************************************/
|
|
+
|
|
+#define for_each_gen_type_zone(gen, type, zone) \
|
|
+ for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \
|
|
+ for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \
|
|
+ for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
|
|
+
|
|
+static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int nid)
|
|
+{
|
|
+ struct pglist_data *pgdat = NODE_DATA(nid);
|
|
+
|
|
+#ifdef CONFIG_MEMCG
|
|
+ if (memcg) {
|
|
+ struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
|
|
+
|
|
+ /* for hotadd_new_pgdat() */
|
|
+ if (!lruvec->pgdat)
|
|
+ lruvec->pgdat = pgdat;
|
|
+
|
|
+ return lruvec;
|
|
+ }
|
|
+#endif
|
|
+ VM_WARN_ON_ONCE(!mem_cgroup_disabled());
|
|
+
|
|
+ return pgdat ? &pgdat->__lruvec : NULL;
|
|
+}
|
|
+
|
|
+/******************************************************************************
|
|
+ * initialization
|
|
+ ******************************************************************************/
|
|
+
|
|
+void lru_gen_init_lruvec(struct lruvec *lruvec)
|
|
+{
|
|
+ int gen, type, zone;
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+
|
|
+ lrugen->max_seq = MIN_NR_GENS + 1;
|
|
+
|
|
+ for_each_gen_type_zone(gen, type, zone)
|
|
+ INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
|
|
+}
|
|
+
|
|
+#ifdef CONFIG_MEMCG
|
|
+void lru_gen_init_memcg(struct mem_cgroup *memcg)
|
|
+{
|
|
+}
|
|
+
|
|
+void lru_gen_exit_memcg(struct mem_cgroup *memcg)
|
|
+{
|
|
+ int nid;
|
|
+
|
|
+ for_each_node(nid) {
|
|
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
|
|
+
|
|
+ VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
|
|
+ sizeof(lruvec->lrugen.nr_pages)));
|
|
+ }
|
|
+}
|
|
+#endif
|
|
+
|
|
+static int __init init_lru_gen(void)
|
|
+{
|
|
+ BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
|
|
+ BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
|
|
+
|
|
+ return 0;
|
|
+};
|
|
+late_initcall(init_lru_gen);
|
|
+
|
|
+#endif /* CONFIG_LRU_GEN */
|
|
+
|
|
static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
|
|
{
|
|
unsigned long nr[NR_LRU_LISTS];
|
|
|
|
From patchwork Wed Jul 6 22:00:15 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908759
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id DF723C433EF
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 23:01:33 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=pw8mCXXDI8v0MDJQc0IHBiDeoukPtRi9Qa3LXTuFRYA=; b=GAKINjSR2jYAuZKQXyHYQ27csK
|
|
vIU+WS8duuSYhE/PbBP55QJHsR+I0Jfop1mhMZQWp+JLGURrwirNRr7QG6l3s6VgzxFmlwNcCUn06
|
|
BWrIyWmLqstBgYPdf6HQnbNzax2pBhJ1xh/+cuyA+2Pqlw8K4g2seeG2bs2QikHc1L+eg305DJGa5
|
|
sHcqeqCJybqHsnQm18wJvTzp60UCiJ1RUbtXxvM9/MhVsqlQW7c6M6QOwawjC9YIXACgREGd5TAA4
|
|
6wSfeEcsUPJNdQxo8FtH/7otsEZ0XFsGyQc7uEPG6yd9Fmnn0obWhhnWi1ldW4dzjST3O20ctGn8K
|
|
4ayXklNA==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9E01-00CfSt-Sp; Wed, 06 Jul 2022 23:00:02 +0000
|
|
Received: from desiato.infradead.org ([2001:8b0:10b:1:d65d:64ff:fe57:4e05])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9Dzj-00CfQj-BX
|
|
for linux-arm-kernel@bombadil.infradead.org; Wed, 06 Jul 2022 22:59:43 +0000
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=infradead.org; s=desiato.20200630; h=Content-Transfer-Encoding:Content-Type
|
|
:Cc:To:From:Subject:References:Mime-Version:Message-Id:In-Reply-To:Date:
|
|
Sender:Reply-To:Content-ID:Content-Description;
|
|
bh=y0ZxSqiIOv2HRYm553wZrJx5fChLkGPPbLO1qwZgmyQ=; b=U+5pDuyHmP7hMdyFTyZSxXBvIC
|
|
CyPg3dBX0ZNabvqb04JUFgowU+vhbfnXXGWenRYUry0+QvuLjNLPR4DPA0XtAKPeGRJj/6CJuHUQI
|
|
370p5Kwc42oWEYM4N4gokoL5vskFSRCGJQXY2pP99BZuIVFyNfxBJA7tKUh5ppf7MSLqHajpuS+7T
|
|
QYUSzn8JZoeViLtyvHbv5zAAQEIvASWq50eH7QIYd+opHWFZr3jgO4c/r6xZUnjbKv+ETyKoQZLQ9
|
|
BDxQLYvZlAVViA92guSLrZhAY6PCDws0FLfZIqTXSVm9SYV2Q3s1ctjTmp9+x0Oc0rh7J9hDh3/F6
|
|
U0FI6tcA==;
|
|
Received: from mail-yw1-x114a.google.com ([2607:f8b0:4864:20::114a])
|
|
by desiato.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D51-000zVk-Is
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:14 +0000
|
|
Received: by mail-yw1-x114a.google.com with SMTP id
|
|
00721157ae682-31c9f68d48cso60703577b3.0
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:01:01 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=y0ZxSqiIOv2HRYm553wZrJx5fChLkGPPbLO1qwZgmyQ=;
|
|
b=k4czIYvx4CiuCTGm0ZE5CP3ROAwcGkVPLViBUVhaVvkR7uaNKMq35oiGoZrpr9wmyA
|
|
3m25Gt55w07/Zl+RDxl25UcbFclUuv1IhW8RxSswLcgrHkQRPfvrY4sHXWvh8Zx9tcVy
|
|
57vPZrwMAdg5KxxrjfPcq/qdHGTF/uyJnTdFe8v4GztZ5hfTrusX1wVVySS9zGZ/5Iow
|
|
Nd9yluqy3C3Vy/90KJx2guGDz9MOF3sU6l1ICpYZ9vNR6C8Rq/+pMVqKsY9lUtmogcQ9
|
|
4GYcy0Nvop1G8oE5zpjlPJBv9NQtnMO9nw2qaCn4RWoOH37nG4jPSXNMIBpa8zn061RW
|
|
FgQg==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=y0ZxSqiIOv2HRYm553wZrJx5fChLkGPPbLO1qwZgmyQ=;
|
|
b=xtHRuVLwEVDT6BEEfqS7cyABRjlZVNEHso4704tNj0XV1xL4tSSLuBKqXU27Tkkb5W
|
|
jXKrE1jYh/ryK/fKMP7l9ZP1R9dRLO37lnhge+apWtiiEBZweIlv+6f+Hc1LaVfzoBWS
|
|
IbLfmwGuUoJMUJddDaL/k6ZaTinhW+70ee37TknmvEM/LcTgdWK4gaw2T+oEp0fg8P6x
|
|
BbxVbyzFh6UeTJ88+YcGCK+7JYuutwDNh+zjRltj9sgoVbuRILIm6rnkDl4wF8/KgsYm
|
|
a0reWinV71iy1yMbM3OVtY/jptNlAR3nh9P77r9uj4JAqo3nk8+F1/EWRjMhBv3O4FT0
|
|
eMoQ==
|
|
X-Gm-Message-State: AJIora+TtRPHQubgfOOrXrX9/K0Jr903+bTvYrxQ0260Sq53UaYlPlRv
|
|
vYGJSVlxXMx+KAvtWaEejGBv2gYgK+8=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1sRDxtae2IlBgXPvJfXEts8Wxw8Va1kZtVIMGzblX4Mg8zS6Ie6RM5yT6WBMCN4GAE5u4jJ09Jf3oM=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a81:ab4d:0:b0:31c:8655:2207 with SMTP id
|
|
d13-20020a81ab4d000000b0031c86552207mr26036050ywk.389.1657144860068; Wed, 06
|
|
Jul 2022 15:01:00 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:15 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-7-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 06/14] mm: multi-gen LRU: minimal implementation
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_230110_483219_C59F8AB1
|
|
X-CRM114-Status: GOOD ( 26.69 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
To avoid confusion, the terms "promotion" and "demotion" will be
|
|
applied to the multi-gen LRU, as a new convention; the terms
|
|
"activation" and "deactivation" will be applied to the active/inactive
|
|
LRU, as usual.
|
|
|
|
The aging produces young generations. Given an lruvec, it increments
|
|
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
|
|
promotes hot pages to the youngest generation when it finds them
|
|
accessed through page tables; the demotion of cold pages happens
|
|
consequently when it increments max_seq. Promotion in the aging path
|
|
does not involve any LRU list operations, only the updates of the gen
|
|
counter and lrugen->nr_pages[]; demotion, unless as the result of the
|
|
increment of max_seq, requires LRU list operations, e.g.,
|
|
lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages),
|
|
since it is only interested in hot pages.
|
|
|
|
The eviction consumes old generations. Given an lruvec, it increments
|
|
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes
|
|
empty. A feedback loop modeled after the PID controller monitors
|
|
refaults over anon and file types and decides which type to evict when
|
|
both types are available from the same generation.
|
|
|
|
The protection of pages accessed multiple times through file
|
|
descriptors takes place in the eviction path. Each generation is
|
|
divided into multiple tiers. A page accessed N times through file
|
|
descriptors is in tier order_base_2(N). Tiers do not have dedicated
|
|
lrugen->lists[], only bits in folio->flags. The aforementioned
|
|
feedback loop also monitors refaults over all tiers and decides when
|
|
to protect pages in which tiers (N>1), using the first tier (N=0,1) as
|
|
a baseline. The first tier contains single-use unmapped clean pages,
|
|
which are most likely the best choices. In contrast to promotion in
|
|
the aging path, the protection of a page in the eviction path is
|
|
achieved by moving this page to the next generation, i.e., min_seq+1,
|
|
if the feedback loop decides so. This approach has the following
|
|
advantages:
|
|
1. It removes the cost of activation in the buffered access path by
|
|
inferring whether pages accessed multiple times through file
|
|
descriptors are statistically hot and thus worth protecting in the
|
|
eviction path.
|
|
2. It takes pages accessed through page tables into account and avoids
|
|
overprotecting pages accessed multiple times through file
|
|
descriptors. (Pages accessed through page tables are in the first
|
|
tier, since N=0.)
|
|
3. More tiers provide better protection for pages accessed more than
|
|
twice through file descriptors, when under heavy buffered I/O
|
|
workloads.
|
|
|
|
Server benchmark results:
|
|
Single workload:
|
|
fio (buffered I/O): +[30, 32]%
|
|
IOPS BW
|
|
5.19-rc1: 2673k 10.2GiB/s
|
|
patch1-6: 3491k 13.3GiB/s
|
|
|
|
Single workload:
|
|
memcached (anon): -[4, 6]%
|
|
Ops/sec KB/sec
|
|
5.19-rc1: 1161501.04 45177.25
|
|
patch1-6: 1106168.46 43025.04
|
|
|
|
Configurations:
|
|
CPU: two Xeon 6154
|
|
Mem: total 256G
|
|
|
|
Node 1 was only used as a ram disk to reduce the variance in the
|
|
results.
|
|
|
|
patch drivers/block/brd.c <<EOF
|
|
99,100c99,100
|
|
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
|
|
< page = alloc_page(gfp_flags);
|
|
---
|
|
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
|
|
> page = alloc_pages_node(1, gfp_flags, 0);
|
|
EOF
|
|
|
|
cat >>/etc/systemd/system.conf <<EOF
|
|
CPUAffinity=numa
|
|
NUMAPolicy=bind
|
|
NUMAMask=0
|
|
EOF
|
|
|
|
cat >>/etc/memcached.conf <<EOF
|
|
-m 184320
|
|
-s /var/run/memcached/memcached.sock
|
|
-a 0766
|
|
-t 36
|
|
-B binary
|
|
EOF
|
|
|
|
cat fio.sh
|
|
modprobe brd rd_nr=1 rd_size=113246208
|
|
swapoff -a
|
|
mkfs.ext4 /dev/ram0
|
|
mount -t ext4 /dev/ram0 /mnt
|
|
|
|
mkdir /sys/fs/cgroup/user.slice/test
|
|
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
|
|
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
|
|
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
|
|
--buffered=1 --ioengine=io_uring --iodepth=128 \
|
|
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
|
|
--rw=randread --random_distribution=random --norandommap \
|
|
--time_based --ramp_time=10m --runtime=5m --group_reporting
|
|
|
|
cat memcached.sh
|
|
modprobe brd rd_nr=1 rd_size=113246208
|
|
swapoff -a
|
|
mkswap /dev/ram0
|
|
swapon /dev/ram0
|
|
|
|
memtier_benchmark -S /var/run/memcached/memcached.sock \
|
|
-P memcache_binary -n allkeys --key-minimum=1 \
|
|
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
|
|
--ratio 1:0 --pipeline 8 -d 2000
|
|
|
|
memtier_benchmark -S /var/run/memcached/memcached.sock \
|
|
-P memcache_binary -n allkeys --key-minimum=1 \
|
|
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
|
|
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
|
|
|
|
Client benchmark results:
|
|
kswapd profiles:
|
|
5.19-rc1
|
|
40.33% page_vma_mapped_walk (overhead)
|
|
21.80% lzo1x_1_do_compress (real work)
|
|
7.53% do_raw_spin_lock
|
|
3.95% _raw_spin_unlock_irq
|
|
2.52% vma_interval_tree_iter_next
|
|
2.37% folio_referenced_one
|
|
2.28% vma_interval_tree_subtree_search
|
|
1.97% anon_vma_interval_tree_iter_first
|
|
1.60% ptep_clear_flush
|
|
1.06% __zram_bvec_write
|
|
|
|
patch1-6
|
|
39.03% lzo1x_1_do_compress (real work)
|
|
18.47% page_vma_mapped_walk (overhead)
|
|
6.74% _raw_spin_unlock_irq
|
|
3.97% do_raw_spin_lock
|
|
2.49% ptep_clear_flush
|
|
2.48% anon_vma_interval_tree_iter_first
|
|
1.92% folio_referenced_one
|
|
1.88% __zram_bvec_write
|
|
1.48% memmove
|
|
1.31% vma_interval_tree_iter_next
|
|
|
|
Configurations:
|
|
CPU: single Snapdragon 7c
|
|
Mem: total 4G
|
|
|
|
Chrome OS MemoryPressure [1]
|
|
|
|
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
include/linux/mm_inline.h | 36 ++
|
|
include/linux/mmzone.h | 41 ++
|
|
include/linux/page-flags-layout.h | 5 +-
|
|
kernel/bounds.c | 2 +
|
|
mm/Kconfig | 11 +
|
|
mm/swap.c | 39 ++
|
|
mm/vmscan.c | 810 +++++++++++++++++++++++++++++-
|
|
mm/workingset.c | 110 +++-
|
|
8 files changed, 1044 insertions(+), 10 deletions(-)
|
|
|
|
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
|
|
index 2ff703900fd0..f2b2296a42f9 100644
|
|
--- a/include/linux/mm_inline.h
|
|
+++ b/include/linux/mm_inline.h
|
|
@@ -121,6 +121,33 @@ static inline int lru_gen_from_seq(unsigned long seq)
|
|
return seq % MAX_NR_GENS;
|
|
}
|
|
|
|
+static inline int lru_hist_from_seq(unsigned long seq)
|
|
+{
|
|
+ return seq % NR_HIST_GENS;
|
|
+}
|
|
+
|
|
+static inline int lru_tier_from_refs(int refs)
|
|
+{
|
|
+ VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH));
|
|
+
|
|
+ /* see the comment in folio_lru_refs() */
|
|
+ return order_base_2(refs + 1);
|
|
+}
|
|
+
|
|
+static inline int folio_lru_refs(struct folio *folio)
|
|
+{
|
|
+ unsigned long flags = READ_ONCE(folio->flags);
|
|
+ bool workingset = flags & BIT(PG_workingset);
|
|
+
|
|
+ /*
|
|
+ * Return the number of accesses beyond PG_referenced, i.e., N-1 if the
|
|
+ * total number of accesses is N>1, since N=0,1 both map to the first
|
|
+ * tier. lru_tier_from_refs() will account for this off-by-one. Also see
|
|
+ * the comment on MAX_NR_TIERS.
|
|
+ */
|
|
+ return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset;
|
|
+}
|
|
+
|
|
static inline int folio_lru_gen(struct folio *folio)
|
|
{
|
|
unsigned long flags = READ_ONCE(folio->flags);
|
|
@@ -173,6 +200,15 @@ static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *foli
|
|
__update_lru_size(lruvec, lru, zone, -delta);
|
|
return;
|
|
}
|
|
+
|
|
+ /* promotion */
|
|
+ if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
|
|
+ __update_lru_size(lruvec, lru, zone, -delta);
|
|
+ __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
|
|
+ }
|
|
+
|
|
+ /* demotion requires isolation, e.g., lru_deactivate_fn() */
|
|
+ VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
|
|
}
|
|
|
|
static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
|
|
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
|
|
index c90c2282044e..0d76222501ed 100644
|
|
--- a/include/linux/mmzone.h
|
|
+++ b/include/linux/mmzone.h
|
|
@@ -347,6 +347,28 @@ enum lruvec_flags {
|
|
#define MIN_NR_GENS 2U
|
|
#define MAX_NR_GENS 4U
|
|
|
|
+/*
|
|
+ * Each generation is divided into multiple tiers. A page accessed N times
|
|
+ * through file descriptors is in tier order_base_2(N). A page in the first tier
|
|
+ * (N=0,1) is marked by PG_referenced unless it was faulted in through page
|
|
+ * tables or read ahead. A page in any other tier (N>1) is marked by
|
|
+ * PG_referenced and PG_workingset. This implies a minimum of two tiers is
|
|
+ * supported without using additional bits in folio->flags.
|
|
+ *
|
|
+ * In contrast to moving across generations which requires the LRU lock, moving
|
|
+ * across tiers only involves atomic operations on folio->flags and therefore
|
|
+ * has a negligible cost in the buffered access path. In the eviction path,
|
|
+ * comparisons of refaulted/(evicted+protected) from the first tier and the
|
|
+ * rest infer whether pages accessed multiple times through file descriptors
|
|
+ * are statistically hot and thus worth protecting.
|
|
+ *
|
|
+ * MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the
|
|
+ * number of categories of the active/inactive LRU when keeping track of
|
|
+ * accesses through file descriptors. This uses MAX_NR_TIERS-2 spare bits in
|
|
+ * folio->flags.
|
|
+ */
|
|
+#define MAX_NR_TIERS 4U
|
|
+
|
|
#ifndef __GENERATING_BOUNDS_H
|
|
|
|
struct lruvec;
|
|
@@ -361,6 +383,16 @@ enum {
|
|
LRU_GEN_FILE,
|
|
};
|
|
|
|
+#define MIN_LRU_BATCH BITS_PER_LONG
|
|
+#define MAX_LRU_BATCH (MIN_LRU_BATCH * 128)
|
|
+
|
|
+/* whether to keep historical stats from evicted generations */
|
|
+#ifdef CONFIG_LRU_GEN_STATS
|
|
+#define NR_HIST_GENS MAX_NR_GENS
|
|
+#else
|
|
+#define NR_HIST_GENS 1U
|
|
+#endif
|
|
+
|
|
/*
|
|
* The youngest generation number is stored in max_seq for both anon and file
|
|
* types as they are aged on an equal footing. The oldest generation numbers are
|
|
@@ -383,6 +415,15 @@ struct lru_gen_struct {
|
|
struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
|
|
/* the multi-gen LRU sizes, eventually consistent */
|
|
long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
|
|
+ /* the exponential moving average of refaulted */
|
|
+ unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
|
|
+ /* the exponential moving average of evicted+protected */
|
|
+ unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
|
|
+ /* the first tier doesn't need protection, hence the minus one */
|
|
+ unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
|
|
+ /* can be modified without holding the LRU lock */
|
|
+ atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
|
|
+ atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
|
|
};
|
|
|
|
void lru_gen_init_lruvec(struct lruvec *lruvec);
|
|
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
|
|
index 240905407a18..7d79818dc065 100644
|
|
--- a/include/linux/page-flags-layout.h
|
|
+++ b/include/linux/page-flags-layout.h
|
|
@@ -106,7 +106,10 @@
|
|
#error "Not enough bits in page flags"
|
|
#endif
|
|
|
|
-#define LRU_REFS_WIDTH 0
|
|
+/* see the comment on MAX_NR_TIERS */
|
|
+#define LRU_REFS_WIDTH min(__LRU_REFS_WIDTH, BITS_PER_LONG - NR_PAGEFLAGS - \
|
|
+ ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \
|
|
+ NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH)
|
|
|
|
#endif
|
|
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
|
|
diff --git a/kernel/bounds.c b/kernel/bounds.c
|
|
index 5ee60777d8e4..b529182e8b04 100644
|
|
--- a/kernel/bounds.c
|
|
+++ b/kernel/bounds.c
|
|
@@ -24,8 +24,10 @@ int main(void)
|
|
DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
|
|
#ifdef CONFIG_LRU_GEN
|
|
DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
|
|
+ DEFINE(__LRU_REFS_WIDTH, MAX_NR_TIERS - 2);
|
|
#else
|
|
DEFINE(LRU_GEN_WIDTH, 0);
|
|
+ DEFINE(__LRU_REFS_WIDTH, 0);
|
|
#endif
|
|
/* End of constants */
|
|
|
|
diff --git a/mm/Kconfig b/mm/Kconfig
|
|
index cee109f3128a..a93478acf341 100644
|
|
--- a/mm/Kconfig
|
|
+++ b/mm/Kconfig
|
|
@@ -1130,6 +1130,7 @@ config PTE_MARKER_UFFD_WP
|
|
purposes. It is required to enable userfaultfd write protection on
|
|
file-backed memory types like shmem and hugetlbfs.
|
|
|
|
+# multi-gen LRU {
|
|
config LRU_GEN
|
|
bool "Multi-Gen LRU"
|
|
depends on MMU
|
|
@@ -1138,6 +1139,16 @@ config LRU_GEN
|
|
help
|
|
A high performance LRU implementation to overcommit memory.
|
|
|
|
+config LRU_GEN_STATS
|
|
+ bool "Full stats for debugging"
|
|
+ depends on LRU_GEN
|
|
+ help
|
|
+ Do not enable this option unless you plan to look at historical stats
|
|
+ from evicted generations for debugging purpose.
|
|
+
|
|
+ This option has a per-memcg and per-node memory overhead.
|
|
+# }
|
|
+
|
|
source "mm/damon/Kconfig"
|
|
|
|
endmenu
|
|
diff --git a/mm/swap.c b/mm/swap.c
|
|
index b062729b340f..67e7962fbacc 100644
|
|
--- a/mm/swap.c
|
|
+++ b/mm/swap.c
|
|
@@ -405,6 +405,40 @@ static void __lru_cache_activate_folio(struct folio *folio)
|
|
local_unlock(&lru_pvecs.lock);
|
|
}
|
|
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+static void folio_inc_refs(struct folio *folio)
|
|
+{
|
|
+ unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
|
|
+
|
|
+ if (folio_test_unevictable(folio))
|
|
+ return;
|
|
+
|
|
+ if (!folio_test_referenced(folio)) {
|
|
+ folio_set_referenced(folio);
|
|
+ return;
|
|
+ }
|
|
+
|
|
+ if (!folio_test_workingset(folio)) {
|
|
+ folio_set_workingset(folio);
|
|
+ return;
|
|
+ }
|
|
+
|
|
+ /* see the comment on MAX_NR_TIERS */
|
|
+ do {
|
|
+ new_flags = old_flags & LRU_REFS_MASK;
|
|
+ if (new_flags == LRU_REFS_MASK)
|
|
+ break;
|
|
+
|
|
+ new_flags += BIT(LRU_REFS_PGOFF);
|
|
+ new_flags |= old_flags & ~LRU_REFS_MASK;
|
|
+ } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
|
|
+}
|
|
+#else
|
|
+static void folio_inc_refs(struct folio *folio)
|
|
+{
|
|
+}
|
|
+#endif /* CONFIG_LRU_GEN */
|
|
+
|
|
/*
|
|
* Mark a page as having seen activity.
|
|
*
|
|
@@ -417,6 +451,11 @@ static void __lru_cache_activate_folio(struct folio *folio)
|
|
*/
|
|
void folio_mark_accessed(struct folio *folio)
|
|
{
|
|
+ if (lru_gen_enabled()) {
|
|
+ folio_inc_refs(folio);
|
|
+ return;
|
|
+ }
|
|
+
|
|
if (!folio_test_referenced(folio)) {
|
|
folio_set_referenced(folio);
|
|
} else if (folio_test_unevictable(folio)) {
|
|
diff --git a/mm/vmscan.c b/mm/vmscan.c
|
|
index 1fcc0feed985..f768d61e7b85 100644
|
|
--- a/mm/vmscan.c
|
|
+++ b/mm/vmscan.c
|
|
@@ -1273,9 +1273,11 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
|
|
|
|
if (folio_test_swapcache(folio)) {
|
|
swp_entry_t swap = folio_swap_entry(folio);
|
|
- mem_cgroup_swapout(folio, swap);
|
|
+
|
|
+ /* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */
|
|
if (reclaimed && !mapping_exiting(mapping))
|
|
shadow = workingset_eviction(folio, target_memcg);
|
|
+ mem_cgroup_swapout(folio, swap);
|
|
__delete_from_swap_cache(&folio->page, swap, shadow);
|
|
xa_unlock_irq(&mapping->i_pages);
|
|
put_swap_page(&folio->page, swap);
|
|
@@ -2675,6 +2677,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
|
|
unsigned long file;
|
|
struct lruvec *target_lruvec;
|
|
|
|
+ if (lru_gen_enabled())
|
|
+ return;
|
|
+
|
|
target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
|
|
|
|
/*
|
|
@@ -2998,6 +3003,17 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
|
|
* shorthand helpers
|
|
******************************************************************************/
|
|
|
|
+#define LRU_REFS_FLAGS (BIT(PG_referenced) | BIT(PG_workingset))
|
|
+
|
|
+#define DEFINE_MAX_SEQ(lruvec) \
|
|
+ unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
|
|
+
|
|
+#define DEFINE_MIN_SEQ(lruvec) \
|
|
+ unsigned long min_seq[ANON_AND_FILE] = { \
|
|
+ READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]), \
|
|
+ READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]), \
|
|
+ }
|
|
+
|
|
#define for_each_gen_type_zone(gen, type, zone) \
|
|
for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \
|
|
for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \
|
|
@@ -3023,6 +3039,764 @@ static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int ni
|
|
return pgdat ? &pgdat->__lruvec : NULL;
|
|
}
|
|
|
|
+static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
|
|
+{
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
|
|
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
|
|
+
|
|
+ if (!can_demote(pgdat->node_id, sc) &&
|
|
+ mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
|
|
+ return 0;
|
|
+
|
|
+ return mem_cgroup_swappiness(memcg);
|
|
+}
|
|
+
|
|
+static int get_nr_gens(struct lruvec *lruvec, int type)
|
|
+{
|
|
+ return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
|
|
+}
|
|
+
|
|
+static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
|
|
+{
|
|
+ /* see the comment on lru_gen_struct */
|
|
+ return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
|
|
+ get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
|
|
+ get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
|
|
+}
|
|
+
|
|
+/******************************************************************************
|
|
+ * refault feedback loop
|
|
+ ******************************************************************************/
|
|
+
|
|
+/*
|
|
+ * A feedback loop based on Proportional-Integral-Derivative (PID) controller.
|
|
+ *
|
|
+ * The P term is refaulted/(evicted+protected) from a tier in the generation
|
|
+ * currently being evicted; the I term is the exponential moving average of the
|
|
+ * P term over the generations previously evicted, using the smoothing factor
|
|
+ * 1/2; the D term isn't supported.
|
|
+ *
|
|
+ * The setpoint (SP) is always the first tier of one type; the process variable
|
|
+ * (PV) is either any tier of the other type or any other tier of the same
|
|
+ * type.
|
|
+ *
|
|
+ * The error is the difference between the SP and the PV; the correction is to
|
|
+ * turn off protection when SP>PV or turn on protection when SP<PV.
|
|
+ *
|
|
+ * For future optimizations:
|
|
+ * 1. The D term may discount the other two terms over time so that long-lived
|
|
+ * generations can resist stale information.
|
|
+ */
|
|
+struct ctrl_pos {
|
|
+ unsigned long refaulted;
|
|
+ unsigned long total;
|
|
+ int gain;
|
|
+};
|
|
+
|
|
+static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
|
|
+ struct ctrl_pos *pos)
|
|
+{
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+ int hist = lru_hist_from_seq(lrugen->min_seq[type]);
|
|
+
|
|
+ pos->refaulted = lrugen->avg_refaulted[type][tier] +
|
|
+ atomic_long_read(&lrugen->refaulted[hist][type][tier]);
|
|
+ pos->total = lrugen->avg_total[type][tier] +
|
|
+ atomic_long_read(&lrugen->evicted[hist][type][tier]);
|
|
+ if (tier)
|
|
+ pos->total += lrugen->protected[hist][type][tier - 1];
|
|
+ pos->gain = gain;
|
|
+}
|
|
+
|
|
+static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
|
|
+{
|
|
+ int hist, tier;
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+ bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1;
|
|
+ unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1;
|
|
+
|
|
+ lockdep_assert_held(&lruvec->lru_lock);
|
|
+
|
|
+ if (!carryover && !clear)
|
|
+ return;
|
|
+
|
|
+ hist = lru_hist_from_seq(seq);
|
|
+
|
|
+ for (tier = 0; tier < MAX_NR_TIERS; tier++) {
|
|
+ if (carryover) {
|
|
+ unsigned long sum;
|
|
+
|
|
+ sum = lrugen->avg_refaulted[type][tier] +
|
|
+ atomic_long_read(&lrugen->refaulted[hist][type][tier]);
|
|
+ WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
|
|
+
|
|
+ sum = lrugen->avg_total[type][tier] +
|
|
+ atomic_long_read(&lrugen->evicted[hist][type][tier]);
|
|
+ if (tier)
|
|
+ sum += lrugen->protected[hist][type][tier - 1];
|
|
+ WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
|
|
+ }
|
|
+
|
|
+ if (clear) {
|
|
+ atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
|
|
+ atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
|
|
+ if (tier)
|
|
+ WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
|
|
+ }
|
|
+ }
|
|
+}
|
|
+
|
|
+static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
|
|
+{
|
|
+ /*
|
|
+ * Return true if the PV has a limited number of refaults or a lower
|
|
+ * refaulted/total than the SP.
|
|
+ */
|
|
+ return pv->refaulted < MIN_LRU_BATCH ||
|
|
+ pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <=
|
|
+ (sp->refaulted + 1) * pv->total * pv->gain;
|
|
+}
|
|
+
|
|
+/******************************************************************************
|
|
+ * the aging
|
|
+ ******************************************************************************/
|
|
+
|
|
+/* protect pages accessed multiple times through file descriptors */
|
|
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
|
|
+{
|
|
+ int type = folio_is_file_lru(folio);
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+ int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
|
|
+ unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
|
|
+
|
|
+ VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);
|
|
+
|
|
+ do {
|
|
+ new_gen = (old_gen + 1) % MAX_NR_GENS;
|
|
+
|
|
+ new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
|
|
+ new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
|
|
+ /* for folio_end_writeback() */
|
|
+ if (reclaiming)
|
|
+ new_flags |= BIT(PG_reclaim);
|
|
+ } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
|
|
+
|
|
+ lru_gen_update_size(lruvec, folio, old_gen, new_gen);
|
|
+
|
|
+ return new_gen;
|
|
+}
|
|
+
|
|
+static void inc_min_seq(struct lruvec *lruvec, int type)
|
|
+{
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+
|
|
+ reset_ctrl_pos(lruvec, type, true);
|
|
+ WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
|
|
+}
|
|
+
|
|
+static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
|
|
+{
|
|
+ int gen, type, zone;
|
|
+ bool success = false;
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+ DEFINE_MIN_SEQ(lruvec);
|
|
+
|
|
+ VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
|
|
+
|
|
+ /* find the oldest populated generation */
|
|
+ for (type = !can_swap; type < ANON_AND_FILE; type++) {
|
|
+ while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
|
|
+ gen = lru_gen_from_seq(min_seq[type]);
|
|
+
|
|
+ for (zone = 0; zone < MAX_NR_ZONES; zone++) {
|
|
+ if (!list_empty(&lrugen->lists[gen][type][zone]))
|
|
+ goto next;
|
|
+ }
|
|
+
|
|
+ min_seq[type]++;
|
|
+ }
|
|
+next:
|
|
+ ;
|
|
+ }
|
|
+
|
|
+ /* see the comment on lru_gen_struct */
|
|
+ if (can_swap) {
|
|
+ min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
|
|
+ min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
|
|
+ }
|
|
+
|
|
+ for (type = !can_swap; type < ANON_AND_FILE; type++) {
|
|
+ if (min_seq[type] == lrugen->min_seq[type])
|
|
+ continue;
|
|
+
|
|
+ reset_ctrl_pos(lruvec, type, true);
|
|
+ WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
|
|
+ success = true;
|
|
+ }
|
|
+
|
|
+ return success;
|
|
+}
|
|
+
|
|
+static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_swap)
|
|
+{
|
|
+ int prev, next;
|
|
+ int type, zone;
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+
|
|
+ spin_lock_irq(&lruvec->lru_lock);
|
|
+
|
|
+ VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
|
|
+
|
|
+ if (max_seq != lrugen->max_seq)
|
|
+ goto unlock;
|
|
+
|
|
+ for (type = 0; type < ANON_AND_FILE; type++) {
|
|
+ if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
|
|
+ continue;
|
|
+
|
|
+ VM_WARN_ON_ONCE(type == LRU_GEN_FILE || can_swap);
|
|
+
|
|
+ inc_min_seq(lruvec, type);
|
|
+ }
|
|
+
|
|
+ /*
|
|
+ * Update the active/inactive LRU sizes for compatibility. Both sides of
|
|
+ * the current max_seq need to be covered, since max_seq+1 can overlap
|
|
+ * with min_seq[LRU_GEN_ANON] if swapping is constrained. And if they do
|
|
+ * overlap, cold/hot inversion happens.
|
|
+ */
|
|
+ prev = lru_gen_from_seq(lrugen->max_seq - 1);
|
|
+ next = lru_gen_from_seq(lrugen->max_seq + 1);
|
|
+
|
|
+ for (type = 0; type < ANON_AND_FILE; type++) {
|
|
+ for (zone = 0; zone < MAX_NR_ZONES; zone++) {
|
|
+ enum lru_list lru = type * LRU_INACTIVE_FILE;
|
|
+ long delta = lrugen->nr_pages[prev][type][zone] -
|
|
+ lrugen->nr_pages[next][type][zone];
|
|
+
|
|
+ if (!delta)
|
|
+ continue;
|
|
+
|
|
+ __update_lru_size(lruvec, lru, zone, delta);
|
|
+ __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
|
|
+ }
|
|
+ }
|
|
+
|
|
+ for (type = 0; type < ANON_AND_FILE; type++)
|
|
+ reset_ctrl_pos(lruvec, type, false);
|
|
+
|
|
+ /* make sure preceding modifications appear */
|
|
+ smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
|
|
+unlock:
|
|
+ spin_unlock_irq(&lruvec->lru_lock);
|
|
+}
|
|
+
|
|
+static unsigned long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
|
|
+ unsigned long *min_seq, bool can_swap, bool *need_aging)
|
|
+{
|
|
+ int gen, type, zone;
|
|
+ unsigned long old = 0;
|
|
+ unsigned long young = 0;
|
|
+ unsigned long total = 0;
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+
|
|
+ for (type = !can_swap; type < ANON_AND_FILE; type++) {
|
|
+ unsigned long seq;
|
|
+
|
|
+ for (seq = min_seq[type]; seq <= max_seq; seq++) {
|
|
+ unsigned long size = 0;
|
|
+
|
|
+ gen = lru_gen_from_seq(seq);
|
|
+
|
|
+ for (zone = 0; zone < MAX_NR_ZONES; zone++)
|
|
+ size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
|
|
+
|
|
+ total += size;
|
|
+ if (seq == max_seq)
|
|
+ young += size;
|
|
+ if (seq + MIN_NR_GENS == max_seq)
|
|
+ old += size;
|
|
+ }
|
|
+ }
|
|
+
|
|
+ /*
|
|
+ * The aging tries to be lazy to reduce the overhead. On the other hand,
|
|
+ * the eviction stalls when the number of generations reaches
|
|
+ * MIN_NR_GENS. So ideally, there should be MIN_NR_GENS+1 generations,
|
|
+ * hence the first two if's.
|
|
+ *
|
|
+ * Also it's ideal to spread pages out evenly, meaning 1/(MIN_NR_GENS+1)
|
|
+ * of the total number of pages for each generation. A reasonable range
|
|
+ * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The
|
|
+ * eviction cares about the lower bound of cold pages, whereas the aging
|
|
+ * cares about the upper bound of hot pages.
|
|
+ */
|
|
+ if (min_seq[!can_swap] + MIN_NR_GENS > max_seq)
|
|
+ *need_aging = true;
|
|
+ else if (min_seq[!can_swap] + MIN_NR_GENS < max_seq)
|
|
+ *need_aging = false;
|
|
+ else if (young * MIN_NR_GENS > total)
|
|
+ *need_aging = true;
|
|
+ else if (old * (MIN_NR_GENS + 2) < total)
|
|
+ *need_aging = true;
|
|
+ else
|
|
+ *need_aging = false;
|
|
+
|
|
+ return total;
|
|
+}
|
|
+
|
|
+static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
|
|
+{
|
|
+ bool need_aging;
|
|
+ unsigned long nr_to_scan;
|
|
+ int swappiness = get_swappiness(lruvec, sc);
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
|
|
+ DEFINE_MAX_SEQ(lruvec);
|
|
+ DEFINE_MIN_SEQ(lruvec);
|
|
+
|
|
+ VM_WARN_ON_ONCE(sc->memcg_low_reclaim);
|
|
+
|
|
+ mem_cgroup_calculate_protection(NULL, memcg);
|
|
+
|
|
+ if (mem_cgroup_below_min(memcg))
|
|
+ return;
|
|
+
|
|
+ nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
|
|
+ if (!nr_to_scan)
|
|
+ return;
|
|
+
|
|
+ nr_to_scan >>= mem_cgroup_online(memcg) ? sc->priority : 0;
|
|
+
|
|
+ if (nr_to_scan && need_aging)
|
|
+ inc_max_seq(lruvec, max_seq, swappiness);
|
|
+}
|
|
+
|
|
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
|
|
+{
|
|
+ struct mem_cgroup *memcg;
|
|
+
|
|
+ VM_WARN_ON_ONCE(!current_is_kswapd());
|
|
+
|
|
+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
|
|
+ do {
|
|
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
|
|
+
|
|
+ age_lruvec(lruvec, sc);
|
|
+
|
|
+ cond_resched();
|
|
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
|
|
+}
|
|
+
|
|
+/******************************************************************************
|
|
+ * the eviction
|
|
+ ******************************************************************************/
|
|
+
|
|
+static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
|
|
+{
|
|
+ bool success;
|
|
+ int gen = folio_lru_gen(folio);
|
|
+ int type = folio_is_file_lru(folio);
|
|
+ int zone = folio_zonenum(folio);
|
|
+ int delta = folio_nr_pages(folio);
|
|
+ int refs = folio_lru_refs(folio);
|
|
+ int tier = lru_tier_from_refs(refs);
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+
|
|
+ VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio);
|
|
+
|
|
+ /* unevictable */
|
|
+ if (!folio_evictable(folio)) {
|
|
+ success = lru_gen_del_folio(lruvec, folio, true);
|
|
+ VM_WARN_ON_ONCE_FOLIO(!success, folio);
|
|
+ folio_set_unevictable(folio);
|
|
+ lruvec_add_folio(lruvec, folio);
|
|
+ __count_vm_events(UNEVICTABLE_PGCULLED, delta);
|
|
+ return true;
|
|
+ }
|
|
+
|
|
+ /* dirty lazyfree */
|
|
+ if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
|
|
+ success = lru_gen_del_folio(lruvec, folio, true);
|
|
+ VM_WARN_ON_ONCE_FOLIO(!success, folio);
|
|
+ folio_set_swapbacked(folio);
|
|
+ lruvec_add_folio_tail(lruvec, folio);
|
|
+ return true;
|
|
+ }
|
|
+
|
|
+ /* protected */
|
|
+ if (tier > tier_idx) {
|
|
+ int hist = lru_hist_from_seq(lrugen->min_seq[type]);
|
|
+
|
|
+ gen = folio_inc_gen(lruvec, folio, false);
|
|
+ list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
|
|
+
|
|
+ WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
|
|
+ lrugen->protected[hist][type][tier - 1] + delta);
|
|
+ __mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
|
|
+ return true;
|
|
+ }
|
|
+
|
|
+ /* waiting for writeback */
|
|
+ if (folio_test_locked(folio) || folio_test_writeback(folio) ||
|
|
+ (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
|
|
+ gen = folio_inc_gen(lruvec, folio, true);
|
|
+ list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
|
|
+ return true;
|
|
+ }
|
|
+
|
|
+ return false;
|
|
+}
|
|
+
|
|
+static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc)
|
|
+{
|
|
+ bool success;
|
|
+
|
|
+ /* unmapping inhibited */
|
|
+ if (!sc->may_unmap && folio_mapped(folio))
|
|
+ return false;
|
|
+
|
|
+ /* swapping inhibited */
|
|
+ if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
|
|
+ (folio_test_dirty(folio) ||
|
|
+ (folio_test_anon(folio) && !folio_test_swapcache(folio))))
|
|
+ return false;
|
|
+
|
|
+ /* raced with release_pages() */
|
|
+ if (!folio_try_get(folio))
|
|
+ return false;
|
|
+
|
|
+ /* raced with another isolation */
|
|
+ if (!folio_test_clear_lru(folio)) {
|
|
+ folio_put(folio);
|
|
+ return false;
|
|
+ }
|
|
+
|
|
+ /* see the comment on MAX_NR_TIERS */
|
|
+ if (!folio_test_referenced(folio))
|
|
+ set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0);
|
|
+
|
|
+ /* for shrink_page_list() */
|
|
+ folio_clear_reclaim(folio);
|
|
+ folio_clear_referenced(folio);
|
|
+
|
|
+ success = lru_gen_del_folio(lruvec, folio, true);
|
|
+ VM_WARN_ON_ONCE_FOLIO(!success, folio);
|
|
+
|
|
+ return true;
|
|
+}
|
|
+
|
|
+static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
|
|
+ int type, int tier, struct list_head *list)
|
|
+{
|
|
+ int gen, zone;
|
|
+ enum vm_event_item item;
|
|
+ int sorted = 0;
|
|
+ int scanned = 0;
|
|
+ int isolated = 0;
|
|
+ int remaining = MAX_LRU_BATCH;
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
|
|
+
|
|
+ VM_WARN_ON_ONCE(!list_empty(list));
|
|
+
|
|
+ if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
|
|
+ return 0;
|
|
+
|
|
+ gen = lru_gen_from_seq(lrugen->min_seq[type]);
|
|
+
|
|
+ for (zone = sc->reclaim_idx; zone >= 0; zone--) {
|
|
+ LIST_HEAD(moved);
|
|
+ int skipped = 0;
|
|
+ struct list_head *head = &lrugen->lists[gen][type][zone];
|
|
+
|
|
+ while (!list_empty(head)) {
|
|
+ struct folio *folio = lru_to_folio(head);
|
|
+ int delta = folio_nr_pages(folio);
|
|
+
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
|
|
+
|
|
+ scanned += delta;
|
|
+
|
|
+ if (sort_folio(lruvec, folio, tier))
|
|
+ sorted += delta;
|
|
+ else if (isolate_folio(lruvec, folio, sc)) {
|
|
+ list_add(&folio->lru, list);
|
|
+ isolated += delta;
|
|
+ } else {
|
|
+ list_move(&folio->lru, &moved);
|
|
+ skipped += delta;
|
|
+ }
|
|
+
|
|
+ if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH)
|
|
+ break;
|
|
+ }
|
|
+
|
|
+ if (skipped) {
|
|
+ list_splice(&moved, head);
|
|
+ __count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
|
|
+ }
|
|
+
|
|
+ if (!remaining || isolated >= MIN_LRU_BATCH)
|
|
+ break;
|
|
+ }
|
|
+
|
|
+ item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
|
|
+ if (!cgroup_reclaim(sc)) {
|
|
+ __count_vm_events(item, isolated);
|
|
+ __count_vm_events(PGREFILL, sorted);
|
|
+ }
|
|
+ __count_memcg_events(memcg, item, isolated);
|
|
+ __count_memcg_events(memcg, PGREFILL, sorted);
|
|
+ __count_vm_events(PGSCAN_ANON + type, isolated);
|
|
+
|
|
+ /*
|
|
+ * There might not be eligible pages due to reclaim_idx, may_unmap and
|
|
+ * may_writepage. Check the remaining to prevent livelock if it's not
|
|
+ * making progress.
|
|
+ */
|
|
+ return isolated || !remaining ? scanned : 0;
|
|
+}
|
|
+
|
|
+static int get_tier_idx(struct lruvec *lruvec, int type)
|
|
+{
|
|
+ int tier;
|
|
+ struct ctrl_pos sp, pv;
|
|
+
|
|
+ /*
|
|
+ * To leave a margin for fluctuations, use a larger gain factor (1:2).
|
|
+ * This value is chosen because any other tier would have at least twice
|
|
+ * as many refaults as the first tier.
|
|
+ */
|
|
+ read_ctrl_pos(lruvec, type, 0, 1, &sp);
|
|
+ for (tier = 1; tier < MAX_NR_TIERS; tier++) {
|
|
+ read_ctrl_pos(lruvec, type, tier, 2, &pv);
|
|
+ if (!positive_ctrl_err(&sp, &pv))
|
|
+ break;
|
|
+ }
|
|
+
|
|
+ return tier - 1;
|
|
+}
|
|
+
|
|
+static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx)
|
|
+{
|
|
+ int type, tier;
|
|
+ struct ctrl_pos sp, pv;
|
|
+ int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
|
|
+
|
|
+ /*
|
|
+ * Compare the first tier of anon with that of file to determine which
|
|
+ * type to scan. Also need to compare other tiers of the selected type
|
|
+ * with the first tier of the other type to determine the last tier (of
|
|
+ * the selected type) to evict.
|
|
+ */
|
|
+ read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
|
|
+ read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
|
|
+ type = positive_ctrl_err(&sp, &pv);
|
|
+
|
|
+ read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
|
|
+ for (tier = 1; tier < MAX_NR_TIERS; tier++) {
|
|
+ read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
|
|
+ if (!positive_ctrl_err(&sp, &pv))
|
|
+ break;
|
|
+ }
|
|
+
|
|
+ *tier_idx = tier - 1;
|
|
+
|
|
+ return type;
|
|
+}
|
|
+
|
|
+static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
|
|
+ int *type_scanned, struct list_head *list)
|
|
+{
|
|
+ int i;
|
|
+ int type;
|
|
+ int scanned;
|
|
+ int tier = -1;
|
|
+ DEFINE_MIN_SEQ(lruvec);
|
|
+
|
|
+ /*
|
|
+ * Try to make the obvious choice first. When anon and file are both
|
|
+ * available from the same generation, interpret swappiness 1 as file
|
|
+ * first and 200 as anon first.
|
|
+ */
|
|
+ if (!swappiness)
|
|
+ type = LRU_GEN_FILE;
|
|
+ else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE])
|
|
+ type = LRU_GEN_ANON;
|
|
+ else if (swappiness == 1)
|
|
+ type = LRU_GEN_FILE;
|
|
+ else if (swappiness == 200)
|
|
+ type = LRU_GEN_ANON;
|
|
+ else
|
|
+ type = get_type_to_scan(lruvec, swappiness, &tier);
|
|
+
|
|
+ for (i = !swappiness; i < ANON_AND_FILE; i++) {
|
|
+ if (tier < 0)
|
|
+ tier = get_tier_idx(lruvec, type);
|
|
+
|
|
+ scanned = scan_folios(lruvec, sc, type, tier, list);
|
|
+ if (scanned)
|
|
+ break;
|
|
+
|
|
+ type = !type;
|
|
+ tier = -1;
|
|
+ }
|
|
+
|
|
+ *type_scanned = type;
|
|
+
|
|
+ return scanned;
|
|
+}
|
|
+
|
|
+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
|
|
+{
|
|
+ int type;
|
|
+ int scanned;
|
|
+ int reclaimed;
|
|
+ LIST_HEAD(list);
|
|
+ struct folio *folio;
|
|
+ enum vm_event_item item;
|
|
+ struct reclaim_stat stat;
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
|
|
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
|
|
+
|
|
+ spin_lock_irq(&lruvec->lru_lock);
|
|
+
|
|
+ scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
|
|
+
|
|
+ scanned += try_to_inc_min_seq(lruvec, swappiness);
|
|
+
|
|
+ if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS)
|
|
+ scanned = 0;
|
|
+
|
|
+ spin_unlock_irq(&lruvec->lru_lock);
|
|
+
|
|
+ if (list_empty(&list))
|
|
+ return scanned;
|
|
+
|
|
+ reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
|
|
+
|
|
+ list_for_each_entry(folio, &list, lru) {
|
|
+ /* restore LRU_REFS_FLAGS cleared by isolate_folio() */
|
|
+ if (folio_test_workingset(folio))
|
|
+ folio_set_referenced(folio);
|
|
+
|
|
+ /* don't add rejected pages to the oldest generation */
|
|
+ if (folio_test_reclaim(folio) &&
|
|
+ (folio_test_dirty(folio) || folio_test_writeback(folio)))
|
|
+ folio_clear_active(folio);
|
|
+ else
|
|
+ folio_set_active(folio);
|
|
+ }
|
|
+
|
|
+ spin_lock_irq(&lruvec->lru_lock);
|
|
+
|
|
+ move_pages_to_lru(lruvec, &list);
|
|
+
|
|
+ item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
|
|
+ if (!cgroup_reclaim(sc))
|
|
+ __count_vm_events(item, reclaimed);
|
|
+ __count_memcg_events(memcg, item, reclaimed);
|
|
+ __count_vm_events(PGSTEAL_ANON + type, reclaimed);
|
|
+
|
|
+ spin_unlock_irq(&lruvec->lru_lock);
|
|
+
|
|
+ mem_cgroup_uncharge_list(&list);
|
|
+ free_unref_page_list(&list);
|
|
+
|
|
+ sc->nr_reclaimed += reclaimed;
|
|
+
|
|
+ return scanned;
|
|
+}
|
|
+
|
|
+static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
|
|
+ bool can_swap, unsigned long reclaimed)
|
|
+{
|
|
+ int priority;
|
|
+ bool need_aging;
|
|
+ unsigned long nr_to_scan;
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
|
|
+ DEFINE_MAX_SEQ(lruvec);
|
|
+ DEFINE_MIN_SEQ(lruvec);
|
|
+
|
|
+ if (mem_cgroup_below_min(memcg) ||
|
|
+ (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
|
|
+ return 0;
|
|
+
|
|
+ nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, can_swap, &need_aging);
|
|
+ if (!nr_to_scan)
|
|
+ return 0;
|
|
+
|
|
+ /* adjust priority if memcg is offline or the target is met */
|
|
+ if (!mem_cgroup_online(memcg))
|
|
+ priority = 0;
|
|
+ else if (sc->nr_reclaimed - reclaimed >= sc->nr_to_reclaim)
|
|
+ priority = DEF_PRIORITY;
|
|
+ else
|
|
+ priority = sc->priority;
|
|
+
|
|
+ nr_to_scan >>= priority;
|
|
+ if (!nr_to_scan)
|
|
+ return 0;
|
|
+
|
|
+ if (!need_aging)
|
|
+ return nr_to_scan;
|
|
+
|
|
+ /* skip the aging path at the default priority */
|
|
+ if (priority == DEF_PRIORITY)
|
|
+ goto done;
|
|
+
|
|
+ /* leave the work to lru_gen_age_node() */
|
|
+ if (current_is_kswapd())
|
|
+ return 0;
|
|
+
|
|
+ inc_max_seq(lruvec, max_seq, can_swap);
|
|
+done:
|
|
+ return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
|
|
+}
|
|
+
|
|
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
|
|
+{
|
|
+ struct blk_plug plug;
|
|
+ unsigned long scanned = 0;
|
|
+ unsigned long reclaimed = sc->nr_reclaimed;
|
|
+
|
|
+ lru_add_drain();
|
|
+
|
|
+ blk_start_plug(&plug);
|
|
+
|
|
+ while (true) {
|
|
+ int delta;
|
|
+ int swappiness;
|
|
+ unsigned long nr_to_scan;
|
|
+
|
|
+ if (sc->may_swap)
|
|
+ swappiness = get_swappiness(lruvec, sc);
|
|
+ else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc))
|
|
+ swappiness = 1;
|
|
+ else
|
|
+ swappiness = 0;
|
|
+
|
|
+ nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness, reclaimed);
|
|
+ if (!nr_to_scan)
|
|
+ break;
|
|
+
|
|
+ delta = evict_folios(lruvec, sc, swappiness);
|
|
+ if (!delta)
|
|
+ break;
|
|
+
|
|
+ scanned += delta;
|
|
+ if (scanned >= nr_to_scan)
|
|
+ break;
|
|
+
|
|
+ cond_resched();
|
|
+ }
|
|
+
|
|
+ blk_finish_plug(&plug);
|
|
+}
|
|
+
|
|
/******************************************************************************
|
|
* initialization
|
|
******************************************************************************/
|
|
@@ -3065,6 +3839,16 @@ static int __init init_lru_gen(void)
|
|
};
|
|
late_initcall(init_lru_gen);
|
|
|
|
+#else /* !CONFIG_LRU_GEN */
|
|
+
|
|
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
|
|
+{
|
|
+}
|
|
+
|
|
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
|
|
+{
|
|
+}
|
|
+
|
|
#endif /* CONFIG_LRU_GEN */
|
|
|
|
static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
|
|
@@ -3078,6 +3862,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
|
|
struct blk_plug plug;
|
|
bool scan_adjusted;
|
|
|
|
+ if (lru_gen_enabled()) {
|
|
+ lru_gen_shrink_lruvec(lruvec, sc);
|
|
+ return;
|
|
+ }
|
|
+
|
|
get_scan_count(lruvec, sc, nr);
|
|
|
|
/* Record the original scan target for proportional adjustments later */
|
|
@@ -3582,6 +4371,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
|
|
struct lruvec *target_lruvec;
|
|
unsigned long refaults;
|
|
|
|
+ if (lru_gen_enabled())
|
|
+ return;
|
|
+
|
|
target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
|
|
refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
|
|
target_lruvec->refaults[0] = refaults;
|
|
@@ -3946,12 +4738,17 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
|
|
}
|
|
#endif
|
|
|
|
-static void age_active_anon(struct pglist_data *pgdat,
|
|
+static void kswapd_age_node(struct pglist_data *pgdat,
|
|
struct scan_control *sc)
|
|
{
|
|
struct mem_cgroup *memcg;
|
|
struct lruvec *lruvec;
|
|
|
|
+ if (lru_gen_enabled()) {
|
|
+ lru_gen_age_node(pgdat, sc);
|
|
+ return;
|
|
+ }
|
|
+
|
|
if (!can_age_anon_pages(pgdat, sc))
|
|
return;
|
|
|
|
@@ -4271,12 +5068,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
|
|
sc.may_swap = !nr_boost_reclaim;
|
|
|
|
/*
|
|
- * Do some background aging of the anon list, to give
|
|
- * pages a chance to be referenced before reclaiming. All
|
|
- * pages are rotated regardless of classzone as this is
|
|
- * about consistent aging.
|
|
+ * Do some background aging, to give pages a chance to be
|
|
+ * referenced before reclaiming. All pages are rotated
|
|
+ * regardless of classzone as this is about consistent aging.
|
|
*/
|
|
- age_active_anon(pgdat, &sc);
|
|
+ kswapd_age_node(pgdat, &sc);
|
|
|
|
/*
|
|
* If we're getting trouble reclaiming, start doing writepage
|
|
diff --git a/mm/workingset.c b/mm/workingset.c
|
|
index 592569a8974c..84a9e0ab04ad 100644
|
|
--- a/mm/workingset.c
|
|
+++ b/mm/workingset.c
|
|
@@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly;
|
|
static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
|
|
bool workingset)
|
|
{
|
|
- eviction >>= bucket_order;
|
|
eviction &= EVICTION_MASK;
|
|
eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
|
|
eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
|
|
@@ -212,10 +211,107 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
|
|
|
|
*memcgidp = memcgid;
|
|
*pgdat = NODE_DATA(nid);
|
|
- *evictionp = entry << bucket_order;
|
|
+ *evictionp = entry;
|
|
*workingsetp = workingset;
|
|
}
|
|
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+
|
|
+static void *lru_gen_eviction(struct folio *folio)
|
|
+{
|
|
+ int hist;
|
|
+ unsigned long token;
|
|
+ unsigned long min_seq;
|
|
+ struct lruvec *lruvec;
|
|
+ struct lru_gen_struct *lrugen;
|
|
+ int type = folio_is_file_lru(folio);
|
|
+ int delta = folio_nr_pages(folio);
|
|
+ int refs = folio_lru_refs(folio);
|
|
+ int tier = lru_tier_from_refs(refs);
|
|
+ struct mem_cgroup *memcg = folio_memcg(folio);
|
|
+ struct pglist_data *pgdat = folio_pgdat(folio);
|
|
+
|
|
+ BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
|
|
+
|
|
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
|
|
+ lrugen = &lruvec->lrugen;
|
|
+ min_seq = READ_ONCE(lrugen->min_seq[type]);
|
|
+ token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0);
|
|
+
|
|
+ hist = lru_hist_from_seq(min_seq);
|
|
+ atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
|
|
+
|
|
+ return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
|
|
+}
|
|
+
|
|
+static void lru_gen_refault(struct folio *folio, void *shadow)
|
|
+{
|
|
+ int hist, tier, refs;
|
|
+ int memcg_id;
|
|
+ bool workingset;
|
|
+ unsigned long token;
|
|
+ unsigned long min_seq;
|
|
+ struct lruvec *lruvec;
|
|
+ struct lru_gen_struct *lrugen;
|
|
+ struct mem_cgroup *memcg;
|
|
+ struct pglist_data *pgdat;
|
|
+ int type = folio_is_file_lru(folio);
|
|
+ int delta = folio_nr_pages(folio);
|
|
+
|
|
+ unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
|
|
+
|
|
+ if (pgdat != folio_pgdat(folio))
|
|
+ return;
|
|
+
|
|
+ rcu_read_lock();
|
|
+
|
|
+ memcg = folio_memcg_rcu(folio);
|
|
+ if (memcg_id != mem_cgroup_id(memcg))
|
|
+ goto unlock;
|
|
+
|
|
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
|
|
+ lrugen = &lruvec->lrugen;
|
|
+
|
|
+ min_seq = READ_ONCE(lrugen->min_seq[type]);
|
|
+ if ((token >> LRU_REFS_WIDTH) != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
|
|
+ goto unlock;
|
|
+
|
|
+ hist = lru_hist_from_seq(min_seq);
|
|
+ /* see the comment in folio_lru_refs() */
|
|
+ refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
|
|
+ tier = lru_tier_from_refs(refs);
|
|
+
|
|
+ atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
|
|
+ mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
|
|
+
|
|
+ /*
|
|
+ * Count the following two cases as stalls:
|
|
+ * 1. For pages accessed through page tables, hotter pages pushed out
|
|
+ * hot pages which refaulted immediately.
|
|
+ * 2. For pages accessed multiple times through file descriptors,
|
|
+ * numbers of accesses might have been out of the range.
|
|
+ */
|
|
+ if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
|
|
+ folio_set_workingset(folio);
|
|
+ mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
|
|
+ }
|
|
+unlock:
|
|
+ rcu_read_unlock();
|
|
+}
|
|
+
|
|
+#else /* !CONFIG_LRU_GEN */
|
|
+
|
|
+static void *lru_gen_eviction(struct folio *folio)
|
|
+{
|
|
+ return NULL;
|
|
+}
|
|
+
|
|
+static void lru_gen_refault(struct folio *folio, void *shadow)
|
|
+{
|
|
+}
|
|
+
|
|
+#endif /* CONFIG_LRU_GEN */
|
|
+
|
|
/**
|
|
* workingset_age_nonresident - age non-resident entries as LRU ages
|
|
* @lruvec: the lruvec that was aged
|
|
@@ -264,10 +360,14 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
|
|
VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
|
|
|
|
+ if (lru_gen_enabled())
|
|
+ return lru_gen_eviction(folio);
|
|
+
|
|
lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
|
|
/* XXX: target_memcg can be NULL, go through lruvec */
|
|
memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
|
|
eviction = atomic_long_read(&lruvec->nonresident_age);
|
|
+ eviction >>= bucket_order;
|
|
workingset_age_nonresident(lruvec, folio_nr_pages(folio));
|
|
return pack_shadow(memcgid, pgdat, eviction,
|
|
folio_test_workingset(folio));
|
|
@@ -298,7 +398,13 @@ void workingset_refault(struct folio *folio, void *shadow)
|
|
int memcgid;
|
|
long nr;
|
|
|
|
+ if (lru_gen_enabled()) {
|
|
+ lru_gen_refault(folio, shadow);
|
|
+ return;
|
|
+ }
|
|
+
|
|
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
|
|
+ eviction <<= bucket_order;
|
|
|
|
rcu_read_lock();
|
|
/*
|
|
|
|
From patchwork Wed Jul 6 22:00:16 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908758
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id 3726AC43334
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 23:01:33 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=q12f+gGQWAnMtw+Lyu1qgELU75/Y0ma6Awj60IjVh6I=; b=yL/hQcJqkgbbmKhXfCKwWc+wkX
|
|
Lbi9Vltb9wHCU5GLfFWNukaPxqvqSHVJodGMilzF1fHW1t+GVa3R8XgTAL1XxReTulSFwHZens3b4
|
|
yfIF+Xcen0KARWB4cev6oJENhyj54gGCtCJrN6Q8kskf5sVgT6imyYmsKSGWuKQDLNOpbdbyaUHNV
|
|
Cnp0fFsLYklAVUMUxt8Tw+dFn70X24Zmnm56Nv4m35an3OJ8/Jp67WE3wakNCdBbXcp5NmrdU0d38
|
|
wOHRpNoUyY4M7Ap2uHDhNKkUmUo04KYiJhh5aqD/oChdMOMXDYjWFsHAdq4C8nWDCY0Cw6OrlNoma
|
|
SrVgMpjQ==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9Dzl-00CfS6-F9; Wed, 06 Jul 2022 22:59:45 +0000
|
|
Received: from desiato.infradead.org ([2001:8b0:10b:1:d65d:64ff:fe57:4e05])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9Dzj-00CfQV-BX
|
|
for linux-arm-kernel@bombadil.infradead.org; Wed, 06 Jul 2022 22:59:43 +0000
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=infradead.org; s=desiato.20200630; h=Content-Transfer-Encoding:Content-Type
|
|
:Cc:To:From:Subject:References:Mime-Version:Message-Id:In-Reply-To:Date:
|
|
Sender:Reply-To:Content-ID:Content-Description;
|
|
bh=XKxfVy7bUFxJEpcf3Ov7jSHxjMmYVb5gb0hRKtMPzZA=; b=qUyQNgyIO7raiJsqxDq1q2CR5O
|
|
Dz6zfmZwVfmORVftZSA0mUzDPKJNKgBl8kNqHRo1U9wlFSPj9sKQt41K2WAWf46aQLzBD6dsDutgD
|
|
Rn8Fd4SMkfLCF1SdBXUbVgorBdgrN+gK/Ew10nxlpQOlX1gZ90ZJEUSdqdtKF1HovI8VT3VZed5o9
|
|
7kPL5VR4DIx5+FEQLULOL4NkfpFO1lkLPtjATPOL8BuUebqXnP68bYsO0W1rWokGNUZ56uwW5wmXf
|
|
TmoQ4m0FZrUV2PrYFoMbAGYbCnT7UukhjyGjgfXFy05sjYZw2mKZrLWwTgQ4S8bemdt7ROmKz/JO4
|
|
QjRCXF7g==;
|
|
Received: from mail-io1-xd49.google.com ([2607:f8b0:4864:20::d49])
|
|
by desiato.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D53-000zVw-34
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:14 +0000
|
|
Received: by mail-io1-xd49.google.com with SMTP id
|
|
y22-20020a056602215600b00673b11a9cd5so8687679ioy.7
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:01:02 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=XKxfVy7bUFxJEpcf3Ov7jSHxjMmYVb5gb0hRKtMPzZA=;
|
|
b=KBFBaieE6U899pZedfVW186dkzrS93jrjdIku8VfT3EELG4tmwSu4pbA8t8KgFrkX7
|
|
PNLYIcjPRwCltpKZ41cpDa72lID2PMQjd0C5UzA1EP6Fozv39FS8efLoCNj0H5GROfCg
|
|
QMeGKWc2c6xuBh73e/hz1kG0ddQk8uDEqQzdd1hwg6GKOeAe0e98I4co7JiaxOzZQyVa
|
|
H3rcYT5ECNNWjJIqW6rJYkUeALUPQkQ6SiSCuxFVIHVt/LqtAYlBu4IfaEL80m1SvJmZ
|
|
XAzuOW4B/+BDlzSPXhxDXR3iWNFF0evXZaEn2Xyp6i9pgpXVhqsXUcbDrh/yv+aznbGp
|
|
vfZw==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=XKxfVy7bUFxJEpcf3Ov7jSHxjMmYVb5gb0hRKtMPzZA=;
|
|
b=IVZFaQ0FquifFuwKin4zR4BMNO67t15pxWJIeU29OI8hQrYJVkZ6/WYwpRPnh6bRgN
|
|
dl5U4PFjmnh3CZ1dhi+zTjPL0mTn3yfKrc5A0zV+ixW3Oa4R2QFZiffDTijy0aHf92v8
|
|
ey1/nfFk4fo2dti7iZYLL3rvkfM1NJUQNMzDqqFjFd+jzkOXtZ7w8wklMObiFOgqaDTp
|
|
zqRYQgGQJlwM3OLGqy4MOWFzu3nAAhJEs9jYtCnmCWttiZ+riAWHGfmT1uD+Q6kGw+ls
|
|
3rvfBb5q1CGpQvN4UZ2j1qAI8tRuuKW6ZmV9y6vejAXMlwhZc8Cq50R/nAB0531Q29Qd
|
|
MGUQ==
|
|
X-Gm-Message-State: AJIora9efjt70LKhCww5JxSXwgbMQ1NrRof/lSeq5o1+KYHs7qnsBYdB
|
|
1ojyvw1jP3WnJgEKqz+FVakMKO97ins=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1sRwdKObor4YUQFmQ9ta0XlmIxSt4ZUo1xITqK6sTS54jUHQ2ZrB8LtbtvMYYrpomN4w49bZKlSJRk=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a05:6e02:148c:b0:2dc:38ae:5c6a with SMTP id
|
|
n12-20020a056e02148c00b002dc38ae5c6amr2363805ilk.115.1657144861728; Wed, 06
|
|
Jul 2022 15:01:01 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:16 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-8-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 07/14] mm: multi-gen LRU: exploit locality in rmap
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Barry Song <baohua@kernel.org>, Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_230111_925434_DC4481F9
|
|
X-CRM114-Status: GOOD ( 24.36 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
Searching the rmap for PTEs mapping each page on an LRU list (to test
|
|
and clear the accessed bit) can be expensive because pages from
|
|
different VMAs (PA space) are not cache friendly to the rmap (VA
|
|
space). For workloads mostly using mapped pages, searching the rmap
|
|
can incur the highest CPU cost in the reclaim path.
|
|
|
|
This patch exploits spatial locality to reduce the trips into the
|
|
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
|
|
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
|
|
adjacent PTEs. On finding another young PTE, it clears the accessed
|
|
bit and updates the gen counter of the page mapped by this PTE to
|
|
(max_seq%MAX_NR_GENS)+1.
|
|
|
|
Server benchmark results:
|
|
Single workload:
|
|
fio (buffered I/O): no change
|
|
|
|
Single workload:
|
|
memcached (anon): +[3, 5]%
|
|
Ops/sec KB/sec
|
|
patch1-6: 1106168.46 43025.04
|
|
patch1-7: 1147696.57 44640.29
|
|
|
|
Configurations:
|
|
no change
|
|
|
|
Client benchmark results:
|
|
kswapd profiles:
|
|
patch1-6
|
|
39.03% lzo1x_1_do_compress (real work)
|
|
18.47% page_vma_mapped_walk (overhead)
|
|
6.74% _raw_spin_unlock_irq
|
|
3.97% do_raw_spin_lock
|
|
2.49% ptep_clear_flush
|
|
2.48% anon_vma_interval_tree_iter_first
|
|
1.92% folio_referenced_one
|
|
1.88% __zram_bvec_write
|
|
1.48% memmove
|
|
1.31% vma_interval_tree_iter_next
|
|
|
|
patch1-7
|
|
48.16% lzo1x_1_do_compress (real work)
|
|
8.20% page_vma_mapped_walk (overhead)
|
|
7.06% _raw_spin_unlock_irq
|
|
2.92% ptep_clear_flush
|
|
2.53% __zram_bvec_write
|
|
2.11% do_raw_spin_lock
|
|
2.02% memmove
|
|
1.93% lru_gen_look_around
|
|
1.56% free_unref_page_list
|
|
1.40% memset
|
|
|
|
Configurations:
|
|
no change
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Acked-by: Barry Song <baohua@kernel.org>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
include/linux/memcontrol.h | 31 +++++++
|
|
include/linux/mm.h | 5 +
|
|
include/linux/mmzone.h | 6 ++
|
|
mm/internal.h | 1 +
|
|
mm/memcontrol.c | 1 +
|
|
mm/rmap.c | 6 ++
|
|
mm/swap.c | 4 +-
|
|
mm/vmscan.c | 184 +++++++++++++++++++++++++++++++++++++
|
|
8 files changed, 236 insertions(+), 2 deletions(-)
|
|
|
|
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
|
|
index 9ecead1042b9..9d0fea17f9ef 100644
|
|
--- a/include/linux/memcontrol.h
|
|
+++ b/include/linux/memcontrol.h
|
|
@@ -444,6 +444,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
|
|
* - LRU isolation
|
|
* - lock_page_memcg()
|
|
* - exclusive reference
|
|
+ * - mem_cgroup_trylock_pages()
|
|
*
|
|
* For a kmem folio a caller should hold an rcu read lock to protect memcg
|
|
* associated with a kmem folio from being released.
|
|
@@ -505,6 +506,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
|
|
* - LRU isolation
|
|
* - lock_page_memcg()
|
|
* - exclusive reference
|
|
+ * - mem_cgroup_trylock_pages()
|
|
*
|
|
* For a kmem page a caller should hold an rcu read lock to protect memcg
|
|
* associated with a kmem page from being released.
|
|
@@ -950,6 +952,23 @@ void unlock_page_memcg(struct page *page);
|
|
|
|
void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
|
|
|
|
+/* try to stablize folio_memcg() for all the pages in a memcg */
|
|
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
|
|
+{
|
|
+ rcu_read_lock();
|
|
+
|
|
+ if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
|
|
+ return true;
|
|
+
|
|
+ rcu_read_unlock();
|
|
+ return false;
|
|
+}
|
|
+
|
|
+static inline void mem_cgroup_unlock_pages(void)
|
|
+{
|
|
+ rcu_read_unlock();
|
|
+}
|
|
+
|
|
/* idx can be of type enum memcg_stat_item or node_stat_item */
|
|
static inline void mod_memcg_state(struct mem_cgroup *memcg,
|
|
int idx, int val)
|
|
@@ -1401,6 +1420,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
|
|
{
|
|
}
|
|
|
|
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
|
|
+{
|
|
+ /* to match folio_memcg_rcu() */
|
|
+ rcu_read_lock();
|
|
+ return true;
|
|
+}
|
|
+
|
|
+static inline void mem_cgroup_unlock_pages(void)
|
|
+{
|
|
+ rcu_read_unlock();
|
|
+}
|
|
+
|
|
static inline void mem_cgroup_handle_over_high(void)
|
|
{
|
|
}
|
|
diff --git a/include/linux/mm.h b/include/linux/mm.h
|
|
index ed5393e5930d..981b2e447936 100644
|
|
--- a/include/linux/mm.h
|
|
+++ b/include/linux/mm.h
|
|
@@ -1523,6 +1523,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
|
|
return page_to_pfn(&folio->page);
|
|
}
|
|
|
|
+static inline struct folio *pfn_folio(unsigned long pfn)
|
|
+{
|
|
+ return page_folio(pfn_to_page(pfn));
|
|
+}
|
|
+
|
|
static inline atomic_t *folio_pincount_ptr(struct folio *folio)
|
|
{
|
|
return &folio_page(folio, 1)->compound_pincount;
|
|
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
|
|
index 0d76222501ed..4fd7fc16eeb4 100644
|
|
--- a/include/linux/mmzone.h
|
|
+++ b/include/linux/mmzone.h
|
|
@@ -372,6 +372,7 @@ enum lruvec_flags {
|
|
#ifndef __GENERATING_BOUNDS_H
|
|
|
|
struct lruvec;
|
|
+struct page_vma_mapped_walk;
|
|
|
|
#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
|
|
#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
|
|
@@ -427,6 +428,7 @@ struct lru_gen_struct {
|
|
};
|
|
|
|
void lru_gen_init_lruvec(struct lruvec *lruvec);
|
|
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
|
|
|
|
#ifdef CONFIG_MEMCG
|
|
void lru_gen_init_memcg(struct mem_cgroup *memcg);
|
|
@@ -439,6 +441,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
|
|
{
|
|
}
|
|
|
|
+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
|
|
+{
|
|
+}
|
|
+
|
|
#ifdef CONFIG_MEMCG
|
|
static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
|
|
{
|
|
diff --git a/mm/internal.h b/mm/internal.h
|
|
index c0f8fbe0445b..3d070582052e 100644
|
|
--- a/mm/internal.h
|
|
+++ b/mm/internal.h
|
|
@@ -83,6 +83,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
|
|
void folio_rotate_reclaimable(struct folio *folio);
|
|
bool __folio_end_writeback(struct folio *folio);
|
|
void deactivate_file_folio(struct folio *folio);
|
|
+void folio_activate(struct folio *folio);
|
|
|
|
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
|
|
unsigned long floor, unsigned long ceiling);
|
|
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
|
|
index 7d58e8a73ece..743f8513f1c3 100644
|
|
--- a/mm/memcontrol.c
|
|
+++ b/mm/memcontrol.c
|
|
@@ -2777,6 +2777,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
|
|
* - LRU isolation
|
|
* - lock_page_memcg()
|
|
* - exclusive reference
|
|
+ * - mem_cgroup_trylock_pages()
|
|
*/
|
|
folio->memcg_data = (unsigned long)memcg;
|
|
}
|
|
diff --git a/mm/rmap.c b/mm/rmap.c
|
|
index 5bcb334cd6f2..dce1a56b02f8 100644
|
|
--- a/mm/rmap.c
|
|
+++ b/mm/rmap.c
|
|
@@ -830,6 +830,12 @@ static bool folio_referenced_one(struct folio *folio,
|
|
}
|
|
|
|
if (pvmw.pte) {
|
|
+ if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
|
|
+ !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
|
|
+ lru_gen_look_around(&pvmw);
|
|
+ referenced++;
|
|
+ }
|
|
+
|
|
if (ptep_clear_flush_young_notify(vma, address,
|
|
pvmw.pte)) {
|
|
/*
|
|
diff --git a/mm/swap.c b/mm/swap.c
|
|
index 67e7962fbacc..131fc76242a3 100644
|
|
--- a/mm/swap.c
|
|
+++ b/mm/swap.c
|
|
@@ -342,7 +342,7 @@ static bool need_activate_page_drain(int cpu)
|
|
return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
|
|
}
|
|
|
|
-static void folio_activate(struct folio *folio)
|
|
+void folio_activate(struct folio *folio)
|
|
{
|
|
if (folio_test_lru(folio) && !folio_test_active(folio) &&
|
|
!folio_test_unevictable(folio)) {
|
|
@@ -362,7 +362,7 @@ static inline void activate_page_drain(int cpu)
|
|
{
|
|
}
|
|
|
|
-static void folio_activate(struct folio *folio)
|
|
+void folio_activate(struct folio *folio)
|
|
{
|
|
struct lruvec *lruvec;
|
|
|
|
diff --git a/mm/vmscan.c b/mm/vmscan.c
|
|
index f768d61e7b85..ec786fc556a7 100644
|
|
--- a/mm/vmscan.c
|
|
+++ b/mm/vmscan.c
|
|
@@ -1574,6 +1574,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
|
|
if (!sc->may_unmap && folio_mapped(folio))
|
|
goto keep_locked;
|
|
|
|
+ /* folio_update_gen() tried to promote this page? */
|
|
+ if (lru_gen_enabled() && !ignore_references &&
|
|
+ folio_mapped(folio) && folio_test_referenced(folio))
|
|
+ goto keep_locked;
|
|
+
|
|
/*
|
|
* The number of dirty pages determines if a node is marked
|
|
* reclaim_congested. kswapd will stall and start writing
|
|
@@ -3161,6 +3166,29 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
|
|
* the aging
|
|
******************************************************************************/
|
|
|
|
+/* promote pages accessed through page tables */
|
|
+static int folio_update_gen(struct folio *folio, int gen)
|
|
+{
|
|
+ unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
|
|
+
|
|
+ VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
|
|
+ VM_WARN_ON_ONCE(!rcu_read_lock_held());
|
|
+
|
|
+ do {
|
|
+ /* lru_gen_del_folio() has isolated this page? */
|
|
+ if (!(old_flags & LRU_GEN_MASK)) {
|
|
+ /* for shrink_page_list() */
|
|
+ new_flags = old_flags | BIT(PG_referenced);
|
|
+ continue;
|
|
+ }
|
|
+
|
|
+ new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
|
|
+ new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
|
|
+ } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
|
|
+
|
|
+ return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
|
|
+}
|
|
+
|
|
/* protect pages accessed multiple times through file descriptors */
|
|
static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
|
|
{
|
|
@@ -3172,6 +3200,11 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
|
|
VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);
|
|
|
|
do {
|
|
+ new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
|
|
+ /* folio_update_gen() has promoted this page? */
|
|
+ if (new_gen >= 0 && new_gen != old_gen)
|
|
+ return new_gen;
|
|
+
|
|
new_gen = (old_gen + 1) % MAX_NR_GENS;
|
|
|
|
new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
|
|
@@ -3186,6 +3219,43 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
|
|
return new_gen;
|
|
}
|
|
|
|
+static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
|
|
+{
|
|
+ unsigned long pfn = pte_pfn(pte);
|
|
+
|
|
+ VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end);
|
|
+
|
|
+ if (!pte_present(pte) || is_zero_pfn(pfn))
|
|
+ return -1;
|
|
+
|
|
+ if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte)))
|
|
+ return -1;
|
|
+
|
|
+ if (WARN_ON_ONCE(!pfn_valid(pfn)))
|
|
+ return -1;
|
|
+
|
|
+ return pfn;
|
|
+}
|
|
+
|
|
+static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
|
|
+ struct pglist_data *pgdat)
|
|
+{
|
|
+ struct folio *folio;
|
|
+
|
|
+ /* try to avoid unnecessary memory loads */
|
|
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
|
|
+ return NULL;
|
|
+
|
|
+ folio = pfn_folio(pfn);
|
|
+ if (folio_nid(folio) != pgdat->node_id)
|
|
+ return NULL;
|
|
+
|
|
+ if (folio_memcg_rcu(folio) != memcg)
|
|
+ return NULL;
|
|
+
|
|
+ return folio;
|
|
+}
|
|
+
|
|
static void inc_min_seq(struct lruvec *lruvec, int type)
|
|
{
|
|
struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
@@ -3387,6 +3457,114 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
|
|
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
|
|
}
|
|
|
|
+/*
|
|
+ * This function exploits spatial locality when shrink_page_list() walks the
|
|
+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
|
|
+ */
|
|
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
|
|
+{
|
|
+ int i;
|
|
+ pte_t *pte;
|
|
+ unsigned long start;
|
|
+ unsigned long end;
|
|
+ unsigned long addr;
|
|
+ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
|
|
+ struct folio *folio = pfn_folio(pvmw->pfn);
|
|
+ struct mem_cgroup *memcg = folio_memcg(folio);
|
|
+ struct pglist_data *pgdat = folio_pgdat(folio);
|
|
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
|
|
+ DEFINE_MAX_SEQ(lruvec);
|
|
+ int old_gen, new_gen = lru_gen_from_seq(max_seq);
|
|
+
|
|
+ lockdep_assert_held(pvmw->ptl);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
|
|
+
|
|
+ if (spin_is_contended(pvmw->ptl))
|
|
+ return;
|
|
+
|
|
+ start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
|
|
+ end = min(pvmw->address | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1;
|
|
+
|
|
+ if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
|
|
+ if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
|
|
+ end = start + MIN_LRU_BATCH * PAGE_SIZE;
|
|
+ else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
|
|
+ start = end - MIN_LRU_BATCH * PAGE_SIZE;
|
|
+ else {
|
|
+ start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
|
|
+ end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
|
|
+ }
|
|
+ }
|
|
+
|
|
+ pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
|
|
+
|
|
+ rcu_read_lock();
|
|
+ arch_enter_lazy_mmu_mode();
|
|
+
|
|
+ for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
|
|
+ unsigned long pfn;
|
|
+
|
|
+ pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
|
|
+ if (pfn == -1)
|
|
+ continue;
|
|
+
|
|
+ if (!pte_young(pte[i]))
|
|
+ continue;
|
|
+
|
|
+ folio = get_pfn_folio(pfn, memcg, pgdat);
|
|
+ if (!folio)
|
|
+ continue;
|
|
+
|
|
+ if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
|
|
+ continue;
|
|
+
|
|
+ if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
|
|
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
|
|
+ !folio_test_swapcache(folio)))
|
|
+ folio_mark_dirty(folio);
|
|
+
|
|
+ old_gen = folio_lru_gen(folio);
|
|
+ if (old_gen < 0)
|
|
+ folio_set_referenced(folio);
|
|
+ else if (old_gen != new_gen)
|
|
+ __set_bit(i, bitmap);
|
|
+ }
|
|
+
|
|
+ arch_leave_lazy_mmu_mode();
|
|
+ rcu_read_unlock();
|
|
+
|
|
+ if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
|
|
+ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
|
|
+ folio = pfn_folio(pte_pfn(pte[i]));
|
|
+ folio_activate(folio);
|
|
+ }
|
|
+ return;
|
|
+ }
|
|
+
|
|
+ /* folio_update_gen() requires stable folio_memcg() */
|
|
+ if (!mem_cgroup_trylock_pages(memcg))
|
|
+ return;
|
|
+
|
|
+ spin_lock_irq(&lruvec->lru_lock);
|
|
+ new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
|
|
+
|
|
+ for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
|
|
+ folio = pfn_folio(pte_pfn(pte[i]));
|
|
+ if (folio_memcg_rcu(folio) != memcg)
|
|
+ continue;
|
|
+
|
|
+ old_gen = folio_update_gen(folio, new_gen);
|
|
+ if (old_gen < 0 || old_gen == new_gen)
|
|
+ continue;
|
|
+
|
|
+ lru_gen_update_size(lruvec, folio, old_gen, new_gen);
|
|
+ }
|
|
+
|
|
+ spin_unlock_irq(&lruvec->lru_lock);
|
|
+
|
|
+ mem_cgroup_unlock_pages();
|
|
+}
|
|
+
|
|
/******************************************************************************
|
|
* the eviction
|
|
******************************************************************************/
|
|
@@ -3423,6 +3601,12 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
|
|
return true;
|
|
}
|
|
|
|
+ /* promoted */
|
|
+ if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
|
|
+ list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
|
|
+ return true;
|
|
+ }
|
|
+
|
|
/* protected */
|
|
if (tier > tier_idx) {
|
|
int hist = lru_hist_from_seq(lrugen->min_seq[type]);
|
|
|
|
From patchwork Wed Jul 6 22:00:17 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908746
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id C3893C433EF
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 22:24:58 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=GwmVYFAPtjQcK1sjq1PR7I4zdC1o/VmP908RIOQnAEI=; b=MkA9SUIodliKDMj1JO48flunoF
|
|
167Wmbj42sHfLwKTuPwxNLV1jI4sHelt5Ah1T/kqe+yIh5XlthE2w7wtt04UoYgOHJFGNyCEWkC7c
|
|
DXbjEzWaYTIt/RMgb3lneuL1OMQsVHb8qhBeJi+X1kWdhMZqtfksJVccwip9Yu4EqbePB2ASV0tng
|
|
GabypOPrW9bHXHDYipjuTOvytKdHmMvVfhAiJ0RTclv1WapHX2CuR9OCGdQevCbJLDmlTUq/hzjWD
|
|
DHocIjF1n4JdLRMJwAXn3RKgg7lBAGKYvVg9OiEVTcsRW0MyXK03IhXIRiXzwIl3Lv2hIyfMrRvKW
|
|
Edj5pWFA==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DQb-00CaAr-IP; Wed, 06 Jul 2022 22:23:26 +0000
|
|
Received: from casper.infradead.org ([2001:8b0:10b:1236::1])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DPs-00CZxO-UN
|
|
for linux-arm-kernel@bombadil.infradead.org; Wed, 06 Jul 2022 22:22:41 +0000
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:Content-Type:
|
|
Cc:To:From:Subject:References:Mime-Version:Message-Id:In-Reply-To:Date:Sender
|
|
:Reply-To:Content-ID:Content-Description;
|
|
bh=QRwOHdNvCJsdEWcZ8PiNBBmz8P6BxE21GfKaWd62Hcs=; b=rrTMW1vinjz8CT2uqBh4wE1NzZ
|
|
h8G8nWDKRfGqv8gS1WBnx2zmzDtFGt7NkgSj1k4L7zuD5mpjWdmxTa0cKRUaURl3VOzcw4ViqRhGi
|
|
NtU3SKTEjjGkS1NrlTHHv6vZSRNeVZ9ilqmsyEZ0UzXLER7y6qIyQzPvlpvadPZ6HfNZF7kqt7QAn
|
|
PkiOBD5pNeGBfJRnxmCwMTXoR8yBRUv/WRHlrF43mIunhoKRkazEKungApsCY+eEp03K5SlFJ7PQb
|
|
hy9SEU9/Xno5XbhC1+3zMSssvra5b73zwafg3VIrUQlN7FxVIKQAiRrWKNXzpUXxVi6d2uWv5z1eg
|
|
pN9lolfA==;
|
|
Received: from mail-yb1-xb49.google.com ([2607:f8b0:4864:20::b49])
|
|
by casper.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D55-0021TB-Uc
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:16 +0000
|
|
Received: by mail-yb1-xb49.google.com with SMTP id
|
|
j11-20020a05690212cb00b006454988d225so12639029ybu.10
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:01:08 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=QRwOHdNvCJsdEWcZ8PiNBBmz8P6BxE21GfKaWd62Hcs=;
|
|
b=B0sFmQhPneIOAV1YVS6vP1oEsRe/BTwVflV6UdX0rzJbZE3r0MadNNURWxHDBukL1I
|
|
ELnHPWwed0WOAIrL8nftaw9ahABsSVQtJZPifYycg6l36RW7IRVZKE/FLzqQbao5lQVp
|
|
2lyTvVaA0fwTYrrOAkppMHFJS9NhtOwiPWkN8qczgGMF/wfTpEMLT1c3gwH7x0wTp0CA
|
|
bmGxHDwTUBtMQvnhn6ZHsn3tW2Mue+sW/jt4FZPTcsu1wgfJSmRfIgRB/FRZTem/MRn1
|
|
s04RGx0yhTSGEtt8gc/smm1CW1G6xElKiEo1r8zVeztvFvFMntvooTqGlsQvsu1rVxNL
|
|
nxIA==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=QRwOHdNvCJsdEWcZ8PiNBBmz8P6BxE21GfKaWd62Hcs=;
|
|
b=EH5cCwBpYkaz46+2S+TWJscZJCylR5msZGTJY2vha/RPhrJuhtztr1uQ+cUdHpnGuh
|
|
XL5HXeACVrmCI/1Q2eXf0sVBra8Fu8lG2gkYCr/vo9LTmPx3LTkLoIxsGBEHyLG/sFTZ
|
|
wRckR6FEwiNOCWuy9n/ToTnlnPwbMcWX79t36x/P0+5dKnV9VGfvh4n6IBYewJlaWlop
|
|
sMZDhTHx+4iiyuTfIfrUg8jvErXomxSuae7OCAFHEJqoZhjA6gbezv9QXlBPelCpjOEq
|
|
81eVQ2sqkqeo33okPgG0PkbjvmFF2dA5EhtMqoV//R45q4AEJ2kDxNW0L0CO9x3EhC5t
|
|
qJIA==
|
|
X-Gm-Message-State: AJIora/T68ZDoId9NfraGXEsvpKYBhYTTRpUj3DUkbyEBLISa7pMojNc
|
|
n9BVbZ4sfBERJmWz7dpu1kRRQyaWCgM=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1uf52B2LApundNg2J5h3sPxkMm3CEPkOlcVUbZUCRfPLLNdJaJY9XTnmessJ0elI3BxSTXeuJSaFn0=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a81:2f4c:0:b0:31c:2bee:dfa4 with SMTP id
|
|
v73-20020a812f4c000000b0031c2beedfa4mr47320138ywv.483.1657144863343; Wed, 06
|
|
Jul 2022 15:01:03 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:17 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-9-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 08/14] mm: multi-gen LRU: support page table walks
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_230112_032987_6903E922
|
|
X-CRM114-Status: GOOD ( 28.09 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
To further exploit spatial locality, the aging prefers to walk page
|
|
tables to search for young PTEs and promote hot pages. A kill switch
|
|
will be added in the next patch to disable this behavior. When
|
|
disabled, the aging relies on the rmap only.
|
|
|
|
NB: this behavior has nothing similar with the page table scanning in
|
|
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
|
|
pages to swapcache and unmaps them.
|
|
|
|
To avoid confusion, the term "iteration" specifically means the
|
|
traversal of an entire mm_struct list; the term "walk" will be applied
|
|
to page tables and the rmap, as usual.
|
|
|
|
An mm_struct list is maintained for each memcg, and an mm_struct
|
|
follows its owner task to the new memcg when this task is migrated.
|
|
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
|
|
walk_page_range() with each mm_struct on this list to promote hot
|
|
pages before it increments max_seq.
|
|
|
|
When multiple page table walkers iterate the same list, each of them
|
|
gets a unique mm_struct; therefore they can run concurrently. Page
|
|
table walkers ignore any misplaced pages, e.g., if an mm_struct was
|
|
migrated, pages it left in the previous memcg will not be promoted
|
|
when its current memcg is under reclaim. Similarly, page table walkers
|
|
will not promote pages from nodes other than the one under reclaim.
|
|
|
|
This patch uses the following optimizations when walking page tables:
|
|
1. It tracks the usage of mm_struct's between context switches so that
|
|
page table walkers can skip processes that have been sleeping since
|
|
the last iteration.
|
|
2. It uses generational Bloom filters to record populated branches so
|
|
that page table walkers can reduce their search space based on the
|
|
query results, e.g., to skip page tables containing mostly holes or
|
|
misplaced pages.
|
|
3. It takes advantage of the accessed bit in non-leaf PMD entries when
|
|
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
|
|
4. It does not zigzag between a PGD table and the same PMD table
|
|
spanning multiple VMAs. IOW, it finishes all the VMAs within the
|
|
range of the same PMD table before it returns to a PGD table. This
|
|
improves the cache performance for workloads that have large
|
|
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
|
|
|
|
Server benchmark results:
|
|
Single workload:
|
|
fio (buffered I/O): no change
|
|
|
|
Single workload:
|
|
memcached (anon): +[8, 10]%
|
|
Ops/sec KB/sec
|
|
patch1-7: 1147696.57 44640.29
|
|
patch1-8: 1245274.91 48435.66
|
|
|
|
Configurations:
|
|
no change
|
|
|
|
Client benchmark results:
|
|
kswapd profiles:
|
|
patch1-7
|
|
48.16% lzo1x_1_do_compress (real work)
|
|
8.20% page_vma_mapped_walk (overhead)
|
|
7.06% _raw_spin_unlock_irq
|
|
2.92% ptep_clear_flush
|
|
2.53% __zram_bvec_write
|
|
2.11% do_raw_spin_lock
|
|
2.02% memmove
|
|
1.93% lru_gen_look_around
|
|
1.56% free_unref_page_list
|
|
1.40% memset
|
|
|
|
patch1-8
|
|
49.44% lzo1x_1_do_compress (real work)
|
|
6.19% page_vma_mapped_walk (overhead)
|
|
5.97% _raw_spin_unlock_irq
|
|
3.13% get_pfn_folio
|
|
2.85% ptep_clear_flush
|
|
2.42% __zram_bvec_write
|
|
2.08% do_raw_spin_lock
|
|
1.92% memmove
|
|
1.44% alloc_zspage
|
|
1.36% memset
|
|
|
|
Configurations:
|
|
no change
|
|
|
|
Thanks to the following developers for their efforts [3].
|
|
kernel test robot <lkp@intel.com>
|
|
|
|
[1] https://lwn.net/Articles/23732/
|
|
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
|
|
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
fs/exec.c | 2 +
|
|
include/linux/memcontrol.h | 5 +
|
|
include/linux/mm_types.h | 77 +++
|
|
include/linux/mmzone.h | 56 +-
|
|
include/linux/swap.h | 4 +
|
|
kernel/exit.c | 1 +
|
|
kernel/fork.c | 9 +
|
|
kernel/sched/core.c | 1 +
|
|
mm/memcontrol.c | 25 +
|
|
mm/vmscan.c | 1000 +++++++++++++++++++++++++++++++++++-
|
|
10 files changed, 1163 insertions(+), 17 deletions(-)
|
|
|
|
diff --git a/fs/exec.c b/fs/exec.c
|
|
index 0989fb8472a1..b1fda634e01a 100644
|
|
--- a/fs/exec.c
|
|
+++ b/fs/exec.c
|
|
@@ -1015,6 +1015,7 @@ static int exec_mmap(struct mm_struct *mm)
|
|
active_mm = tsk->active_mm;
|
|
tsk->active_mm = mm;
|
|
tsk->mm = mm;
|
|
+ lru_gen_add_mm(mm);
|
|
/*
|
|
* This prevents preemption while active_mm is being loaded and
|
|
* it and mm are being updated, which could cause problems for
|
|
@@ -1030,6 +1031,7 @@ static int exec_mmap(struct mm_struct *mm)
|
|
tsk->mm->vmacache_seqnum = 0;
|
|
vmacache_flush(tsk);
|
|
task_unlock(tsk);
|
|
+ lru_gen_use_mm(mm);
|
|
if (old_mm) {
|
|
mmap_read_unlock(old_mm);
|
|
BUG_ON(active_mm != old_mm);
|
|
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
|
|
index 9d0fea17f9ef..eca62345fdd5 100644
|
|
--- a/include/linux/memcontrol.h
|
|
+++ b/include/linux/memcontrol.h
|
|
@@ -350,6 +350,11 @@ struct mem_cgroup {
|
|
struct deferred_split deferred_split_queue;
|
|
#endif
|
|
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+ /* per-memcg mm_struct list */
|
|
+ struct lru_gen_mm_list mm_list;
|
|
+#endif
|
|
+
|
|
struct mem_cgroup_per_node *nodeinfo[];
|
|
};
|
|
|
|
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
|
|
index c29ab4c0cd5c..7db51151a28b 100644
|
|
--- a/include/linux/mm_types.h
|
|
+++ b/include/linux/mm_types.h
|
|
@@ -3,6 +3,7 @@
|
|
#define _LINUX_MM_TYPES_H
|
|
|
|
#include <linux/mm_types_task.h>
|
|
+#include <linux/sched.h>
|
|
|
|
#include <linux/auxvec.h>
|
|
#include <linux/kref.h>
|
|
@@ -17,6 +18,7 @@
|
|
#include <linux/page-flags-layout.h>
|
|
#include <linux/workqueue.h>
|
|
#include <linux/seqlock.h>
|
|
+#include <linux/mmdebug.h>
|
|
|
|
#include <asm/mmu.h>
|
|
|
|
@@ -667,6 +669,22 @@ struct mm_struct {
|
|
*/
|
|
unsigned long ksm_merging_pages;
|
|
#endif
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+ struct {
|
|
+ /* this mm_struct is on lru_gen_mm_list */
|
|
+ struct list_head list;
|
|
+ /*
|
|
+ * Set when switching to this mm_struct, as a hint of
|
|
+ * whether it has been used since the last time per-node
|
|
+ * page table walkers cleared the corresponding bits.
|
|
+ */
|
|
+ unsigned long bitmap;
|
|
+#ifdef CONFIG_MEMCG
|
|
+ /* points to the memcg of "owner" above */
|
|
+ struct mem_cgroup *memcg;
|
|
+#endif
|
|
+ } lru_gen;
|
|
+#endif /* CONFIG_LRU_GEN */
|
|
} __randomize_layout;
|
|
|
|
/*
|
|
@@ -693,6 +711,65 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
|
|
return (struct cpumask *)&mm->cpu_bitmap;
|
|
}
|
|
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+
|
|
+struct lru_gen_mm_list {
|
|
+ /* mm_struct list for page table walkers */
|
|
+ struct list_head fifo;
|
|
+ /* protects the list above */
|
|
+ spinlock_t lock;
|
|
+};
|
|
+
|
|
+void lru_gen_add_mm(struct mm_struct *mm);
|
|
+void lru_gen_del_mm(struct mm_struct *mm);
|
|
+#ifdef CONFIG_MEMCG
|
|
+void lru_gen_migrate_mm(struct mm_struct *mm);
|
|
+#endif
|
|
+
|
|
+static inline void lru_gen_init_mm(struct mm_struct *mm)
|
|
+{
|
|
+ INIT_LIST_HEAD(&mm->lru_gen.list);
|
|
+ mm->lru_gen.bitmap = 0;
|
|
+#ifdef CONFIG_MEMCG
|
|
+ mm->lru_gen.memcg = NULL;
|
|
+#endif
|
|
+}
|
|
+
|
|
+static inline void lru_gen_use_mm(struct mm_struct *mm)
|
|
+{
|
|
+ /* unlikely but not a bug when racing with lru_gen_migrate_mm() */
|
|
+ VM_WARN_ON_ONCE(list_empty(&mm->lru_gen.list));
|
|
+
|
|
+ if (!(current->flags & PF_KTHREAD))
|
|
+ WRITE_ONCE(mm->lru_gen.bitmap, -1);
|
|
+}
|
|
+
|
|
+#else /* !CONFIG_LRU_GEN */
|
|
+
|
|
+static inline void lru_gen_add_mm(struct mm_struct *mm)
|
|
+{
|
|
+}
|
|
+
|
|
+static inline void lru_gen_del_mm(struct mm_struct *mm)
|
|
+{
|
|
+}
|
|
+
|
|
+#ifdef CONFIG_MEMCG
|
|
+static inline void lru_gen_migrate_mm(struct mm_struct *mm)
|
|
+{
|
|
+}
|
|
+#endif
|
|
+
|
|
+static inline void lru_gen_init_mm(struct mm_struct *mm)
|
|
+{
|
|
+}
|
|
+
|
|
+static inline void lru_gen_use_mm(struct mm_struct *mm)
|
|
+{
|
|
+}
|
|
+
|
|
+#endif /* CONFIG_LRU_GEN */
|
|
+
|
|
struct mmu_gather;
|
|
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
|
|
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
|
|
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
|
|
index 4fd7fc16eeb4..0cf0856b484a 100644
|
|
--- a/include/linux/mmzone.h
|
|
+++ b/include/linux/mmzone.h
|
|
@@ -405,7 +405,7 @@ enum {
|
|
* min_seq behind.
|
|
*
|
|
* The number of pages in each generation is eventually consistent and therefore
|
|
- * can be transiently negative.
|
|
+ * can be transiently negative when reset_batch_size() is pending.
|
|
*/
|
|
struct lru_gen_struct {
|
|
/* the aging increments the youngest generation number */
|
|
@@ -427,6 +427,53 @@ struct lru_gen_struct {
|
|
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
|
|
};
|
|
|
|
+enum {
|
|
+ MM_LEAF_TOTAL, /* total leaf entries */
|
|
+ MM_LEAF_OLD, /* old leaf entries */
|
|
+ MM_LEAF_YOUNG, /* young leaf entries */
|
|
+ MM_NONLEAF_TOTAL, /* total non-leaf entries */
|
|
+ MM_NONLEAF_FOUND, /* non-leaf entries found in Bloom filters */
|
|
+ MM_NONLEAF_ADDED, /* non-leaf entries added to Bloom filters */
|
|
+ NR_MM_STATS
|
|
+};
|
|
+
|
|
+/* double-buffering Bloom filters */
|
|
+#define NR_BLOOM_FILTERS 2
|
|
+
|
|
+struct lru_gen_mm_state {
|
|
+ /* set to max_seq after each iteration */
|
|
+ unsigned long seq;
|
|
+ /* where the current iteration continues (inclusive) */
|
|
+ struct list_head *head;
|
|
+ /* where the last iteration ended (exclusive) */
|
|
+ struct list_head *tail;
|
|
+ /* to wait for the last page table walker to finish */
|
|
+ struct wait_queue_head wait;
|
|
+ /* Bloom filters flip after each iteration */
|
|
+ unsigned long *filters[NR_BLOOM_FILTERS];
|
|
+ /* the mm stats for debugging */
|
|
+ unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
|
|
+ /* the number of concurrent page table walkers */
|
|
+ int nr_walkers;
|
|
+};
|
|
+
|
|
+struct lru_gen_mm_walk {
|
|
+ /* the lruvec under reclaim */
|
|
+ struct lruvec *lruvec;
|
|
+ /* unstable max_seq from lru_gen_struct */
|
|
+ unsigned long max_seq;
|
|
+ /* the next address within an mm to scan */
|
|
+ unsigned long next_addr;
|
|
+ /* to batch promoted pages */
|
|
+ int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
|
|
+ /* to batch the mm stats */
|
|
+ int mm_stats[NR_MM_STATS];
|
|
+ /* total batched items */
|
|
+ int batched;
|
|
+ bool can_swap;
|
|
+ bool force_scan;
|
|
+};
|
|
+
|
|
void lru_gen_init_lruvec(struct lruvec *lruvec);
|
|
void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
|
|
|
|
@@ -477,6 +524,8 @@ struct lruvec {
|
|
#ifdef CONFIG_LRU_GEN
|
|
/* evictable pages divided into generations */
|
|
struct lru_gen_struct lrugen;
|
|
+ /* to concurrently iterate lru_gen_mm_list */
|
|
+ struct lru_gen_mm_state mm_state;
|
|
#endif
|
|
#ifdef CONFIG_MEMCG
|
|
struct pglist_data *pgdat;
|
|
@@ -1070,6 +1119,11 @@ typedef struct pglist_data {
|
|
|
|
unsigned long flags;
|
|
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+ /* kswap mm walk data */
|
|
+ struct lru_gen_mm_walk mm_walk;
|
|
+#endif
|
|
+
|
|
ZONE_PADDING(_pad2_)
|
|
|
|
/* Per-node vmstats */
|
|
diff --git a/include/linux/swap.h b/include/linux/swap.h
|
|
index 0c0fed1b348f..b66cbc7ea93c 100644
|
|
--- a/include/linux/swap.h
|
|
+++ b/include/linux/swap.h
|
|
@@ -162,6 +162,10 @@ union swap_header {
|
|
*/
|
|
struct reclaim_state {
|
|
unsigned long reclaimed_slab;
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+ /* per-thread mm walk data */
|
|
+ struct lru_gen_mm_walk *mm_walk;
|
|
+#endif
|
|
};
|
|
|
|
#ifdef __KERNEL__
|
|
diff --git a/kernel/exit.c b/kernel/exit.c
|
|
index f072959fcab7..f2d4d48ea790 100644
|
|
--- a/kernel/exit.c
|
|
+++ b/kernel/exit.c
|
|
@@ -466,6 +466,7 @@ void mm_update_next_owner(struct mm_struct *mm)
|
|
goto retry;
|
|
}
|
|
WRITE_ONCE(mm->owner, c);
|
|
+ lru_gen_migrate_mm(mm);
|
|
task_unlock(c);
|
|
put_task_struct(c);
|
|
}
|
|
diff --git a/kernel/fork.c b/kernel/fork.c
|
|
index 9d44f2d46c69..67b7666d7321 100644
|
|
--- a/kernel/fork.c
|
|
+++ b/kernel/fork.c
|
|
@@ -1152,6 +1152,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
|
|
goto fail_nocontext;
|
|
|
|
mm->user_ns = get_user_ns(user_ns);
|
|
+ lru_gen_init_mm(mm);
|
|
return mm;
|
|
|
|
fail_nocontext:
|
|
@@ -1194,6 +1195,7 @@ static inline void __mmput(struct mm_struct *mm)
|
|
}
|
|
if (mm->binfmt)
|
|
module_put(mm->binfmt->module);
|
|
+ lru_gen_del_mm(mm);
|
|
mmdrop(mm);
|
|
}
|
|
|
|
@@ -2676,6 +2678,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
|
|
get_task_struct(p);
|
|
}
|
|
|
|
+ if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
|
|
+ /* lock the task to synchronize with memcg migration */
|
|
+ task_lock(p);
|
|
+ lru_gen_add_mm(p->mm);
|
|
+ task_unlock(p);
|
|
+ }
|
|
+
|
|
wake_up_new_task(p);
|
|
|
|
/* forking complete and child started to run, tell ptracer */
|
|
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
|
|
index da0bf6fe9ecd..320d82697037 100644
|
|
--- a/kernel/sched/core.c
|
|
+++ b/kernel/sched/core.c
|
|
@@ -5130,6 +5130,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
|
|
* finish_task_switch()'s mmdrop().
|
|
*/
|
|
switch_mm_irqs_off(prev->active_mm, next->mm, next);
|
|
+ lru_gen_use_mm(next->mm);
|
|
|
|
if (!prev->mm) { // from kernel
|
|
/* will mmdrop() in finish_task_switch(). */
|
|
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
|
|
index 743f8513f1c3..84f3707667bc 100644
|
|
--- a/mm/memcontrol.c
|
|
+++ b/mm/memcontrol.c
|
|
@@ -6133,6 +6133,30 @@ static void mem_cgroup_move_task(void)
|
|
}
|
|
#endif
|
|
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
|
|
+{
|
|
+ struct task_struct *task;
|
|
+ struct cgroup_subsys_state *css;
|
|
+
|
|
+ /* find the first leader if there is any */
|
|
+ cgroup_taskset_for_each_leader(task, css, tset)
|
|
+ break;
|
|
+
|
|
+ if (!task)
|
|
+ return;
|
|
+
|
|
+ task_lock(task);
|
|
+ if (task->mm && task->mm->owner == task)
|
|
+ lru_gen_migrate_mm(task->mm);
|
|
+ task_unlock(task);
|
|
+}
|
|
+#else
|
|
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
|
|
+{
|
|
+}
|
|
+#endif /* CONFIG_LRU_GEN */
|
|
+
|
|
static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
|
|
{
|
|
if (value == PAGE_COUNTER_MAX)
|
|
@@ -6536,6 +6560,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
|
|
.css_reset = mem_cgroup_css_reset,
|
|
.css_rstat_flush = mem_cgroup_css_rstat_flush,
|
|
.can_attach = mem_cgroup_can_attach,
|
|
+ .attach = mem_cgroup_attach,
|
|
.cancel_attach = mem_cgroup_cancel_attach,
|
|
.post_attach = mem_cgroup_move_task,
|
|
.dfl_cftypes = memory_files,
|
|
diff --git a/mm/vmscan.c b/mm/vmscan.c
|
|
index ec786fc556a7..8e55a1ce1ae0 100644
|
|
--- a/mm/vmscan.c
|
|
+++ b/mm/vmscan.c
|
|
@@ -50,6 +50,8 @@
|
|
#include <linux/printk.h>
|
|
#include <linux/dax.h>
|
|
#include <linux/psi.h>
|
|
+#include <linux/pagewalk.h>
|
|
+#include <linux/shmem_fs.h>
|
|
|
|
#include <asm/tlbflush.h>
|
|
#include <asm/div64.h>
|
|
@@ -3024,7 +3026,7 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
|
|
for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \
|
|
for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
|
|
|
|
-static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int nid)
|
|
+static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
|
|
{
|
|
struct pglist_data *pgdat = NODE_DATA(nid);
|
|
|
|
@@ -3069,6 +3071,372 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
|
|
get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
|
|
}
|
|
|
|
+/******************************************************************************
|
|
+ * mm_struct list
|
|
+ ******************************************************************************/
|
|
+
|
|
+static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
|
|
+{
|
|
+ static struct lru_gen_mm_list mm_list = {
|
|
+ .fifo = LIST_HEAD_INIT(mm_list.fifo),
|
|
+ .lock = __SPIN_LOCK_UNLOCKED(mm_list.lock),
|
|
+ };
|
|
+
|
|
+#ifdef CONFIG_MEMCG
|
|
+ if (memcg)
|
|
+ return &memcg->mm_list;
|
|
+#endif
|
|
+ VM_WARN_ON_ONCE(!mem_cgroup_disabled());
|
|
+
|
|
+ return &mm_list;
|
|
+}
|
|
+
|
|
+void lru_gen_add_mm(struct mm_struct *mm)
|
|
+{
|
|
+ int nid;
|
|
+ struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
|
|
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
|
|
+
|
|
+ VM_WARN_ON_ONCE(!list_empty(&mm->lru_gen.list));
|
|
+#ifdef CONFIG_MEMCG
|
|
+ VM_WARN_ON_ONCE(mm->lru_gen.memcg);
|
|
+ mm->lru_gen.memcg = memcg;
|
|
+#endif
|
|
+ spin_lock(&mm_list->lock);
|
|
+
|
|
+ for_each_node_state(nid, N_MEMORY) {
|
|
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
|
|
+
|
|
+ if (!lruvec)
|
|
+ continue;
|
|
+
|
|
+ /* the first addition since the last iteration */
|
|
+ if (lruvec->mm_state.tail == &mm_list->fifo)
|
|
+ lruvec->mm_state.tail = &mm->lru_gen.list;
|
|
+ }
|
|
+
|
|
+ list_add_tail(&mm->lru_gen.list, &mm_list->fifo);
|
|
+
|
|
+ spin_unlock(&mm_list->lock);
|
|
+}
|
|
+
|
|
+void lru_gen_del_mm(struct mm_struct *mm)
|
|
+{
|
|
+ int nid;
|
|
+ struct lru_gen_mm_list *mm_list;
|
|
+ struct mem_cgroup *memcg = NULL;
|
|
+
|
|
+ if (list_empty(&mm->lru_gen.list))
|
|
+ return;
|
|
+
|
|
+#ifdef CONFIG_MEMCG
|
|
+ memcg = mm->lru_gen.memcg;
|
|
+#endif
|
|
+ mm_list = get_mm_list(memcg);
|
|
+
|
|
+ spin_lock(&mm_list->lock);
|
|
+
|
|
+ for_each_node(nid) {
|
|
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
|
|
+
|
|
+ if (!lruvec)
|
|
+ continue;
|
|
+
|
|
+ /* where the last iteration ended (exclusive) */
|
|
+ if (lruvec->mm_state.tail == &mm->lru_gen.list)
|
|
+ lruvec->mm_state.tail = lruvec->mm_state.tail->next;
|
|
+
|
|
+ /* where the current iteration continues (inclusive) */
|
|
+ if (lruvec->mm_state.head != &mm->lru_gen.list)
|
|
+ continue;
|
|
+
|
|
+ lruvec->mm_state.head = lruvec->mm_state.head->next;
|
|
+ /* the deletion ends the current iteration */
|
|
+ if (lruvec->mm_state.head == &mm_list->fifo)
|
|
+ WRITE_ONCE(lruvec->mm_state.seq, lruvec->mm_state.seq + 1);
|
|
+ }
|
|
+
|
|
+ list_del_init(&mm->lru_gen.list);
|
|
+
|
|
+ spin_unlock(&mm_list->lock);
|
|
+
|
|
+#ifdef CONFIG_MEMCG
|
|
+ mem_cgroup_put(mm->lru_gen.memcg);
|
|
+ mm->lru_gen.memcg = NULL;
|
|
+#endif
|
|
+}
|
|
+
|
|
+#ifdef CONFIG_MEMCG
|
|
+void lru_gen_migrate_mm(struct mm_struct *mm)
|
|
+{
|
|
+ struct mem_cgroup *memcg;
|
|
+
|
|
+ lockdep_assert_held(&mm->owner->alloc_lock);
|
|
+
|
|
+ /* for mm_update_next_owner() */
|
|
+ if (mem_cgroup_disabled())
|
|
+ return;
|
|
+
|
|
+ rcu_read_lock();
|
|
+ memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
|
|
+ rcu_read_unlock();
|
|
+ if (memcg == mm->lru_gen.memcg)
|
|
+ return;
|
|
+
|
|
+ VM_WARN_ON_ONCE(!mm->lru_gen.memcg);
|
|
+ VM_WARN_ON_ONCE(list_empty(&mm->lru_gen.list));
|
|
+
|
|
+ lru_gen_del_mm(mm);
|
|
+ lru_gen_add_mm(mm);
|
|
+}
|
|
+#endif
|
|
+
|
|
+/*
|
|
+ * Bloom filters with m=1<<15, k=2 and the false positive rates of ~1/5 when
|
|
+ * n=10,000 and ~1/2 when n=20,000, where, conventionally, m is the number of
|
|
+ * bits in a bitmap, k is the number of hash functions and n is the number of
|
|
+ * inserted items.
|
|
+ *
|
|
+ * Page table walkers use one of the two filters to reduce their search space.
|
|
+ * To get rid of non-leaf entries that no longer have enough leaf entries, the
|
|
+ * aging uses the double-buffering technique to flip to the other filter each
|
|
+ * time it produces a new generation. For non-leaf entries that have enough
|
|
+ * leaf entries, the aging carries them over to the next generation in
|
|
+ * walk_pmd_range(); the eviction also report them when walking the rmap
|
|
+ * in lru_gen_look_around().
|
|
+ *
|
|
+ * For future optimizations:
|
|
+ * 1. It's not necessary to keep both filters all the time. The spare one can be
|
|
+ * freed after the RCU grace period and reallocated if needed again.
|
|
+ * 2. And when reallocating, it's worth scaling its size according to the number
|
|
+ * of inserted entries in the other filter, to reduce the memory overhead on
|
|
+ * small systems and false positives on large systems.
|
|
+ * 3. Jenkins' hash function is an alternative to Knuth's.
|
|
+ */
|
|
+#define BLOOM_FILTER_SHIFT 15
|
|
+
|
|
+static inline int filter_gen_from_seq(unsigned long seq)
|
|
+{
|
|
+ return seq % NR_BLOOM_FILTERS;
|
|
+}
|
|
+
|
|
+static void get_item_key(void *item, int *key)
|
|
+{
|
|
+ u32 hash = hash_ptr(item, BLOOM_FILTER_SHIFT * 2);
|
|
+
|
|
+ BUILD_BUG_ON(BLOOM_FILTER_SHIFT * 2 > BITS_PER_TYPE(u32));
|
|
+
|
|
+ key[0] = hash & (BIT(BLOOM_FILTER_SHIFT) - 1);
|
|
+ key[1] = hash >> BLOOM_FILTER_SHIFT;
|
|
+}
|
|
+
|
|
+static void reset_bloom_filter(struct lruvec *lruvec, unsigned long seq)
|
|
+{
|
|
+ unsigned long *filter;
|
|
+ int gen = filter_gen_from_seq(seq);
|
|
+
|
|
+ filter = lruvec->mm_state.filters[gen];
|
|
+ if (filter) {
|
|
+ bitmap_clear(filter, 0, BIT(BLOOM_FILTER_SHIFT));
|
|
+ return;
|
|
+ }
|
|
+
|
|
+ filter = bitmap_zalloc(BIT(BLOOM_FILTER_SHIFT),
|
|
+ __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
|
|
+ WRITE_ONCE(lruvec->mm_state.filters[gen], filter);
|
|
+}
|
|
+
|
|
+static void update_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
|
|
+{
|
|
+ int key[2];
|
|
+ unsigned long *filter;
|
|
+ int gen = filter_gen_from_seq(seq);
|
|
+
|
|
+ filter = READ_ONCE(lruvec->mm_state.filters[gen]);
|
|
+ if (!filter)
|
|
+ return;
|
|
+
|
|
+ get_item_key(item, key);
|
|
+
|
|
+ if (!test_bit(key[0], filter))
|
|
+ set_bit(key[0], filter);
|
|
+ if (!test_bit(key[1], filter))
|
|
+ set_bit(key[1], filter);
|
|
+}
|
|
+
|
|
+static bool test_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
|
|
+{
|
|
+ int key[2];
|
|
+ unsigned long *filter;
|
|
+ int gen = filter_gen_from_seq(seq);
|
|
+
|
|
+ filter = READ_ONCE(lruvec->mm_state.filters[gen]);
|
|
+ if (!filter)
|
|
+ return true;
|
|
+
|
|
+ get_item_key(item, key);
|
|
+
|
|
+ return test_bit(key[0], filter) && test_bit(key[1], filter);
|
|
+}
|
|
+
|
|
+static void reset_mm_stats(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, bool last)
|
|
+{
|
|
+ int i;
|
|
+ int hist;
|
|
+
|
|
+ lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
|
|
+
|
|
+ if (walk) {
|
|
+ hist = lru_hist_from_seq(walk->max_seq);
|
|
+
|
|
+ for (i = 0; i < NR_MM_STATS; i++) {
|
|
+ WRITE_ONCE(lruvec->mm_state.stats[hist][i],
|
|
+ lruvec->mm_state.stats[hist][i] + walk->mm_stats[i]);
|
|
+ walk->mm_stats[i] = 0;
|
|
+ }
|
|
+ }
|
|
+
|
|
+ if (NR_HIST_GENS > 1 && last) {
|
|
+ hist = lru_hist_from_seq(lruvec->mm_state.seq + 1);
|
|
+
|
|
+ for (i = 0; i < NR_MM_STATS; i++)
|
|
+ WRITE_ONCE(lruvec->mm_state.stats[hist][i], 0);
|
|
+ }
|
|
+}
|
|
+
|
|
+static bool should_skip_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
|
|
+{
|
|
+ int type;
|
|
+ unsigned long size = 0;
|
|
+ struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
|
|
+ int key = pgdat->node_id % BITS_PER_TYPE(mm->lru_gen.bitmap);
|
|
+
|
|
+ if (!walk->force_scan && !test_bit(key, &mm->lru_gen.bitmap))
|
|
+ return true;
|
|
+
|
|
+ clear_bit(key, &mm->lru_gen.bitmap);
|
|
+
|
|
+ for (type = !walk->can_swap; type < ANON_AND_FILE; type++) {
|
|
+ size += type ? get_mm_counter(mm, MM_FILEPAGES) :
|
|
+ get_mm_counter(mm, MM_ANONPAGES) +
|
|
+ get_mm_counter(mm, MM_SHMEMPAGES);
|
|
+ }
|
|
+
|
|
+ if (size < MIN_LRU_BATCH)
|
|
+ return true;
|
|
+
|
|
+ if (test_bit(MMF_OOM_REAP_QUEUED, &mm->flags))
|
|
+ return true;
|
|
+
|
|
+ return !mmget_not_zero(mm);
|
|
+}
|
|
+
|
|
+static bool iterate_mm_list(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
|
|
+ struct mm_struct **iter)
|
|
+{
|
|
+ bool first = false;
|
|
+ bool last = true;
|
|
+ struct mm_struct *mm = NULL;
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
|
|
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
|
|
+ struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
|
|
+
|
|
+ /*
|
|
+ * There are four interesting cases for this page table walker:
|
|
+ * 1. It tries to start a new iteration of mm_list with a stale max_seq;
|
|
+ * there is nothing left to do.
|
|
+ * 2. It's the first of the current generation, and it needs to reset
|
|
+ * the Bloom filter for the next generation.
|
|
+ * 3. It reaches the end of mm_list, and it needs to increment
|
|
+ * mm_state->seq; the iteration is done.
|
|
+ * 4. It's the last of the current generation, and it needs to reset the
|
|
+ * mm stats counters for the next generation.
|
|
+ */
|
|
+ spin_lock(&mm_list->lock);
|
|
+
|
|
+ VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->max_seq);
|
|
+ VM_WARN_ON_ONCE(*iter && mm_state->seq > walk->max_seq);
|
|
+ VM_WARN_ON_ONCE(*iter && !mm_state->nr_walkers);
|
|
+
|
|
+ if (walk->max_seq <= mm_state->seq) {
|
|
+ if (!*iter)
|
|
+ last = false;
|
|
+ goto done;
|
|
+ }
|
|
+
|
|
+ if (!mm_state->nr_walkers) {
|
|
+ VM_WARN_ON_ONCE(mm_state->head && mm_state->head != &mm_list->fifo);
|
|
+
|
|
+ mm_state->head = mm_list->fifo.next;
|
|
+ first = true;
|
|
+ }
|
|
+
|
|
+ while (!mm && mm_state->head != &mm_list->fifo) {
|
|
+ mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list);
|
|
+
|
|
+ mm_state->head = mm_state->head->next;
|
|
+
|
|
+ /* force scan for those added after the last iteration */
|
|
+ if (!mm_state->tail || mm_state->tail == &mm->lru_gen.list) {
|
|
+ mm_state->tail = mm_state->head;
|
|
+ walk->force_scan = true;
|
|
+ }
|
|
+
|
|
+ if (should_skip_mm(mm, walk))
|
|
+ mm = NULL;
|
|
+ }
|
|
+
|
|
+ if (mm_state->head == &mm_list->fifo)
|
|
+ WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
|
|
+done:
|
|
+ if (*iter && !mm)
|
|
+ mm_state->nr_walkers--;
|
|
+ if (!*iter && mm)
|
|
+ mm_state->nr_walkers++;
|
|
+
|
|
+ if (mm_state->nr_walkers)
|
|
+ last = false;
|
|
+
|
|
+ if (*iter || last)
|
|
+ reset_mm_stats(lruvec, walk, last);
|
|
+
|
|
+ spin_unlock(&mm_list->lock);
|
|
+
|
|
+ if (mm && first)
|
|
+ reset_bloom_filter(lruvec, walk->max_seq + 1);
|
|
+
|
|
+ if (*iter)
|
|
+ mmput_async(*iter);
|
|
+
|
|
+ *iter = mm;
|
|
+
|
|
+ return last;
|
|
+}
|
|
+
|
|
+static bool iterate_mm_list_nowalk(struct lruvec *lruvec, unsigned long max_seq)
|
|
+{
|
|
+ bool success = false;
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
|
|
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
|
|
+ struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
|
|
+
|
|
+ spin_lock(&mm_list->lock);
|
|
+
|
|
+ VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
|
|
+
|
|
+ if (max_seq > mm_state->seq && !mm_state->nr_walkers) {
|
|
+ VM_WARN_ON_ONCE(mm_state->head && mm_state->head != &mm_list->fifo);
|
|
+
|
|
+ WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
|
|
+ reset_mm_stats(lruvec, NULL, true);
|
|
+ success = true;
|
|
+ }
|
|
+
|
|
+ spin_unlock(&mm_list->lock);
|
|
+
|
|
+ return success;
|
|
+}
|
|
+
|
|
/******************************************************************************
|
|
* refault feedback loop
|
|
******************************************************************************/
|
|
@@ -3219,6 +3587,118 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
|
|
return new_gen;
|
|
}
|
|
|
|
+static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
|
|
+ int old_gen, int new_gen)
|
|
+{
|
|
+ int type = folio_is_file_lru(folio);
|
|
+ int zone = folio_zonenum(folio);
|
|
+ int delta = folio_nr_pages(folio);
|
|
+
|
|
+ VM_WARN_ON_ONCE(old_gen >= MAX_NR_GENS);
|
|
+ VM_WARN_ON_ONCE(new_gen >= MAX_NR_GENS);
|
|
+
|
|
+ walk->batched++;
|
|
+
|
|
+ walk->nr_pages[old_gen][type][zone] -= delta;
|
|
+ walk->nr_pages[new_gen][type][zone] += delta;
|
|
+}
|
|
+
|
|
+static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk)
|
|
+{
|
|
+ int gen, type, zone;
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+
|
|
+ walk->batched = 0;
|
|
+
|
|
+ for_each_gen_type_zone(gen, type, zone) {
|
|
+ enum lru_list lru = type * LRU_INACTIVE_FILE;
|
|
+ int delta = walk->nr_pages[gen][type][zone];
|
|
+
|
|
+ if (!delta)
|
|
+ continue;
|
|
+
|
|
+ walk->nr_pages[gen][type][zone] = 0;
|
|
+ WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
|
|
+ lrugen->nr_pages[gen][type][zone] + delta);
|
|
+
|
|
+ if (lru_gen_is_active(lruvec, gen))
|
|
+ lru += LRU_ACTIVE;
|
|
+ __update_lru_size(lruvec, lru, zone, delta);
|
|
+ }
|
|
+}
|
|
+
|
|
+static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *args)
|
|
+{
|
|
+ struct address_space *mapping;
|
|
+ struct vm_area_struct *vma = args->vma;
|
|
+ struct lru_gen_mm_walk *walk = args->private;
|
|
+
|
|
+ if (!vma_is_accessible(vma))
|
|
+ return true;
|
|
+
|
|
+ if (is_vm_hugetlb_page(vma))
|
|
+ return true;
|
|
+
|
|
+ if (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ))
|
|
+ return true;
|
|
+
|
|
+ if (vma == get_gate_vma(vma->vm_mm))
|
|
+ return true;
|
|
+
|
|
+ if (vma_is_anonymous(vma))
|
|
+ return !walk->can_swap;
|
|
+
|
|
+ if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
|
|
+ return true;
|
|
+
|
|
+ mapping = vma->vm_file->f_mapping;
|
|
+ if (mapping_unevictable(mapping))
|
|
+ return true;
|
|
+
|
|
+ if (shmem_mapping(mapping))
|
|
+ return !walk->can_swap;
|
|
+
|
|
+ /* to exclude special mappings like dax, etc. */
|
|
+ return !mapping->a_ops->read_folio;
|
|
+}
|
|
+
|
|
+/*
|
|
+ * Some userspace memory allocators map many single-page VMAs. Instead of
|
|
+ * returning back to the PGD table for each of such VMAs, finish an entire PMD
|
|
+ * table to reduce zigzags and improve cache performance.
|
|
+ */
|
|
+static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk *args,
|
|
+ unsigned long *vm_start, unsigned long *vm_end)
|
|
+{
|
|
+ unsigned long start = round_up(*vm_end, size);
|
|
+ unsigned long end = (start | ~mask) + 1;
|
|
+
|
|
+ VM_WARN_ON_ONCE(mask & size);
|
|
+ VM_WARN_ON_ONCE((start & mask) != (*vm_start & mask));
|
|
+
|
|
+ while (args->vma) {
|
|
+ if (start >= args->vma->vm_end) {
|
|
+ args->vma = args->vma->vm_next;
|
|
+ continue;
|
|
+ }
|
|
+
|
|
+ if (end && end <= args->vma->vm_start)
|
|
+ return false;
|
|
+
|
|
+ if (should_skip_vma(args->vma->vm_start, args->vma->vm_end, args)) {
|
|
+ args->vma = args->vma->vm_next;
|
|
+ continue;
|
|
+ }
|
|
+
|
|
+ *vm_start = max(start, args->vma->vm_start);
|
|
+ *vm_end = min(end - 1, args->vma->vm_end - 1) + 1;
|
|
+
|
|
+ return true;
|
|
+ }
|
|
+
|
|
+ return false;
|
|
+}
|
|
+
|
|
static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
|
|
{
|
|
unsigned long pfn = pte_pfn(pte);
|
|
@@ -3237,8 +3717,28 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
|
|
return pfn;
|
|
}
|
|
|
|
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
|
|
+static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr)
|
|
+{
|
|
+ unsigned long pfn = pmd_pfn(pmd);
|
|
+
|
|
+ VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end);
|
|
+
|
|
+ if (!pmd_present(pmd) || is_huge_zero_pmd(pmd))
|
|
+ return -1;
|
|
+
|
|
+ if (WARN_ON_ONCE(pmd_devmap(pmd)))
|
|
+ return -1;
|
|
+
|
|
+ if (WARN_ON_ONCE(!pfn_valid(pfn)))
|
|
+ return -1;
|
|
+
|
|
+ return pfn;
|
|
+}
|
|
+#endif
|
|
+
|
|
static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
|
|
- struct pglist_data *pgdat)
|
|
+ struct pglist_data *pgdat, bool can_swap)
|
|
{
|
|
struct folio *folio;
|
|
|
|
@@ -3253,9 +3753,371 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
|
|
if (folio_memcg_rcu(folio) != memcg)
|
|
return NULL;
|
|
|
|
+ /* file VMAs can contain anon pages from COW */
|
|
+ if (!folio_is_file_lru(folio) && !can_swap)
|
|
+ return NULL;
|
|
+
|
|
return folio;
|
|
}
|
|
|
|
+static bool suitable_to_scan(int total, int young)
|
|
+{
|
|
+ int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
|
|
+
|
|
+ /* suitable if the average number of young PTEs per cacheline is >=1 */
|
|
+ return young * n >= total;
|
|
+}
|
|
+
|
|
+static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
|
|
+ struct mm_walk *args)
|
|
+{
|
|
+ int i;
|
|
+ pte_t *pte;
|
|
+ spinlock_t *ptl;
|
|
+ unsigned long addr;
|
|
+ int total = 0;
|
|
+ int young = 0;
|
|
+ struct lru_gen_mm_walk *walk = args->private;
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
|
|
+ struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
|
|
+ int old_gen, new_gen = lru_gen_from_seq(walk->max_seq);
|
|
+
|
|
+ VM_WARN_ON_ONCE(pmd_leaf(*pmd));
|
|
+
|
|
+ ptl = pte_lockptr(args->mm, pmd);
|
|
+ if (!spin_trylock(ptl))
|
|
+ return false;
|
|
+
|
|
+ arch_enter_lazy_mmu_mode();
|
|
+
|
|
+ pte = pte_offset_map(pmd, start & PMD_MASK);
|
|
+restart:
|
|
+ for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
|
|
+ unsigned long pfn;
|
|
+ struct folio *folio;
|
|
+
|
|
+ total++;
|
|
+ walk->mm_stats[MM_LEAF_TOTAL]++;
|
|
+
|
|
+ pfn = get_pte_pfn(pte[i], args->vma, addr);
|
|
+ if (pfn == -1)
|
|
+ continue;
|
|
+
|
|
+ if (!pte_young(pte[i])) {
|
|
+ walk->mm_stats[MM_LEAF_OLD]++;
|
|
+ continue;
|
|
+ }
|
|
+
|
|
+ folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
|
|
+ if (!folio)
|
|
+ continue;
|
|
+
|
|
+ if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
|
|
+ continue;
|
|
+
|
|
+ young++;
|
|
+ walk->mm_stats[MM_LEAF_YOUNG]++;
|
|
+
|
|
+ if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
|
|
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
|
|
+ !folio_test_swapcache(folio)))
|
|
+ folio_mark_dirty(folio);
|
|
+
|
|
+ old_gen = folio_update_gen(folio, new_gen);
|
|
+ if (old_gen >= 0 && old_gen != new_gen)
|
|
+ update_batch_size(walk, folio, old_gen, new_gen);
|
|
+ }
|
|
+
|
|
+ if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
|
|
+ goto restart;
|
|
+
|
|
+ pte_unmap(pte);
|
|
+
|
|
+ arch_leave_lazy_mmu_mode();
|
|
+ spin_unlock(ptl);
|
|
+
|
|
+ return suitable_to_scan(total, young);
|
|
+}
|
|
+
|
|
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
|
|
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
|
|
+ struct mm_walk *args, unsigned long *bitmap, unsigned long *start)
|
|
+{
|
|
+ int i;
|
|
+ pmd_t *pmd;
|
|
+ spinlock_t *ptl;
|
|
+ struct lru_gen_mm_walk *walk = args->private;
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
|
|
+ struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
|
|
+ int old_gen, new_gen = lru_gen_from_seq(walk->max_seq);
|
|
+
|
|
+ VM_WARN_ON_ONCE(pud_leaf(*pud));
|
|
+
|
|
+ /* try to batch at most 1+MIN_LRU_BATCH+1 entries */
|
|
+ if (*start == -1) {
|
|
+ *start = next;
|
|
+ return;
|
|
+ }
|
|
+
|
|
+ i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start);
|
|
+ if (i && i <= MIN_LRU_BATCH) {
|
|
+ __set_bit(i - 1, bitmap);
|
|
+ return;
|
|
+ }
|
|
+
|
|
+ pmd = pmd_offset(pud, *start);
|
|
+
|
|
+ ptl = pmd_lockptr(args->mm, pmd);
|
|
+ if (!spin_trylock(ptl))
|
|
+ goto done;
|
|
+
|
|
+ arch_enter_lazy_mmu_mode();
|
|
+
|
|
+ do {
|
|
+ unsigned long pfn;
|
|
+ struct folio *folio;
|
|
+ unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start;
|
|
+
|
|
+ pfn = get_pmd_pfn(pmd[i], vma, addr);
|
|
+ if (pfn == -1)
|
|
+ goto next;
|
|
+
|
|
+ if (!pmd_trans_huge(pmd[i])) {
|
|
+ if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
|
|
+ pmdp_test_and_clear_young(vma, addr, pmd + i);
|
|
+ goto next;
|
|
+ }
|
|
+
|
|
+ folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
|
|
+ if (!folio)
|
|
+ goto next;
|
|
+
|
|
+ if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
|
|
+ goto next;
|
|
+
|
|
+ walk->mm_stats[MM_LEAF_YOUNG]++;
|
|
+
|
|
+ if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) &&
|
|
+ !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
|
|
+ !folio_test_swapcache(folio)))
|
|
+ folio_mark_dirty(folio);
|
|
+
|
|
+ old_gen = folio_update_gen(folio, new_gen);
|
|
+ if (old_gen >= 0 && old_gen != new_gen)
|
|
+ update_batch_size(walk, folio, old_gen, new_gen);
|
|
+next:
|
|
+ i = i > MIN_LRU_BATCH ? 0 : find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1;
|
|
+ } while (i <= MIN_LRU_BATCH);
|
|
+
|
|
+ arch_leave_lazy_mmu_mode();
|
|
+ spin_unlock(ptl);
|
|
+done:
|
|
+ *start = -1;
|
|
+ bitmap_zero(bitmap, MIN_LRU_BATCH);
|
|
+}
|
|
+#else
|
|
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
|
|
+ struct mm_walk *args, unsigned long *bitmap, unsigned long *start)
|
|
+{
|
|
+}
|
|
+#endif
|
|
+
|
|
+static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
|
|
+ struct mm_walk *args)
|
|
+{
|
|
+ int i;
|
|
+ pmd_t *pmd;
|
|
+ unsigned long next;
|
|
+ unsigned long addr;
|
|
+ struct vm_area_struct *vma;
|
|
+ unsigned long pos = -1;
|
|
+ struct lru_gen_mm_walk *walk = args->private;
|
|
+ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
|
|
+
|
|
+ VM_WARN_ON_ONCE(pud_leaf(*pud));
|
|
+
|
|
+ /*
|
|
+ * Finish an entire PMD in two passes: the first only reaches to PTE
|
|
+ * tables to avoid taking the PMD lock; the second, if necessary, takes
|
|
+ * the PMD lock to clear the accessed bit in PMD entries.
|
|
+ */
|
|
+ pmd = pmd_offset(pud, start & PUD_MASK);
|
|
+restart:
|
|
+ /* walk_pte_range() may call get_next_vma() */
|
|
+ vma = args->vma;
|
|
+ for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
|
|
+ pmd_t val = pmd_read_atomic(pmd + i);
|
|
+
|
|
+ /* for pmd_read_atomic() */
|
|
+ barrier();
|
|
+
|
|
+ next = pmd_addr_end(addr, end);
|
|
+
|
|
+ if (!pmd_present(val) || is_huge_zero_pmd(val)) {
|
|
+ walk->mm_stats[MM_LEAF_TOTAL]++;
|
|
+ continue;
|
|
+ }
|
|
+
|
|
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
+ if (pmd_trans_huge(val)) {
|
|
+ unsigned long pfn = pmd_pfn(val);
|
|
+ struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
|
|
+
|
|
+ walk->mm_stats[MM_LEAF_TOTAL]++;
|
|
+
|
|
+ if (!pmd_young(val)) {
|
|
+ walk->mm_stats[MM_LEAF_OLD]++;
|
|
+ continue;
|
|
+ }
|
|
+
|
|
+ /* try to avoid unnecessary memory loads */
|
|
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
|
|
+ continue;
|
|
+
|
|
+ walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos);
|
|
+ continue;
|
|
+ }
|
|
+#endif
|
|
+ walk->mm_stats[MM_NONLEAF_TOTAL]++;
|
|
+
|
|
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
|
|
+ if (!pmd_young(val))
|
|
+ continue;
|
|
+
|
|
+ walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos);
|
|
+#endif
|
|
+ if (!walk->force_scan && !test_bloom_filter(walk->lruvec, walk->max_seq, pmd + i))
|
|
+ continue;
|
|
+
|
|
+ walk->mm_stats[MM_NONLEAF_FOUND]++;
|
|
+
|
|
+ if (!walk_pte_range(&val, addr, next, args))
|
|
+ continue;
|
|
+
|
|
+ walk->mm_stats[MM_NONLEAF_ADDED]++;
|
|
+
|
|
+ /* carry over to the next generation */
|
|
+ update_bloom_filter(walk->lruvec, walk->max_seq + 1, pmd + i);
|
|
+ }
|
|
+
|
|
+ walk_pmd_range_locked(pud, -1, vma, args, bitmap, &pos);
|
|
+
|
|
+ if (i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end))
|
|
+ goto restart;
|
|
+}
|
|
+
|
|
+static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
|
|
+ struct mm_walk *args)
|
|
+{
|
|
+ int i;
|
|
+ pud_t *pud;
|
|
+ unsigned long addr;
|
|
+ unsigned long next;
|
|
+ struct lru_gen_mm_walk *walk = args->private;
|
|
+
|
|
+ VM_WARN_ON_ONCE(p4d_leaf(*p4d));
|
|
+
|
|
+ pud = pud_offset(p4d, start & P4D_MASK);
|
|
+restart:
|
|
+ for (i = pud_index(start), addr = start; addr != end; i++, addr = next) {
|
|
+ pud_t val = READ_ONCE(pud[i]);
|
|
+
|
|
+ next = pud_addr_end(addr, end);
|
|
+
|
|
+ if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
|
|
+ continue;
|
|
+
|
|
+ walk_pmd_range(&val, addr, next, args);
|
|
+
|
|
+ if (walk->batched >= MAX_LRU_BATCH) {
|
|
+ end = (addr | ~PUD_MASK) + 1;
|
|
+ goto done;
|
|
+ }
|
|
+ }
|
|
+
|
|
+ if (i < PTRS_PER_PUD && get_next_vma(P4D_MASK, PUD_SIZE, args, &start, &end))
|
|
+ goto restart;
|
|
+
|
|
+ end = round_up(end, P4D_SIZE);
|
|
+done:
|
|
+ if (!end || !args->vma)
|
|
+ return 1;
|
|
+
|
|
+ walk->next_addr = max(end, args->vma->vm_start);
|
|
+
|
|
+ return -EAGAIN;
|
|
+}
|
|
+
|
|
+static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
|
|
+{
|
|
+ static const struct mm_walk_ops mm_walk_ops = {
|
|
+ .test_walk = should_skip_vma,
|
|
+ .p4d_entry = walk_pud_range,
|
|
+ };
|
|
+
|
|
+ int err;
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
|
|
+
|
|
+ walk->next_addr = FIRST_USER_ADDRESS;
|
|
+
|
|
+ do {
|
|
+ err = -EBUSY;
|
|
+
|
|
+ /* folio_update_gen() requires stable folio_memcg() */
|
|
+ if (!mem_cgroup_trylock_pages(memcg))
|
|
+ break;
|
|
+
|
|
+ /* the caller might be holding the lock for write */
|
|
+ if (mmap_read_trylock(mm)) {
|
|
+ err = walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, walk);
|
|
+
|
|
+ mmap_read_unlock(mm);
|
|
+ }
|
|
+
|
|
+ mem_cgroup_unlock_pages();
|
|
+
|
|
+ if (walk->batched) {
|
|
+ spin_lock_irq(&lruvec->lru_lock);
|
|
+ reset_batch_size(lruvec, walk);
|
|
+ spin_unlock_irq(&lruvec->lru_lock);
|
|
+ }
|
|
+
|
|
+ cond_resched();
|
|
+ } while (err == -EAGAIN);
|
|
+}
|
|
+
|
|
+static struct lru_gen_mm_walk *set_mm_walk(struct pglist_data *pgdat)
|
|
+{
|
|
+ struct lru_gen_mm_walk *walk = current->reclaim_state->mm_walk;
|
|
+
|
|
+ if (pgdat && current_is_kswapd()) {
|
|
+ VM_WARN_ON_ONCE(walk);
|
|
+
|
|
+ walk = &pgdat->mm_walk;
|
|
+ } else if (!pgdat && !walk) {
|
|
+ VM_WARN_ON_ONCE(current_is_kswapd());
|
|
+
|
|
+ walk = kzalloc(sizeof(*walk), __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
|
|
+ }
|
|
+
|
|
+ current->reclaim_state->mm_walk = walk;
|
|
+
|
|
+ return walk;
|
|
+}
|
|
+
|
|
+static void clear_mm_walk(void)
|
|
+{
|
|
+ struct lru_gen_mm_walk *walk = current->reclaim_state->mm_walk;
|
|
+
|
|
+ VM_WARN_ON_ONCE(walk && memchr_inv(walk->nr_pages, 0, sizeof(walk->nr_pages)));
|
|
+ VM_WARN_ON_ONCE(walk && memchr_inv(walk->mm_stats, 0, sizeof(walk->mm_stats)));
|
|
+
|
|
+ current->reclaim_state->mm_walk = NULL;
|
|
+
|
|
+ if (!current_is_kswapd())
|
|
+ kfree(walk);
|
|
+}
|
|
+
|
|
static void inc_min_seq(struct lruvec *lruvec, int type)
|
|
{
|
|
struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
@@ -3307,7 +4169,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
|
|
return success;
|
|
}
|
|
|
|
-static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_swap)
|
|
+static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
|
|
{
|
|
int prev, next;
|
|
int type, zone;
|
|
@@ -3317,9 +4179,6 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_s
|
|
|
|
VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
|
|
|
|
- if (max_seq != lrugen->max_seq)
|
|
- goto unlock;
|
|
-
|
|
for (type = 0; type < ANON_AND_FILE; type++) {
|
|
if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
|
|
continue;
|
|
@@ -3357,10 +4216,76 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_s
|
|
|
|
/* make sure preceding modifications appear */
|
|
smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
|
|
-unlock:
|
|
+
|
|
spin_unlock_irq(&lruvec->lru_lock);
|
|
}
|
|
|
|
+static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
|
|
+ struct scan_control *sc, bool can_swap)
|
|
+{
|
|
+ bool success;
|
|
+ struct lru_gen_mm_walk *walk;
|
|
+ struct mm_struct *mm = NULL;
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+
|
|
+ VM_WARN_ON_ONCE(max_seq > READ_ONCE(lrugen->max_seq));
|
|
+
|
|
+ /* see the comment in iterate_mm_list() */
|
|
+ if (max_seq <= READ_ONCE(lruvec->mm_state.seq)) {
|
|
+ success = false;
|
|
+ goto done;
|
|
+ }
|
|
+
|
|
+ /*
|
|
+ * If the hardware doesn't automatically set the accessed bit, fallback
|
|
+ * to lru_gen_look_around(), which only clears the accessed bit in a
|
|
+ * handful of PTEs. Spreading the work out over a period of time usually
|
|
+ * is less efficient, but it avoids bursty page faults.
|
|
+ */
|
|
+ if (!arch_has_hw_pte_young()) {
|
|
+ success = iterate_mm_list_nowalk(lruvec, max_seq);
|
|
+ goto done;
|
|
+ }
|
|
+
|
|
+ walk = set_mm_walk(NULL);
|
|
+ if (!walk) {
|
|
+ success = iterate_mm_list_nowalk(lruvec, max_seq);
|
|
+ goto done;
|
|
+ }
|
|
+
|
|
+ walk->lruvec = lruvec;
|
|
+ walk->max_seq = max_seq;
|
|
+ walk->can_swap = can_swap;
|
|
+ walk->force_scan = false;
|
|
+
|
|
+ do {
|
|
+ success = iterate_mm_list(lruvec, walk, &mm);
|
|
+ if (mm)
|
|
+ walk_mm(lruvec, mm, walk);
|
|
+
|
|
+ cond_resched();
|
|
+ } while (mm);
|
|
+done:
|
|
+ if (!success) {
|
|
+ if (!current_is_kswapd() && !sc->priority)
|
|
+ wait_event_killable(lruvec->mm_state.wait,
|
|
+ max_seq < READ_ONCE(lrugen->max_seq));
|
|
+
|
|
+ return max_seq < READ_ONCE(lrugen->max_seq);
|
|
+ }
|
|
+
|
|
+ VM_WARN_ON_ONCE(max_seq != READ_ONCE(lrugen->max_seq));
|
|
+
|
|
+ inc_max_seq(lruvec, can_swap);
|
|
+ /* either this sees any waiters or they will see updated max_seq */
|
|
+ if (wq_has_sleeper(&lruvec->mm_state.wait))
|
|
+ wake_up_all(&lruvec->mm_state.wait);
|
|
+
|
|
+ wakeup_flusher_threads(WB_REASON_VMSCAN);
|
|
+
|
|
+ return true;
|
|
+}
|
|
+
|
|
static unsigned long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
|
|
unsigned long *min_seq, bool can_swap, bool *need_aging)
|
|
{
|
|
@@ -3438,7 +4363,7 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
|
|
nr_to_scan >>= mem_cgroup_online(memcg) ? sc->priority : 0;
|
|
|
|
if (nr_to_scan && need_aging)
|
|
- inc_max_seq(lruvec, max_seq, swappiness);
|
|
+ try_to_inc_max_seq(lruvec, max_seq, sc, swappiness);
|
|
}
|
|
|
|
static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
|
|
@@ -3447,6 +4372,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
|
|
|
|
VM_WARN_ON_ONCE(!current_is_kswapd());
|
|
|
|
+ set_mm_walk(pgdat);
|
|
+
|
|
memcg = mem_cgroup_iter(NULL, NULL, NULL);
|
|
do {
|
|
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
|
|
@@ -3455,11 +4382,16 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
|
|
|
|
cond_resched();
|
|
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
|
|
+
|
|
+ clear_mm_walk();
|
|
}
|
|
|
|
/*
|
|
* This function exploits spatial locality when shrink_page_list() walks the
|
|
- * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
|
|
+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
|
|
+ * the scan was done cacheline efficiently, it adds the PMD entry pointing to
|
|
+ * the PTE table to the Bloom filter. This forms a feedback loop between the
|
|
+ * eviction and the aging.
|
|
*/
|
|
void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
|
|
{
|
|
@@ -3468,6 +4400,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
|
|
unsigned long start;
|
|
unsigned long end;
|
|
unsigned long addr;
|
|
+ struct lru_gen_mm_walk *walk;
|
|
+ int young = 0;
|
|
unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
|
|
struct folio *folio = pfn_folio(pvmw->pfn);
|
|
struct mem_cgroup *memcg = folio_memcg(folio);
|
|
@@ -3497,6 +4431,7 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
|
|
}
|
|
|
|
pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
|
|
+ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
|
|
|
|
rcu_read_lock();
|
|
arch_enter_lazy_mmu_mode();
|
|
@@ -3511,13 +4446,15 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
|
|
if (!pte_young(pte[i]))
|
|
continue;
|
|
|
|
- folio = get_pfn_folio(pfn, memcg, pgdat);
|
|
+ folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap);
|
|
if (!folio)
|
|
continue;
|
|
|
|
if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
|
|
continue;
|
|
|
|
+ young++;
|
|
+
|
|
if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
|
|
!(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
|
|
!folio_test_swapcache(folio)))
|
|
@@ -3533,7 +4470,11 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
|
|
arch_leave_lazy_mmu_mode();
|
|
rcu_read_unlock();
|
|
|
|
- if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
|
|
+ /* feedback from rmap walkers to page table walkers */
|
|
+ if (suitable_to_scan(i, young))
|
|
+ update_bloom_filter(lruvec, max_seq, pvmw->pmd);
|
|
+
|
|
+ if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
|
|
for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
|
|
folio = pfn_folio(pte_pfn(pte[i]));
|
|
folio_activate(folio);
|
|
@@ -3545,8 +4486,10 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
|
|
if (!mem_cgroup_trylock_pages(memcg))
|
|
return;
|
|
|
|
- spin_lock_irq(&lruvec->lru_lock);
|
|
- new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
|
|
+ if (!walk) {
|
|
+ spin_lock_irq(&lruvec->lru_lock);
|
|
+ new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
|
|
+ }
|
|
|
|
for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
|
|
folio = pfn_folio(pte_pfn(pte[i]));
|
|
@@ -3557,10 +4500,14 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
|
|
if (old_gen < 0 || old_gen == new_gen)
|
|
continue;
|
|
|
|
- lru_gen_update_size(lruvec, folio, old_gen, new_gen);
|
|
+ if (walk)
|
|
+ update_batch_size(walk, folio, old_gen, new_gen);
|
|
+ else
|
|
+ lru_gen_update_size(lruvec, folio, old_gen, new_gen);
|
|
}
|
|
|
|
- spin_unlock_irq(&lruvec->lru_lock);
|
|
+ if (!walk)
|
|
+ spin_unlock_irq(&lruvec->lru_lock);
|
|
|
|
mem_cgroup_unlock_pages();
|
|
}
|
|
@@ -3843,6 +4790,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
|
|
struct folio *folio;
|
|
enum vm_event_item item;
|
|
struct reclaim_stat stat;
|
|
+ struct lru_gen_mm_walk *walk;
|
|
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
|
|
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
|
|
|
|
@@ -3879,6 +4827,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
|
|
|
|
move_pages_to_lru(lruvec, &list);
|
|
|
|
+ walk = current->reclaim_state->mm_walk;
|
|
+ if (walk && walk->batched)
|
|
+ reset_batch_size(lruvec, walk);
|
|
+
|
|
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
|
|
if (!cgroup_reclaim(sc))
|
|
__count_vm_events(item, reclaimed);
|
|
@@ -3936,7 +4888,8 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
|
|
if (current_is_kswapd())
|
|
return 0;
|
|
|
|
- inc_max_seq(lruvec, max_seq, can_swap);
|
|
+ if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap))
|
|
+ return nr_to_scan;
|
|
done:
|
|
return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
|
|
}
|
|
@@ -3951,6 +4904,8 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
|
|
|
|
blk_start_plug(&plug);
|
|
|
|
+ set_mm_walk(lruvec_pgdat(lruvec));
|
|
+
|
|
while (true) {
|
|
int delta;
|
|
int swappiness;
|
|
@@ -3978,6 +4933,8 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
|
|
cond_resched();
|
|
}
|
|
|
|
+ clear_mm_walk();
|
|
+
|
|
blk_finish_plug(&plug);
|
|
}
|
|
|
|
@@ -3994,15 +4951,21 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
|
|
|
|
for_each_gen_type_zone(gen, type, zone)
|
|
INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
|
|
+
|
|
+ lruvec->mm_state.seq = MIN_NR_GENS;
|
|
+ init_waitqueue_head(&lruvec->mm_state.wait);
|
|
}
|
|
|
|
#ifdef CONFIG_MEMCG
|
|
void lru_gen_init_memcg(struct mem_cgroup *memcg)
|
|
{
|
|
+ INIT_LIST_HEAD(&memcg->mm_list.fifo);
|
|
+ spin_lock_init(&memcg->mm_list.lock);
|
|
}
|
|
|
|
void lru_gen_exit_memcg(struct mem_cgroup *memcg)
|
|
{
|
|
+ int i;
|
|
int nid;
|
|
|
|
for_each_node(nid) {
|
|
@@ -4010,6 +4973,11 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg)
|
|
|
|
VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
|
|
sizeof(lruvec->lrugen.nr_pages)));
|
|
+
|
|
+ for (i = 0; i < NR_BLOOM_FILTERS; i++) {
|
|
+ bitmap_free(lruvec->mm_state.filters[i]);
|
|
+ lruvec->mm_state.filters[i] = NULL;
|
|
+ }
|
|
}
|
|
}
|
|
#endif
|
|
|
|
From patchwork Wed Jul 6 22:00:18 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908747
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id F042EC433EF
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 22:25:03 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=tOld0mycu7Btf63ByyZvx/Z0eAcW0k4TTCwG8saV5Is=; b=WZ0z5mNFTsK7hRYJL/tCGGNA2R
|
|
iiznQtlbgg8vLE61AzG+y1WOAL8n4oAZaDRxzs7VG7T/XKSlnSIXj2zfskgR16i9+sx+SmMrTshj3
|
|
Rhjm+IRJTRR51xdqhBqyHYd9gEZY6d+nFfuSWzik8lZjsa7kgq2U28rXqc3iBYRKo64G2hQkr1Fh2
|
|
0AAIblG8sUW91n9klrycAut5Kn+W2H8PrBW1fMV+6LmXbTlLa38BUCauswJqNQXspcogqgL5QxDy9
|
|
AnrXpXmaC3WcGCZaWItqtaNf3IbcI+uBOqIaAvE/JlW0AXERLFg/86HqXddt44qdFmQmq11/vfgJr
|
|
/LhEzUUA==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DQx-00CaKQ-7t; Wed, 06 Jul 2022 22:23:47 +0000
|
|
Received: from casper.infradead.org ([2001:8b0:10b:1236::1])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DPt-00CZxO-A3
|
|
for linux-arm-kernel@bombadil.infradead.org; Wed, 06 Jul 2022 22:22:41 +0000
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:Content-Type:
|
|
Cc:To:From:Subject:References:Mime-Version:Message-Id:In-Reply-To:Date:Sender
|
|
:Reply-To:Content-ID:Content-Description;
|
|
bh=ZS7i+zWbsuFYZiRlenI/F/Y7PzZj3Cv3ABmuogIV+d0=; b=TeAh1eIZ03w6HrNkTRlz9qeQMr
|
|
vjtPWcTiEX5m7VQpC2G8vHYID0KM123xTb+3PPWyWmjSLmi0qJNX/0Jy/rOobi4BpgGBtmqVIFog7
|
|
knes6t5A8tnUvgxsNychZrfNzSR19niYLR9GAB99rqI9DXXnuZ+fg76tfMWeFMG2OxQPqE27YV+V7
|
|
OFJWjWOmK1ePwmkg25C0umCQgtjhfTU+xwuBPXcyeLvgSGRMN6lMjmKPqCc5EKr2n22riO+UKHnkq
|
|
y2ctRxFyq08boHwyErmsC+2N/WKWW054aLRYpWaUpGTFOvkwALng4wVazpw8NZEW3FTM4hGltvGWd
|
|
tksBRiAw==;
|
|
Received: from mail-yw1-x114a.google.com ([2607:f8b0:4864:20::114a])
|
|
by casper.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D54-0021TC-Ht
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:12 +0000
|
|
Received: by mail-yw1-x114a.google.com with SMTP id
|
|
00721157ae682-31c858e18c8so80044267b3.4
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:01:09 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=ZS7i+zWbsuFYZiRlenI/F/Y7PzZj3Cv3ABmuogIV+d0=;
|
|
b=Is9nnwDLdoF8cmdhQhl8FEZEIPLZOTCQNPziPrZ3WCv4Hkh+8SM7Qirn2/JzlJe5Qt
|
|
IMzoKhGVVu62zPGO2f8uqvwVO7ZBpwGEu3Y0nx+xsR+UR6rSMs9BgDYfSl6hxumhEzXQ
|
|
AVU29P45SCq1drQE+AuDu2NsKyQ+R9NLi2XNN7GjQzGIS59mnKnciabxZ70kUwocqXEh
|
|
TsuagDSQmmH5SjPkOzOUNm6Sk8f3JEhf7X8a1bPpbg+ozA3KspzkTBjkMrHomLe9ffcm
|
|
BFgwNEyH9XBgnj0m4gnfT2SYRWWY1k3MsXJMQ+zIJmqc6vDRB4WpYW/qGMJadOFCZfMM
|
|
nXgA==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=ZS7i+zWbsuFYZiRlenI/F/Y7PzZj3Cv3ABmuogIV+d0=;
|
|
b=m5GMfpcTgXdZv0bvOBFOaQ5/3zkN4nLnJ2JXpxvwe+7VsjltyOTJbO4+xVuYq/aLyR
|
|
K2h6SW8P0jwd6uMtnxMi6OaUGHMtqzg+fssK04A/eEyzhTtfvvb8oIQfUQhfCMOK84rC
|
|
Ug3Mh/5TYlvXawV8msprDi7+5cmJ6V1YMiIfRwdfDEBZlKMfrxPqa4iHtJ0zM7WvL6GP
|
|
ny3n/eqmmBenMRf57/T0qjyxf1iCcY0mMSJpCt7EemKa67Nm7b4+IboRKlU3xCCInkcO
|
|
jX2AyNmVY4Ycd9QL+fXratviYp8zhj2H+x0kpbsX01ml+n7zP2ST5UNyUqN/YklJh8eP
|
|
EWpg==
|
|
X-Gm-Message-State: AJIora91IwfVwl9xAefQ4nVfpr0qkzxfwL8I866kZDLzQEcZJP6CVSie
|
|
sYu9K3RnrWDpjoU+PQ0L1oouffqAR/0=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1uwFcCEi/xhgZ0h4sIMbRvSkHm3hRssR6SxZ63hO5m3+xAhe4vRctYY9iQ6nmc/njn1u9BXzW/MiYM=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a25:abb3:0:b0:66e:2f9a:4201 with SMTP id
|
|
v48-20020a25abb3000000b0066e2f9a4201mr26479914ybi.125.1657144864829; Wed, 06
|
|
Jul 2022 15:01:04 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:18 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-10-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 09/14] mm: multi-gen LRU: optimize multiple memcgs
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_230110_630694_3009F2C2
|
|
X-CRM114-Status: GOOD ( 17.04 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
When multiple memcgs are available, it is possible to make better
|
|
choices based on generations and tiers and therefore improve the
|
|
overall performance under global memory pressure. This patch adds a
|
|
rudimentary optimization to select memcgs that can drop single-use
|
|
unmapped clean pages first. Doing so reduces the chance of going into
|
|
the aging path or swapping. These two decisions can be costly.
|
|
|
|
A typical example that benefits from this optimization is a server
|
|
running mixed types of workloads, e.g., heavy anon workload in one
|
|
memcg and heavy buffered I/O workload in the other.
|
|
|
|
Though this optimization can be applied to both kswapd and direct
|
|
reclaim, it is only added to kswapd to keep the patchset manageable.
|
|
Later improvements will cover the direct reclaim path.
|
|
|
|
Server benchmark results:
|
|
Mixed workloads:
|
|
fio (buffered I/O): +[19, 21]%
|
|
IOPS BW
|
|
patch1-8: 1880k 7343MiB/s
|
|
patch1-9: 2252k 8796MiB/s
|
|
|
|
memcached (anon): +[119, 123]%
|
|
Ops/sec KB/sec
|
|
patch1-8: 862768.65 33514.68
|
|
patch1-9: 1911022.12 74234.54
|
|
|
|
Mixed workloads:
|
|
fio (buffered I/O): +[75, 77]%
|
|
IOPS BW
|
|
5.19-rc1: 1279k 4996MiB/s
|
|
patch1-9: 2252k 8796MiB/s
|
|
|
|
memcached (anon): +[13, 15]%
|
|
Ops/sec KB/sec
|
|
5.19-rc1: 1673524.04 65008.87
|
|
patch1-9: 1911022.12 74234.54
|
|
|
|
Configurations:
|
|
(changes since patch 6)
|
|
|
|
cat mixed.sh
|
|
modprobe brd rd_nr=2 rd_size=56623104
|
|
|
|
swapoff -a
|
|
mkswap /dev/ram0
|
|
swapon /dev/ram0
|
|
|
|
mkfs.ext4 /dev/ram1
|
|
mount -t ext4 /dev/ram1 /mnt
|
|
|
|
memtier_benchmark -S /var/run/memcached/memcached.sock \
|
|
-P memcache_binary -n allkeys --key-minimum=1 \
|
|
--key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
|
|
--ratio 1:0 --pipeline 8 -d 2000
|
|
|
|
fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
|
|
--buffered=1 --ioengine=io_uring --iodepth=128 \
|
|
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
|
|
--rw=randread --random_distribution=random --norandommap \
|
|
--time_based --ramp_time=10m --runtime=90m --group_reporting &
|
|
pid=$!
|
|
|
|
sleep 200
|
|
|
|
memtier_benchmark -S /var/run/memcached/memcached.sock \
|
|
-P memcache_binary -n allkeys --key-minimum=1 \
|
|
--key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
|
|
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
|
|
|
|
kill -INT $pid
|
|
wait
|
|
|
|
Client benchmark results:
|
|
no change (CONFIG_MEMCG=n)
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
mm/vmscan.c | 55 ++++++++++++++++++++++++++++++++++++++++++++---------
|
|
1 file changed, 46 insertions(+), 9 deletions(-)
|
|
|
|
diff --git a/mm/vmscan.c b/mm/vmscan.c
|
|
index 8e55a1ce1ae0..f469a2740835 100644
|
|
--- a/mm/vmscan.c
|
|
+++ b/mm/vmscan.c
|
|
@@ -129,6 +129,13 @@ struct scan_control {
|
|
/* Always discard instead of demoting to lower tier memory */
|
|
unsigned int no_demotion:1;
|
|
|
|
+#ifdef CONFIG_LRU_GEN
|
|
+ /* help make better choices when multiple memcgs are available */
|
|
+ unsigned int memcgs_need_aging:1;
|
|
+ unsigned int memcgs_need_swapping:1;
|
|
+ unsigned int memcgs_avoid_swapping:1;
|
|
+#endif
|
|
+
|
|
/* Allocation order */
|
|
s8 order;
|
|
|
|
@@ -4372,6 +4379,22 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
|
|
|
|
VM_WARN_ON_ONCE(!current_is_kswapd());
|
|
|
|
+ /*
|
|
+ * To reduce the chance of going into the aging path or swapping, which
|
|
+ * can be costly, optimistically skip them unless their corresponding
|
|
+ * flags were cleared in the eviction path. This improves the overall
|
|
+ * performance when multiple memcgs are available.
|
|
+ */
|
|
+ if (!sc->memcgs_need_aging) {
|
|
+ sc->memcgs_need_aging = true;
|
|
+ sc->memcgs_avoid_swapping = !sc->memcgs_need_swapping;
|
|
+ sc->memcgs_need_swapping = true;
|
|
+ return;
|
|
+ }
|
|
+
|
|
+ sc->memcgs_need_swapping = true;
|
|
+ sc->memcgs_avoid_swapping = true;
|
|
+
|
|
set_mm_walk(pgdat);
|
|
|
|
memcg = mem_cgroup_iter(NULL, NULL, NULL);
|
|
@@ -4781,7 +4804,8 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
|
|
return scanned;
|
|
}
|
|
|
|
-static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
|
|
+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
|
|
+ bool *need_swapping)
|
|
{
|
|
int type;
|
|
int scanned;
|
|
@@ -4844,14 +4868,16 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
|
|
|
|
sc->nr_reclaimed += reclaimed;
|
|
|
|
+ if (type == LRU_GEN_ANON && need_swapping)
|
|
+ *need_swapping = true;
|
|
+
|
|
return scanned;
|
|
}
|
|
|
|
static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
|
|
- bool can_swap, unsigned long reclaimed)
|
|
+ bool can_swap, unsigned long reclaimed, bool *need_aging)
|
|
{
|
|
int priority;
|
|
- bool need_aging;
|
|
unsigned long nr_to_scan;
|
|
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
|
|
DEFINE_MAX_SEQ(lruvec);
|
|
@@ -4861,7 +4887,7 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
|
|
(mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
|
|
return 0;
|
|
|
|
- nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, can_swap, &need_aging);
|
|
+ nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, can_swap, need_aging);
|
|
if (!nr_to_scan)
|
|
return 0;
|
|
|
|
@@ -4877,7 +4903,7 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
|
|
if (!nr_to_scan)
|
|
return 0;
|
|
|
|
- if (!need_aging)
|
|
+ if (!*need_aging)
|
|
return nr_to_scan;
|
|
|
|
/* skip the aging path at the default priority */
|
|
@@ -4897,6 +4923,8 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
|
|
static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
|
|
{
|
|
struct blk_plug plug;
|
|
+ bool need_aging = false;
|
|
+ bool need_swapping = false;
|
|
unsigned long scanned = 0;
|
|
unsigned long reclaimed = sc->nr_reclaimed;
|
|
|
|
@@ -4918,21 +4946,30 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
|
|
else
|
|
swappiness = 0;
|
|
|
|
- nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness, reclaimed);
|
|
+ nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness, reclaimed, &need_aging);
|
|
if (!nr_to_scan)
|
|
- break;
|
|
+ goto done;
|
|
|
|
- delta = evict_folios(lruvec, sc, swappiness);
|
|
+ delta = evict_folios(lruvec, sc, swappiness, &need_swapping);
|
|
if (!delta)
|
|
- break;
|
|
+ goto done;
|
|
|
|
scanned += delta;
|
|
if (scanned >= nr_to_scan)
|
|
break;
|
|
|
|
+ if (sc->memcgs_avoid_swapping && swappiness < 200 && need_swapping)
|
|
+ break;
|
|
+
|
|
cond_resched();
|
|
}
|
|
|
|
+ /* see the comment in lru_gen_age_node() */
|
|
+ if (!need_aging)
|
|
+ sc->memcgs_need_aging = false;
|
|
+ if (!need_swapping)
|
|
+ sc->memcgs_need_swapping = false;
|
|
+done:
|
|
clear_mm_walk();
|
|
|
|
blk_finish_plug(&plug);
|
|
|
|
From patchwork Wed Jul 6 22:00:19 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908744
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id 00BDCC43334
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 22:24:13 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=oT8I/wh31nIjGoMtUNgMYpt4+aQqf3FV7NOEFIyrojo=; b=t3xaS3tp93CE9yFazHhWLhcOfr
|
|
xQjt0gn1PLodED28Dq7newQV9JGh/Xq45yc0iG+/RAVmiI22mp6PZVcNK1aDYDtg5kG/RoWGG9jyh
|
|
aC0SjX3mT/z/lmr0cz6Qa9b6dnSj1YZhA4Nkd9an8X+Hg6mYuEsmTV3pO4GhbuxnVz0M8ejZniacB
|
|
D8+01cOT/7liS6H9dC3AsX8fX/qsr2K5gHBZgtZxjSR8qttLgW90m/V/93neVCrD7b469gCTpbYz7
|
|
PLJ770qcZVswnc+n1TRwMOToF3EfDWQICi7e9fiApnb+hpPxmmR5ZoiqEyOdMlaTIrNksgQck2o2o
|
|
0CXp0cwQ==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DQI-00Ca53-3F; Wed, 06 Jul 2022 22:23:06 +0000
|
|
Received: from casper.infradead.org ([2001:8b0:10b:1236::1])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DPs-00CZxO-7b
|
|
for linux-arm-kernel@bombadil.infradead.org; Wed, 06 Jul 2022 22:22:40 +0000
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:Content-Type:
|
|
Cc:To:From:Subject:References:Mime-Version:Message-Id:In-Reply-To:Date:Sender
|
|
:Reply-To:Content-ID:Content-Description;
|
|
bh=t3JqYbFJT9lP6E96sRUzmCzQEu0iJg+mfU6dciROW6I=; b=U1gyEy98aPvujY0pmAgcic+k5p
|
|
Pha5K2wpAouo3fQgrvtwa26UFiInOSjZIIvdZex6IIxsatGU4ezNfMf7f3ANrg52OExrjjqpwWWLo
|
|
irqxwa8VrdlMenRjEyXmsumJzfNyUfyhb0FqiZfatfHvzyfg1a1m7EBlXuT/9+9o9iPMw1N9DqcJy
|
|
z19r1qPLRMco2ULEBqBsrks4eNid397ZCiOmNh1+zPf6UBhVHVxzIDiMxLuNxZCmESTLGPQW9Uvrf
|
|
pmEmnSdRtKtyD3zBYBhV5/Vp70N7kugtlrCVLVjQlcoND17+ftX4wcJXZpT+xFZfkqKEV/qupnHYK
|
|
YQuY9YJg==;
|
|
Received: from mail-yw1-x1149.google.com ([2607:f8b0:4864:20::1149])
|
|
by casper.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D57-0021TD-Cj
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:17 +0000
|
|
Received: by mail-yw1-x1149.google.com with SMTP id
|
|
00721157ae682-31c8a5d51adso72236937b3.14
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:01:09 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=t3JqYbFJT9lP6E96sRUzmCzQEu0iJg+mfU6dciROW6I=;
|
|
b=K/nKIb14JmIaSQ25G+voEr3Xu6sFBToolWxLX2DrPdbxAa6BpfoEW4/5621Rzsff4D
|
|
1k3G9tp+5ESbNVZCZfqietdtMt6OTAchdy14TXI4WTiTZLglVlIfr80zpxGfIGcphLBv
|
|
c2R6icWOjZ0upEVkivTfwH9rKBl233YFlYCWfHzoiU07eBFA2yPOzHZx49n6UFl3tbHt
|
|
eSai05q6oFPAPMqEwWKLLg5e2ewTiqoowbahH4nTTyw69dIDZhmip41HFaA0/Sczzyq3
|
|
JDic9dSJ+BDTRQ6TaWU0nw7eqP8mi+/sxNdfATpIluPgr0W9A0QZ1JCn1D9q09woZwV/
|
|
PFjA==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=t3JqYbFJT9lP6E96sRUzmCzQEu0iJg+mfU6dciROW6I=;
|
|
b=dLnGTIHf9lH3Hl4B46mWBDrCtFFWhd+6Hzir7XLYcQ2XDRsArWl3Rf9h/pVLfeSAoe
|
|
yaAL5OeultJkkVwSugm3FtULkZH1erIvJA5xM83gfRruRuxN/481Z7xcZVchIYxv8b3T
|
|
QVIx2hMkpPkpkbB/7OREDtY3oh/dc9M8ZUaGrJutAcqkZjvNSAymyZmUbN255ZDCSiZH
|
|
6L0AoCgZ35yh7JNPkzGe9+UNN9aSdh1PcynTezmIyi9nvHgxS+YZjG9gCOYj4lVEpklj
|
|
b/0pI0uZd7sJdipHJK45lcS0o/y/ngEGG+rBbCu+QekMvwR1IeBsn3nXxsQKN8oLuDOg
|
|
QsBQ==
|
|
X-Gm-Message-State: AJIora/rGlUf3M9+99VcXzAYV4ogUvnysuD0jBJtozkq82GpvjAKRK4F
|
|
8Y5Hwx7XrW9fmo3n2LTWIll2irGs6ak=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1uCmWNAcyv7l4c+bwlvsNWjdcmS50NXK/ousi79Gs9bHWyAObimB3RXzG41nJY/wFbH1TL7Js/68Zk=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a25:e74e:0:b0:66e:32d4:1f0 with SMTP id
|
|
e75-20020a25e74e000000b0066e32d401f0mr24265460ybh.421.1657144866511; Wed, 06
|
|
Jul 2022 15:01:06 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:19 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-11-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 10/14] mm: multi-gen LRU: kill switch
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_230113_462895_844458BD
|
|
X-CRM114-Status: GOOD ( 25.01 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
|
|
can be disabled include:
|
|
0x0001: the multi-gen LRU core
|
|
0x0002: walking page table, when arch_has_hw_pte_young() returns
|
|
true
|
|
0x0004: clearing the accessed bit in non-leaf PMD entries, when
|
|
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
|
|
[yYnN]: apply to all the components above
|
|
E.g.,
|
|
echo y >/sys/kernel/mm/lru_gen/enabled
|
|
cat /sys/kernel/mm/lru_gen/enabled
|
|
0x0007
|
|
echo 5 >/sys/kernel/mm/lru_gen/enabled
|
|
cat /sys/kernel/mm/lru_gen/enabled
|
|
0x0005
|
|
|
|
NB: the page table walks happen on the scale of seconds under heavy
|
|
memory pressure, in which case the mmap_lock contention is a lesser
|
|
concern, compared with the LRU lock contention and the I/O congestion.
|
|
So far the only well-known case of the mmap_lock contention happens on
|
|
Android, due to Scudo [1] which allocates several thousand VMAs for
|
|
merely a few hundred MBs. The SPF and the Maple Tree also have
|
|
provided their own assessments [2][3]. However, if walking page tables
|
|
does worsen the mmap_lock contention, the kill switch can be used to
|
|
disable it. In this case the multi-gen LRU will suffer a minor
|
|
performance degradation, as shown previously.
|
|
|
|
Clearing the accessed bit in non-leaf PMD entries can also be
|
|
disabled, since this behavior was not tested on x86 varieties other
|
|
than Intel and AMD.
|
|
|
|
[1] https://source.android.com/devices/tech/debug/scudo
|
|
[2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
|
|
[3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
include/linux/cgroup.h | 15 ++-
|
|
include/linux/mm_inline.h | 15 ++-
|
|
include/linux/mmzone.h | 9 ++
|
|
kernel/cgroup/cgroup-internal.h | 1 -
|
|
mm/Kconfig | 6 +
|
|
mm/vmscan.c | 231 +++++++++++++++++++++++++++++++-
|
|
6 files changed, 268 insertions(+), 9 deletions(-)
|
|
|
|
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
|
|
index 0d1ada8968d7..1bc0cabf993f 100644
|
|
--- a/include/linux/cgroup.h
|
|
+++ b/include/linux/cgroup.h
|
|
@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
|
|
css_put(&cgrp->self);
|
|
}
|
|
|
|
+extern struct mutex cgroup_mutex;
|
|
+
|
|
+static inline void cgroup_lock(void)
|
|
+{
|
|
+ mutex_lock(&cgroup_mutex);
|
|
+}
|
|
+
|
|
+static inline void cgroup_unlock(void)
|
|
+{
|
|
+ mutex_unlock(&cgroup_mutex);
|
|
+}
|
|
+
|
|
/**
|
|
* task_css_set_check - obtain a task's css_set with extra access conditions
|
|
* @task: the task to obtain css_set for
|
|
@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
|
|
* as locks used during the cgroup_subsys::attach() methods.
|
|
*/
|
|
#ifdef CONFIG_PROVE_RCU
|
|
-extern struct mutex cgroup_mutex;
|
|
extern spinlock_t css_set_lock;
|
|
#define task_css_set_check(task, __c) \
|
|
rcu_dereference_check((task)->cgroups, \
|
|
@@ -708,6 +719,8 @@ struct cgroup;
|
|
static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
|
|
static inline void css_get(struct cgroup_subsys_state *css) {}
|
|
static inline void css_put(struct cgroup_subsys_state *css) {}
|
|
+static inline void cgroup_lock(void) {}
|
|
+static inline void cgroup_unlock(void) {}
|
|
static inline int cgroup_attach_task_all(struct task_struct *from,
|
|
struct task_struct *t) { return 0; }
|
|
static inline int cgroupstats_build(struct cgroupstats *stats,
|
|
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
|
|
index f2b2296a42f9..4949eda9a9a2 100644
|
|
--- a/include/linux/mm_inline.h
|
|
+++ b/include/linux/mm_inline.h
|
|
@@ -106,10 +106,21 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
|
|
|
|
#ifdef CONFIG_LRU_GEN
|
|
|
|
+#ifdef CONFIG_LRU_GEN_ENABLED
|
|
static inline bool lru_gen_enabled(void)
|
|
{
|
|
- return true;
|
|
+ DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]);
|
|
+
|
|
+ return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]);
|
|
}
|
|
+#else
|
|
+static inline bool lru_gen_enabled(void)
|
|
+{
|
|
+ DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]);
|
|
+
|
|
+ return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]);
|
|
+}
|
|
+#endif
|
|
|
|
static inline bool lru_gen_in_fault(void)
|
|
{
|
|
@@ -222,7 +233,7 @@ static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
|
|
|
|
VM_WARN_ON_ONCE_FOLIO(gen != -1, folio);
|
|
|
|
- if (folio_test_unevictable(folio))
|
|
+ if (folio_test_unevictable(folio) || !lrugen->enabled)
|
|
return false;
|
|
/*
|
|
* There are three common cases for this page:
|
|
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
|
|
index 0cf0856b484a..840b7ca8b91f 100644
|
|
--- a/include/linux/mmzone.h
|
|
+++ b/include/linux/mmzone.h
|
|
@@ -384,6 +384,13 @@ enum {
|
|
LRU_GEN_FILE,
|
|
};
|
|
|
|
+enum {
|
|
+ LRU_GEN_CORE,
|
|
+ LRU_GEN_MM_WALK,
|
|
+ LRU_GEN_NONLEAF_YOUNG,
|
|
+ NR_LRU_GEN_CAPS
|
|
+};
|
|
+
|
|
#define MIN_LRU_BATCH BITS_PER_LONG
|
|
#define MAX_LRU_BATCH (MIN_LRU_BATCH * 128)
|
|
|
|
@@ -425,6 +432,8 @@ struct lru_gen_struct {
|
|
/* can be modified without holding the LRU lock */
|
|
atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
|
|
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
|
|
+ /* whether the multi-gen LRU is enabled */
|
|
+ bool enabled;
|
|
};
|
|
|
|
enum {
|
|
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
|
|
index 5da09c74228d..c966e55cab29 100644
|
|
--- a/kernel/cgroup/cgroup-internal.h
|
|
+++ b/kernel/cgroup/cgroup-internal.h
|
|
@@ -164,7 +164,6 @@ struct cgroup_mgctx {
|
|
#define DEFINE_CGROUP_MGCTX(name) \
|
|
struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
|
|
|
|
-extern struct mutex cgroup_mutex;
|
|
extern spinlock_t css_set_lock;
|
|
extern struct cgroup_subsys *cgroup_subsys[];
|
|
extern struct list_head cgroup_roots;
|
|
diff --git a/mm/Kconfig b/mm/Kconfig
|
|
index a93478acf341..0c2ef0af0036 100644
|
|
--- a/mm/Kconfig
|
|
+++ b/mm/Kconfig
|
|
@@ -1139,6 +1139,12 @@ config LRU_GEN
|
|
help
|
|
A high performance LRU implementation to overcommit memory.
|
|
|
|
+config LRU_GEN_ENABLED
|
|
+ bool "Enable by default"
|
|
+ depends on LRU_GEN
|
|
+ help
|
|
+ This option enables the multi-gen LRU by default.
|
|
+
|
|
config LRU_GEN_STATS
|
|
bool "Full stats for debugging"
|
|
depends on LRU_GEN
|
|
diff --git a/mm/vmscan.c b/mm/vmscan.c
|
|
index f469a2740835..4c8b475429ed 100644
|
|
--- a/mm/vmscan.c
|
|
+++ b/mm/vmscan.c
|
|
@@ -52,6 +52,7 @@
|
|
#include <linux/psi.h>
|
|
#include <linux/pagewalk.h>
|
|
#include <linux/shmem_fs.h>
|
|
+#include <linux/ctype.h>
|
|
|
|
#include <asm/tlbflush.h>
|
|
#include <asm/div64.h>
|
|
@@ -3013,6 +3014,14 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
|
|
|
|
#ifdef CONFIG_LRU_GEN
|
|
|
|
+#ifdef CONFIG_LRU_GEN_ENABLED
|
|
+DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
|
|
+#define get_cap(cap) static_branch_likely(&lru_gen_caps[cap])
|
|
+#else
|
|
+DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS);
|
|
+#define get_cap(cap) static_branch_unlikely(&lru_gen_caps[cap])
|
|
+#endif
|
|
+
|
|
/******************************************************************************
|
|
* shorthand helpers
|
|
******************************************************************************/
|
|
@@ -3890,7 +3899,8 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area
|
|
goto next;
|
|
|
|
if (!pmd_trans_huge(pmd[i])) {
|
|
- if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
|
|
+ if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) &&
|
|
+ get_cap(LRU_GEN_NONLEAF_YOUNG))
|
|
pmdp_test_and_clear_young(vma, addr, pmd + i);
|
|
goto next;
|
|
}
|
|
@@ -3988,10 +3998,12 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
|
|
walk->mm_stats[MM_NONLEAF_TOTAL]++;
|
|
|
|
#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
|
|
- if (!pmd_young(val))
|
|
- continue;
|
|
+ if (get_cap(LRU_GEN_NONLEAF_YOUNG)) {
|
|
+ if (!pmd_young(val))
|
|
+ continue;
|
|
|
|
- walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos);
|
|
+ walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos);
|
|
+ }
|
|
#endif
|
|
if (!walk->force_scan && !test_bloom_filter(walk->lruvec, walk->max_seq, pmd + i))
|
|
continue;
|
|
@@ -4249,7 +4261,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
|
|
* handful of PTEs. Spreading the work out over a period of time usually
|
|
* is less efficient, but it avoids bursty page faults.
|
|
*/
|
|
- if (!arch_has_hw_pte_young()) {
|
|
+ if (!(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) {
|
|
success = iterate_mm_list_nowalk(lruvec, max_seq);
|
|
goto done;
|
|
}
|
|
@@ -4975,6 +4987,211 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
|
|
blk_finish_plug(&plug);
|
|
}
|
|
|
|
+/******************************************************************************
|
|
+ * state change
|
|
+ ******************************************************************************/
|
|
+
|
|
+static bool __maybe_unused state_is_valid(struct lruvec *lruvec)
|
|
+{
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+
|
|
+ if (lrugen->enabled) {
|
|
+ enum lru_list lru;
|
|
+
|
|
+ for_each_evictable_lru(lru) {
|
|
+ if (!list_empty(&lruvec->lists[lru]))
|
|
+ return false;
|
|
+ }
|
|
+ } else {
|
|
+ int gen, type, zone;
|
|
+
|
|
+ for_each_gen_type_zone(gen, type, zone) {
|
|
+ if (!list_empty(&lrugen->lists[gen][type][zone]))
|
|
+ return false;
|
|
+
|
|
+ /* unlikely but not a bug when reset_batch_size() is pending */
|
|
+ VM_WARN_ON_ONCE(lrugen->nr_pages[gen][type][zone]);
|
|
+ }
|
|
+ }
|
|
+
|
|
+ return true;
|
|
+}
|
|
+
|
|
+static bool fill_evictable(struct lruvec *lruvec)
|
|
+{
|
|
+ enum lru_list lru;
|
|
+ int remaining = MAX_LRU_BATCH;
|
|
+
|
|
+ for_each_evictable_lru(lru) {
|
|
+ int type = is_file_lru(lru);
|
|
+ bool active = is_active_lru(lru);
|
|
+ struct list_head *head = &lruvec->lists[lru];
|
|
+
|
|
+ while (!list_empty(head)) {
|
|
+ bool success;
|
|
+ struct folio *folio = lru_to_folio(head);
|
|
+
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio) != active, folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_lru_gen(folio) != -1, folio);
|
|
+
|
|
+ lruvec_del_folio(lruvec, folio);
|
|
+ success = lru_gen_add_folio(lruvec, folio, false);
|
|
+ VM_WARN_ON_ONCE(!success);
|
|
+
|
|
+ if (!--remaining)
|
|
+ return false;
|
|
+ }
|
|
+ }
|
|
+
|
|
+ return true;
|
|
+}
|
|
+
|
|
+static bool drain_evictable(struct lruvec *lruvec)
|
|
+{
|
|
+ int gen, type, zone;
|
|
+ int remaining = MAX_LRU_BATCH;
|
|
+
|
|
+ for_each_gen_type_zone(gen, type, zone) {
|
|
+ struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
|
|
+
|
|
+ while (!list_empty(head)) {
|
|
+ bool success;
|
|
+ struct folio *folio = lru_to_folio(head);
|
|
+
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
|
|
+
|
|
+ success = lru_gen_del_folio(lruvec, folio, false);
|
|
+ VM_WARN_ON_ONCE(!success);
|
|
+ lruvec_add_folio(lruvec, folio);
|
|
+
|
|
+ if (!--remaining)
|
|
+ return false;
|
|
+ }
|
|
+ }
|
|
+
|
|
+ return true;
|
|
+}
|
|
+
|
|
+static void lru_gen_change_state(bool enabled)
|
|
+{
|
|
+ static DEFINE_MUTEX(state_mutex);
|
|
+
|
|
+ struct mem_cgroup *memcg;
|
|
+
|
|
+ cgroup_lock();
|
|
+ cpus_read_lock();
|
|
+ get_online_mems();
|
|
+ mutex_lock(&state_mutex);
|
|
+
|
|
+ if (enabled == lru_gen_enabled())
|
|
+ goto unlock;
|
|
+
|
|
+ if (enabled)
|
|
+ static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
|
|
+ else
|
|
+ static_branch_disable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
|
|
+
|
|
+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
|
|
+ do {
|
|
+ int nid;
|
|
+
|
|
+ for_each_node(nid) {
|
|
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
|
|
+
|
|
+ if (!lruvec)
|
|
+ continue;
|
|
+
|
|
+ spin_lock_irq(&lruvec->lru_lock);
|
|
+
|
|
+ VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
|
|
+ VM_WARN_ON_ONCE(!state_is_valid(lruvec));
|
|
+
|
|
+ lruvec->lrugen.enabled = enabled;
|
|
+
|
|
+ while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
|
|
+ spin_unlock_irq(&lruvec->lru_lock);
|
|
+ cond_resched();
|
|
+ spin_lock_irq(&lruvec->lru_lock);
|
|
+ }
|
|
+
|
|
+ spin_unlock_irq(&lruvec->lru_lock);
|
|
+ }
|
|
+
|
|
+ cond_resched();
|
|
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
|
|
+unlock:
|
|
+ mutex_unlock(&state_mutex);
|
|
+ put_online_mems();
|
|
+ cpus_read_unlock();
|
|
+ cgroup_unlock();
|
|
+}
|
|
+
|
|
+/******************************************************************************
|
|
+ * sysfs interface
|
|
+ ******************************************************************************/
|
|
+
|
|
+static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
|
|
+{
|
|
+ unsigned int caps = 0;
|
|
+
|
|
+ if (get_cap(LRU_GEN_CORE))
|
|
+ caps |= BIT(LRU_GEN_CORE);
|
|
+
|
|
+ if (arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))
|
|
+ caps |= BIT(LRU_GEN_MM_WALK);
|
|
+
|
|
+ if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && get_cap(LRU_GEN_NONLEAF_YOUNG))
|
|
+ caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
|
|
+
|
|
+ return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
|
|
+}
|
|
+
|
|
+static ssize_t store_enabled(struct kobject *kobj, struct kobj_attribute *attr,
|
|
+ const char *buf, size_t len)
|
|
+{
|
|
+ int i;
|
|
+ unsigned int caps;
|
|
+
|
|
+ if (tolower(*buf) == 'n')
|
|
+ caps = 0;
|
|
+ else if (tolower(*buf) == 'y')
|
|
+ caps = -1;
|
|
+ else if (kstrtouint(buf, 0, &caps))
|
|
+ return -EINVAL;
|
|
+
|
|
+ for (i = 0; i < NR_LRU_GEN_CAPS; i++) {
|
|
+ bool enabled = caps & BIT(i);
|
|
+
|
|
+ if (i == LRU_GEN_CORE)
|
|
+ lru_gen_change_state(enabled);
|
|
+ else if (enabled)
|
|
+ static_branch_enable(&lru_gen_caps[i]);
|
|
+ else
|
|
+ static_branch_disable(&lru_gen_caps[i]);
|
|
+ }
|
|
+
|
|
+ return len;
|
|
+}
|
|
+
|
|
+static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
|
|
+ enabled, 0644, show_enabled, store_enabled
|
|
+);
|
|
+
|
|
+static struct attribute *lru_gen_attrs[] = {
|
|
+ &lru_gen_enabled_attr.attr,
|
|
+ NULL
|
|
+};
|
|
+
|
|
+static struct attribute_group lru_gen_attr_group = {
|
|
+ .name = "lru_gen",
|
|
+ .attrs = lru_gen_attrs,
|
|
+};
|
|
+
|
|
/******************************************************************************
|
|
* initialization
|
|
******************************************************************************/
|
|
@@ -4985,6 +5202,7 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
|
|
struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
|
|
lrugen->max_seq = MIN_NR_GENS + 1;
|
|
+ lrugen->enabled = lru_gen_enabled();
|
|
|
|
for_each_gen_type_zone(gen, type, zone)
|
|
INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
|
|
@@ -5024,6 +5242,9 @@ static int __init init_lru_gen(void)
|
|
BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
|
|
BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
|
|
|
|
+ if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
|
|
+ pr_err("lru_gen: failed to create sysfs group\n");
|
|
+
|
|
return 0;
|
|
};
|
|
late_initcall(init_lru_gen);
|
|
|
|
From patchwork Wed Jul 6 22:00:20 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908741
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id 9142FC433EF
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 22:24:03 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=5C0yJiz6WGUIxWbyHoYl+ayNd1dwNg6CZB1yq15IdGU=; b=cmNuZRCClzxDF8e+AKeO+FbAnG
|
|
XpOCYuIe3hHYTBwVN/O5acc0VMEEmeyD2Q5qBd8r04KVOHhxCRfEa0osfhWnVYzdbM9Go1hL1rhYo
|
|
UGk6i5XT7JXlfZZl5zjMgU645o1O2IZX6djwMUxxoeXnWSru/nePtRCJFvK8mHbsqapu/YGDFTLUo
|
|
wc5rgB4I93eR0cLr30c2oos/UPqJqhdLAzsCdHnD8kZOZ5Rn24aiqE65S6IrmEh0l1a4eBBxxHZGc
|
|
9JdU10gWMmZuHL+T9q/amMYpztHrSrNZ1osXCBM3JJvR8LURJe5x4F8m1h9lidgtwetquwmX4FlfK
|
|
uX3uCUCQ==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DQ9-00Ca3O-K3; Wed, 06 Jul 2022 22:22:57 +0000
|
|
Received: from casper.infradead.org ([2001:8b0:10b:1236::1])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DPr-00CZxO-Rx
|
|
for linux-arm-kernel@bombadil.infradead.org; Wed, 06 Jul 2022 22:22:40 +0000
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:Content-Type:
|
|
Cc:To:From:Subject:References:Mime-Version:Message-Id:In-Reply-To:Date:Sender
|
|
:Reply-To:Content-ID:Content-Description;
|
|
bh=m3EW4cfAlntTqnxn3SvhsZvF1ytN+sfDtB6iRdzihvY=; b=TxbXhvZjV+G4MSXEQ/BL+gVGbT
|
|
cmnl3lP3sXj5XJ0PnUIrzjjo8ZMWPm8wBkCN8BO4dfWKJew8h7h1H892hUkkbKvRpCh2reCqiTZQM
|
|
NqiHKKRUhWPT6eV9AOqRdYgKTKMaG8TWr5WhgQnCI4Cw1l4eT01koJbC2wCfVu24WW7otpAmQ+riN
|
|
cZYT3g5R+BcprwyTVXIl5W4q7TVFtFyXEy8Ht4K5+kx8wTjq5fBUlRtYH/ypqfHXmH6mLJP/1h+Y2
|
|
qbjOGLMoLrCisw03/m64Ytq3xxg17oM8fByugb7OdIzjmBGfmHxuLP72NN8Cys8puzYuv0PMBGZyg
|
|
dcK2b+8Q==;
|
|
Received: from mail-yw1-x114a.google.com ([2607:f8b0:4864:20::114a])
|
|
by casper.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D57-0021TK-E5
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:18 +0000
|
|
Received: by mail-yw1-x114a.google.com with SMTP id
|
|
00721157ae682-31cd7ade3d6so34564667b3.3
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:01:10 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=m3EW4cfAlntTqnxn3SvhsZvF1ytN+sfDtB6iRdzihvY=;
|
|
b=NNZxOJisLedvEph13coGoCeVo89XYF3cKhoLr0Qj+8EQSroRh25w+qZuSGaKvrNfmO
|
|
djUv79dYHeRCliQ2lBYEsuuPJN6lgSZ6cKW987LKYkUaRIiHw552kndr1VR1raRgUvCU
|
|
568te5aggKYg95okJZ0cLsdFaiOBB18/hCGgU+4bQM73SosPCL/NpSqGWL8mW9AiVFs+
|
|
hT7ErHYOnMn+bCDzuk8GAu9J4/5Gq8c/6z9M6D6X+HmVK0MeVpaKpZ0jPz/vsi747v3J
|
|
zvNibUS9XJKNBhR7/Fg26FpINdlMkWHvvcikRiTD5O+czcMeNF2XfnGAvAgAPgyPnYK8
|
|
b6mw==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=m3EW4cfAlntTqnxn3SvhsZvF1ytN+sfDtB6iRdzihvY=;
|
|
b=yH41R8nbp6IxBw199GGlk/iPcO03Q8VAcXOdHKygVCllrYglfxJnBNrsoPjYhyO8fz
|
|
4QX1Ch3XXQsa3z7/zc52X1GN5BR32z/ODiTM1i8OIqJocVPEydQqIUw4OstrvY7JwCuN
|
|
iOQoCr9BKkiuRwU4lEDgFSCm9XdtXCr4SOKhny8AIad9g7Lt1gK4GIFjPtC49CZXc1Jt
|
|
IP0Z4CdmWF5W2H029L2RoHLtm8Av4Sl3HW4g0cZaLqaDQBNnYlpPetk9X1TRZdlg9r/3
|
|
i+8Rmbwso/+8NTo71fHg+C0x94YYIzWJw/3Mv9iV2JqnAzu5+9UuJaj0pSz7sZcPmv3q
|
|
VVqQ==
|
|
X-Gm-Message-State: AJIora+c59wHWTW8KnajqgVu51ah4kP3TzcU3GkosQlSNLmSwYrtLYFm
|
|
zOkN1D5jS+J5a2AI6J9OdjeA++qt4DI=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1uVFLai60fRsrxUz+UveX+2HvTnchQxr73gyI+bA9ud92MMOTkT47lvZz9+aNC2VPhD8jfbEwKxJDM=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a25:1c56:0:b0:66e:2d23:d65d with SMTP id
|
|
c83-20020a251c56000000b0066e2d23d65dmr26931039ybc.253.1657144867700; Wed, 06
|
|
Jul 2022 15:01:07 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:20 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-12-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 11/14] mm: multi-gen LRU: thrashing prevention
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_230113_486599_E9231268
|
|
X-CRM114-Status: GOOD ( 18.37 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
|
|
requested by many desktop users [1].
|
|
|
|
When set to value N, it prevents the working set of N milliseconds
|
|
from getting evicted. The OOM killer is triggered if this working set
|
|
cannot be kept in memory. Based on the average human detectable lag
|
|
(~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
|
|
Larger values like N=3000 make lags less noticeable at the risk of
|
|
premature OOM kills.
|
|
|
|
Compared with the size-based approach [2], this time-based approach
|
|
has the following advantages:
|
|
1. It is easier to configure because it is agnostic to applications
|
|
and memory sizes.
|
|
2. It is more reliable because it is directly wired to the OOM killer.
|
|
|
|
[1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/
|
|
[2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
include/linux/mmzone.h | 2 ++
|
|
mm/vmscan.c | 71 +++++++++++++++++++++++++++++++++++++++---
|
|
2 files changed, 69 insertions(+), 4 deletions(-)
|
|
|
|
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
|
|
index 840b7ca8b91f..472bd5335517 100644
|
|
--- a/include/linux/mmzone.h
|
|
+++ b/include/linux/mmzone.h
|
|
@@ -419,6 +419,8 @@ struct lru_gen_struct {
|
|
unsigned long max_seq;
|
|
/* the eviction increments the oldest generation numbers */
|
|
unsigned long min_seq[ANON_AND_FILE];
|
|
+ /* the birth time of each generation in jiffies */
|
|
+ unsigned long timestamps[MAX_NR_GENS];
|
|
/* the multi-gen LRU lists, lazily sorted on eviction */
|
|
struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
|
|
/* the multi-gen LRU sizes, eventually consistent */
|
|
diff --git a/mm/vmscan.c b/mm/vmscan.c
|
|
index 4c8b475429ed..1f2892a0dc41 100644
|
|
--- a/mm/vmscan.c
|
|
+++ b/mm/vmscan.c
|
|
@@ -4233,6 +4233,7 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
|
|
for (type = 0; type < ANON_AND_FILE; type++)
|
|
reset_ctrl_pos(lruvec, type, false);
|
|
|
|
+ WRITE_ONCE(lrugen->timestamps[next], jiffies);
|
|
/* make sure preceding modifications appear */
|
|
smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
|
|
|
|
@@ -4359,7 +4360,7 @@ static unsigned long get_nr_evictable(struct lruvec *lruvec, unsigned long max_s
|
|
return total;
|
|
}
|
|
|
|
-static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
|
|
+static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned long min_ttl)
|
|
{
|
|
bool need_aging;
|
|
unsigned long nr_to_scan;
|
|
@@ -4373,21 +4374,40 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
|
|
mem_cgroup_calculate_protection(NULL, memcg);
|
|
|
|
if (mem_cgroup_below_min(memcg))
|
|
- return;
|
|
+ return false;
|
|
|
|
nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
|
|
if (!nr_to_scan)
|
|
- return;
|
|
+ return false;
|
|
|
|
nr_to_scan >>= mem_cgroup_online(memcg) ? sc->priority : 0;
|
|
|
|
+ if (min_ttl) {
|
|
+ int gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
|
|
+ unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
|
|
+
|
|
+ if (time_is_after_jiffies(birth + min_ttl))
|
|
+ return false;
|
|
+
|
|
+ /* the size is likely too small to be helpful */
|
|
+ if (!nr_to_scan && sc->priority != DEF_PRIORITY)
|
|
+ return false;
|
|
+ }
|
|
+
|
|
if (nr_to_scan && need_aging)
|
|
try_to_inc_max_seq(lruvec, max_seq, sc, swappiness);
|
|
+
|
|
+ return true;
|
|
}
|
|
|
|
+/* to protect the working set of the last N jiffies */
|
|
+static unsigned long lru_gen_min_ttl __read_mostly;
|
|
+
|
|
static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
|
|
{
|
|
struct mem_cgroup *memcg;
|
|
+ bool success = false;
|
|
+ unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
|
|
|
|
VM_WARN_ON_ONCE(!current_is_kswapd());
|
|
|
|
@@ -4413,12 +4433,28 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
|
|
do {
|
|
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
|
|
|
|
- age_lruvec(lruvec, sc);
|
|
+ if (age_lruvec(lruvec, sc, min_ttl))
|
|
+ success = true;
|
|
|
|
cond_resched();
|
|
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
|
|
|
|
clear_mm_walk();
|
|
+
|
|
+ /*
|
|
+ * The main goal is to OOM kill if every generation from all memcgs is
|
|
+ * younger than min_ttl. However, another theoretical possibility is all
|
|
+ * memcgs are either below min or empty.
|
|
+ */
|
|
+ if (!success && !sc->order && mutex_trylock(&oom_lock)) {
|
|
+ struct oom_control oc = {
|
|
+ .gfp_mask = sc->gfp_mask,
|
|
+ };
|
|
+
|
|
+ out_of_memory(&oc);
|
|
+
|
|
+ mutex_unlock(&oom_lock);
|
|
+ }
|
|
}
|
|
|
|
/*
|
|
@@ -5135,6 +5171,28 @@ static void lru_gen_change_state(bool enabled)
|
|
* sysfs interface
|
|
******************************************************************************/
|
|
|
|
+static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
|
|
+{
|
|
+ return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
|
|
+}
|
|
+
|
|
+static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
|
|
+ const char *buf, size_t len)
|
|
+{
|
|
+ unsigned int msecs;
|
|
+
|
|
+ if (kstrtouint(buf, 0, &msecs))
|
|
+ return -EINVAL;
|
|
+
|
|
+ WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs));
|
|
+
|
|
+ return len;
|
|
+}
|
|
+
|
|
+static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR(
|
|
+ min_ttl_ms, 0644, show_min_ttl, store_min_ttl
|
|
+);
|
|
+
|
|
static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
|
|
{
|
|
unsigned int caps = 0;
|
|
@@ -5183,6 +5241,7 @@ static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
|
|
);
|
|
|
|
static struct attribute *lru_gen_attrs[] = {
|
|
+ &lru_gen_min_ttl_attr.attr,
|
|
&lru_gen_enabled_attr.attr,
|
|
NULL
|
|
};
|
|
@@ -5198,12 +5257,16 @@ static struct attribute_group lru_gen_attr_group = {
|
|
|
|
void lru_gen_init_lruvec(struct lruvec *lruvec)
|
|
{
|
|
+ int i;
|
|
int gen, type, zone;
|
|
struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
|
|
lrugen->max_seq = MIN_NR_GENS + 1;
|
|
lrugen->enabled = lru_gen_enabled();
|
|
|
|
+ for (i = 0; i <= MIN_NR_GENS + 1; i++)
|
|
+ lrugen->timestamps[i] = jiffies;
|
|
+
|
|
for_each_gen_type_zone(gen, type, zone)
|
|
INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
|
|
|
|
|
|
From patchwork Wed Jul 6 22:00:21 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908745
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id 69C7CC433EF
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 22:24:29 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=kn0zTwUAkJ9YvXEbvH6Zd1cB4SwLG9B87GcNlQE3sck=; b=CpGgArDWloz9xlpWVlfYrCehce
|
|
QDoCjckhmXmlpkrSh/0SyOr0ftP7+rBJrP46fHB0yjS+S3rApCc4QvtqPsLL5c2MGXV9Zgn/fKE/a
|
|
n1bmtatTm1KQSkPQcdfXVvcfj521gj9Fk2SacYsgtpObS0hVAKcc7bEQ3QL5FyjkQCoJjVjEkrZ53
|
|
RJaBV/lU12nQTyGIOuk31nyT4Qn6mi6XcIIige5uFzmIhBIU12w7CnYZMyp3CMaL9Tn0slOJtlNGj
|
|
LbnixMG+qbhklr0L2qAUni4/bG7mRm91CBJNZqwCoEtyI2OIoM/hBQftpijuGp2xDN0WGZTRpBh7q
|
|
nmk5WdqA==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DQR-00Ca72-28; Wed, 06 Jul 2022 22:23:15 +0000
|
|
Received: from casper.infradead.org ([2001:8b0:10b:1236::1])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DPs-00CZxO-Iw
|
|
for linux-arm-kernel@bombadil.infradead.org; Wed, 06 Jul 2022 22:22:40 +0000
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:Content-Type:
|
|
Cc:To:From:Subject:References:Mime-Version:Message-Id:In-Reply-To:Date:Sender
|
|
:Reply-To:Content-ID:Content-Description;
|
|
bh=nff1jLrA4AEpo88lpO2ZCXRvuzs0CKl/TI+ofmEg1y8=; b=Lyl/3qQ7SrpOKYLNfds67zMHRt
|
|
eUIMA3I01yD2ZB3zW7E3WG6t6jdPJHDmA1lRT9tg70NmW9zIHc/cE3/uv3h/468xUf3XrwNqJrMAE
|
|
+wo5EejgDrZ1yiuMHTIaQwhs09NHLQnLp3fsSoKVrJ84XhesaZMmcmhLXoF512WUh7+wNKYU7oKmj
|
|
sZ3FlVUNZO1uSBd81LSTK5DxXVIZ4NA71DuMGL/JOZCwG30bew76VWUVqxj/DVhTVE9RSXTYdCkZw
|
|
29CvP6s9ig31LBeATsWqUrcP+BnFL0+5CWolreyVQJqaY29pN6oOjPz39dUF9ZVMK73mXNmRGt5yX
|
|
y4QarRyw==;
|
|
Received: from mail-yb1-xb4a.google.com ([2607:f8b0:4864:20::b4a])
|
|
by casper.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D57-0021Td-B1
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:17 +0000
|
|
Received: by mail-yb1-xb4a.google.com with SMTP id
|
|
y4-20020a25b9c4000000b0066e573fb0fcso5956820ybj.21
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:01:11 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=nff1jLrA4AEpo88lpO2ZCXRvuzs0CKl/TI+ofmEg1y8=;
|
|
b=CPQvXMErOqHr1LM+OMqtT0F59XyB+HiQxBX+EbwoUSnPn/FOpbR4dV1NCCwYakR+KD
|
|
gThfZIfqp3Y1SzCO2443reP2Soe3KDHNgAEXCZ5YNoeE7AXlAuA2fgD7YeAXZovjmVIh
|
|
7mERrjTMT6/EWjW531e5FNoxfhaMBEMBEgwjAOQ3Km57LeRgBcWr2IgRe48XaW69M16C
|
|
KWj2PGLEmurhGwwHU4NVVPpbjL3o7cE3vD/yehuUCz476hIOcC2Nqpn4krz36H5vP68u
|
|
MNeJkhynrE7FhYi7+GgffibtX96Vf3x/16YGAxyUCnSyvvk6OhNUeqKo/LQmoS3LAyl4
|
|
LFpw==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=nff1jLrA4AEpo88lpO2ZCXRvuzs0CKl/TI+ofmEg1y8=;
|
|
b=So5lW4F21tDMASYDUfsRHvrcdCadkcE48vj7OigkMBZEvhMQcpurwcw40ffw+5CWRo
|
|
JQgbp7sKYlel0o8/R+yX4TbguLQNaRZJTxNa5NdK///r2TyNiI0UZVgRJUllo91I1iBS
|
|
bMnX0UmyJaXs5HfcdJ2mBCynhlG3Port8pSAlSM2VR0L21710yjA9fCWttF3xJxzU04Y
|
|
mpJZv0XS+ADKDmufvYB3x3LQ5tS6ezdaKnQo8+3o+3qCQ42CuFnZwpX9/dfyxfrjTWfP
|
|
9cg5vyriTtszaGPSY7jn+fKiRPEzsXFpCFv3q1qdzRIJzJgsv8i3yvx3Ifizmaw3E2xu
|
|
U7/w==
|
|
X-Gm-Message-State: AJIora8FvoX31aCeblFiCHzHMkTsZOFkBDJywfNPqiEQ53UsxMHV1vvA
|
|
mEwrwu6C/L9QfRnrTDaqGibQKh8fOyg=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1shBkUATwCAbsz8cAeEoY3s7WAj+Jhs0L0rMlWdOOLCX8yRP4QO9OI90Aiszy92GtEPUW7W76UGd7w=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a0d:c486:0:b0:31c:3b63:91fe with SMTP id
|
|
g128-20020a0dc486000000b0031c3b6391femr43427605ywd.7.1657144869573; Wed, 06
|
|
Jul 2022 15:01:09 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:21 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-13-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 12/14] mm: multi-gen LRU: debugfs interface
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Qi Zheng <zhengqi.arch@bytedance.com>, Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_230113_411424_932E34AC
|
|
X-CRM114-Status: GOOD ( 21.98 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
Add /sys/kernel/debug/lru_gen for working set estimation and proactive
|
|
reclaim. These techniques are commonly used to optimize job scheduling
|
|
(bin packing) in data centers [1][2].
|
|
|
|
Compared with the page table-based approach and the PFN-based
|
|
approach, this lruvec-based approach has the following advantages:
|
|
1. It offers better choices because it is aware of memcgs, NUMA nodes,
|
|
shared mappings and unmapped page cache.
|
|
2. It is more scalable because it is O(nr_hot_pages), whereas the
|
|
PFN-based approach is O(nr_total_pages).
|
|
|
|
Add /sys/kernel/debug/lru_gen_full for debugging.
|
|
|
|
[1] https://dl.acm.org/doi/10.1145/3297858.3304053
|
|
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
---
|
|
include/linux/nodemask.h | 1 +
|
|
mm/vmscan.c | 411 ++++++++++++++++++++++++++++++++++++++-
|
|
2 files changed, 402 insertions(+), 10 deletions(-)
|
|
|
|
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
|
|
index 0f233b76c9ce..292ec0ce0d63 100644
|
|
--- a/include/linux/nodemask.h
|
|
+++ b/include/linux/nodemask.h
|
|
@@ -485,6 +485,7 @@ static inline int num_node_state(enum node_states state)
|
|
#define first_online_node 0
|
|
#define first_memory_node 0
|
|
#define next_online_node(nid) (MAX_NUMNODES)
|
|
+#define next_memory_node(nid) (MAX_NUMNODES)
|
|
#define nr_node_ids 1U
|
|
#define nr_online_nodes 1U
|
|
|
|
diff --git a/mm/vmscan.c b/mm/vmscan.c
|
|
index 1f2892a0dc41..fbcd298adca7 100644
|
|
--- a/mm/vmscan.c
|
|
+++ b/mm/vmscan.c
|
|
@@ -53,6 +53,7 @@
|
|
#include <linux/pagewalk.h>
|
|
#include <linux/shmem_fs.h>
|
|
#include <linux/ctype.h>
|
|
+#include <linux/debugfs.h>
|
|
|
|
#include <asm/tlbflush.h>
|
|
#include <asm/div64.h>
|
|
@@ -4137,12 +4138,40 @@ static void clear_mm_walk(void)
|
|
kfree(walk);
|
|
}
|
|
|
|
-static void inc_min_seq(struct lruvec *lruvec, int type)
|
|
+static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
|
|
{
|
|
+ int zone;
|
|
+ int remaining = MAX_LRU_BATCH;
|
|
struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+ int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
|
|
|
|
+ if (type == LRU_GEN_ANON && !can_swap)
|
|
+ goto done;
|
|
+
|
|
+ /* prevent cold/hot inversion if force_scan is true */
|
|
+ for (zone = 0; zone < MAX_NR_ZONES; zone++) {
|
|
+ struct list_head *head = &lrugen->lists[old_gen][type][zone];
|
|
+
|
|
+ while (!list_empty(head)) {
|
|
+ struct folio *folio = lru_to_folio(head);
|
|
+
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
|
|
+ VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
|
|
+
|
|
+ new_gen = folio_inc_gen(lruvec, folio, false);
|
|
+ list_move_tail(&folio->lru, &lrugen->lists[new_gen][type][zone]);
|
|
+
|
|
+ if (!--remaining)
|
|
+ return false;
|
|
+ }
|
|
+ }
|
|
+done:
|
|
reset_ctrl_pos(lruvec, type, true);
|
|
WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
|
|
+
|
|
+ return true;
|
|
}
|
|
|
|
static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
|
|
@@ -4188,7 +4217,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
|
|
return success;
|
|
}
|
|
|
|
-static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
|
|
+static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
|
|
{
|
|
int prev, next;
|
|
int type, zone;
|
|
@@ -4202,9 +4231,13 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
|
|
if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
|
|
continue;
|
|
|
|
- VM_WARN_ON_ONCE(type == LRU_GEN_FILE || can_swap);
|
|
+ VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
|
|
|
|
- inc_min_seq(lruvec, type);
|
|
+ while (!inc_min_seq(lruvec, type, can_swap)) {
|
|
+ spin_unlock_irq(&lruvec->lru_lock);
|
|
+ cond_resched();
|
|
+ spin_lock_irq(&lruvec->lru_lock);
|
|
+ }
|
|
}
|
|
|
|
/*
|
|
@@ -4241,7 +4274,7 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
|
|
}
|
|
|
|
static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
|
|
- struct scan_control *sc, bool can_swap)
|
|
+ struct scan_control *sc, bool can_swap, bool force_scan)
|
|
{
|
|
bool success;
|
|
struct lru_gen_mm_walk *walk;
|
|
@@ -4262,7 +4295,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
|
|
* handful of PTEs. Spreading the work out over a period of time usually
|
|
* is less efficient, but it avoids bursty page faults.
|
|
*/
|
|
- if (!(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) {
|
|
+ if (!force_scan && !(arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))) {
|
|
success = iterate_mm_list_nowalk(lruvec, max_seq);
|
|
goto done;
|
|
}
|
|
@@ -4276,7 +4309,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
|
|
walk->lruvec = lruvec;
|
|
walk->max_seq = max_seq;
|
|
walk->can_swap = can_swap;
|
|
- walk->force_scan = false;
|
|
+ walk->force_scan = force_scan;
|
|
|
|
do {
|
|
success = iterate_mm_list(lruvec, walk, &mm);
|
|
@@ -4296,7 +4329,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
|
|
|
|
VM_WARN_ON_ONCE(max_seq != READ_ONCE(lrugen->max_seq));
|
|
|
|
- inc_max_seq(lruvec, can_swap);
|
|
+ inc_max_seq(lruvec, can_swap, force_scan);
|
|
/* either this sees any waiters or they will see updated max_seq */
|
|
if (wq_has_sleeper(&lruvec->mm_state.wait))
|
|
wake_up_all(&lruvec->mm_state.wait);
|
|
@@ -4395,7 +4428,7 @@ static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned
|
|
}
|
|
|
|
if (nr_to_scan && need_aging)
|
|
- try_to_inc_max_seq(lruvec, max_seq, sc, swappiness);
|
|
+ try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
|
|
|
|
return true;
|
|
}
|
|
@@ -4962,7 +4995,7 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
|
|
if (current_is_kswapd())
|
|
return 0;
|
|
|
|
- if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap))
|
|
+ if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false))
|
|
return nr_to_scan;
|
|
done:
|
|
return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
|
|
@@ -5251,6 +5284,361 @@ static struct attribute_group lru_gen_attr_group = {
|
|
.attrs = lru_gen_attrs,
|
|
};
|
|
|
|
+/******************************************************************************
|
|
+ * debugfs interface
|
|
+ ******************************************************************************/
|
|
+
|
|
+static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos)
|
|
+{
|
|
+ struct mem_cgroup *memcg;
|
|
+ loff_t nr_to_skip = *pos;
|
|
+
|
|
+ m->private = kvmalloc(PATH_MAX, GFP_KERNEL);
|
|
+ if (!m->private)
|
|
+ return ERR_PTR(-ENOMEM);
|
|
+
|
|
+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
|
|
+ do {
|
|
+ int nid;
|
|
+
|
|
+ for_each_node_state(nid, N_MEMORY) {
|
|
+ if (!nr_to_skip--)
|
|
+ return get_lruvec(memcg, nid);
|
|
+ }
|
|
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
|
|
+
|
|
+ return NULL;
|
|
+}
|
|
+
|
|
+static void lru_gen_seq_stop(struct seq_file *m, void *v)
|
|
+{
|
|
+ if (!IS_ERR_OR_NULL(v))
|
|
+ mem_cgroup_iter_break(NULL, lruvec_memcg(v));
|
|
+
|
|
+ kvfree(m->private);
|
|
+ m->private = NULL;
|
|
+}
|
|
+
|
|
+static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos)
|
|
+{
|
|
+ int nid = lruvec_pgdat(v)->node_id;
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(v);
|
|
+
|
|
+ ++*pos;
|
|
+
|
|
+ nid = next_memory_node(nid);
|
|
+ if (nid == MAX_NUMNODES) {
|
|
+ memcg = mem_cgroup_iter(NULL, memcg, NULL);
|
|
+ if (!memcg)
|
|
+ return NULL;
|
|
+
|
|
+ nid = first_memory_node;
|
|
+ }
|
|
+
|
|
+ return get_lruvec(memcg, nid);
|
|
+}
|
|
+
|
|
+static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
|
|
+ unsigned long max_seq, unsigned long *min_seq,
|
|
+ unsigned long seq)
|
|
+{
|
|
+ int i;
|
|
+ int type, tier;
|
|
+ int hist = lru_hist_from_seq(seq);
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+
|
|
+ for (tier = 0; tier < MAX_NR_TIERS; tier++) {
|
|
+ seq_printf(m, " %10d", tier);
|
|
+ for (type = 0; type < ANON_AND_FILE; type++) {
|
|
+ const char *s = " ";
|
|
+ unsigned long n[3] = {};
|
|
+
|
|
+ if (seq == max_seq) {
|
|
+ s = "RT ";
|
|
+ n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
|
|
+ n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
|
|
+ } else if (seq == min_seq[type] || NR_HIST_GENS > 1) {
|
|
+ s = "rep";
|
|
+ n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]);
|
|
+ n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]);
|
|
+ if (tier)
|
|
+ n[2] = READ_ONCE(lrugen->protected[hist][type][tier - 1]);
|
|
+ }
|
|
+
|
|
+ for (i = 0; i < 3; i++)
|
|
+ seq_printf(m, " %10lu%c", n[i], s[i]);
|
|
+ }
|
|
+ seq_putc(m, '\n');
|
|
+ }
|
|
+
|
|
+ seq_puts(m, " ");
|
|
+ for (i = 0; i < NR_MM_STATS; i++) {
|
|
+ const char *s = " ";
|
|
+ unsigned long n = 0;
|
|
+
|
|
+ if (seq == max_seq && NR_HIST_GENS == 1) {
|
|
+ s = "LOYNFA";
|
|
+ n = READ_ONCE(lruvec->mm_state.stats[hist][i]);
|
|
+ } else if (seq != max_seq && NR_HIST_GENS > 1) {
|
|
+ s = "loynfa";
|
|
+ n = READ_ONCE(lruvec->mm_state.stats[hist][i]);
|
|
+ }
|
|
+
|
|
+ seq_printf(m, " %10lu%c", n, s[i]);
|
|
+ }
|
|
+ seq_putc(m, '\n');
|
|
+}
|
|
+
|
|
+static int lru_gen_seq_show(struct seq_file *m, void *v)
|
|
+{
|
|
+ unsigned long seq;
|
|
+ bool full = !debugfs_real_fops(m->file)->write;
|
|
+ struct lruvec *lruvec = v;
|
|
+ struct lru_gen_struct *lrugen = &lruvec->lrugen;
|
|
+ int nid = lruvec_pgdat(lruvec)->node_id;
|
|
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
|
|
+ DEFINE_MAX_SEQ(lruvec);
|
|
+ DEFINE_MIN_SEQ(lruvec);
|
|
+
|
|
+ if (nid == first_memory_node) {
|
|
+ const char *path = memcg ? m->private : "";
|
|
+
|
|
+#ifdef CONFIG_MEMCG
|
|
+ if (memcg)
|
|
+ cgroup_path(memcg->css.cgroup, m->private, PATH_MAX);
|
|
+#endif
|
|
+ seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path);
|
|
+ }
|
|
+
|
|
+ seq_printf(m, " node %5d\n", nid);
|
|
+
|
|
+ if (!full)
|
|
+ seq = min_seq[LRU_GEN_ANON];
|
|
+ else if (max_seq >= MAX_NR_GENS)
|
|
+ seq = max_seq - MAX_NR_GENS + 1;
|
|
+ else
|
|
+ seq = 0;
|
|
+
|
|
+ for (; seq <= max_seq; seq++) {
|
|
+ int type, zone;
|
|
+ int gen = lru_gen_from_seq(seq);
|
|
+ unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
|
|
+
|
|
+ seq_printf(m, " %10lu %10u", seq, jiffies_to_msecs(jiffies - birth));
|
|
+
|
|
+ for (type = 0; type < ANON_AND_FILE; type++) {
|
|
+ unsigned long size = 0;
|
|
+ char mark = full && seq < min_seq[type] ? 'x' : ' ';
|
|
+
|
|
+ for (zone = 0; zone < MAX_NR_ZONES; zone++)
|
|
+ size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
|
|
+
|
|
+ seq_printf(m, " %10lu%c", size, mark);
|
|
+ }
|
|
+
|
|
+ seq_putc(m, '\n');
|
|
+
|
|
+ if (full)
|
|
+ lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq);
|
|
+ }
|
|
+
|
|
+ return 0;
|
|
+}
|
|
+
|
|
+static const struct seq_operations lru_gen_seq_ops = {
|
|
+ .start = lru_gen_seq_start,
|
|
+ .stop = lru_gen_seq_stop,
|
|
+ .next = lru_gen_seq_next,
|
|
+ .show = lru_gen_seq_show,
|
|
+};
|
|
+
|
|
+static int run_aging(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
|
|
+ bool can_swap, bool force_scan)
|
|
+{
|
|
+ DEFINE_MAX_SEQ(lruvec);
|
|
+ DEFINE_MIN_SEQ(lruvec);
|
|
+
|
|
+ if (seq < max_seq)
|
|
+ return 0;
|
|
+
|
|
+ if (seq > max_seq)
|
|
+ return -EINVAL;
|
|
+
|
|
+ if (!force_scan && min_seq[!can_swap] + MAX_NR_GENS - 1 <= max_seq)
|
|
+ return -ERANGE;
|
|
+
|
|
+ try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, force_scan);
|
|
+
|
|
+ return 0;
|
|
+}
|
|
+
|
|
+static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
|
|
+ int swappiness, unsigned long nr_to_reclaim)
|
|
+{
|
|
+ DEFINE_MAX_SEQ(lruvec);
|
|
+
|
|
+ if (seq + MIN_NR_GENS > max_seq)
|
|
+ return -EINVAL;
|
|
+
|
|
+ sc->nr_reclaimed = 0;
|
|
+
|
|
+ while (!signal_pending(current)) {
|
|
+ DEFINE_MIN_SEQ(lruvec);
|
|
+
|
|
+ if (seq < min_seq[!swappiness])
|
|
+ return 0;
|
|
+
|
|
+ if (sc->nr_reclaimed >= nr_to_reclaim)
|
|
+ return 0;
|
|
+
|
|
+ if (!evict_folios(lruvec, sc, swappiness, NULL))
|
|
+ return 0;
|
|
+
|
|
+ cond_resched();
|
|
+ }
|
|
+
|
|
+ return -EINTR;
|
|
+}
|
|
+
|
|
+static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
|
|
+ struct scan_control *sc, int swappiness, unsigned long opt)
|
|
+{
|
|
+ struct lruvec *lruvec;
|
|
+ int err = -EINVAL;
|
|
+ struct mem_cgroup *memcg = NULL;
|
|
+
|
|
+ if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY))
|
|
+ return -EINVAL;
|
|
+
|
|
+ if (!mem_cgroup_disabled()) {
|
|
+ rcu_read_lock();
|
|
+ memcg = mem_cgroup_from_id(memcg_id);
|
|
+#ifdef CONFIG_MEMCG
|
|
+ if (memcg && !css_tryget(&memcg->css))
|
|
+ memcg = NULL;
|
|
+#endif
|
|
+ rcu_read_unlock();
|
|
+
|
|
+ if (!memcg)
|
|
+ return -EINVAL;
|
|
+ }
|
|
+
|
|
+ if (memcg_id != mem_cgroup_id(memcg))
|
|
+ goto done;
|
|
+
|
|
+ lruvec = get_lruvec(memcg, nid);
|
|
+
|
|
+ if (swappiness < 0)
|
|
+ swappiness = get_swappiness(lruvec, sc);
|
|
+ else if (swappiness > 200)
|
|
+ goto done;
|
|
+
|
|
+ switch (cmd) {
|
|
+ case '+':
|
|
+ err = run_aging(lruvec, seq, sc, swappiness, opt);
|
|
+ break;
|
|
+ case '-':
|
|
+ err = run_eviction(lruvec, seq, sc, swappiness, opt);
|
|
+ break;
|
|
+ }
|
|
+done:
|
|
+ mem_cgroup_put(memcg);
|
|
+
|
|
+ return err;
|
|
+}
|
|
+
|
|
+static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
|
|
+ size_t len, loff_t *pos)
|
|
+{
|
|
+ void *buf;
|
|
+ char *cur, *next;
|
|
+ unsigned int flags;
|
|
+ struct blk_plug plug;
|
|
+ int err = -EINVAL;
|
|
+ struct scan_control sc = {
|
|
+ .may_writepage = true,
|
|
+ .may_unmap = true,
|
|
+ .may_swap = true,
|
|
+ .reclaim_idx = MAX_NR_ZONES - 1,
|
|
+ .gfp_mask = GFP_KERNEL,
|
|
+ };
|
|
+
|
|
+ buf = kvmalloc(len + 1, GFP_KERNEL);
|
|
+ if (!buf)
|
|
+ return -ENOMEM;
|
|
+
|
|
+ if (copy_from_user(buf, src, len)) {
|
|
+ kvfree(buf);
|
|
+ return -EFAULT;
|
|
+ }
|
|
+
|
|
+ set_task_reclaim_state(current, &sc.reclaim_state);
|
|
+ flags = memalloc_noreclaim_save();
|
|
+ blk_start_plug(&plug);
|
|
+ if (!set_mm_walk(NULL)) {
|
|
+ err = -ENOMEM;
|
|
+ goto done;
|
|
+ }
|
|
+
|
|
+ next = buf;
|
|
+ next[len] = '\0';
|
|
+
|
|
+ while ((cur = strsep(&next, ",;\n"))) {
|
|
+ int n;
|
|
+ int end;
|
|
+ char cmd;
|
|
+ unsigned int memcg_id;
|
|
+ unsigned int nid;
|
|
+ unsigned long seq;
|
|
+ unsigned int swappiness = -1;
|
|
+ unsigned long opt = -1;
|
|
+
|
|
+ cur = skip_spaces(cur);
|
|
+ if (!*cur)
|
|
+ continue;
|
|
+
|
|
+ n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid,
|
|
+ &seq, &end, &swappiness, &end, &opt, &end);
|
|
+ if (n < 4 || cur[end]) {
|
|
+ err = -EINVAL;
|
|
+ break;
|
|
+ }
|
|
+
|
|
+ err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt);
|
|
+ if (err)
|
|
+ break;
|
|
+ }
|
|
+done:
|
|
+ clear_mm_walk();
|
|
+ blk_finish_plug(&plug);
|
|
+ memalloc_noreclaim_restore(flags);
|
|
+ set_task_reclaim_state(current, NULL);
|
|
+
|
|
+ kvfree(buf);
|
|
+
|
|
+ return err ? : len;
|
|
+}
|
|
+
|
|
+static int lru_gen_seq_open(struct inode *inode, struct file *file)
|
|
+{
|
|
+ return seq_open(file, &lru_gen_seq_ops);
|
|
+}
|
|
+
|
|
+static const struct file_operations lru_gen_rw_fops = {
|
|
+ .open = lru_gen_seq_open,
|
|
+ .read = seq_read,
|
|
+ .write = lru_gen_seq_write,
|
|
+ .llseek = seq_lseek,
|
|
+ .release = seq_release,
|
|
+};
|
|
+
|
|
+static const struct file_operations lru_gen_ro_fops = {
|
|
+ .open = lru_gen_seq_open,
|
|
+ .read = seq_read,
|
|
+ .llseek = seq_lseek,
|
|
+ .release = seq_release,
|
|
+};
|
|
+
|
|
/******************************************************************************
|
|
* initialization
|
|
******************************************************************************/
|
|
@@ -5308,6 +5696,9 @@ static int __init init_lru_gen(void)
|
|
if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
|
|
pr_err("lru_gen: failed to create sysfs group\n");
|
|
|
|
+ debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops);
|
|
+ debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
|
|
+
|
|
return 0;
|
|
};
|
|
late_initcall(init_lru_gen);
|
|
|
|
From patchwork Wed Jul 6 22:00:22 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908742
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id DD484C43334
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 22:24:05 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=Ao3jm1+lj+7gAmG7fQl/WTPtirEMNTYNpcD+Xu8Sm1o=; b=kAyVjTSjpE0AWlMkfBg/QqkCGv
|
|
wzSIicSDVFNQozS8FTJWN8p7wuikom5TiHmYMvWOAJ0WaqxVkNl5NfbjgQSO6W62YoXV49Kz5Kz7T
|
|
b0V9ApU0/S4zuexXgh9pHjwP3F+R628AB0LjoNIZhHS1WCKMGJjP59fgualauPloRGp5itAmeoXwA
|
|
cAOpRZqvF1jDeJKS4iy+BJYYkTL/DMlAQ0Szw6ij/irglhBsvk8vc8NjJYydRKQ+kC+rtxfUMy/dx
|
|
IwXzoqtIrzTzhI4owkmChVYOSvlOrsJfPlKj4p8QLXNDOtn1FM6Vh1zPaG0GViv+/T6FNIFj2KUbE
|
|
BjXVA/Lw==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DQ1-00Ca2T-4n; Wed, 06 Jul 2022 22:22:49 +0000
|
|
Received: from casper.infradead.org ([2001:8b0:10b:1236::1])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DPr-00CZxO-Gf
|
|
for linux-arm-kernel@bombadil.infradead.org; Wed, 06 Jul 2022 22:22:39 +0000
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:Content-Type:
|
|
Cc:To:From:Subject:References:Mime-Version:Message-Id:In-Reply-To:Date:Sender
|
|
:Reply-To:Content-ID:Content-Description;
|
|
bh=gpspQZzCiCtLDF1mE2Rbzp3OUWg7vlq70C4xLE3ya+E=; b=ARZggGLH9o40RcLPMwBeww6+VP
|
|
PLL7QiZxiKyeNnbPTR8cF433hgZzCpN5IYS6hLttM9dHVh57wBu/x2l8Xf1vXieVgm3KWkcBFTbKu
|
|
Idenk/EGoXjyJmtz+Cdlwmm/VP+OnSiQiR8z7hvFw3r7sGYp4e5Qad0qpj6X7NsGjIo84HaE7lZd+
|
|
ldumkxCgx8dCdHqYB+hKuyPvqrHcmNwK6E+CAFtwoB7Sbbyl6iyfdPGVhr/CxfqE8XZwvoKeUj5N+
|
|
tJIsDUMpglTNHmyVYZgptCqhyz3Q6iAwHvsgkII8RiL6wBPSFAoVyzsLUuRpfEfw+0AvZeOEH01Fs
|
|
yuHPXZUQ==;
|
|
Received: from mail-yw1-x1149.google.com ([2607:f8b0:4864:20::1149])
|
|
by casper.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D58-0021UO-PF
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:19 +0000
|
|
Received: by mail-yw1-x1149.google.com with SMTP id
|
|
00721157ae682-31cd7ade3d6so34565957b3.3
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:01:12 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=gpspQZzCiCtLDF1mE2Rbzp3OUWg7vlq70C4xLE3ya+E=;
|
|
b=KmRh3W6zCTnYhuu2uLwH/71AGZzl5TVUrtsNnUP5zXTmGsYrVbcqdtCu+MA/r0Ndp0
|
|
Swx6K5/Y1yzZuona+ojX9pyfPH0vSgmsnPUuGuK8IgKoxke8pbVIOMVO1oHB4MFfbJr9
|
|
MZQ2DHsaZhnv+oABy231/ZNYVnut1uI8HXMoZE64GkKDaX0oTm6VD5IWp6Pjb9e4CCS2
|
|
4l6LRlV0GkUZbtfNu7oRMgYKOcOBXuCtbtOCopiW839uMoofW0liroJ2wElyPDiAsF2j
|
|
ZEKcyiLmzwxANf1QRl8D0H0t207nTseUwQuoJ0fGq2geu1GyW7/GzRuxYm66v/+UUfVJ
|
|
Ti/g==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=gpspQZzCiCtLDF1mE2Rbzp3OUWg7vlq70C4xLE3ya+E=;
|
|
b=xnAxXpDSgNsOs+lWJLT3RYX89lubGk4FDAvw0IPo8W1qjlt0NXcG5gR2auTfte337U
|
|
UgmnKLuiPgigLuGza3GEjS56nFmS1vUv5D/aEG535f0m0ZTFvqU1riGQGES5H7kSZfCf
|
|
Pel49wyuFOHpVdsDhtmNIBSkLrzMikAAFXrlz4AxdT4v6Oh/8364cIv1mL17YZnRMtzd
|
|
gFN7NDRajHOwDxBlbmY5L42+1Obtqi5I031OZ1xIw8V8MC6e/8Yt0qQfCcKTgA0Sf0wc
|
|
3GG0Ae0mCbueu6Y7EWcbpUeXskqengXyjkar6LXO8Yi5Nc0vA8vweM6B8BQcYW0STZtR
|
|
HWMQ==
|
|
X-Gm-Message-State: AJIora+8o9YEFxj5O8CUA7VpVBqwtwg5yeKK6fa/HS8j91uA9VSY1g70
|
|
bKdpftypH9ReDuqE+vtoYgrjKJseL9k=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1tcS0bhkovBqaAMGcBTFG0LXet+IyIY3UhCyBJaxouYWPrdgSATtWZUnD1044Cxo6jW3UsFPLIHGBk=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a25:73d1:0:b0:66e:aee4:feb3 with SMTP id
|
|
o200-20020a2573d1000000b0066eaee4feb3mr1925521ybc.452.1657144871215; Wed, 06
|
|
Jul 2022 15:01:11 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:22 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-14-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 13/14] mm: multi-gen LRU: admin guide
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_230114_864603_F86478F1
|
|
X-CRM114-Status: GOOD ( 22.99 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
Add an admin guide.
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
|
|
---
|
|
Documentation/admin-guide/mm/index.rst | 1 +
|
|
Documentation/admin-guide/mm/multigen_lru.rst | 156 ++++++++++++++++++
|
|
mm/Kconfig | 3 +-
|
|
mm/vmscan.c | 4 +
|
|
4 files changed, 163 insertions(+), 1 deletion(-)
|
|
create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
|
|
|
|
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
|
|
index c21b5823f126..2cf5bae62036 100644
|
|
--- a/Documentation/admin-guide/mm/index.rst
|
|
+++ b/Documentation/admin-guide/mm/index.rst
|
|
@@ -32,6 +32,7 @@ the Linux memory management.
|
|
idle_page_tracking
|
|
ksm
|
|
memory-hotplug
|
|
+ multigen_lru
|
|
nommu-mmap
|
|
numa_memory_policy
|
|
numaperf
|
|
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
|
|
new file mode 100644
|
|
index 000000000000..6355f2b5019d
|
|
--- /dev/null
|
|
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
|
|
@@ -0,0 +1,156 @@
|
|
+.. SPDX-License-Identifier: GPL-2.0
|
|
+
|
|
+=============
|
|
+Multi-Gen LRU
|
|
+=============
|
|
+The multi-gen LRU is an alternative LRU implementation that optimizes
|
|
+page reclaim and improves performance under memory pressure. Page
|
|
+reclaim decides the kernel's caching policy and ability to overcommit
|
|
+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
|
|
+
|
|
+Quick start
|
|
+===========
|
|
+Build the kernel with the following configurations.
|
|
+
|
|
+* ``CONFIG_LRU_GEN=y``
|
|
+* ``CONFIG_LRU_GEN_ENABLED=y``
|
|
+
|
|
+All set!
|
|
+
|
|
+Runtime options
|
|
+===============
|
|
+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
|
|
+following subsections.
|
|
+
|
|
+Kill switch
|
|
+-----------
|
|
+``enabled`` accepts different values to enable or disable the
|
|
+following components. Its default value depends on
|
|
+``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
|
|
+unless some of them have unforeseen side effects. Writing to
|
|
+``enabled`` has no effect when a component is not supported by the
|
|
+hardware, and valid values will be accepted even when the main switch
|
|
+is off.
|
|
+
|
|
+====== ===============================================================
|
|
+Values Components
|
|
+====== ===============================================================
|
|
+0x0001 The main switch for the multi-gen LRU.
|
|
+0x0002 Clearing the accessed bit in leaf page table entries in large
|
|
+ batches, when MMU sets it (e.g., on x86). This behavior can
|
|
+ theoretically worsen lock contention (mmap_lock). If it is
|
|
+ disabled, the multi-gen LRU will suffer a minor performance
|
|
+ degradation for workloads that contiguously map hot pages,
|
|
+ whose accessed bits can be otherwise cleared by fewer larger
|
|
+ batches.
|
|
+0x0004 Clearing the accessed bit in non-leaf page table entries as
|
|
+ well, when MMU sets it (e.g., on x86). This behavior was not
|
|
+ verified on x86 varieties other than Intel and AMD. If it is
|
|
+ disabled, the multi-gen LRU will suffer a negligible
|
|
+ performance degradation.
|
|
+[yYnN] Apply to all the components above.
|
|
+====== ===============================================================
|
|
+
|
|
+E.g.,
|
|
+::
|
|
+
|
|
+ echo y >/sys/kernel/mm/lru_gen/enabled
|
|
+ cat /sys/kernel/mm/lru_gen/enabled
|
|
+ 0x0007
|
|
+ echo 5 >/sys/kernel/mm/lru_gen/enabled
|
|
+ cat /sys/kernel/mm/lru_gen/enabled
|
|
+ 0x0005
|
|
+
|
|
+Thrashing prevention
|
|
+--------------------
|
|
+Personal computers are more sensitive to thrashing because it can
|
|
+cause janks (lags when rendering UI) and negatively impact user
|
|
+experience. The multi-gen LRU offers thrashing prevention to the
|
|
+majority of laptop and desktop users who do not have ``oomd``.
|
|
+
|
|
+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
|
|
+``N`` milliseconds from getting evicted. The OOM killer is triggered
|
|
+if this working set cannot be kept in memory. In other words, this
|
|
+option works as an adjustable pressure relief valve, and when open, it
|
|
+terminates applications that are hopefully not being used.
|
|
+
|
|
+Based on the average human detectable lag (~100ms), ``N=1000`` usually
|
|
+eliminates intolerable janks due to thrashing. Larger values like
|
|
+``N=3000`` make janks less noticeable at the risk of premature OOM
|
|
+kills.
|
|
+
|
|
+The default value ``0`` means disabled.
|
|
+
|
|
+Experimental features
|
|
+=====================
|
|
+``/sys/kernel/debug/lru_gen`` accepts commands described in the
|
|
+following subsections. Multiple command lines are supported, so does
|
|
+concatenation with delimiters ``,`` and ``;``.
|
|
+
|
|
+``/sys/kernel/debug/lru_gen_full`` provides additional stats for
|
|
+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
|
|
+evicted generations in this file.
|
|
+
|
|
+Working set estimation
|
|
+----------------------
|
|
+Working set estimation measures how much memory an application needs
|
|
+in a given time interval, and it is usually done with little impact on
|
|
+the performance of the application. E.g., data centers want to
|
|
+optimize job scheduling (bin packing) to improve memory utilizations.
|
|
+When a new job comes in, the job scheduler needs to find out whether
|
|
+each server it manages can allocate a certain amount of memory for
|
|
+this new job before it can pick a candidate. To do so, the job
|
|
+scheduler needs to estimate the working sets of the existing jobs.
|
|
+
|
|
+When it is read, ``lru_gen`` returns a histogram of numbers of pages
|
|
+accessed over different time intervals for each memcg and node.
|
|
+``MAX_NR_GENS`` decides the number of bins for each histogram. The
|
|
+histograms are noncumulative.
|
|
+::
|
|
+
|
|
+ memcg memcg_id memcg_path
|
|
+ node node_id
|
|
+ min_gen_nr age_in_ms nr_anon_pages nr_file_pages
|
|
+ ...
|
|
+ max_gen_nr age_in_ms nr_anon_pages nr_file_pages
|
|
+
|
|
+Each bin contains an estimated number of pages that have been accessed
|
|
+within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
|
|
+and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
|
|
+the former is the largest and that of the latter is the smallest.
|
|
+
|
|
+Users can write ``+ memcg_id node_id max_gen_nr
|
|
+[can_swap [force_scan]]`` to ``lru_gen`` to create a new generation
|
|
+``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it
|
|
+is set to ``1``, it forces the scan of anon pages when swap is off,
|
|
+and vice versa. ``force_scan`` defaults to ``1`` and, if it is set to
|
|
+``0``, it employs heuristics to reduce the overhead, which is likely
|
|
+to reduce the coverage as well.
|
|
+
|
|
+A typical use case is that a job scheduler writes to ``lru_gen`` at a
|
|
+certain time interval to create new generations, and it ranks the
|
|
+servers it manages based on the sizes of their cold pages defined by
|
|
+this time interval.
|
|
+
|
|
+Proactive reclaim
|
|
+-----------------
|
|
+Proactive reclaim induces page reclaim when there is no memory
|
|
+pressure. It usually targets cold pages only. E.g., when a new job
|
|
+comes in, the job scheduler wants to proactively reclaim cold pages on
|
|
+the server it selected to improve the chance of successfully landing
|
|
+this new job.
|
|
+
|
|
+Users can write ``- memcg_id node_id min_gen_nr [swappiness
|
|
+[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
|
|
+equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
|
|
+``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
|
|
+aged and therefore cannot be evicted. ``swappiness`` overrides the
|
|
+default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
|
|
+the number of pages to evict.
|
|
+
|
|
+A typical use case is that a job scheduler writes to ``lru_gen``
|
|
+before it tries to land a new job on a server. If it fails to
|
|
+materialize enough cold pages because of the overestimation, it
|
|
+retries on the next server according to the ranking result obtained
|
|
+from the working set estimation step. This less forceful approach
|
|
+limits the impacts on the existing jobs.
|
|
diff --git a/mm/Kconfig b/mm/Kconfig
|
|
index 0c2ef0af0036..a0f7b6e66410 100644
|
|
--- a/mm/Kconfig
|
|
+++ b/mm/Kconfig
|
|
@@ -1137,7 +1137,8 @@ config LRU_GEN
|
|
# make sure folio->flags has enough spare bits
|
|
depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP
|
|
help
|
|
- A high performance LRU implementation to overcommit memory.
|
|
+ A high performance LRU implementation to overcommit memory. See
|
|
+ Documentation/admin-guide/mm/multigen_lru.rst for details.
|
|
|
|
config LRU_GEN_ENABLED
|
|
bool "Enable by default"
|
|
diff --git a/mm/vmscan.c b/mm/vmscan.c
|
|
index fbcd298adca7..7096ff7836db 100644
|
|
--- a/mm/vmscan.c
|
|
+++ b/mm/vmscan.c
|
|
@@ -5209,6 +5209,7 @@ static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, c
|
|
return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
|
|
}
|
|
|
|
+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
|
|
static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
|
|
const char *buf, size_t len)
|
|
{
|
|
@@ -5242,6 +5243,7 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
|
|
return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
|
|
}
|
|
|
|
+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
|
|
static ssize_t store_enabled(struct kobject *kobj, struct kobj_attribute *attr,
|
|
const char *buf, size_t len)
|
|
{
|
|
@@ -5389,6 +5391,7 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
|
|
seq_putc(m, '\n');
|
|
}
|
|
|
|
+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
|
|
static int lru_gen_seq_show(struct seq_file *m, void *v)
|
|
{
|
|
unsigned long seq;
|
|
@@ -5547,6 +5550,7 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
|
|
return err;
|
|
}
|
|
|
|
+/* see Documentation/admin-guide/mm/multigen_lru.rst for details */
|
|
static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
|
|
size_t len, loff_t *pos)
|
|
{
|
|
|
|
From patchwork Wed Jul 6 22:00:23 2022
|
|
Content-Type: text/plain; charset="utf-8"
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 8bit
|
|
X-Patchwork-Submitter: Yu Zhao <yuzhao@google.com>
|
|
X-Patchwork-Id: 12908743
|
|
Return-Path:
|
|
<linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
|
|
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
|
|
aws-us-west-2-korg-lkml-1.web.codeaurora.org
|
|
Received: from bombadil.infradead.org (bombadil.infradead.org
|
|
[198.137.202.133])
|
|
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
|
|
(No client certificate requested)
|
|
by smtp.lore.kernel.org (Postfix) with ESMTPS id F1049CCA480
|
|
for <linux-arm-kernel@archiver.kernel.org>;
|
|
Wed, 6 Jul 2022 22:24:05 +0000 (UTC)
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=lists.infradead.org; s=bombadil.20210309; h=Sender:
|
|
Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
|
|
List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References:
|
|
Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID:
|
|
Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
|
|
:Resent-Message-ID:List-Owner;
|
|
bh=Xgk3+Of0YIMWf8O9JMQ9/XvK0/M+Y4+Fo7V1nirjGWk=; b=Ikczo9d8yKpAV17EBvGgifkzCg
|
|
XdnREf27M/HCIddMXnZhXRO8ZOnhXihDFD2xQXOykZluf1K2TRg+frxOZISO9Btjbntdm1Z5sTT+P
|
|
OYv5lp/iPndBxEr3bCycW0syk4Ua3Ehvy4tervFebak85EKj2hyyitkRzF7F7CXrEj3XugynJ8EDZ
|
|
S6AQxF9Z58ceg+HaoKkw4eBo5xOp4wB+WaVtuFnjppN98e93jFaZqWD5Au6aHpMr9UkzdVPTWIPtD
|
|
zwYBpvNFPXbsQyhROrfIjRG3cKKlCRkTja1vD02USxJCWU762nNv/GYl0Az04P48qSRAUA74Et57P
|
|
9dVDSF6g==;
|
|
Received: from localhost ([::1] helo=bombadil.infradead.org)
|
|
by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DPs-00Ca1j-DN; Wed, 06 Jul 2022 22:22:40 +0000
|
|
Received: from casper.infradead.org ([2001:8b0:10b:1236::1])
|
|
by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9DPq-00CZxO-FA
|
|
for linux-arm-kernel@bombadil.infradead.org; Wed, 06 Jul 2022 22:22:38 +0000
|
|
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
|
|
d=infradead.org; s=casper.20170209; h=Content-Transfer-Encoding:Content-Type:
|
|
Cc:To:From:Subject:References:Mime-Version:Message-Id:In-Reply-To:Date:Sender
|
|
:Reply-To:Content-ID:Content-Description;
|
|
bh=Y7M5+uMCyjK2Tw3gtvlFnf3s0uMKtiqOOKU+iupOzGc=; b=eNxhaQnsSLRlm7xcZWf4vRDO1L
|
|
eg0kAREJse9rScOQRPFAJ7NMrXz7IlQKwOeQiB7VT3p4J3bF9y9rqPv/kCWD78aSn9pj1qjD0/BOi
|
|
f8/D2cdI/noxqLqYNlxZd9MrEwAOHz/BUhX8oySPF36Iqx0OElYC+q2YFFb6st+Cbk6+yoKU3s/P0
|
|
yer4vAYtoZWXt6/xSmm2cClxZTBOVHoQd5XrrcfIJ0uAG08kyLrEhFinhs9wyBv4aceWYG+rMfXat
|
|
hch+qaompy9Nop2MGqHfkzvNgqzTwBU87F20TW6h25klabtjJ4H1MG2czWzkZ6h4Gwva2tyJuViYM
|
|
uXiB/lSA==;
|
|
Received: from mail-yb1-xb4a.google.com ([2607:f8b0:4864:20::b4a])
|
|
by casper.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
|
|
id 1o9D59-0021Vr-JL
|
|
for linux-arm-kernel@lists.infradead.org; Wed, 06 Jul 2022 22:01:20 +0000
|
|
Received: by mail-yb1-xb4a.google.com with SMTP id
|
|
o200-20020a2541d1000000b0066ebb148de6so26443yba.15
|
|
for <linux-arm-kernel@lists.infradead.org>;
|
|
Wed, 06 Jul 2022 15:01:14 -0700 (PDT)
|
|
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=google.com; s=20210112;
|
|
h=date:in-reply-to:message-id:mime-version:references:subject:from:to
|
|
:cc:content-transfer-encoding;
|
|
bh=Y7M5+uMCyjK2Tw3gtvlFnf3s0uMKtiqOOKU+iupOzGc=;
|
|
b=RaJYVCw6kQFWZr57Fj6Z+M7CjIu+Fy2mkXaD9icGpAKOAxyz1uufDA95qkMfXqksCy
|
|
CttyIsR4+X5trkDvd0W5HTI3/XFLKoLEsiRSAv23qebNkIOkH8cPlNd2JsU/+DVzJUpM
|
|
TGOZ6teMB/sFPIH8IZKMODnpg+VxKIyScGqlsqOiDoxcPPCMP8e0zolM240kI1HmhYsj
|
|
WxZdSDL+OZnX2V8pTDz516/mmCsEM23W0x65TiLdKDGOIFAAkNP/EIcvQWWj8SBUz/dL
|
|
a0IGdBEhZobBNts8S/4QPXOFk1zc9TBNhY+OPo4y5YJG3duUWWVQ+373DmVdZPluRI23
|
|
DgVQ==
|
|
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
|
|
d=1e100.net; s=20210112;
|
|
h=x-gm-message-state:date:in-reply-to:message-id:mime-version
|
|
:references:subject:from:to:cc:content-transfer-encoding;
|
|
bh=Y7M5+uMCyjK2Tw3gtvlFnf3s0uMKtiqOOKU+iupOzGc=;
|
|
b=JM79P1Doc4hsUfPvsK/e/hKehYZK+ZXOr/USvLx8kdfIe82+N6ZoRSL5utdrQjK5m4
|
|
ZoMbDc0ww7uCAzo0Vn+1O0QBVzRo9AONyV8tJ8vLhMdAvKcn+7dsoNKH1+tC7My9PnAo
|
|
oo7vfTaO2YFfl+7aP9tylcSSkpMuurh5UQlsuaG683jw/YbSlspA1PDjof6ePZLtBpCS
|
|
7rnYkfpeSh7NpyV7uix/+PmKRWVTYkP5ESZMxgrITfZN0ZmzY9T+qfKe2IRll26QwOta
|
|
GfFb3h1ENFZiTuNk7YBU5RX3m4Vk/WfbtFVaRLTNo3BSk3/1aqr0h22KlVhKwdqxJhnS
|
|
YpAQ==
|
|
X-Gm-Message-State: AJIora90QDswnLliNc4CJI3HSKU/Z12qwP0gefR10TL2nkFmFawy/Qx/
|
|
rIQ75H3tAXOupF733QrK6bmUWVh/Glw=
|
|
X-Google-Smtp-Source:
|
|
AGRyM1tPyG6w7lg37p0dKVbMplDSUgwZboH2lG42opEnpdXZgbjOhtWD7cZCMHKO+sLemtrKnNTphNyTinE=
|
|
X-Received: from yuzhao.bld.corp.google.com
|
|
([2620:15c:183:200:b89c:e10a:466e:cf7d])
|
|
(user=yuzhao job=sendgmr) by 2002:a25:b806:0:b0:663:d35d:8b8a with SMTP id
|
|
v6-20020a25b806000000b00663d35d8b8amr45647399ybj.69.1657144872662; Wed, 06
|
|
Jul 2022 15:01:12 -0700 (PDT)
|
|
Date: Wed, 6 Jul 2022 16:00:23 -0600
|
|
In-Reply-To: <20220706220022.968789-1-yuzhao@google.com>
|
|
Message-Id: <20220706220022.968789-15-yuzhao@google.com>
|
|
Mime-Version: 1.0
|
|
References: <20220706220022.968789-1-yuzhao@google.com>
|
|
X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog
|
|
Subject: [PATCH v13 14/14] mm: multi-gen LRU: design doc
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
To: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: Andi Kleen <ak@linux.intel.com>,
|
|
Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
|
|
Catalin Marinas <catalin.marinas@arm.com>,
|
|
Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>,
|
|
Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>,
|
|
Jonathan Corbet <corbet@lwn.net>,
|
|
Linus Torvalds <torvalds@linux-foundation.org>,
|
|
Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>,
|
|
Michael Larabel <Michael@michaellarabel.com>,
|
|
Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>,
|
|
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
|
|
Vlastimil Babka <vbabka@suse.cz>, Will Deacon <will@kernel.org>,
|
|
linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
|
|
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
|
|
page-reclaim@google.com, Yu Zhao <yuzhao@google.com>,
|
|
Brian Geffon <bgeffon@google.com>,
|
|
Jan Alexander Steffens <heftig@archlinux.org>,
|
|
Oleksandr Natalenko <oleksandr@natalenko.name>,
|
|
Steven Barrett <steven@liquorix.net>,
|
|
Suleiman Souhlal <suleiman@google.com>, Daniel Byrne <djbyrne@mtu.edu>,
|
|
Donald Carr <d@chaos-reins.com>,
|
|
" =?utf-8?q?Holger_Hoffst=C3=A4tte?= " <holger@applied-asynchrony.com>,
|
|
Konstantin Kharlamov <Hi-Angel@yandex.ru>,
|
|
Shuang Zhai <szhai2@cs.rochester.edu>, Sofia Trinh <sofia.trinh@edi.works>,
|
|
Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3
|
|
X-CRM114-CacheID: sfid-20220706_230115_703977_08753111
|
|
X-CRM114-Status: GOOD ( 17.78 )
|
|
X-BeenThere: linux-arm-kernel@lists.infradead.org
|
|
X-Mailman-Version: 2.1.34
|
|
Precedence: list
|
|
List-Id: <linux-arm-kernel.lists.infradead.org>
|
|
List-Unsubscribe:
|
|
<http://lists.infradead.org/mailman/options/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
|
|
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
|
|
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
|
|
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
|
|
List-Subscribe:
|
|
<http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
|
|
<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
|
|
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
|
|
Errors-To:
|
|
linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
|
|
|
|
Add a design doc.
|
|
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
|
|
---
|
|
Documentation/vm/index.rst | 1 +
|
|
Documentation/vm/multigen_lru.rst | 159 ++++++++++++++++++++++++++++++
|
|
2 files changed, 160 insertions(+)
|
|
create mode 100644 Documentation/vm/multigen_lru.rst
|
|
|
|
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
|
|
index 575ccd40e30c..4aa12b8be278 100644
|
|
--- a/Documentation/vm/index.rst
|
|
+++ b/Documentation/vm/index.rst
|
|
@@ -51,6 +51,7 @@ above structured documentation, or deleted if it has served its purpose.
|
|
ksm
|
|
memory-model
|
|
mmu_notifier
|
|
+ multigen_lru
|
|
numa
|
|
overcommit-accounting
|
|
page_migration
|
|
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
|
|
new file mode 100644
|
|
index 000000000000..d7062c6a8946
|
|
--- /dev/null
|
|
+++ b/Documentation/vm/multigen_lru.rst
|
|
@@ -0,0 +1,159 @@
|
|
+.. SPDX-License-Identifier: GPL-2.0
|
|
+
|
|
+=============
|
|
+Multi-Gen LRU
|
|
+=============
|
|
+The multi-gen LRU is an alternative LRU implementation that optimizes
|
|
+page reclaim and improves performance under memory pressure. Page
|
|
+reclaim decides the kernel's caching policy and ability to overcommit
|
|
+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
|
|
+
|
|
+Design overview
|
|
+===============
|
|
+Objectives
|
|
+----------
|
|
+The design objectives are:
|
|
+
|
|
+* Good representation of access recency
|
|
+* Try to profit from spatial locality
|
|
+* Fast paths to make obvious choices
|
|
+* Simple self-correcting heuristics
|
|
+
|
|
+The representation of access recency is at the core of all LRU
|
|
+implementations. In the multi-gen LRU, each generation represents a
|
|
+group of pages with similar access recency. Generations establish a
|
|
+(time-based) common frame of reference and therefore help make better
|
|
+choices, e.g., between different memcgs on a computer or different
|
|
+computers in a data center (for job scheduling).
|
|
+
|
|
+Exploiting spatial locality improves efficiency when gathering the
|
|
+accessed bit. A rmap walk targets a single page and does not try to
|
|
+profit from discovering a young PTE. A page table walk can sweep all
|
|
+the young PTEs in an address space, but the address space can be too
|
|
+sparse to make a profit. The key is to optimize both methods and use
|
|
+them in combination.
|
|
+
|
|
+Fast paths reduce code complexity and runtime overhead. Unmapped pages
|
|
+do not require TLB flushes; clean pages do not require writeback.
|
|
+These facts are only helpful when other conditions, e.g., access
|
|
+recency, are similar. With generations as a common frame of reference,
|
|
+additional factors stand out. But obvious choices might not be good
|
|
+choices; thus self-correction is necessary.
|
|
+
|
|
+The benefits of simple self-correcting heuristics are self-evident.
|
|
+Again, with generations as a common frame of reference, this becomes
|
|
+attainable. Specifically, pages in the same generation can be
|
|
+categorized based on additional factors, and a feedback loop can
|
|
+statistically compare the refault percentages across those categories
|
|
+and infer which of them are better choices.
|
|
+
|
|
+Assumptions
|
|
+-----------
|
|
+The protection of hot pages and the selection of cold pages are based
|
|
+on page access channels and patterns. There are two access channels:
|
|
+
|
|
+* Accesses through page tables
|
|
+* Accesses through file descriptors
|
|
+
|
|
+The protection of the former channel is by design stronger because:
|
|
+
|
|
+1. The uncertainty in determining the access patterns of the former
|
|
+ channel is higher due to the approximation of the accessed bit.
|
|
+2. The cost of evicting the former channel is higher due to the TLB
|
|
+ flushes required and the likelihood of encountering the dirty bit.
|
|
+3. The penalty of underprotecting the former channel is higher because
|
|
+ applications usually do not prepare themselves for major page
|
|
+ faults like they do for blocked I/O. E.g., GUI applications
|
|
+ commonly use dedicated I/O threads to avoid blocking rendering
|
|
+ threads.
|
|
+
|
|
+There are also two access patterns:
|
|
+
|
|
+* Accesses exhibiting temporal locality
|
|
+* Accesses not exhibiting temporal locality
|
|
+
|
|
+For the reasons listed above, the former channel is assumed to follow
|
|
+the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
|
|
+present, and the latter channel is assumed to follow the latter
|
|
+pattern unless outlying refaults have been observed.
|
|
+
|
|
+Workflow overview
|
|
+=================
|
|
+Evictable pages are divided into multiple generations for each
|
|
+``lruvec``. The youngest generation number is stored in
|
|
+``lrugen->max_seq`` for both anon and file types as they are aged on
|
|
+an equal footing. The oldest generation numbers are stored in
|
|
+``lrugen->min_seq[]`` separately for anon and file types as clean file
|
|
+pages can be evicted regardless of swap constraints. These three
|
|
+variables are monotonically increasing.
|
|
+
|
|
+Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
|
|
+bits in order to fit into the gen counter in ``folio->flags``. Each
|
|
+truncated generation number is an index to ``lrugen->lists[]``. The
|
|
+sliding window technique is used to track at least ``MIN_NR_GENS`` and
|
|
+at most ``MAX_NR_GENS`` generations. The gen counter stores a value
|
|
+within ``[1, MAX_NR_GENS]`` while a page is on one of
|
|
+``lrugen->lists[]``; otherwise it stores zero.
|
|
+
|
|
+Each generation is divided into multiple tiers. A page accessed ``N``
|
|
+times through file descriptors is in tier ``order_base_2(N)``. Unlike
|
|
+generations, tiers do not have dedicated ``lrugen->lists[]``. In
|
|
+contrast to moving across generations, which requires the LRU lock,
|
|
+moving across tiers only involves atomic operations on
|
|
+``folio->flags`` and therefore has a negligible cost. A feedback loop
|
|
+modeled after the PID controller monitors refaults over all the tiers
|
|
+from anon and file types and decides which tiers from which types to
|
|
+evict or protect.
|
|
+
|
|
+There are two conceptually independent procedures: the aging and the
|
|
+eviction. They form a closed-loop system, i.e., the page reclaim.
|
|
+
|
|
+Aging
|
|
+-----
|
|
+The aging produces young generations. Given an ``lruvec``, it
|
|
+increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
|
|
+``MIN_NR_GENS``. The aging promotes hot pages to the youngest
|
|
+generation when it finds them accessed through page tables; the
|
|
+demotion of cold pages happens consequently when it increments
|
|
+``max_seq``. The aging uses page table walks and rmap walks to find
|
|
+young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
|
|
+and calls ``walk_page_range()`` with each ``mm_struct`` on this list
|
|
+to scan PTEs, and after each iteration, it increments ``max_seq``. For
|
|
+the latter, when the eviction walks the rmap and finds a young PTE,
|
|
+the aging scans the adjacent PTEs. For both, on finding a young PTE,
|
|
+the aging clears the accessed bit and updates the gen counter of the
|
|
+page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
|
|
+
|
|
+Eviction
|
|
+--------
|
|
+The eviction consumes old generations. Given an ``lruvec``, it
|
|
+increments ``min_seq`` when ``lrugen->lists[]`` indexed by
|
|
+``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
|
|
+evict from, it first compares ``min_seq[]`` to select the older type.
|
|
+If both types are equally old, it selects the one whose first tier has
|
|
+a lower refault percentage. The first tier contains single-use
|
|
+unmapped clean pages, which are the best bet. The eviction sorts a
|
|
+page according to its gen counter if the aging has found this page
|
|
+accessed through page tables and updated its gen counter. It also
|
|
+moves a page to the next generation, i.e., ``min_seq+1``, if this page
|
|
+was accessed multiple times through file descriptors and the feedback
|
|
+loop has detected outlying refaults from the tier this page is in. To
|
|
+this end, the feedback loop uses the first tier as the baseline, for
|
|
+the reason stated earlier.
|
|
+
|
|
+Summary
|
|
+-------
|
|
+The multi-gen LRU can be disassembled into the following parts:
|
|
+
|
|
+* Generations
|
|
+* Rmap walks
|
|
+* Page table walks
|
|
+* Bloom filters
|
|
+* PID controller
|
|
+
|
|
+The aging and the eviction form a producer-consumer model;
|
|
+specifically, the latter drives the former by the sliding window over
|
|
+generations. Within the aging, rmap walks drive page table walks by
|
|
+inserting hot densely populated page tables to the Bloom filters.
|
|
+Within the eviction, the PID controller uses refaults as the feedback
|
|
+to select types to evict and tiers to protect.
|