[RFC PATCH 00/29] arm64: Scalable Vector Extension core support

Mirror of the gdb mailing list
 help / color / mirror / Atom feed

* [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
@ 2016-11-25 19:39 Dave Martin
  2016-11-25 19:41 ` [RFC PATCH 16/29] arm64/sve: signal: Add SVE state record to sigcontext Dave Martin
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Dave Martin @ 2016-11-25 19:39 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Christoffer Dall, Florian Weimer, Ard Biesheuvel, Marc Zyngier,
	Alan Hayward, libc-alpha, gdb

The Scalable Vector Extension (SVE) [1] is an extension to AArch64 which
adds extra SIMD functionality and supports much larger vectors.

This series implements core Linux support for SVE.

Recipents not copied on the whole series can find the rest of the
patches in the linux-arm-kernel archives [2].


The first 5 patches "arm64: signal: ..." factor out the allocation and
placement of state information in the signal frame.  The first three
are prerequisites for the SVE support patches.

Patches 04-05 implement expansion of the signal frame, and may remain
controversial due to ABI break issues:

 * Discussion is needed on how userspace should detect/negotiate signal
   frame size in order for this expansion mechanism to be workable.


The remaining patches implement initial SVE support for Linux, with the
following limitations:

 * No KVM/virtualisation support for guests.

 * No independent SVE vector length configuration per thread.  This is
   planned, but will follow as a separate add-on series.

 * As a temporary workaround for the signal frame size issue, vector
   length is software-limited to 512 bits (see patch 29), with a
   build-time kernel configuration option to relax this.

   Discussion is needed on how to smooth address the signal ABI issues
   so that this workaround can be removed.

 * A fair number of development BUG_ON()s are still present, which
   will be demoted or removed for merge.

 * There is a context-switch race condition lurking somewhere which
   fires in certain situations with my development KVM hacks (not part
   of this posting) -- the underlying bug might or might not be in this
   series.


Review and comments welcome.

Cheers
---Dave

[1] https://community.arm.com/groups/processors/blog/2016/08/22/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture

[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/thread.html


Alan Hayward (1):
  arm64/sve: ptrace support

Dave Martin (28):
  arm64: signal: Refactor sigcontext parsing in rt_sigreturn
  arm64: signal: factor frame layout and population into separate passes
  arm64: signal: factor out signal frame record allocation
  arm64: signal: Allocate extra sigcontext space as needed
  arm64: signal: Parse extra_context during sigreturn
  arm64: efi: Add missing Kconfig dependency on KERNEL_MODE_NEON
  arm64/sve: Allow kernel-mode NEON to be disabled in Kconfig
  arm64/sve: Low-level save/restore code
  arm64/sve: Boot-time feature detection and reporting
  arm64/sve: Boot-time feature enablement
  arm64/sve: Expand task_struct for Scalable Vector Extension state
  arm64/sve: Save/restore SVE state on context switch paths
  arm64/sve: Basic support for KERNEL_MODE_NEON
  Revert "arm64/sve: Allow kernel-mode NEON to be disabled in Kconfig"
  arm64/sve: Restore working FPSIMD save/restore around signals
  arm64/sve: signal: Add SVE state record to sigcontext
  arm64/sve: signal: Dump Scalable Vector Extension registers to user
    stack
  arm64/sve: signal: Restore FPSIMD/SVE state in rt_sigreturn
  arm64/sve: Avoid corruption when replacing the SVE state
  arm64/sve: traps: Add descriptive string for SVE exceptions
  arm64/sve: Enable SVE on demand for userspace
  arm64/sve: Implement FPSIMD-only context for tasks not using SVE
  arm64/sve: Move ZEN handling to the common task_fpsimd_load() path
  arm64/sve: Discard SVE state on system call
  arm64/sve: Avoid preempt_disable() during sigreturn
  arm64/sve: Avoid stale user register state after SVE access exception
  arm64: KVM: Treat SVE use by guests as undefined instruction execution
  arm64/sve: Limit vector length to 512 bits by default

 arch/arm64/Kconfig                       |  48 +++
 arch/arm64/include/asm/esr.h             |   3 +-
 arch/arm64/include/asm/fpsimd.h          |  37 +++
 arch/arm64/include/asm/fpsimdmacros.h    | 145 +++++++++
 arch/arm64/include/asm/kvm_arm.h         |   1 +
 arch/arm64/include/asm/sysreg.h          |  11 +
 arch/arm64/include/asm/thread_info.h     |   2 +
 arch/arm64/include/uapi/asm/hwcap.h      |   1 +
 arch/arm64/include/uapi/asm/ptrace.h     | 125 ++++++++
 arch/arm64/include/uapi/asm/sigcontext.h | 117 ++++++++
 arch/arm64/kernel/cpufeature.c           |   3 +
 arch/arm64/kernel/cpuinfo.c              |   1 +
 arch/arm64/kernel/entry-fpsimd.S         |  17 ++
 arch/arm64/kernel/entry.S                |  18 +-
 arch/arm64/kernel/fpsimd.c               | 301 ++++++++++++++++++-
 arch/arm64/kernel/head.S                 |  16 +-
 arch/arm64/kernel/process.c              |   2 +-
 arch/arm64/kernel/ptrace.c               | 254 +++++++++++++++-
 arch/arm64/kernel/setup.c                |   3 +
 arch/arm64/kernel/signal.c               | 497 +++++++++++++++++++++++++++++--
 arch/arm64/kernel/signal32.c             |   2 +-
 arch/arm64/kernel/traps.c                |   1 +
 arch/arm64/kvm/handle_exit.c             |   9 +
 arch/arm64/mm/proc.S                     |  27 +-
 include/uapi/linux/elf.h                 |   1 +
 25 files changed, 1583 insertions(+), 59 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 16/29] arm64/sve: signal: Add SVE state record to sigcontext
  2016-11-25 19:39 [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Dave Martin
@ 2016-11-25 19:41 ` Dave Martin
  2016-11-25 19:41 ` [RFC PATCH 24/29] arm64/sve: Discard SVE state on system call Dave Martin
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Dave Martin @ 2016-11-25 19:41 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Florian Weimer, libc-alpha, gdb

This patch adds a record to sigcontext that will contain the SVE
state.

Subsequent patches will implement the actual register dumping.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
---
 arch/arm64/include/uapi/asm/sigcontext.h | 86 ++++++++++++++++++++++++++++++++
 arch/arm64/kernel/signal.c               | 62 +++++++++++++++++++++++
 2 files changed, 148 insertions(+)

diff --git a/arch/arm64/include/uapi/asm/sigcontext.h b/arch/arm64/include/uapi/asm/sigcontext.h
index 1af8437..11c915d 100644
--- a/arch/arm64/include/uapi/asm/sigcontext.h
+++ b/arch/arm64/include/uapi/asm/sigcontext.h
@@ -88,4 +88,90 @@ struct extra_context {
 	__u32 size;	/* size in bytes of the extra space */
 };
 
+#define SVE_MAGIC	0x53564501
+
+struct sve_context {
+	struct _aarch64_ctx head;
+	__u16 vl;
+	__u16 __reserved[3];
+};
+
+/*
+ * The SVE architecture leaves space for future expansion of the
+ * vector length beyond its initial architectural limit of 2048 bits
+ * (16 quadwords).
+ */
+#define SVE_VQ_MIN		1
+#define SVE_VQ_MAX		0x200
+
+#define SVE_VL_MIN		(SVE_VQ_MIN * 0x10)
+#define SVE_VL_MAX		(SVE_VQ_MAX * 0x10)
+
+#define SVE_NUM_ZREGS		32
+#define SVE_NUM_PREGS		16
+
+#define sve_vl_valid(vl) \
+	((vl) % 0x10 == 0 && (vl) >= SVE_VL_MIN && (vl) <= SVE_VL_MAX)
+#define sve_vq_from_vl(vl)	((vl) / 0x10)
+
+/*
+ * The total size of meaningful data in the SVE context in bytes,
+ * including the header, is given by SVE_SIG_CONTEXT_SIZE(vq).
+ *
+ * Note: for all these macros, the "vq" argument denotes the SVE
+ * vector length in quadwords (i.e., units of 128 bits).
+ *
+ * The correct way to obtain vq is to use sve_vq_from_vl(vl).  The
+ * result is valid if and only if sve_vl_valid(vl) is true.  This is
+ * guaranteed for a struct sve_context written by the kernel.
+ *
+ *
+ * Additional macros describe the contents and layout of the payload.
+ * For each, SVE_SIG_x_OFFSET(args) is the start offset relative to
+ * the start of struct sve_context, and SVE_SIG_x_SIZE(args) is the
+ * size in bytes:
+ *
+ *	x	type				description
+ *	-	----				-----------
+ *	REGS					the entire SVE context
+ *
+ *	ZREGS	__uint128_t[SVE_NUM_ZREGS][vq]	all Z-registers
+ *	ZREG	__uint128_t[vq]			individual Z-register Zn
+ *
+ *	PREGS	uint16_t[SVE_NUM_PREGS][vq]	all P-registers
+ *	PREG	uint16_t[vq]			individual P-register Pn
+ *
+ *	FFR	uint16_t[vq]			first-fault status register
+ *
+ * Additional data might be appended in the future.
+ */
+
+#define SVE_SIG_ZREG_SIZE(vq)	((__u32)(vq) * 16)
+#define SVE_SIG_PREG_SIZE(vq)	((__u32)(vq) * 2)
+#define SVE_SIG_FFR_SIZE(vq)	SVE_SIG_PREG_SIZE(vq)
+
+#define SVE_SIG_REGS_OFFSET	((sizeof(struct sve_context) + 15) / 16 * 16)
+
+#define SVE_SIG_ZREGS_OFFSET	SVE_SIG_REGS_OFFSET
+#define SVE_SIG_ZREG_OFFSET(vq, n) \
+	(SVE_SIG_ZREGS_OFFSET + SVE_SIG_ZREG_SIZE(vq) * (n))
+#define SVE_SIG_ZREGS_SIZE(vq) \
+	(SVE_SIG_ZREG_OFFSET(vq, SVE_NUM_ZREGS) - SVE_SIG_ZREGS_OFFSET)
+
+#define SVE_SIG_PREGS_OFFSET(vq) \
+	(SVE_SIG_ZREGS_OFFSET + SVE_SIG_ZREGS_SIZE(vq))
+#define SVE_SIG_PREG_OFFSET(vq, n) \
+	(SVE_SIG_PREGS_OFFSET(vq) + SVE_SIG_PREG_SIZE(vq) * (n))
+#define SVE_SIG_PREGS_SIZE(vq) \
+	(SVE_SIG_PREG_OFFSET(vq, SVE_NUM_PREGS) - SVE_SIG_PREGS_OFFSET(vq))
+
+#define SVE_SIG_FFR_OFFSET(vq) \
+	(SVE_SIG_PREGS_OFFSET(vq) + SVE_SIG_PREGS_SIZE(vq))
+
+#define SVE_SIG_REGS_SIZE(vq) \
+	(SVE_SIG_FFR_OFFSET(vq) + SVE_SIG_FFR_SIZE(vq) - SVE_SIG_REGS_OFFSET)
+
+#define SVE_SIG_CONTEXT_SIZE(vq) (SVE_SIG_REGS_OFFSET + SVE_SIG_REGS_SIZE(vq))
+
+
 #endif /* _UAPI__ASM_SIGCONTEXT_H */
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 1e430b4..7418237 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -57,6 +57,7 @@ struct rt_sigframe_user_layout {
 
 	unsigned long fpsimd_offset;
 	unsigned long esr_offset;
+	unsigned long sve_offset;
 	unsigned long extra_offset;
 	unsigned long end_offset;
 };
@@ -209,8 +210,39 @@ static int restore_fpsimd_context(struct fpsimd_context __user *ctx)
 	return err ? -EFAULT : 0;
 }
 
+
+#ifdef CONFIG_ARM64_SVE
+
+static int preserve_sve_context(struct sve_context __user *ctx)
+{
+	int err = 0;
+	u16 reserved[ARRAY_SIZE(ctx->__reserved)];
+	unsigned int vl = sve_get_vl();
+	unsigned int vq = sve_vq_from_vl(vl);
+
+	memset(reserved, 0, sizeof(reserved));
+
+	__put_user_error(SVE_MAGIC, &ctx->head.magic, err);
+	__put_user_error(round_up(SVE_SIG_CONTEXT_SIZE(vq), 16),
+			 &ctx->head.size, err);
+	__put_user_error(vl, &ctx->vl, err);
+	BUILD_BUG_ON(sizeof(ctx->__reserved) != sizeof(reserved));
+	err |= copy_to_user(&ctx->__reserved, reserved, sizeof(reserved));
+
+	return err ? -EFAULT : 0;
+}
+
+#else /* ! CONFIG_ARM64_SVE */
+
+/* Turn any non-optimised out attempt to use this into a link error: */
+extern int preserve_sve_context(void __user *ctx);
+
+#endif /* ! CONFIG_ARM64_SVE */
+
+
 struct user_ctxs {
 	struct fpsimd_context __user *fpsimd;
+	struct sve_context __user *sve;
 };
 
 static int parse_user_sigframe(struct user_ctxs *user,
@@ -224,6 +256,7 @@ static int parse_user_sigframe(struct user_ctxs *user,
 	bool have_extra_context = false;
 
 	user->fpsimd = NULL;
+	user->sve = NULL;
 
 	if (!IS_ALIGNED((unsigned long)base, 16))
 		goto invalid;
@@ -271,6 +304,19 @@ static int parse_user_sigframe(struct user_ctxs *user,
 			/* ignore */
 			break;
 
+		case SVE_MAGIC:
+			if (!IS_ENABLED(CONFIG_ARM64_SVE))
+				goto invalid;
+
+			if (user->sve)
+				goto invalid;
+
+			if (size < sizeof(*user->sve))
+				goto invalid;
+
+			user->sve = (struct sve_context __user *)head;
+			break;
+
 		case EXTRA_MAGIC:
 			if (have_extra_context)
 				goto invalid;
@@ -417,6 +463,15 @@ static int setup_sigframe_layout(struct rt_sigframe_user_layout *user)
 			return err;
 	}
 
+	if (IS_ENABLED(CONFIG_ARM64_SVE) && (elf_hwcap & HWCAP_SVE)) {
+		unsigned int vq = sve_vq_from_vl(sve_get_vl());
+
+		err = sigframe_alloc(user, &user->sve_offset,
+				     SVE_SIG_CONTEXT_SIZE(vq));
+		if (err)
+			return err;
+	}
+
 	return sigframe_alloc_end(user);
 }
 
@@ -458,6 +513,13 @@ static int setup_sigframe(struct rt_sigframe_user_layout *user,
 		__put_user_error(current->thread.fault_code, &esr_ctx->esr, err);
 	}
 
+	/* Scalable Vector Extension state, if present */
+	if (IS_ENABLED(CONFIG_ARM64_SVE) && err == 0 && user->sve_offset) {
+		struct sve_context __user *sve_ctx =
+			apply_user_offset(user, user->sve_offset);
+		err |= preserve_sve_context(sve_ctx);
+	}
+
 	if (err == 0 && user->extra_offset) {
 		struct extra_context __user *extra =
 			apply_user_offset(user, user->extra_offset);
-- 
2.1.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 24/29] arm64/sve: Discard SVE state on system call
  2016-11-25 19:39 [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Dave Martin
  2016-11-25 19:41 ` [RFC PATCH 16/29] arm64/sve: signal: Add SVE state record to sigcontext Dave Martin
@ 2016-11-25 19:41 ` Dave Martin
  2016-11-25 19:41 ` [RFC PATCH 18/29] arm64/sve: signal: Restore FPSIMD/SVE state in rt_sigreturn Dave Martin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Dave Martin @ 2016-11-25 19:41 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: libc-alpha, gdb

The base procedure call standard for the Scalable Vector Extension
defines all of the SVE programmer's model state (Z0-31, P0-15, FFR)
as caller-save, except for that subset of the state that aliases
FPSIMD state.

System calls from userspace will almost always be made through C
library wrappers -- as a consequence of the PCS there will thus
rarely if ever be any live SVE state at syscall entry in practice.

This gives us an opportinity to make SVE explicitly caller-save
around SVC and so stop carrying around the SVE state for tasks that
use SVE only occasionally (say, by calling a library).

Note that FPSIMD state will still be preserved around SVC.

As a crude heuristic to avoid pathological cases where a thread
that uses SVE frequently has to fault back into the kernel again to
re-enable SVE after a syscall, we switch the thread back to
FPSIMD-only context tracking only if the context is actually
switched out before returning to userspace.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
---
 arch/arm64/kernel/fpsimd.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 5834f81..2e1056e 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -203,6 +203,23 @@ static void task_fpsimd_load(struct task_struct *task)
 static void task_fpsimd_save(struct task_struct *task)
 {
 	if (IS_ENABLED(CONFIG_ARM64_SVE) &&
+	    task_pt_regs(task)->syscallno != ~0UL &&
+	    test_tsk_thread_flag(task, TIF_SVE)) {
+		unsigned long tmp;
+
+		clear_tsk_thread_flag(task, TIF_SVE);
+
+		/* Trap if the task tries to use SVE again: */
+		asm volatile (
+			"mrs	%[tmp], cpacr_el1\n\t"
+			"bic	%[tmp], %[tmp], %[mask]\n\t"
+			"msr	cpacr_el1, %[tmp]"
+			: [tmp] "=r" (tmp)
+			: [mask] "i" (CPACR_EL1_ZEN_EL0EN)
+		);
+	}
+
+	if (IS_ENABLED(CONFIG_ARM64_SVE) &&
 	    test_tsk_thread_flag(task, TIF_SVE))
 		sve_save_state(__task_pffr(task),
 			       &task->thread.fpsimd_state.fpsr);
-- 
2.1.4

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 18/29] arm64/sve: signal: Restore FPSIMD/SVE state in rt_sigreturn
  2016-11-25 19:39 [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Dave Martin
  2016-11-25 19:41 ` [RFC PATCH 16/29] arm64/sve: signal: Add SVE state record to sigcontext Dave Martin
  2016-11-25 19:41 ` [RFC PATCH 24/29] arm64/sve: Discard SVE state on system call Dave Martin
@ 2016-11-25 19:41 ` Dave Martin
  2016-11-25 19:41 ` [RFC PATCH 17/29] arm64/sve: signal: Dump Scalable Vector Extension registers to user stack Dave Martin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Dave Martin @ 2016-11-25 19:41 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Florian Weimer, libc-alpha, gdb

This patch adds the missing logic to restore the SVE state in
rt_sigreturn.

Because the FPSIMD and SVE state alias, this code replaces the
existing fpsimd restore code when there is SVE state to restore.

For Zn[127:0], the saved FPSIMD state in Vn takes precedence.

Since __task_fpsimd_to_sve() is used to merge the FPSIMD and SVE
state back together, and only for this purpose, we don't want it to
zero out the SVE state -- hence delete the memset() from there.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
---
 arch/arm64/kernel/fpsimd.c |  4 ---
 arch/arm64/kernel/signal.c | 87 ++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 76 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 4ef2e37..b1a8d3e 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -266,9 +266,6 @@ static void task_sve_to_fpsimd(struct task_struct *task __always_unused) { }
 
 void fpsimd_signal_preserve_current_state(void)
 {
-	WARN_ONCE(elf_hwcap & HWCAP_SVE,
-		  "SVE state save/restore around signals doesn't work properly, expect userspace corruption!\n");
-
 	fpsimd_preserve_current_state();
 	task_sve_to_fpsimd(current);
 }
@@ -301,7 +298,6 @@ static void __task_fpsimd_to_sve(struct task_struct *task, unsigned int vq)
 	struct fpsimd_state *fst = &task->thread.fpsimd_state;
 	unsigned int i;
 
-	memset(sst, 0, sizeof(*sst));
 	for (i = 0; i < 32; ++i)
 		sst->zregs[i][0] = fst->vregs[i];
 }
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 038e7338..2697d09 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -211,6 +211,11 @@ static int restore_fpsimd_context(struct fpsimd_context __user *ctx)
 }
 
 
+struct user_ctxs {
+	struct fpsimd_context __user *fpsimd;
+	struct sve_context __user *sve;
+};
+
 #ifdef CONFIG_ARM64_SVE
 
 static int preserve_sve_context(struct sve_context __user *ctx)
@@ -240,19 +245,68 @@ static int preserve_sve_context(struct sve_context __user *ctx)
 	return err ? -EFAULT : 0;
 }
 
+static int __restore_sve_fpsimd_context(struct user_ctxs *user,
+					unsigned int vl, unsigned int vq)
+{
+	int err;
+	struct fpsimd_sve_state(vq) *task_sve_regs =
+		__task_sve_state(current);
+	struct fpsimd_state fpsimd;
+
+	if (vl != sve_get_vl())
+		return -EINVAL;
+
+	BUG_ON(SVE_SIG_REGS_SIZE(vq) > sizeof(*task_sve_regs));
+	BUG_ON(round_up(SVE_SIG_REGS_SIZE(vq), 16) < sizeof(*task_sve_regs));
+	BUG_ON(SVE_SIG_FFR_OFFSET(vq) - SVE_SIG_REGS_OFFSET !=
+	       (char *)&task_sve_regs->ffr - (char *)task_sve_regs);
+	err = __copy_from_user(task_sve_regs,
+			       (char __user const *)user->sve +
+					SVE_SIG_REGS_OFFSET,
+			       SVE_SIG_REGS_SIZE(vq));
+	if (err)
+		return err;
+
+	/* copy the FP and status/control registers */
+	/* restore_sigframe() already checked that user->fpsimd != NULL. */
+	err = __copy_from_user(fpsimd.vregs, user->fpsimd->vregs,
+			       sizeof(fpsimd.vregs));
+	__get_user_error(fpsimd.fpsr, &user->fpsimd->fpsr, err);
+	__get_user_error(fpsimd.fpcr, &user->fpsimd->fpcr, err);
+
+	/* load the hardware registers from the fpsimd_state structure */
+	if (!err)
+		fpsimd_update_current_state(&fpsimd);
+
+	return err;
+}
+
+static int restore_sve_fpsimd_context(struct user_ctxs *user)
+{
+	int err;
+	u16 vl, vq;
+
+	err = __get_user(vl, &user->sve->vl);
+	if (err)
+		return err;
+
+	if (!sve_vl_valid(vl))
+		return -EINVAL;
+
+	vq = sve_vq_from_vl(vl);
+
+	return __restore_sve_fpsimd_context(user, vl, vq);
+}
+
 #else /* ! CONFIG_ARM64_SVE */
 
-/* Turn any non-optimised out attempt to use this into a link error: */
+/* Turn any non-optimised out attempts to use these into a link error: */
 extern int preserve_sve_context(void __user *ctx);
+extern int restore_sve_fpsimd_context(struct user_ctxs *user);
 
 #endif /* ! CONFIG_ARM64_SVE */
 
 
-struct user_ctxs {
-	struct fpsimd_context __user *fpsimd;
-	struct sve_context __user *sve;
-};
-
 static int parse_user_sigframe(struct user_ctxs *user,
 			       struct rt_sigframe __user *sf)
 {
@@ -316,6 +370,9 @@ static int parse_user_sigframe(struct user_ctxs *user,
 			if (!IS_ENABLED(CONFIG_ARM64_SVE))
 				goto invalid;
 
+			if (!(elf_hwcap & HWCAP_SVE))
+				goto invalid;
+
 			if (user->sve)
 				goto invalid;
 
@@ -375,9 +432,6 @@ static int parse_user_sigframe(struct user_ctxs *user,
 	}
 
 done:
-	if (!user->fpsimd)
-		goto invalid;
-
 	return 0;
 
 invalid:
@@ -411,8 +465,19 @@ static int restore_sigframe(struct pt_regs *regs,
 	if (err == 0)
 		err = parse_user_sigframe(&user, sf);
 
-	if (err == 0)
-		err = restore_fpsimd_context(user.fpsimd);
+	if (err == 0) {
+		if (!user.fpsimd)
+			return -EINVAL;
+
+		if (user.sve) {
+			if (!IS_ENABLED(CONFIG_ARM64_SVE) ||
+			    !(elf_hwcap & HWCAP_SVE))
+				return -EINVAL;
+
+			err = restore_sve_fpsimd_context(&user);
+		} else
+			err = restore_fpsimd_context(user.fpsimd);
+	}
 
 	return err;
 }
-- 
2.1.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 17/29] arm64/sve: signal: Dump Scalable Vector Extension registers to user stack
  2016-11-25 19:39 [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Dave Martin
                   ` (2 preceding siblings ...)
  2016-11-25 19:41 ` [RFC PATCH 18/29] arm64/sve: signal: Restore FPSIMD/SVE state in rt_sigreturn Dave Martin
@ 2016-11-25 19:41 ` Dave Martin
  2016-11-25 19:42 ` [RFC PATCH 27/29] arm64/sve: ptrace support Dave Martin
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Dave Martin @ 2016-11-25 19:41 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Florian Weimer, libc-alpha, gdb

This patch populates the sve_regs() area reserved on the user stack
with the actual register context.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
---
 arch/arm64/include/asm/fpsimd.h | 1 +
 arch/arm64/kernel/fpsimd.c      | 5 ++---
 arch/arm64/kernel/signal.c      | 8 ++++++++
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
index aa82b38..e39066a 100644
--- a/arch/arm64/include/asm/fpsimd.h
+++ b/arch/arm64/include/asm/fpsimd.h
@@ -93,6 +93,7 @@ extern void fpsimd_load_partial_state(struct fpsimd_partial_state *state);
 
 extern void __init fpsimd_init_task_struct_size(void);
 
+extern void *__task_sve_state(struct task_struct *task);
 extern void sve_save_state(void *state, u32 *pfpsr);
 extern void sve_load_state(void const *state, u32 const *pfpsr);
 extern unsigned int sve_get_vl(void);
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 9a90921..4ef2e37 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -128,7 +128,7 @@ void do_fpsimd_exc(unsigned int esr, struct pt_regs *regs)
 
 #ifdef CONFIG_ARM64_SVE
 
-static void *__task_sve_state(struct task_struct *task)
+void *__task_sve_state(struct task_struct *task)
 {
 	return (char *)task + ALIGN(sizeof(*task), 16);
 }
@@ -143,8 +143,7 @@ static void *__task_pffr(struct task_struct *task)
 
 #else /* !CONFIG_ARM64_SVE */
 
-/* Turn any non-optimised out attempts to use these into a link error: */
-extern void *__task_sve_state(struct task_struct *task);
+/* Turn any non-optimised out attempts to use this into a link error: */
 extern void *__task_pffr(struct task_struct *task);
 
 #endif /* !CONFIG_ARM64_SVE */
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 7418237..038e7338 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -229,6 +229,14 @@ static int preserve_sve_context(struct sve_context __user *ctx)
 	BUILD_BUG_ON(sizeof(ctx->__reserved) != sizeof(reserved));
 	err |= copy_to_user(&ctx->__reserved, reserved, sizeof(reserved));
 
+	/*
+	 * This assumes that the SVE state has already been saved to
+	 * the task struct by calling preserve_fpsimd_context().
+	 */
+	err |= copy_to_user((char __user *)ctx + SVE_SIG_REGS_OFFSET,
+			    __task_sve_state(current),
+			    SVE_SIG_REGS_SIZE(vq));
+
 	return err ? -EFAULT : 0;
 }
 
-- 
2.1.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 27/29] arm64/sve: ptrace support
  2016-11-25 19:39 [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Dave Martin
                   ` (3 preceding siblings ...)
  2016-11-25 19:41 ` [RFC PATCH 17/29] arm64/sve: signal: Dump Scalable Vector Extension registers to user stack Dave Martin
@ 2016-11-25 19:42 ` Dave Martin
  2016-11-30  9:56 ` [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Yao Qi
  2016-11-30 10:08 ` Florian Weimer
  6 siblings, 0 replies; 30+ messages in thread
From: Dave Martin @ 2016-11-25 19:42 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Alan Hayward, gdb

From: Alan Hayward <alan.hayward@arm.com>

This patch adds support for accessing a task's SVE registers via
ptrace.

Some additional helpers are added in order to support the SVE/
FPSIMD register view synchronisation operations that are required
in order to make the NT_PRFPREG and NT_ARM_SVE regsets interact
correctly.

fpr_set()/fpr_get() are refactored into backend/frontend functions,
so that the core can be reused by sve_set()/sve_get() for the case
where no SVE registers are stored for a thread.

Signed-off-by: Alan Hayward <alan.hayward@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
---
 arch/arm64/include/asm/fpsimd.h          |  20 +++
 arch/arm64/include/uapi/asm/ptrace.h     | 125 +++++++++++++++
 arch/arm64/include/uapi/asm/sigcontext.h |   4 +
 arch/arm64/kernel/fpsimd.c               |  42 +++++
 arch/arm64/kernel/ptrace.c               | 254 ++++++++++++++++++++++++++++++-
 include/uapi/linux/elf.h                 |   1 +
 6 files changed, 440 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
index e39066a..88bcf69 100644
--- a/arch/arm64/include/asm/fpsimd.h
+++ b/arch/arm64/include/asm/fpsimd.h
@@ -35,6 +35,10 @@ struct fpsimd_state {
 			__uint128_t vregs[32];
 			u32 fpsr;
 			u32 fpcr;
+			/*
+			 * For ptrace compatibility, pad to next 128-bit
+			 * boundary here if extending this struct.
+			 */
 		};
 	};
 	/* the id of the last cpu to have restored this state */
@@ -98,6 +102,22 @@ extern void sve_save_state(void *state, u32 *pfpsr);
 extern void sve_load_state(void const *state, u32 const *pfpsr);
 extern unsigned int sve_get_vl(void);
 
+/*
+ * FPSIMD/SVE synchronisation helpers for ptrace:
+ * For use on stopped tasks only
+ */
+
+extern void fpsimd_sync_to_sve(struct task_struct *task);
+
+#ifdef CONFIG_ARM64_SVE
+extern void fpsimd_sync_to_fpsimd(struct task_struct *task);
+extern void fpsimd_sync_from_fpsimd_zeropad(struct task_struct *task);
+#else /* !CONFIG_ARM64_SVE */
+static void __maybe_unused fpsimd_sync_to_fpsimd(struct task_struct *task) { }
+static void __maybe_unused fpsimd_sync_from_fpsimd_zeropad(
+	struct task_struct *task) { }
+#endif /* !CONFIG_ARM64_SVE */
+
 #endif
 
 #endif
diff --git a/arch/arm64/include/uapi/asm/ptrace.h b/arch/arm64/include/uapi/asm/ptrace.h
index b5c3933..48b57a0 100644
--- a/arch/arm64/include/uapi/asm/ptrace.h
+++ b/arch/arm64/include/uapi/asm/ptrace.h
@@ -22,6 +22,7 @@
 #include <linux/types.h>
 
 #include <asm/hwcap.h>
+#include <asm/sigcontext.h>
 
 
 /*
@@ -77,6 +78,7 @@ struct user_fpsimd_state {
 	__uint128_t	vregs[32];
 	__u32		fpsr;
 	__u32		fpcr;
+	/* Pad to next 128-bit boundary here if extending this struct */
 };
 
 struct user_hwdebug_state {
@@ -89,6 +91,129 @@ struct user_hwdebug_state {
 	}		dbg_regs[16];
 };
 
+/* SVE/FP/SIMD state (NT_ARM_SVE) */
+
+struct user_sve_header {
+	__u32 size; /* total meaningful regset content in bytes */
+	__u32 max_size; /* maxmium possible size for this thread */
+	__u16 vl; /* current vector length */
+	__u16 max_vl; /* maximum possible vector length */
+	__u16 flags;
+	__u16 __reserved;
+};
+
+/* Definitions for user_sve_header.flags: */
+#define SVE_PT_REGS_MASK		(1 << 0)
+
+#define SVE_PT_REGS_FPSIMD		0
+#define SVE_PT_REGS_SVE			SVE_PT_REGS_MASK
+
+
+/*
+ * The remainder of the SVE state follows struct user_sve_header.  The
+ * total size of the SVE state (including header) depends on the
+ * metadata in the header:  SVE_PT_SIZE(vq, flags) gives the total size
+ * of the state in bytes, including the header.
+ *
+ * Refer to <asm/sigcontext.h> for details of how to pass the correct
+ * "vq" argument to these macros.
+ */
+
+/* Offset from the start of struct user_sve_header to the register data */
+#define SVE_PT_REGS_OFFSET	((sizeof(struct sve_context) + 15) / 16 * 16)
+
+/*
+ * The register data content and layout depends on the value of the
+ * flags field.
+ */
+
+/*
+ * (flags & SVE_PT_REGS_MASK) == SVE_PT_REGS_FPSIMD case:
+ *
+ * The payload starts at offset SVE_PT_FPSIMD_OFFSET, and is of type
+ * struct user_fpsimd_state.  Additional data might be appended in the
+ * future: use SVE_PT_FPSIMD_SIZE(vq, flags) to compute the total size.
+ * SVE_PT_FPSIMD_SIZE(vq, flags) will never be less than
+ * sizeof(struct user_fpsimd_state).
+ */
+
+#define SVE_PT_FPSIMD_OFFSET		SVE_PT_REGS_OFFSET
+
+#define SVE_PT_FPSIMD_SIZE(vq, flags)	(sizeof(struct user_fpsimd_state))
+
+/*
+ * (flags & SVE_PT_REGS_MASK) == SVE_PT_REGS_SVE case:
+ *
+ * The payload starts at offset SVE_PT_SVE_OFFSET, and is of size
+ * SVE_PT_SVE_SIZE(vq, flags).
+ *
+ * Additional macros describe the contents and layout of the payload.
+ * For each, SVE_PT_SVE_x_OFFSET(args) is the start offset relative to
+ * the start of struct user_sve_header, and SVE_PT_SVE_x_SIZE(args) is
+ * the size in bytes:
+ *
+ *	x	type				description
+ *	-	----				-----------
+ *	ZREGS		\
+ *	ZREG		|
+ *	PREGS		| refer to <asm/sigcontext.h>
+ *	PREG		|
+ *	FFR		/
+ *
+ *	FPSR	uint32_t			FPSR
+ *	FPCR	uint32_t			FPCR
+ *
+ * Additional data might be appended in the future.
+ */
+
+#define SVE_PT_SVE_ZREG_SIZE(vq)	SVE_SIG_ZREG_SIZE(vq)
+#define SVE_PT_SVE_PREG_SIZE(vq)	SVE_SIG_PREG_SIZE(vq)
+#define SVE_PT_SVE_FFR_SIZE(vq)		SVE_SIG_FFR_SIZE(vq)
+#define SVE_PT_SVE_FPSR_SIZE		sizeof(__u32)
+#define SVE_PT_SVE_FPCR_SIZE		sizeof(__u32)
+
+#define __SVE_SIG_TO_PT(offset) \
+	((offset) - SVE_SIG_REGS_OFFSET + SVE_PT_REGS_OFFSET)
+
+#define SVE_PT_SVE_OFFSET		SVE_PT_REGS_OFFSET
+
+#define SVE_PT_SVE_ZREGS_OFFSET \
+	__SVE_SIG_TO_PT(SVE_SIG_ZREGS_OFFSET)
+#define SVE_PT_SVE_ZREG_OFFSET(vq, n) \
+	__SVE_SIG_TO_PT(SVE_SIG_ZREG_OFFSET(vq, n))
+#define SVE_PT_SVE_ZREGS_SIZE(vq) \
+	(SVE_PT_SVE_ZREG_OFFSET(vq, SVE_NUM_ZREGS) - SVE_PT_SVE_ZREGS_OFFSET)
+
+#define SVE_PT_SVE_PREGS_OFFSET(vq) \
+	__SVE_SIG_TO_PT(SVE_SIG_PREGS_OFFSET(vq))
+#define SVE_PT_SVE_PREG_OFFSET(vq, n) \
+	__SVE_SIG_TO_PT(SVE_SIG_PREG_OFFSET(vq, n))
+#define SVE_PT_SVE_PREGS_SIZE(vq) \
+	(SVE_PT_SVE_PREG_OFFSET(vq, SVE_NUM_PREGS) - \
+		SVE_PT_SVE_PREGS_OFFSET(vq))
+
+#define SVE_PT_SVE_FFR_OFFSET(vq) \
+	__SVE_SIG_TO_PT(SVE_SIG_FFR_OFFSET(vq))
+
+#define SVE_PT_SVE_FPSR_OFFSET(vq) \
+	((SVE_PT_SVE_FFR_OFFSET(vq) + SVE_PT_SVE_FFR_SIZE(vq) + 15) / 16 * 16)
+#define SVE_PT_SVE_FPCR_OFFSET(vq) \
+	(SVE_PT_SVE_FPSR_OFFSET(vq) + SVE_PT_SVE_FPSR_SIZE)
+
+/*
+ * Any future extension appended after FPCR must be aligned to the next
+ * 128-bit boundary.
+ */
+
+#define SVE_PT_SVE_SIZE(vq, flags)				\
+	((SVE_PT_SVE_FPCR_OFFSET(vq) + SVE_PT_SVE_FPCR_SIZE -	\
+		SVE_PT_SVE_OFFSET + 15) / 16 * 16)
+
+#define SVE_PT_SIZE(vq, flags)						\
+	 (((flags) & SVE_PT_REGS_MASK) == SVE_PT_REGS_SVE ?		\
+		  SVE_PT_SVE_OFFSET + SVE_PT_SVE_SIZE(vq, flags)	\
+		: SVE_PT_FPSIMD_OFFSET + SVE_PT_FPSIMD_SIZE(vq, flags))
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _UAPI__ASM_PTRACE_H */
diff --git a/arch/arm64/include/uapi/asm/sigcontext.h b/arch/arm64/include/uapi/asm/sigcontext.h
index 11c915d..91e55de 100644
--- a/arch/arm64/include/uapi/asm/sigcontext.h
+++ b/arch/arm64/include/uapi/asm/sigcontext.h
@@ -16,6 +16,8 @@
 #ifndef _UAPI__ASM_SIGCONTEXT_H
 #define _UAPI__ASM_SIGCONTEXT_H
 
+#ifndef __ASSEMBLY__
+
 #include <linux/types.h>
 
 /*
@@ -96,6 +98,8 @@ struct sve_context {
 	__u16 __reserved[3];
 };
 
+#endif /* !__ASSEMBLY__ */
+
 /*
  * The SVE architecture leaves space for future expansion of the
  * vector length beyond its initial architectural limit of 2048 bits
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 1750301..6a5e725 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -417,6 +417,48 @@ void fpsimd_flush_task_state(struct task_struct *t)
 	t->thread.fpsimd_state.cpu = NR_CPUS;
 }
 
+#ifdef CONFIG_ARM64_SVE
+
+/* FPSIMD/SVE synchronisation helpers for ptrace */
+
+void fpsimd_sync_to_sve(struct task_struct *task)
+{
+	if (!test_tsk_thread_flag(task, TIF_SVE))
+		task_fpsimd_to_sve(task);
+}
+
+void fpsimd_sync_to_fpsimd(struct task_struct *task)
+{
+	if (test_tsk_thread_flag(task, TIF_SVE))
+		task_sve_to_fpsimd(task);
+}
+
+static void __fpsimd_sync_from_fpsimd_zeropad(struct task_struct *task,
+					      unsigned int vq)
+{
+	struct sve_struct fpsimd_sve_state(vq) *sst =
+		__task_sve_state(task);
+	struct fpsimd_state *fst = &task->thread.fpsimd_state;
+	unsigned int i;
+
+	if (!test_tsk_thread_flag(task, TIF_SVE))
+		return;
+
+	memset(sst->zregs, 0, sizeof(sst->zregs));
+
+	for (i = 0; i < 32; ++i)
+		sst->zregs[i][0] = fst->vregs[i];
+}
+
+void fpsimd_sync_from_fpsimd_zeropad(struct task_struct *task)
+{
+	unsigned int vl = sve_get_vl();
+
+	__fpsimd_sync_from_fpsimd_zeropad(task, sve_vq_from_vl(vl));
+}
+
+#endif /* CONFIG_ARM64_SVE */
+
 #ifdef CONFIG_KERNEL_MODE_NEON
 
 static DEFINE_PER_CPU(struct fpsimd_partial_state, hardirq_fpsimdstate);
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index e0c81da..bdd2ad3 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -30,7 +30,9 @@
 #include <linux/seccomp.h>
 #include <linux/security.h>
 #include <linux/init.h>
+#include <linux/sched.h>
 #include <linux/signal.h>
+#include <linux/string.h>
 #include <linux/uaccess.h>
 #include <linux/perf_event.h>
 #include <linux/hw_breakpoint.h>
@@ -611,13 +613,46 @@ static int gpr_set(struct task_struct *target, const struct user_regset *regset,
 /*
  * TODO: update fp accessors for lazy context switching (sync/flush hwstate)
  */
+static int __fpr_get(struct task_struct *target,
+		     const struct user_regset *regset,
+		     unsigned int pos, unsigned int count,
+		     void *kbuf, void __user *ubuf, unsigned int start_pos)
+{
+	struct user_fpsimd_state *uregs;
+
+	fpsimd_sync_to_fpsimd(target);
+
+	uregs = &target->thread.fpsimd_state.user_fpsimd;
+	return user_regset_copyout(&pos, &count, &kbuf, &ubuf, uregs,
+				   start_pos, start_pos + sizeof(*uregs));
+}
+
 static int fpr_get(struct task_struct *target, const struct user_regset *regset,
 		   unsigned int pos, unsigned int count,
 		   void *kbuf, void __user *ubuf)
 {
-	struct user_fpsimd_state *uregs;
-	uregs = &target->thread.fpsimd_state.user_fpsimd;
-	return user_regset_copyout(&pos, &count, &kbuf, &ubuf, uregs, 0, -1);
+	return __fpr_get(target, regset, pos, count, kbuf, ubuf, 0);
+}
+
+static int __fpr_set(struct task_struct *target,
+		     const struct user_regset *regset,
+		     unsigned int pos, unsigned int count,
+		     const void *kbuf, const void __user *ubuf,
+		     unsigned int start_pos)
+{
+	int ret;
+	struct user_fpsimd_state newstate;
+
+	fpsimd_sync_to_fpsimd(target);
+
+	ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &newstate,
+				 start_pos, start_pos + sizeof(newstate));
+	if (ret)
+		return ret;
+
+	target->thread.fpsimd_state.user_fpsimd = newstate;
+
+	return ret;
 }
 
 static int fpr_set(struct task_struct *target, const struct user_regset *regset,
@@ -625,14 +660,14 @@ static int fpr_set(struct task_struct *target, const struct user_regset *regset,
 		   const void *kbuf, const void __user *ubuf)
 {
 	int ret;
-	struct user_fpsimd_state newstate;
 
-	ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &newstate, 0, -1);
+	ret = __fpr_set(target, regset, pos, count, kbuf, ubuf, 0);
 	if (ret)
 		return ret;
 
-	target->thread.fpsimd_state.user_fpsimd = newstate;
+	fpsimd_sync_from_fpsimd_zeropad(target);
 	fpsimd_flush_task_state(target);
+
 	return ret;
 }
 
@@ -685,6 +720,204 @@ static int system_call_set(struct task_struct *target,
 	return ret;
 }
 
+#ifdef CONFIG_ARM64_SVE
+
+static int sve_get(struct task_struct *target,
+		     const struct user_regset *regset,
+		     unsigned int pos, unsigned int count,
+		     void *kbuf, void __user *ubuf)
+{
+	int ret;
+	struct user_sve_header header;
+	unsigned int vq;
+	unsigned long start, end;
+
+	/* Header */
+	memset(&header, 0, sizeof(header));
+
+	header.vl = sve_get_vl();
+
+	BUG_ON(!sve_vl_valid(header.vl));
+	vq = sve_vq_from_vl(header.vl);
+
+	/* Until runtime or per-task vector length changing is supported: */
+	header.max_vl = header.vl;
+
+	header.flags = test_tsk_thread_flag(target, TIF_SVE) ?
+		SVE_PT_REGS_SVE : SVE_PT_REGS_FPSIMD;
+
+	header.size = SVE_PT_SIZE(vq, header.flags);
+	header.max_size = SVE_PT_SIZE(vq, SVE_PT_REGS_SVE);
+
+	ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf, &header,
+				  0, sizeof(header));
+	if (ret)
+		return ret;
+
+	/* Registers: FPSIMD-only case */
+
+	BUILD_BUG_ON(SVE_PT_FPSIMD_OFFSET != sizeof(header));
+
+	if ((header.flags & SVE_PT_REGS_MASK) == SVE_PT_REGS_FPSIMD)
+		return __fpr_get(target, regset, pos, count, kbuf, ubuf,
+				 SVE_PT_FPSIMD_OFFSET);
+
+	/* Otherwise: full SVE case */
+
+	BUILD_BUG_ON(SVE_PT_SVE_OFFSET != sizeof(header));
+
+	start = SVE_PT_SVE_OFFSET;
+	end = SVE_PT_SVE_FFR_OFFSET(vq) + SVE_PT_SVE_FFR_SIZE(vq);
+
+	BUG_ON((char *)__task_sve_state(target) < (char *)target);
+	BUG_ON(end < start);
+	BUG_ON(arch_task_struct_size < end - start);
+	BUG_ON((char *)__task_sve_state(target) - (char *)target >
+	       arch_task_struct_size - (end - start));
+	ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+				  __task_sve_state(target),
+				  start, end);
+	if (ret)
+		return ret;
+
+	start = end;
+	end = SVE_PT_SVE_FPSR_OFFSET(vq);
+
+	BUG_ON(end < start);
+	ret = user_regset_copyout_zero(&pos, &count, &kbuf, &ubuf,
+				       start, end);
+	if (ret)
+		return ret;
+
+	start = end;
+	end = SVE_PT_SVE_FPCR_OFFSET(vq) + SVE_PT_SVE_FPCR_SIZE;
+
+	BUG_ON((char *)(&target->thread.fpsimd_state.fpcr + 1) <
+	       (char *)&target->thread.fpsimd_state.fpsr);
+	BUG_ON(end < start);
+	BUG_ON((char *)(&target->thread.fpsimd_state.fpcr + 1) -
+	       (char *)&target->thread.fpsimd_state.fpsr !=
+		end - start);
+
+	ret = user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+				  &target->thread.fpsimd_state.fpsr,
+				  start, end);
+	if (ret)
+		return ret;
+
+	start = end;
+	end = (SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE) + 15) / 16 * 16;
+	BUG_ON(end < start);
+
+	return user_regset_copyout_zero(&pos, &count, &kbuf, &ubuf,
+					start, end);
+}
+
+static int sve_set(struct task_struct *target,
+		     const struct user_regset *regset,
+		     unsigned int pos, unsigned int count,
+		     const void *kbuf, const void __user *ubuf)
+{
+	int ret;
+	struct user_sve_header header;
+	unsigned int vq;
+	unsigned long start, end;
+
+	/* Header */
+	ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &header,
+				 0, sizeof(header));
+	if (ret)
+		goto out;
+
+	if (header.vl != sve_get_vl())
+		return -EINVAL;
+
+	BUG_ON(!sve_vl_valid(header.vl));
+	vq = sve_vq_from_vl(header.vl);
+
+	if (header.flags & ~SVE_PT_REGS_MASK)
+		return -EINVAL;
+
+	/* Registers: FPSIMD-only case */
+
+	BUILD_BUG_ON(SVE_PT_FPSIMD_OFFSET != sizeof(header));
+
+	if ((header.flags & SVE_PT_REGS_MASK) == SVE_PT_REGS_FPSIMD) {
+		ret = __fpr_set(target, regset, pos, count, kbuf, ubuf,
+				SVE_PT_FPSIMD_OFFSET);
+		clear_tsk_thread_flag(target, TIF_SVE);
+		goto out;
+	}
+
+	/* Otherwise: full SVE case */
+
+	fpsimd_sync_to_sve(target);
+	set_tsk_thread_flag(target, TIF_SVE);
+
+	BUILD_BUG_ON(SVE_PT_SVE_OFFSET != sizeof(header));
+
+	start = SVE_PT_SVE_OFFSET;
+	end = SVE_PT_SVE_FFR_OFFSET(vq) + SVE_PT_SVE_FFR_SIZE(vq);
+
+	BUG_ON((char *)__task_sve_state(target) < (char *)target);
+	BUG_ON(end < start);
+	BUG_ON(arch_task_struct_size < end - start);
+	BUG_ON((char *)__task_sve_state(target) - (char *)target >
+	       arch_task_struct_size - (end - start));
+	ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
+				 __task_sve_state(target),
+				 start, end);
+	if (ret)
+		goto out;
+
+	start = end;
+	end = SVE_PT_SVE_FPSR_OFFSET(vq);
+
+	BUG_ON(end < start);
+	ret = user_regset_copyin_ignore(&pos, &count, &kbuf, &ubuf,
+					start, end);
+	if (ret)
+		goto out;
+
+	start = end;
+	end = SVE_PT_SVE_FPCR_OFFSET(vq) + SVE_PT_SVE_FPCR_SIZE;
+
+	BUG_ON((char *)(&target->thread.fpsimd_state.fpcr + 1) <
+		(char *)&target->thread.fpsimd_state.fpsr);
+	BUG_ON(end < start);
+	BUG_ON((char *)(&target->thread.fpsimd_state.fpcr + 1) -
+	       (char *)&target->thread.fpsimd_state.fpsr !=
+		end - start);
+
+	ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
+				 &target->thread.fpsimd_state.fpsr,
+				 start, end);
+
+out:
+	fpsimd_flush_task_state(target);
+	return ret;
+}
+
+#else /* !CONFIG_ARM64_SVE */
+
+static int sve_get(struct task_struct *target,
+		     const struct user_regset *regset,
+		     unsigned int pos, unsigned int count,
+		     void *kbuf, void __user *ubuf)
+{
+	return -EINVAL;
+}
+
+static int sve_set(struct task_struct *target,
+		     const struct user_regset *regset,
+		     unsigned int pos, unsigned int count,
+		     const void *kbuf, const void __user *ubuf)
+{
+	return -EINVAL;
+}
+
+#endif /* !CONFIG_ARM64_SVE */
+
 enum aarch64_regset {
 	REGSET_GPR,
 	REGSET_FPR,
@@ -694,6 +927,7 @@ enum aarch64_regset {
 	REGSET_HW_WATCH,
 #endif
 	REGSET_SYSTEM_CALL,
+	REGSET_SVE,
 };
 
 static const struct user_regset aarch64_regsets[] = {
@@ -751,6 +985,14 @@ static const struct user_regset aarch64_regsets[] = {
 		.get = system_call_get,
 		.set = system_call_set,
 	},
+	[REGSET_SVE] = { /* Scalable Vector Extension */
+		.core_note_type = NT_ARM_SVE,
+		.n = (SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE) + 15) / 16,
+		.size = 16,
+		.align = 16,
+		.get = sve_get,
+		.set = sve_set,
+	},
 };
 
 static const struct user_regset_view user_aarch64_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index b59ee07..23c6585 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -414,6 +414,7 @@ typedef struct elf64_shdr {
 #define NT_ARM_HW_BREAK	0x402		/* ARM hardware breakpoint registers */
 #define NT_ARM_HW_WATCH	0x403		/* ARM hardware watchpoint registers */
 #define NT_ARM_SYSTEM_CALL	0x404	/* ARM system call number */
+#define NT_ARM_SVE	0x405		/* ARM Scalable Vector Extension registers */
 #define NT_METAG_CBUF	0x500		/* Metag catch buffer registers */
 #define NT_METAG_RPIPE	0x501		/* Metag read pipeline state */
 #define NT_METAG_TLS	0x502		/* Metag TLS pointer */
-- 
2.1.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-25 19:39 [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Dave Martin
                   ` (4 preceding siblings ...)
  2016-11-25 19:42 ` [RFC PATCH 27/29] arm64/sve: ptrace support Dave Martin
@ 2016-11-30  9:56 ` Yao Qi
  2016-11-30 12:07   ` Dave Martin
  2016-11-30 10:08 ` Florian Weimer
  6 siblings, 1 reply; 30+ messages in thread
From: Yao Qi @ 2016-11-30  9:56 UTC (permalink / raw)
  To: Dave Martin
  Cc: linux-arm-kernel, Christoffer Dall, Florian Weimer,
	Ard Biesheuvel, Marc Zyngier, Alan Hayward, libc-alpha, GDB

Hi, Dave,

On Fri, Nov 25, 2016 at 7:38 PM, Dave Martin <Dave.Martin@arm.com> wrote:
>  * No independent SVE vector length configuration per thread.  This is
>    planned, but will follow as a separate add-on series.

If I read "independent SVE vector length configuration per thread"
correctly, SVE vector length can be different in each thread, so the
size of vector registers is different too.  In GDB, we describe registers
by "target description" which is per process, not per thread.

-- 
Yao (齐尧)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-30  9:56 ` [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Yao Qi
@ 2016-11-30 12:07   ` Dave Martin
  2016-11-30 12:22     ` Szabolcs Nagy
                       ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Dave Martin @ 2016-11-30 12:07 UTC (permalink / raw)
  To: Florian Weimer, Yao Qi
  Cc: linux-arm-kernel, libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb,
	Alan Hayward, Torvald Riegel, Christoffer Dall

On Wed, Nov 30, 2016 at 11:08:50AM +0100, Florian Weimer wrote:
> On 11/25/2016 08:38 PM, Dave Martin wrote:
> >The Scalable Vector Extension (SVE) [1] is an extension to AArch64 which
> >adds extra SIMD functionality and supports much larger vectors.
> >
> >This series implements core Linux support for SVE.
> >
> >Recipents not copied on the whole series can find the rest of the
> >patches in the linux-arm-kernel archives [2].
> >
> >
> >The first 5 patches "arm64: signal: ..." factor out the allocation and
> >placement of state information in the signal frame.  The first three
> >are prerequisites for the SVE support patches.
> >
> >Patches 04-05 implement expansion of the signal frame, and may remain
> >controversial due to ABI break issues:
> >
> > * Discussion is needed on how userspace should detect/negotiate signal
> >   frame size in order for this expansion mechanism to be workable.
> 
> I'm leaning towards a simple increase in the glibc headers (despite the ABI
> risk), plus a personality flag to disable really wide vector registers in
> case this causes problems with old binaries.

I'm concerned here that there may be no sensible fixed size for the
signal frame.  We would make it ridiculously large in order to minimise
the chance of hitting this problem again -- but then it would be
ridiculously large, which is a potential problem for massively threaded
workloads.

Or we could be more conservative, but risk a re-run of similar ABI
breaks in the future.

A personality flag may also discourage use of larger vectors, even
though the vast majority of software will work fine with them.

> A more elaborate mechanism will likely introduce more bugs than it makes
> existing applications working, due to its complexity.

Yes, I was a bit concerned about this when I tried to sketch something
out.

[...]

> > * No independent SVE vector length configuration per thread.  This is
> >   planned, but will follow as a separate add-on series.
> 
> Per-thread register widths will likely make coroutine switching (setcontext)
> and C++ resumable functions/executors quite challenging.
> 
> Can you detail your plans in this area?
> 
> Thanks,
> Florian

I'll also respond to Yao's question here, since it's closely related:

On Wed, Nov 30, 2016 at 09:56:14AM +0000, Yao Qi wrote:

[...]

> If I read "independent SVE vector length configuration per thread"
> correctly, SVE vector length can be different in each thread, so the
> size of vector registers is different too.  In GDB, we describe registers
> by "target description" which is per process, not per thread.
> 
> -- 
> Yao (é½å°§)

So, my key goal is to support _per-process_ vector length control.

From the kernel perspective, it is easiest to achieve this by providing
per-thread control since that is the unit that context switching acts
on.

How useful it really is to have threads with different VLs in the same
process is an open question.  It's theoretically useful for runtime
environments, which may want to dispatch code optimised for different
VLs -- changing the VL on-the-fly within a single thread is not
something I want to encourage, due to overhead and ABI issues, but
switching between threads of different VLs would be more manageable.

However, I expect mixing different VLs within a single process to be
very much a special case -- it's not something I'd expect to work with
general-purpose code.

Since the need for indepent VLs per thread is not proven, we could

 * forbid it -- i.e., only a thread-group leader with no children is
permitted to change the VL, which is then inherited by any child threads
that are subsequently created

 * permit it only if a special flag is specified when requesting the VL
change

 * permit it and rely on userspace to be sensible -- easiest option for
the kernel.

For setcontext/setjmp, we don't save/restore any SVE state due to the
caller-save status of SVE, and I would not consider it necessary to
save/restore VL itself because of the no-change-on-the-fly policy for
this.

I'm not familiar with resumable functions/executors -- are these in
the C++ standards yet (not that that would cause me to be familiar
with them... ;)  Any implementation of coroutines (i.e.,
cooperative switching) is likely to fall under the "setcontext"
argument above.

Thoughts?

---Dave

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-30 12:07   ` Dave Martin
@ 2016-11-30 12:22     ` Szabolcs Nagy
  2016-11-30 14:10       ` Dave Martin
  2016-11-30 12:38     ` Florian Weimer
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Szabolcs Nagy @ 2016-11-30 12:22 UTC (permalink / raw)
  To: Dave Martin, Florian Weimer, Yao Qi
  Cc: nd, linux-arm-kernel, libc-alpha, Ard Biesheuvel, Marc Zyngier,
	gdb, Alan Hayward, Torvald Riegel, Christoffer Dall

On 30/11/16 12:06, Dave Martin wrote:
> For setcontext/setjmp, we don't save/restore any SVE state due to the
> caller-save status of SVE, and I would not consider it necessary to
> save/restore VL itself because of the no-change-on-the-fly policy for
> this.

the problem is not changing VL within a thread,
but that setcontext can resume a context of a
different thread which had different VL and there
might be SVE regs spilled on the stack according
to that.

(i consider this usage undefined, but at least
the gccgo runtime does this)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-30 12:22     ` Szabolcs Nagy
@ 2016-11-30 14:10       ` Dave Martin
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Martin @ 2016-11-30 14:10 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Florian Weimer, Yao Qi, Torvald Riegel, libc-alpha,
	Ard Biesheuvel, Marc Zyngier, gdb, Christoffer Dall,
	Alan Hayward, nd, linux-arm-kernel

On Wed, Nov 30, 2016 at 12:22:32PM +0000, Szabolcs Nagy wrote:
> On 30/11/16 12:06, Dave Martin wrote:
> > For setcontext/setjmp, we don't save/restore any SVE state due to the
> > caller-save status of SVE, and I would not consider it necessary to
> > save/restore VL itself because of the no-change-on-the-fly policy for
> > this.
> 
> the problem is not changing VL within a thread,
> but that setcontext can resume a context of a
> different thread which had different VL and there
> might be SVE regs spilled on the stack according
> to that.
> 
> (i consider this usage undefined, but at least
> the gccgo runtime does this)

Understood -- which is part of the reason for the argument that although
the kernel may permit different threads to have different VLs, whether
this actually works usefully also depends on your userspace runtime
environment.

This again leads me to the conclusion that the request to create threads
with different VLs within a single process should be explicit, in order
to avoid accidents.

Cheers
---Dave


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-30 12:07   ` Dave Martin
  2016-11-30 12:22     ` Szabolcs Nagy
@ 2016-11-30 12:38     ` Florian Weimer
  2016-11-30 13:56       ` Dave Martin
  2016-12-02 11:49       ` Dave Martin
  2016-12-02 21:56     ` Yao Qi
  2016-12-05 22:42     ` Torvald Riegel
  3 siblings, 2 replies; 30+ messages in thread
From: Florian Weimer @ 2016-11-30 12:38 UTC (permalink / raw)
  To: Dave Martin, Yao Qi
  Cc: linux-arm-kernel, libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb,
	Alan Hayward, Torvald Riegel, Christoffer Dall

On 11/30/2016 01:06 PM, Dave Martin wrote:

> I'm concerned here that there may be no sensible fixed size for the
> signal frame.  We would make it ridiculously large in order to minimise
> the chance of hitting this problem again -- but then it would be
> ridiculously large, which is a potential problem for massively threaded
> workloads.

What's ridiculously large?

We could add a system call to get the right stack size.  But as it 
depends on VL, I'm not sure what it looks like.  Particularly if you 
need determine the stack size before creating a thread that uses a 
specific VL setting.

> For setcontext/setjmp, we don't save/restore any SVE state due to the
> caller-save status of SVE, and I would not consider it necessary to
> save/restore VL itself because of the no-change-on-the-fly policy for
> this.

Okay, so we'd potentially set it on thread creation only?  That might 
not be too bad.

I really want to avoid a repeat of the setxid fiasco, where we need to 
run code on all threads to get something that approximates the 
POSIX-mandated behavior (process attribute) from what the kernel 
provides (thread/task attribute).

> I'm not familiar with resumable functions/executors -- are these in
> the C++ standards yet (not that that would cause me to be familiar
> with them... ;)  Any implementation of coroutines (i.e.,
> cooperative switching) is likely to fall under the "setcontext"
> argument above.

There are different ways to implement coroutines.  Stack switching (like 
setcontext) is obviously impacted by non-uniform register sizes.  But 
even the most conservative variant, rather similar to switch-based 
emulation you sometimes see in C coroutine implementations, might have 
trouble restoring the state if it just cannot restore the saved state 
due to register size reductions.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-30 12:38     ` Florian Weimer
@ 2016-11-30 13:56       ` Dave Martin
  2016-12-01  9:21         ` Florian Weimer
  2016-12-02 11:49       ` Dave Martin
  1 sibling, 1 reply; 30+ messages in thread
From: Dave Martin @ 2016-11-30 13:56 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Yao Qi, libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb,
	Christoffer Dall, Alan Hayward, Torvald Riegel, linux-arm-kernel

On Wed, Nov 30, 2016 at 01:38:28PM +0100, Florian Weimer wrote:
> On 11/30/2016 01:06 PM, Dave Martin wrote:
> 
> >I'm concerned here that there may be no sensible fixed size for the
> >signal frame.  We would make it ridiculously large in order to minimise
> >the chance of hitting this problem again -- but then it would be
> >ridiculously large, which is a potential problem for massively threaded
> >workloads.
> 
> What's ridiculously large?

The SVE architecture permits VLs up to 2048 bits per vector initially --
but it makes space for future architecture revisions to expand up to
65536 bits per vector, which would result in a signal frame > 270 KB.

It's far from certain we'll ever see such large vectors, but it's hard
to know where to draw the line.

> We could add a system call to get the right stack size.  But as it depends
> on VL, I'm not sure what it looks like.  Particularly if you need determine
> the stack size before creating a thread that uses a specific VL setting.

I think that the most likely time to set the VL is libc startup or ld.so
startup -- so really a process considers the VL fixed, and a
hypothetical getsigstksz() function would return a constant value
depending on the VL that was set.

I'd expect that only specialised code such as libc/ld.so itself or fancy
runtimes would need to cope with the need to synchronise stack
allocation with VL setting.

The initial stack after exec is determined by RLIMIT_STACK -- we can
expect that to be easily large enough for the initial thread, under any
remotely normal scenario.

> >For setcontext/setjmp, we don't save/restore any SVE state due to the
> >caller-save status of SVE, and I would not consider it necessary to
> >save/restore VL itself because of the no-change-on-the-fly policy for
> >this.
> 
> Okay, so we'd potentially set it on thread creation only?  That might not be
> too bad.

Basically, yes.  A runtime _could_ set it at other times, and my view
is that the kernel shouldn't arbitrarily forbid this -- but it's up to
userspace to determine when it's safe to do it, ensure that there's no
VL-dependent data live in memory, and to arrange to reallocate stacks
or pre-arrange that allocations were already big enough etc.

> I really want to avoid a repeat of the setxid fiasco, where we need to run
> code on all threads to get something that approximates the POSIX-mandated
> behavior (process attribute) from what the kernel provides (thread/task
> attribute).

Yeah, that would suck.

However, for the proposed ABI there is no illusion to preserve here,
since the VL is proposed as a per-thread property everywhere, and this
is outside the scope of POSIX.

If we do have distinct "set process VL" and "set thread VL" interfaces,
then my view is that the former should fail if there are already
multiple threads, rather than just setting the VL of a single thread or
(worse) asynchronously changing the VL of threads other than the
caller...

> >I'm not familiar with resumable functions/executors -- are these in
> >the C++ standards yet (not that that would cause me to be familiar
> >with them... ;)  Any implementation of coroutines (i.e.,
> >cooperative switching) is likely to fall under the "setcontext"
> >argument above.
> 
> There are different ways to implement coroutines.  Stack switching (like
> setcontext) is obviously impacted by non-uniform register sizes.  But even
> the most conservative variant, rather similar to switch-based emulation you
> sometimes see in C coroutine implementations, might have trouble restoring
> the state if it just cannot restore the saved state due to register size
> reductions.

Which is not a problem if the variably-sized state is not part of the
switched context?

Because the SVE procedure call standard determines that the SVE
registers are caller-save, they are not live at any external function
boundary -- so in cooperative switching it is useless to save/restore
this state unless the coroutine framework is defined to have a special
procedure call standard.

Similarly, my view is that we don't attempt to magically save and
restore VL itself either.  Code that changes VL after startup would be
expected to be aware of and deal with the consequences itself.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-30 13:56       ` Dave Martin
@ 2016-12-01  9:21         ` Florian Weimer
  2016-12-01 10:30           ` Dave Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Florian Weimer @ 2016-12-01  9:21 UTC (permalink / raw)
  To: Dave Martin
  Cc: Yao Qi, libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb,
	Christoffer Dall, Alan Hayward, Torvald Riegel, linux-arm-kernel

On 11/30/2016 02:56 PM, Dave Martin wrote:

> If we do have distinct "set process VL" and "set thread VL" interfaces,
> then my view is that the former should fail if there are already
> multiple threads, rather than just setting the VL of a single thread or
> (worse) asynchronously changing the VL of threads other than the
> caller...

Yes, looks feasible to me.

>>> I'm not familiar with resumable functions/executors -- are these in
>>> the C++ standards yet (not that that would cause me to be familiar
>>> with them... ;)  Any implementation of coroutines (i.e.,
>>> cooperative switching) is likely to fall under the "setcontext"
>>> argument above.
>>
>> There are different ways to implement coroutines.  Stack switching (like
>> setcontext) is obviously impacted by non-uniform register sizes.  But even
>> the most conservative variant, rather similar to switch-based emulation you
>> sometimes see in C coroutine implementations, might have trouble restoring
>> the state if it just cannot restore the saved state due to register size
>> reductions.
>
> Which is not a problem if the variably-sized state is not part of the
> switched context?

The VL value is implicitly thread-local data, and the encoded state may 
have an implicit dependency on it, although it does not contain vector 
registers as such.

> Because the SVE procedure call standard determines that the SVE
> registers are caller-save,

By the way, how is this implemented?  Some of them overlap existing 
callee-saved registers.

> they are not live at any external function
> boundary -- so in cooperative switching it is useless to save/restore
> this state unless the coroutine framework is defined to have a special
> procedure call standard.

It can use the standard calling convention, but it may have selected a 
particular implementation based on the VL value before suspension.

Florian


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-12-01  9:21         ` Florian Weimer
@ 2016-12-01 10:30           ` Dave Martin
  2016-12-01 12:19             ` Dave Martin
  2016-12-05 10:44             ` Florian Weimer
  0 siblings, 2 replies; 30+ messages in thread
From: Dave Martin @ 2016-12-01 10:30 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb, Yao Qi,
	linux-arm-kernel, Alan Hayward, Torvald Riegel, Christoffer Dall

On Thu, Dec 01, 2016 at 10:21:03AM +0100, Florian Weimer wrote:
> On 11/30/2016 02:56 PM, Dave Martin wrote:
> 
> >If we do have distinct "set process VL" and "set thread VL" interfaces,
> >then my view is that the former should fail if there are already
> >multiple threads, rather than just setting the VL of a single thread or
> >(worse) asynchronously changing the VL of threads other than the
> >caller...
> 
> Yes, looks feasible to me.

OK, I'll try to hack up something along these lines.

> >>>I'm not familiar with resumable functions/executors -- are these in
> >>>the C++ standards yet (not that that would cause me to be familiar
> >>>with them... ;)  Any implementation of coroutines (i.e.,
> >>>cooperative switching) is likely to fall under the "setcontext"
> >>>argument above.
> >>
> >>There are different ways to implement coroutines.  Stack switching (like
> >>setcontext) is obviously impacted by non-uniform register sizes.  But even
> >>the most conservative variant, rather similar to switch-based emulation you
> >>sometimes see in C coroutine implementations, might have trouble restoring
> >>the state if it just cannot restore the saved state due to register size
> >>reductions.
> >
> >Which is not a problem if the variably-sized state is not part of the
> >switched context?
> 
> The VL value is implicitly thread-local data, and the encoded state may have
> an implicit dependency on it, although it does not contain vector registers
> as such.

This doesn't sound like an absolute requirement to me.

If we presume that the SVE registers never need to get saved or
restored, what stops the context data format being VL-independent?

The setcontext()/getcontext() implementation for example will not change
at all for SVE.

> >Because the SVE procedure call standard determines that the SVE
> >registers are caller-save,
> 
> By the way, how is this implemented?  Some of them overlap existing
> callee-saved registers.

Basically, all the *new* state is caller-save.

The Neon/FPSIMD regs V8-V15 are callee-save, so in the SVE view
Zn[bits 127:0] is callee-save for all n = 8..15.

> >they are not live at any external function
> >boundary -- so in cooperative switching it is useless to save/restore
> >this state unless the coroutine framework is defined to have a special
> >procedure call standard.
> 
> It can use the standard calling convention, but it may have selected a
> particular implementation based on the VL value before suspension.

If the save/restore logic doesn't touch SVE, which would its
implementation be VL-dependent?

Cheers
---Dave	


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-12-01 10:30           ` Dave Martin
@ 2016-12-01 12:19             ` Dave Martin
  2016-12-05 10:44             ` Florian Weimer
  1 sibling, 0 replies; 30+ messages in thread
From: Dave Martin @ 2016-12-01 12:19 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb, Yao Qi,
	Christoffer Dall, Alan Hayward, Torvald Riegel, linux-arm-kernel

On Thu, Dec 01, 2016 at 10:30:51AM +0000, Dave Martin wrote:

[...]

> Basically, all the *new* state is caller-save.
> 
> The Neon/FPSIMD regs V8-V15 are callee-save, so in the SVE view
> Zn[bits 127:0] is callee-save for all n = 8..15.

Ramana is right -- the current procedure call standard (ARM IHI 0055B)
only requires the bottom _64_ bits of V8-V15 to be preserved (not all
128 bits as I stated).

[...]

Cheers
---Dave


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-12-01 10:30           ` Dave Martin
  2016-12-01 12:19             ` Dave Martin
@ 2016-12-05 10:44             ` Florian Weimer
  2016-12-05 11:07               ` Szabolcs Nagy
  2016-12-05 15:05               ` Dave Martin
  1 sibling, 2 replies; 30+ messages in thread
From: Florian Weimer @ 2016-12-05 10:44 UTC (permalink / raw)
  To: Dave Martin
  Cc: libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb, Yao Qi,
	linux-arm-kernel, Alan Hayward, Torvald Riegel, Christoffer Dall

On 12/01/2016 11:30 AM, Dave Martin wrote:

>> The VL value is implicitly thread-local data, and the encoded state may have
>> an implicit dependency on it, although it does not contain vector registers
>> as such.
>
> This doesn't sound like an absolute requirement to me.
>
> If we presume that the SVE registers never need to get saved or
> restored, what stops the context data format being VL-independent?

I'm concerned the suspended computation has code which has been selected 
to fit a particular VL value.

 > If the save/restore logic doesn't touch SVE, which would its
 > implementation be VL-dependent?

Because it has been optimized for a certain vector length?

>>> Because the SVE procedure call standard determines that the SVE
>>> registers are caller-save,
>>
>> By the way, how is this implemented?  Some of them overlap existing
>> callee-saved registers.
>
> Basically, all the *new* state is caller-save.
>
> The Neon/FPSIMD regs V8-V15 are callee-save, so in the SVE view
> Zn[bits 127:0] is callee-save for all n = 8..15.

Are the extension parts of registers v8 to v15 used for argument passing?

If not, we should be able to use the existing dynamic linker trampoline.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-12-05 10:44             ` Florian Weimer
@ 2016-12-05 11:07               ` Szabolcs Nagy
  2016-12-05 15:05               ` Dave Martin
  1 sibling, 0 replies; 30+ messages in thread
From: Szabolcs Nagy @ 2016-12-05 11:07 UTC (permalink / raw)
  To: Florian Weimer, Dave Martin
  Cc: nd, libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb, Yao Qi,
	linux-arm-kernel, Alan Hayward, Torvald Riegel, Christoffer Dall

On 05/12/16 10:44, Florian Weimer wrote:
>>> By the way, how is this implemented?  Some of them overlap existing
>>> callee-saved registers.
>>
>> Basically, all the *new* state is caller-save.
>>
>> The Neon/FPSIMD regs V8-V15 are callee-save, so in the SVE view
>> Zn[bits 127:0] is callee-save for all n = 8..15.
> 
> Are the extension parts of registers v8 to v15 used for argument passing?
> 
> If not, we should be able to use the existing dynamic linker trampoline.
> 

if sve arguments are passed to a function then it
has special call abi (which is probably not yet
documented), this call abi requires that such a
call does not go through plt to avoid requiring
sve aware libc.

same for tls access: the top part of sve regs have
to be saved by the caller before accessing tls so
the tlsdesc entry does not have to save them.

so current trampolines should be fine.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-12-05 10:44             ` Florian Weimer
  2016-12-05 11:07               ` Szabolcs Nagy
@ 2016-12-05 15:05               ` Dave Martin
  1 sibling, 0 replies; 30+ messages in thread
From: Dave Martin @ 2016-12-05 15:05 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb, Yao Qi,
	Christoffer Dall, Alan Hayward, Torvald Riegel, linux-arm-kernel

On Mon, Dec 05, 2016 at 11:44:38AM +0100, Florian Weimer wrote:
> On 12/01/2016 11:30 AM, Dave Martin wrote:
> 
> >>The VL value is implicitly thread-local data, and the encoded state may have
> >>an implicit dependency on it, although it does not contain vector registers
> >>as such.
> >
> >This doesn't sound like an absolute requirement to me.
> >
> >If we presume that the SVE registers never need to get saved or
> >restored, what stops the context data format being VL-independent?
> 
> I'm concerned the suspended computation has code which has been selected to
> fit a particular VL value.
> 
> > If the save/restore logic doesn't touch SVE, which would its
> > implementation be VL-dependent?
> 
> Because it has been optimized for a certain vector length?

I'll respond to these via Szabolcs' reply.

> >>>Because the SVE procedure call standard determines that the SVE
> >>>registers are caller-save,
> >>
> >>By the way, how is this implemented?  Some of them overlap existing
> >>callee-saved registers.
> >
> >Basically, all the *new* state is caller-save.
> >
> >The Neon/FPSIMD regs V8-V15 are callee-save, so in the SVE view
> >Zn[bits 127:0] is callee-save for all n = 8..15.
> 
> Are the extension parts of registers v8 to v15 used for argument passing?

No -- the idea is to be directly compatible with the existing PCS.

> If not, we should be able to use the existing dynamic linker trampoline.

Yes, I believe so.

Cheers
---Dave


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-30 12:38     ` Florian Weimer
  2016-11-30 13:56       ` Dave Martin
@ 2016-12-02 11:49       ` Dave Martin
  2016-12-02 16:34         ` Florian Weimer
  1 sibling, 1 reply; 30+ messages in thread
From: Dave Martin @ 2016-12-02 11:49 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Yao Qi, libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb,
	Christoffer Dall, Alan Hayward, Torvald Riegel, linux-arm-kernel

On Wed, Nov 30, 2016 at 01:38:28PM +0100, Florian Weimer wrote:

[...]

> We could add a system call to get the right stack size.  But as it depends
> on VL, I'm not sure what it looks like.  Particularly if you need determine
> the stack size before creating a thread that uses a specific VL setting.

I missed this point previously -- apologies for that.

What would you think of:

	set_vl(vl_for_new_thread);
	minsigstksz = get_minsigstksz();
	set_vl(my_vl);

This avoids get_minsigstksz() requiring parameters -- which is mainly a
concern because the parameters tomorrow might be different from the
parameters today.

If it is possible to create the new thread without any SVE-dependent code,
then we could

	set_vl(vl_for_new_thread);
	new_thread_stack = malloc(get_minsigstksz());
	new_thread = create_thread(..., new_thread_stack);
	set_vl(my_vl);

which has the nice property that the new thread directly inherits the
configuration that was used for get_minsigstksz().

However, it would be necessary to prevent GCC from moving any code
across these statements -- in particular, SVE code that access VL-
dependent data spilled on the stack is liable to go wrong if reordered
with the above.  So the sequence would need to go in an external
function (or a single asm...)

Failing that, we could maybe define some extensible struct to
get_minsigstksz().

Thoughts?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-12-02 11:49       ` Dave Martin
@ 2016-12-02 16:34         ` Florian Weimer
  2016-12-02 16:59           ` Joseph Myers
  0 siblings, 1 reply; 30+ messages in thread
From: Florian Weimer @ 2016-12-02 16:34 UTC (permalink / raw)
  To: Dave Martin
  Cc: Yao Qi, libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb,
	Christoffer Dall, Alan Hayward, Torvald Riegel, linux-arm-kernel

On 12/02/2016 12:48 PM, Dave Martin wrote:
> On Wed, Nov 30, 2016 at 01:38:28PM +0100, Florian Weimer wrote:
>
> [...]
>
>> We could add a system call to get the right stack size.  But as it depends
>> on VL, I'm not sure what it looks like.  Particularly if you need determine
>> the stack size before creating a thread that uses a specific VL setting.
>
> I missed this point previously -- apologies for that.
>
> What would you think of:
>
> 	set_vl(vl_for_new_thread);
> 	minsigstksz = get_minsigstksz();
> 	set_vl(my_vl);
>
> This avoids get_minsigstksz() requiring parameters -- which is mainly a
> concern because the parameters tomorrow might be different from the
> parameters today.
>
> If it is possible to create the new thread without any SVE-dependent code,
> then we could
>
> 	set_vl(vl_for_new_thread);
> 	new_thread_stack = malloc(get_minsigstksz());
> 	new_thread = create_thread(..., new_thread_stack);
> 	set_vl(my_vl);
>
> which has the nice property that the new thread directly inherits the
> configuration that was used for get_minsigstksz().

Because all SVE registers are caller-saved, it's acceptable to 
temporarily reduce the VL value, I think.  So this should work.

One complication is that both the kernel and the libc need to reserve 
stack space, so the kernel-returned value and the one which has to be 
used in reality will be different.

> However, it would be necessary to prevent GCC from moving any code
> across these statements -- in particular, SVE code that access VL-
> dependent data spilled on the stack is liable to go wrong if reordered
> with the above.  So the sequence would need to go in an external
> function (or a single asm...)

I would talk to GCC folksÂ—we have similar issues with changing the FPU 
rounding mode, I assume.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-12-02 16:34         ` Florian Weimer
@ 2016-12-02 16:59           ` Joseph Myers
  2016-12-02 18:21             ` Dave Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Joseph Myers @ 2016-12-02 16:59 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Dave Martin, Yao Qi, libc-alpha, Ard Biesheuvel, Marc Zyngier,
	gdb, Christoffer Dall, Alan Hayward, Torvald Riegel,
	linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 1482 bytes --]

On Fri, 2 Dec 2016, Florian Weimer wrote:

> > However, it would be necessary to prevent GCC from moving any code
> > across these statements -- in particular, SVE code that access VL-
> > dependent data spilled on the stack is liable to go wrong if reordered
> > with the above.  So the sequence would need to go in an external
> > function (or a single asm...)
> 
> I would talk to GCC folksÂ—we have similar issues with changing the FPU
> rounding mode, I assume.

In general, GCC doesn't track the implicit uses of thread-local state 
involved in floating-point exceptions and rounding modes, and so doesn't 
avoid moving code across manipulations of such state; there are various 
open bugs in this area (though many of the open bugs are for local rather 
than global issues with code generation or local optimizations not 
respecting exceptions and rounding modes, which are easier to fix).  Hence 
glibc using various macros such as math_opt_barrier and math_force_eval 
which use asms to prevent such motion.

I'm not familiar enough with the optimizers to judge the right way to 
address such issues with implicit use of thread-local state.  And I 
haven't thought much yet about how to implement TS 18661-1 constant 
rounding modes, which would involve the compiler implicitly inserting 
rounding modes changes, though I think it would be fairly straightforward 
given underlying support for avoiding inappropriate code motion.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-12-02 16:59           ` Joseph Myers
@ 2016-12-02 18:21             ` Dave Martin
  2016-12-02 21:57               ` Joseph Myers
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Martin @ 2016-12-02 18:21 UTC (permalink / raw)
  To: Joseph Myers
  Cc: Florian Weimer, libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb,
	Yao Qi, linux-arm-kernel, Alan Hayward, Torvald Riegel,
	Christoffer Dall

On Fri, Dec 02, 2016 at 04:59:27PM +0000, Joseph Myers wrote:
> On Fri, 2 Dec 2016, Florian Weimer wrote:
> 
> > > However, it would be necessary to prevent GCC from moving any code
> > > across these statements -- in particular, SVE code that access VL-
> > > dependent data spilled on the stack is liable to go wrong if reordered
> > > with the above.  So the sequence would need to go in an external
> > > function (or a single asm...)
> > 
> > I would talk to GCC folksâ€”we have similar issues with changing the FPU
> > rounding mode, I assume.
> 
> In general, GCC doesn't track the implicit uses of thread-local state 
> involved in floating-point exceptions and rounding modes, and so doesn't 
> avoid moving code across manipulations of such state; there are various 
> open bugs in this area (though many of the open bugs are for local rather 
> than global issues with code generation or local optimizations not 
> respecting exceptions and rounding modes, which are easier to fix).  Hence 
> glibc using various macros such as math_opt_barrier and math_force_eval 
> which use asms to prevent such motion.

Presumably the C language specs specify that fenv manipulations cannot
be reordered with respect to evaluation or floating-point expressions?
Sanity would seem to require this, though I've not dug into the specs
myself yet.

This doesn't get us off the hook for prctl() -- the C specs can only
define constraints on reordering for things that appear in the C spec.
prctl() is just an external function call in this context, and doesn't
enjoy the same guarantees.

> I'm not familiar enough with the optimizers to judge the right way to 
> address such issues with implicit use of thread-local state.  And I 
> haven't thought much yet about how to implement TS 18661-1 constant 
> rounding modes, which would involve the compiler implicitly inserting 
> rounding modes changes, though I think it would be fairly straightforward 
> given underlying support for avoiding inappropriate code motion.

My concern is that the compiler has no clue about what code motions are
appropriate or not with respect to a system call, beyond what applies
to a system call in general (i.e., asm volatile ( ::: "memory" ) for
GCC).

?

Cheers
---Dave


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-12-02 18:21             ` Dave Martin
@ 2016-12-02 21:57               ` Joseph Myers
  0 siblings, 0 replies; 30+ messages in thread
From: Joseph Myers @ 2016-12-02 21:57 UTC (permalink / raw)
  To: Dave Martin
  Cc: Florian Weimer, libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb,
	Yao Qi, linux-arm-kernel, Alan Hayward, Torvald Riegel,
	Christoffer Dall

On Fri, 2 Dec 2016, Dave Martin wrote:

> Presumably the C language specs specify that fenv manipulations cannot
> be reordered with respect to evaluation or floating-point expressions?

Yes (in the context of #pragma STDC FENV_ACCESS ON).  And you need to 
presume that an arbitrary function call might manipulate the environment 
unless you know it doesn't.

-- 
Joseph S. Myers
joseph@codesourcery.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-30 12:07   ` Dave Martin
  2016-11-30 12:22     ` Szabolcs Nagy
  2016-11-30 12:38     ` Florian Weimer
@ 2016-12-02 21:56     ` Yao Qi
  2016-12-05 15:12       ` Dave Martin
  2016-12-05 22:42     ` Torvald Riegel
  3 siblings, 1 reply; 30+ messages in thread
From: Yao Qi @ 2016-12-02 21:56 UTC (permalink / raw)
  To: Dave Martin
  Cc: Florian Weimer, linux-arm-kernel, libc-alpha, Ard Biesheuvel,
	Marc Zyngier, gdb, Alan Hayward, Torvald Riegel,
	Christoffer Dall

On 16-11-30 12:06:54, Dave Martin wrote:
> So, my key goal is to support _per-process_ vector length control.
> 
> From the kernel perspective, it is easiest to achieve this by providing
> per-thread control since that is the unit that context switching acts
> on.
>

Hi, Dave,
Thanks for the explanation.

> How useful it really is to have threads with different VLs in the same
> process is an open question.  It's theoretically useful for runtime
> environments, which may want to dispatch code optimised for different
> VLs -- changing the VL on-the-fly within a single thread is not
> something I want to encourage, due to overhead and ABI issues, but
> switching between threads of different VLs would be more manageable.

This is a weird programming model.

> However, I expect mixing different VLs within a single process to be
> very much a special case -- it's not something I'd expect to work with
> general-purpose code.
> 
> Since the need for indepent VLs per thread is not proven, we could
> 
>  * forbid it -- i.e., only a thread-group leader with no children is
> permitted to change the VL, which is then inherited by any child threads
> that are subsequently created
> 
>  * permit it only if a special flag is specified when requesting the VL
> change
> 
>  * permit it and rely on userspace to be sensible -- easiest option for
> the kernel.

Both the first and the third one is reasonable to me, but the first one
fit well in existing GDB design.  I don't know how useful it is to have
per-thread VL, there may be some workloads can be implemented that way.
GDB needs some changes to support "per-thread" target description.

-- 
Yao 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-12-02 21:56     ` Yao Qi
@ 2016-12-05 15:12       ` Dave Martin
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Martin @ 2016-12-05 15:12 UTC (permalink / raw)
  To: Yao Qi
  Cc: Florian Weimer, libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb,
	Christoffer Dall, Alan Hayward, Torvald Riegel, linux-arm-kernel

On Fri, Dec 02, 2016 at 09:56:46PM +0000, Yao Qi wrote:
> On 16-11-30 12:06:54, Dave Martin wrote:
> > So, my key goal is to support _per-process_ vector length control.
> > 
> > From the kernel perspective, it is easiest to achieve this by providing
> > per-thread control since that is the unit that context switching acts
> > on.
> >
> 
> Hi, Dave,
> Thanks for the explanation.
> 
> > How useful it really is to have threads with different VLs in the same
> > process is an open question.  It's theoretically useful for runtime
> > environments, which may want to dispatch code optimised for different
> > VLs -- changing the VL on-the-fly within a single thread is not
> > something I want to encourage, due to overhead and ABI issues, but
> > switching between threads of different VLs would be more manageable.
> 
> This is a weird programming model.

I may not have explained that very well.

What I meant is, you have two threads communicating with one another,
say.  Providing that they don't exchange data using a VL-dependent
representation, it should not matter that the two threads are running
with different VLs.

This may make sense if a particular piece of work was optimised for a
particular VL: you can pick a worker thread with the correct VL and
dispatch the job there for best performance.

I wouldn't expect this ability to be exploited except by specialised
frameworks.

> > However, I expect mixing different VLs within a single process to be
> > very much a special case -- it's not something I'd expect to work with
> > general-purpose code.
> > 
> > Since the need for indepent VLs per thread is not proven, we could
> > 
> >  * forbid it -- i.e., only a thread-group leader with no children is
> > permitted to change the VL, which is then inherited by any child threads
> > that are subsequently created
> > 
> >  * permit it only if a special flag is specified when requesting the VL
> > change
> > 
> >  * permit it and rely on userspace to be sensible -- easiest option for
> > the kernel.
> 
> Both the first and the third one is reasonable to me, but the first one
> fit well in existing GDB design.  I don't know how useful it is to have
> per-thread VL, there may be some workloads can be implemented that way.
> GDB needs some changes to support "per-thread" target description.

OK -- I'll implement for per-thread for now, but this can be clarified
later.

Cheers
---Dave


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-30 12:07   ` Dave Martin
                       ` (2 preceding siblings ...)
  2016-12-02 21:56     ` Yao Qi
@ 2016-12-05 22:42     ` Torvald Riegel
  2016-12-06 14:46       ` Dave Martin
  3 siblings, 1 reply; 30+ messages in thread
From: Torvald Riegel @ 2016-12-05 22:42 UTC (permalink / raw)
  To: Dave Martin
  Cc: Florian Weimer, Yao Qi, linux-arm-kernel, libc-alpha,
	Ard Biesheuvel, Marc Zyngier, gdb, Alan Hayward,
	Christoffer Dall

On Wed, 2016-11-30 at 12:06 +0000, Dave Martin wrote:
> So, my key goal is to support _per-process_ vector length control.
> 
> From the kernel perspective, it is easiest to achieve this by providing
> per-thread control since that is the unit that context switching acts
> on.
> 
> How useful it really is to have threads with different VLs in the same
> process is an open question.  It's theoretically useful for runtime
> environments, which may want to dispatch code optimised for different
> VLs

What would be the primary use case(s)?  Vectorization of short vectors
(eg, if having an array of structs or sth like that)?

> -- changing the VL on-the-fly within a single thread is not
> something I want to encourage, due to overhead and ABI issues, but
> switching between threads of different VLs would be more manageable.

So if on-the-fly switching is probably not useful, that would mean we
need special threads for the use cases.  Is that a realistic assumption
for the use cases?  Or do you primarily want to keep it possible to do
this, regardless of whether there are real use cases now?
I suppose allowing for a per-thread setting of VL could also be added as
a feature in the future without breaking existing code.

> For setcontext/setjmp, we don't save/restore any SVE state due to the
> caller-save status of SVE, and I would not consider it necessary to
> save/restore VL itself because of the no-change-on-the-fly policy for
> this.

Thus, you would basically consider VL changes or per-thread VL as in the
realm of compilation internals?  So, the specific size for a particular
piece of code would not be part of an ABI?

> I'm not familiar with resumable functions/executors -- are these in
> the C++ standards yet (not that that would cause me to be familiar
> with them... ;)  Any implementation of coroutines (i.e.,
> cooperative switching) is likely to fall under the "setcontext"
> argument above.

These are not part of the C++ standard yet, but will appear in TSes.
There are various features for which implementations would be assumed to
use one OS thread for several tasks, coroutines, etc.  Some of them
switch between these tasks or coroutines while these are running,
whereas the ones that will be in C++17 only run more than parallel task
on the same OS thread but one after the other (like in a thread pool).

However, if we are careful not to expose VL or make promises about it,
this may just end up being a detail similar to, say, register
allocation, which isn't exposed beyond the internals of a particular
compiler either.
Exposing it as a feature the user can set without messing with the
implementation would introduce additional thread-specific state, as
Florian said.  This might not be a show-stopper by itself, but the more
thread-specific state we have the more an implementation has to take
care of or switch, and the higher the runtime costs are.  C++17 already
makes weaker promises for TLS for parallel tasks, so that
implementations don't have to run TLS constructors or destructors just
because a small parallel task was executed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-12-05 22:42     ` Torvald Riegel
@ 2016-12-06 14:46       ` Dave Martin
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Martin @ 2016-12-06 14:46 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Florian Weimer, libc-alpha, Ard Biesheuvel, Marc Zyngier, gdb,
	Yao Qi, Christoffer Dall, Alan Hayward, linux-arm-kernel

On Mon, Dec 05, 2016 at 11:42:19PM +0100, Torvald Riegel wrote:

Hi there,

> On Wed, 2016-11-30 at 12:06 +0000, Dave Martin wrote:
> > So, my key goal is to support _per-process_ vector length control.
> > 
> > From the kernel perspective, it is easiest to achieve this by providing
> > per-thread control since that is the unit that context switching acts
> > on.
> > 
> > How useful it really is to have threads with different VLs in the same
> > process is an open question.  It's theoretically useful for runtime
> > environments, which may want to dispatch code optimised for different
> > VLs
> 
> What would be the primary use case(s)?  Vectorization of short vectors
> (eg, if having an array of structs or sth like that)?

I'm not sure exactly what you're asking here.

SVE supports a regular SIMD-type computational model, along with
scalable vectors and features for speculative vectorisation of loops
whose iteration count is not statically known (or, possibly not known
even at loop entry at runtime).  It's intended as a compiler target, so
any algorithm that involves iterative computation may get some benefit
-- though the amount of benefit, and how the benefit scales with vector
length, will depend on the algorithm in question.

So some algorithms may get more benefit more from large VLs than others.
For jobs where performance tends to saturate at a shorter VL, it may
make sense to get the compiler to compile for the shorter VL -- this
may enable the same binary code to perform more optimally on a wider
range of hardware, but that may also mean you want to run that job with
the VL it was compiled for instead of what the hardware
supports.

In high-assurance scenarios, you might also want to restrict a
particular job to run at the VL that you validated for.

> > -- changing the VL on-the-fly within a single thread is not
> > something I want to encourage, due to overhead and ABI issues, but
> > switching between threads of different VLs would be more manageable.
> 
> So if on-the-fly switching is probably not useful, that would mean we
> need special threads for the use cases.  Is that a realistic assumption
> for the use cases?  Or do you primarily want to keep it possible to do
> this, regardless of whether there are real use cases now?
> I suppose allowing for a per-thread setting of VL could also be added as
> a feature in the future without breaking existing code.

Per-thread VL use cases are hypothetical for now.

It's easy to support per-thread VLs in the kernel, but we could deny it
initially and wait for someone to come along with a concrete use case.

> > For setcontext/setjmp, we don't save/restore any SVE state due to the
> > caller-save status of SVE, and I would not consider it necessary to
> > save/restore VL itself because of the no-change-on-the-fly policy for
> > this.
> 
> Thus, you would basically consider VL changes or per-thread VL as in the
> realm of compilation internals?  So, the specific size for a particular
> piece of code would not be part of an ABI?

Basically yes.  For most people, this would be hidden in libc/ld.so/some
framework.  This goes for most prctl()s -- random user code shouldn't
normally touch them unless it knows what it's doing.

> > I'm not familiar with resumable functions/executors -- are these in
> > the C++ standards yet (not that that would cause me to be familiar
> > with them... ;)  Any implementation of coroutines (i.e.,
> > cooperative switching) is likely to fall under the "setcontext"
> > argument above.
> 
> These are not part of the C++ standard yet, but will appear in TSes.
> There are various features for which implementations would be assumed to
> use one OS thread for several tasks, coroutines, etc.  Some of them
> switch between these tasks or coroutines while these are running,

Is the switching ever preemptive?  If not, that these features are
unlikely to be a concern for SVE.  It's preemptive switching that would
require the saving of extra SVE state (which is why we need to care for
signals).

> whereas the ones that will be in C++17 only run more than parallel task
> on the same OS thread but one after the other (like in a thread pool).

If jobs are only run to completion before yielding, that again isn't a
concern for SVE.

> However, if we are careful not to expose VL or make promises about it,
> this may just end up being a detail similar to, say, register
> allocation, which isn't exposed beyond the internals of a particular
> compiler either.
> Exposing it as a feature the user can set without messing with the
> implementation would introduce additional thread-specific state, as
> Florian said.  This might not be a show-stopper by itself, but the more
> thread-specific state we have the more an implementation has to take
> care of or switch, and the higher the runtime costs are.  C++17 already
> makes weaker promises for TLS for parallel tasks, so that
> implementations don't have to run TLS constructors or destructors just
> because a small parallel task was executed.

There's a difference between a feature that exposed by the kernel, and
a feature endorsed by the language / runtime.

For example, random code can enable seccomp via prctl(PR_SET_SECCOMP)
-- this may make most of libc unsafe to use, because under strict
seccomp most syscalls simply kill the thread.  libc doesn't pretend to
support this out of the box, but this feature is also not needlessly
denied to user code that knows what it's doing.

I tend to put setting the VL into this category: it is safe, and
useful or even necessary to change the VL in some situations, but
userspace is responsible for managing this for itself.  The kernel
doesn't have enough information to make these decisions unilaterally.

Cheers
---Dave


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-25 19:39 [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Dave Martin
                   ` (5 preceding siblings ...)
  2016-11-30  9:56 ` [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Yao Qi
@ 2016-11-30 10:08 ` Florian Weimer
  2016-11-30 11:06   ` Szabolcs Nagy
  6 siblings, 1 reply; 30+ messages in thread
From: Florian Weimer @ 2016-11-30 10:08 UTC (permalink / raw)
  To: Dave Martin, linux-arm-kernel
  Cc: Christoffer Dall, Ard Biesheuvel, Marc Zyngier, Alan Hayward,
	libc-alpha, gdb, Torvald Riegel

On 11/25/2016 08:38 PM, Dave Martin wrote:
> The Scalable Vector Extension (SVE) [1] is an extension to AArch64 which
> adds extra SIMD functionality and supports much larger vectors.
>
> This series implements core Linux support for SVE.
>
> Recipents not copied on the whole series can find the rest of the
> patches in the linux-arm-kernel archives [2].
>
>
> The first 5 patches "arm64: signal: ..." factor out the allocation and
> placement of state information in the signal frame.  The first three
> are prerequisites for the SVE support patches.
>
> Patches 04-05 implement expansion of the signal frame, and may remain
> controversial due to ABI break issues:
>
>  * Discussion is needed on how userspace should detect/negotiate signal
>    frame size in order for this expansion mechanism to be workable.

I'm leaning towards a simple increase in the glibc headers (despite the 
ABI risk), plus a personality flag to disable really wide vector 
registers in case this causes problems with old binaries.

A more elaborate mechanism will likely introduce more bugs than it makes 
existing applications working, due to its complexity.

> The remaining patches implement initial SVE support for Linux, with the
> following limitations:
>
>  * No KVM/virtualisation support for guests.
>
>  * No independent SVE vector length configuration per thread.  This is
>    planned, but will follow as a separate add-on series.

Per-thread register widths will likely make coroutine switching 
(setcontext) and C++ resumable functions/executors quite challenging.

Can you detail your plans in this area?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-30 10:08 ` Florian Weimer
@ 2016-11-30 11:06   ` Szabolcs Nagy
  2016-11-30 14:06     ` Dave Martin
  0 siblings, 1 reply; 30+ messages in thread
From: Szabolcs Nagy @ 2016-11-30 11:06 UTC (permalink / raw)
  To: Florian Weimer, Dave Martin, linux-arm-kernel
  Cc: nd, Christoffer Dall, Ard Biesheuvel, Marc Zyngier, Alan Hayward,
	libc-alpha, gdb, Torvald Riegel

On 30/11/16 10:08, Florian Weimer wrote:
> On 11/25/2016 08:38 PM, Dave Martin wrote:
>> The Scalable Vector Extension (SVE) [1] is an extension to AArch64 which
>> adds extra SIMD functionality and supports much larger vectors.
>>
>> This series implements core Linux support for SVE.
>>
>> Recipents not copied on the whole series can find the rest of the
>> patches in the linux-arm-kernel archives [2].
>>
>>
>> The first 5 patches "arm64: signal: ..." factor out the allocation and
>> placement of state information in the signal frame.  The first three
>> are prerequisites for the SVE support patches.
>>
>> Patches 04-05 implement expansion of the signal frame, and may remain
>> controversial due to ABI break issues:
>>
>>  * Discussion is needed on how userspace should detect/negotiate signal
>>    frame size in order for this expansion mechanism to be workable.
> 
> I'm leaning towards a simple increase in the glibc headers (despite the ABI risk), plus a personality flag to
> disable really wide vector registers in case this causes problems with old binaries.
> 

if the kernel does not increase the size and libc
does not add size checks then old binaries would
work with new libc just fine..
but that's non-conforming, posix requires the check.

if the kernel increases the size then it has to be
changed in bionic and musl as well and old binaries
may break.

> A more elaborate mechanism will likely introduce more bugs than it makes existing applications working, due to
> its complexity.
> 
>> The remaining patches implement initial SVE support for Linux, with the
>> following limitations:
>>
>>  * No KVM/virtualisation support for guests.
>>
>>  * No independent SVE vector length configuration per thread.  This is
>>    planned, but will follow as a separate add-on series.
> 
> Per-thread register widths will likely make coroutine switching (setcontext) and C++ resumable
> functions/executors quite challenging.
> 

i'd assume it's undefined to context switch to a different
thread or to resume a function on a different thread
(because the implementation can cache thread local state
on the stack: e.g. errno pointer).. of course this does
not stop ppl from doing it, but the practice is questionable.

> Can you detail your plans in this area?
> 
> Thanks,
> Florian


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/29] arm64: Scalable Vector Extension core support
  2016-11-30 11:06   ` Szabolcs Nagy
@ 2016-11-30 14:06     ` Dave Martin
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Martin @ 2016-11-30 14:06 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Florian Weimer, linux-arm-kernel, Torvald Riegel, libc-alpha,
	Ard Biesheuvel, Marc Zyngier, gdb, Alan Hayward, nd,
	Christoffer Dall

On Wed, Nov 30, 2016 at 11:05:41AM +0000, Szabolcs Nagy wrote:
> On 30/11/16 10:08, Florian Weimer wrote:
> > On 11/25/2016 08:38 PM, Dave Martin wrote:

[...]

> >>  * Discussion is needed on how userspace should detect/negotiate signal
> >>    frame size in order for this expansion mechanism to be workable.
> > 
> > I'm leaning towards a simple increase in the glibc headers (despite the ABI risk), plus a personality flag to
> > disable really wide vector registers in case this causes problems with old binaries.
> > 
> 
> if the kernel does not increase the size and libc
> does not add size checks then old binaries would
> work with new libc just fine..
> but that's non-conforming, posix requires the check.
> 
> if the kernel increases the size then it has to be
> changed in bionic and musl as well and old binaries
> may break.

Or we need a personality flag or similar to distinguish the two cases.

[...]

> > A more elaborate mechanism will likely introduce more bugs than it makes existing applications working, due to
> > its complexity.
> > 
> >> The remaining patches implement initial SVE support for Linux, with the
> >> following limitations:
> >>
> >>  * No KVM/virtualisation support for guests.
> >>
> >>  * No independent SVE vector length configuration per thread.  This is
> >>    planned, but will follow as a separate add-on series.
> > 
> > Per-thread register widths will likely make coroutine switching (setcontext) and C++ resumable
> > functions/executors quite challenging.
> > 
> 
> i'd assume it's undefined to context switch to a different
> thread or to resume a function on a different thread
> (because the implementation can cache thread local state
> on the stack: e.g. errno pointer).. of course this does
> not stop ppl from doing it, but the practice is questionable.

I don't have a view on this.

Cheers
---Dave


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-12-06 14:46 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-25 19:39 [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Dave Martin
2016-11-25 19:41 ` [RFC PATCH 16/29] arm64/sve: signal: Add SVE state record to sigcontext Dave Martin
2016-11-25 19:41 ` [RFC PATCH 24/29] arm64/sve: Discard SVE state on system call Dave Martin
2016-11-25 19:41 ` [RFC PATCH 18/29] arm64/sve: signal: Restore FPSIMD/SVE state in rt_sigreturn Dave Martin
2016-11-25 19:41 ` [RFC PATCH 17/29] arm64/sve: signal: Dump Scalable Vector Extension registers to user stack Dave Martin
2016-11-25 19:42 ` [RFC PATCH 27/29] arm64/sve: ptrace support Dave Martin
2016-11-30  9:56 ` [RFC PATCH 00/29] arm64: Scalable Vector Extension core support Yao Qi
2016-11-30 12:07   ` Dave Martin
2016-11-30 12:22     ` Szabolcs Nagy
2016-11-30 14:10       ` Dave Martin
2016-11-30 12:38     ` Florian Weimer
2016-11-30 13:56       ` Dave Martin
2016-12-01  9:21         ` Florian Weimer
2016-12-01 10:30           ` Dave Martin
2016-12-01 12:19             ` Dave Martin
2016-12-05 10:44             ` Florian Weimer
2016-12-05 11:07               ` Szabolcs Nagy
2016-12-05 15:05               ` Dave Martin
2016-12-02 11:49       ` Dave Martin
2016-12-02 16:34         ` Florian Weimer
2016-12-02 16:59           ` Joseph Myers
2016-12-02 18:21             ` Dave Martin
2016-12-02 21:57               ` Joseph Myers
2016-12-02 21:56     ` Yao Qi
2016-12-05 15:12       ` Dave Martin
2016-12-05 22:42     ` Torvald Riegel
2016-12-06 14:46       ` Dave Martin
2016-11-30 10:08 ` Florian Weimer
2016-11-30 11:06   ` Szabolcs Nagy
2016-11-30 14:06     ` Dave Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox