Discussion:
[RFC net-next v2] bridge lwtunnel, VPLS & NVGRE
(too old to reply)
David Lamparter
2017-08-21 17:15:17 UTC
Permalink
Hi all,


this is an update on the earlier "[RFC net-next] VPLS support". Note
I've changed the subject lines on some of the patches to better reflect
what they really do (tbh the earlier subject lines were crap.)

As previously, iproute2 / FRR patches are at:
- https://github.com/eqvinox/vpls-iproute2
- https://github.com/opensourcerouting/frr/commits/vpls
while this patchset is also available at:
- https://github.com/eqvinox/vpls-linux-kernel
(but please be aware that I'm amending and rebasing commits)

The NVGRE implementation in the 3rd patch in this series is actually an
accident - I was just wiring up gretap as a reference; only after I was
done I noticed that that sums up to NVGRE, more or less. IMHO, it does
serve well to demonstrate the bridge changes are not VPLS-specific.
I've tested some basic setups, the chain from LDP down into the kernel
works at least in these. FRR has some testcases around from OpenBSD
VPLS support, I haven't wired that up to run against Linux / this
patchset yet.
Same as before (API didn't change).
The patchset needs a lot of polishing (yes I left my TODO notes in the
commit messages), for now my primary concern is overall design
feedback. Roopa has already provided a lot of input (Thanks!); the
major topic I'm expecting to get discussion on is the bridge FDB
changes.
Got some useful input; but still need feedback on the bridge FDB
changes (first 2 patches). I don't believe it to have a significant
impact on existing bridge operation, and I believe a multipoint tunnel
driver without its own FDB (e.g. NVGRE in this set) should perform
better than one with its own FDB (e.g. existing VXLAN).
P.S.: For a little context on the bridge FDB changes - I'm hoping to
find some time to extend this to the MDB to allow aggregating dst
metadata and handing down a list of dst metas on TX. This isn't
specifically for VPLS but rather to give sufficient information to the
802.11 stack to allow it to optimize selecting rates (or unicasting)
for multicast traffic by having the multicast subscriber list known.
This is done by major commercial wifi solutions (e.g. google "dynamic
multicast optimization".)
You can find hacks at this on:
https://github.com/eqvinox/vpls-linux-kernel/tree/mdb-hack
Please note that the patches in that branch are not at an acceptable
quality level, but you can see the semantic relation to 802.11.

I would, however, like to point out that this branch has pseudo-working
IGMP/MLD snooping for VPLS, and it'd be 20-ish lines to add it to NVGRE
(I'll do that as soon as I get to it, it'll pop up on that branch too.)

This is relevant to the discussion because it's a feature which is
non-obvious (to me) on how to do with the VXLAN model of having an
entirely separate FDB. Meanwhile, with this architecture, the proof of
concept / hack is coming in at a measly cost of:
8 files changed, 176 insertions(+), 15 deletions(-)


Cheers,

-David


--- diffstat:
include/linux/netdevice.h | 18 ++++++
include/net/dst_metadata.h | 51 ++++++++++++++---
include/net/ip_tunnels.h | 5 ++
include/uapi/linux/lwtunnel.h | 8 +++
include/uapi/linux/neighbour.h | 2 +
include/uapi/linux/rtnetlink.h | 5 ++
net/bridge/br.c | 2 +-
net/bridge/br_device.c | 4 ++
net/bridge/br_fdb.c | 119 ++++++++++++++++++++++++++++++++--------
net/bridge/br_input.c | 6 +-
net/bridge/br_private.h | 6 +-
net/core/lwtunnel.c | 1 +
net/ipv4/ip_gre.c | 40 ++++++++++++--
net/ipv4/ip_tunnel.c | 1 +
net/ipv4/ip_tunnel_core.c | 87 +++++++++++++++++++++++------
net/mpls/Kconfig | 11 ++++
net/mpls/Makefile | 1 +
net/mpls/af_mpls.c | 113 ++++++++++++++++++++++++++++++++------
net/mpls/internal.h | 44 +++++++++++++--
net/mpls/vpls.c | 550 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
20 files changed, 990 insertions(+), 84 deletions(-)
David Lamparter
2017-08-21 17:15:18 UTC
Permalink
This implements holding tunnel config in the form of metadata_dst
information in the bridge layer, though only for unicast right now.
Multicast is still left to design and implement.

While struct lwtunnel_state might seem the more appropriate structure to
use here, there are two problems with that:
- I haven't found a good way to stuff it onto a SKB
(there's dst_entry->lwtstate, but if we're adding a dst, we might as
well go with a metadata_dst)
- it also needs to propagate upwards on received packets, which is
already in place for tunnel metadata collection

[v2: fixed race in fdb update with atomic_xchg]
[v3: consistently use metadata_dst pointer]
[v4: patch renamed]
Signed-off-by: David Lamparter <***@diac24.net>
---
include/net/dst_metadata.h | 27 ++++++++++++++++++---------
net/bridge/br_device.c | 4 ++++
net/bridge/br_fdb.c | 46 ++++++++++++++++++++++++++++++++--------------
net/bridge/br_input.c | 6 ++++--
net/bridge/br_private.h | 5 ++++-
5 files changed, 62 insertions(+), 26 deletions(-)

diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index a803129a4849..4bcc0f314853 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -24,7 +24,7 @@ struct metadata_dst {
} u;
};

-static inline struct metadata_dst *skb_metadata_dst(struct sk_buff *skb)
+static inline struct metadata_dst *skb_metadata_dst(const struct sk_buff *skb)
{
struct metadata_dst *md_dst = (struct metadata_dst *) skb_dst(skb);

@@ -34,6 +34,11 @@ static inline struct metadata_dst *skb_metadata_dst(struct sk_buff *skb)
return NULL;
}

+static inline struct metadata_dst *metadata_dst_clone(struct metadata_dst *md_dst)
+{
+ return (struct metadata_dst *)dst_clone(&md_dst->dst);
+}
+
static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb)
{
struct metadata_dst *md_dst = skb_metadata_dst(skb);
@@ -56,17 +61,12 @@ static inline bool skb_valid_dst(const struct sk_buff *skb)
return dst && !(dst->flags & DST_METADATA);
}

-static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
- const struct sk_buff *skb_b)
+static inline int metadata_dst_cmp(const struct metadata_dst *a,
+ const struct metadata_dst *b)
{
- const struct metadata_dst *a, *b;
-
- if (!(skb_a->_skb_refdst | skb_b->_skb_refdst))
+ if (!(a || b))
return 0;

- a = (const struct metadata_dst *) skb_dst(skb_a);
- b = (const struct metadata_dst *) skb_dst(skb_b);
-
if (!a != !b || a->type != b->type)
return 1;

@@ -83,6 +83,15 @@ static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
}
}

+static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
+ const struct sk_buff *skb_b)
+{
+ if (!(skb_a->_skb_refdst | skb_b->_skb_refdst))
+ return 0;
+ return metadata_dst_cmp(skb_metadata_dst(skb_a),
+ skb_metadata_dst(skb_b));
+}
+
void metadata_dst_free(struct metadata_dst *);
struct metadata_dst *metadata_dst_alloc(u8 optslen, enum metadata_type type,
gfp_t flags);
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 861ae2a165f4..f98bc2016ddd 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -53,6 +53,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
brstats->tx_bytes += skb->len;
u64_stats_update_end(&brstats->syncp);

+ skb_dst_drop(skb);
BR_INPUT_SKB_CB(skb)->brdev = dev;

skb_reset_mac_header(skb);
@@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
else
br_flood(br, skb, BR_PKT_MULTICAST, false, true);
} else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) {
+ struct metadata_dst *md_dst = rcu_dereference(dst->md_dst);
+ if (md_dst)
+ skb_dst_set_noref(skb, &md_dst->dst);
br_forward(dst->dst, skb, false, true);
} else {
br_flood(br, skb, BR_PKT_UNICAST, false, true);
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index a79b648aac88..6ac3b916c39b 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -25,11 +25,13 @@
#include <asm/unaligned.h>
#include <linux/if_vlan.h>
#include <net/switchdev.h>
+#include <net/dst_metadata.h>
#include "br_private.h"

static struct kmem_cache *br_fdb_cache __read_mostly;
static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
- const unsigned char *addr, u16 vid);
+ struct metadata_dst *md_dst, const unsigned char *addr,
+ u16 vid);
static void fdb_notify(struct net_bridge *br,
const struct net_bridge_fdb_entry *, int);

@@ -174,6 +176,8 @@ static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f)
if (f->is_static)
fdb_del_hw_addr(br, f->addr.addr);

+ dst_release(&(rcu_access_pointer(f->md_dst)->dst));
+
hlist_del_init_rcu(&f->hlist);
fdb_notify(br, f, RTM_DELNEIGH);
call_rcu(&f->rcu, fdb_rcu_free);
@@ -260,7 +264,7 @@ void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr)

insert:
/* insert new address, may fail if invalid address or dup. */
- fdb_insert(br, p, newaddr, 0);
+ fdb_insert(br, p, NULL, newaddr, 0);

if (!vg || !vg->num_vlans)
goto done;
@@ -270,7 +274,7 @@ void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr)
* from under us.
*/
list_for_each_entry(v, &vg->vlan_list, vlist)
- fdb_insert(br, p, newaddr, v->vid);
+ fdb_insert(br, p, NULL, newaddr, v->vid);

done:
spin_unlock_bh(&br->hash_lock);
@@ -289,10 +293,11 @@ void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr)
if (f && f->is_local && !f->dst && !f->added_by_user)
fdb_delete_local(br, NULL, f);

- fdb_insert(br, NULL, newaddr, 0);
+ fdb_insert(br, NULL, NULL, newaddr, 0);
vg = br_vlan_group(br);
if (!vg || !vg->num_vlans)
goto out;
+
/* Now remove and add entries for every VLAN configured on the
* bridge. This function runs under RTNL so the bitmap will not
* change from under us.
@@ -303,7 +308,7 @@ void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr)
f = br_fdb_find(br, br->dev->dev_addr, v->vid);
if (f && f->is_local && !f->dst && !f->added_by_user)
fdb_delete_local(br, NULL, f);
- fdb_insert(br, NULL, newaddr, v->vid);
+ fdb_insert(br, NULL, NULL, newaddr, v->vid);
}
out:
spin_unlock_bh(&br->hash_lock);
@@ -477,6 +482,7 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf,

static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
struct net_bridge_port *source,
+ struct metadata_dst *md_dst,
const unsigned char *addr,
__u16 vid,
unsigned char is_local,
@@ -488,6 +494,7 @@ static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
if (fdb) {
memcpy(fdb->addr.addr, addr, ETH_ALEN);
fdb->dst = source;
+ rcu_assign_pointer(fdb->md_dst, metadata_dst_clone(md_dst));
fdb->vlan_id = vid;
fdb->is_local = is_local;
fdb->is_static = is_static;
@@ -501,7 +508,8 @@ static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
}

static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
- const unsigned char *addr, u16 vid)
+ struct metadata_dst *md_dst, const unsigned char *addr,
+ u16 vid)
{
struct hlist_head *head = &br->hash[br_mac_hash(addr, vid)];
struct net_bridge_fdb_entry *fdb;
@@ -521,7 +529,7 @@ static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
fdb_delete(br, fdb);
}

- fdb = fdb_create(head, source, addr, vid, 1, 1);
+ fdb = fdb_create(head, source, md_dst, addr, vid, 1, 1);
if (!fdb)
return -ENOMEM;

@@ -537,13 +545,14 @@ int br_fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
int ret;

spin_lock_bh(&br->hash_lock);
- ret = fdb_insert(br, source, addr, vid);
+ ret = fdb_insert(br, source, NULL, addr, vid);
spin_unlock_bh(&br->hash_lock);
return ret;
}

void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
- const unsigned char *addr, u16 vid, bool added_by_user)
+ struct metadata_dst *md_dst, const unsigned char *addr,
+ u16 vid, bool added_by_user)
{
struct hlist_head *head = &br->hash[br_mac_hash(addr, vid)];
struct net_bridge_fdb_entry *fdb;
@@ -567,10 +576,19 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
source->dev->name, addr, vid);
} else {
unsigned long now = jiffies;
+ struct metadata_dst *md_ref;
+
+ md_ref = rcu_access_pointer(fdb->md_dst);

/* fastpath: update of existing entry */
- if (unlikely(source != fdb->dst)) {
+ if (unlikely(source != fdb->dst ||
+ metadata_dst_cmp(md_dst, md_ref))) {
fdb->dst = source;
+
+ md_ref = xchg(&fdb->md_dst,
+ metadata_dst_clone(md_dst));
+ dst_release(&md_ref->dst);
+
fdb_modified = true;
/* Take over HW learned entry */
if (unlikely(fdb->added_by_external_learn))
@@ -586,7 +604,7 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
} else {
spin_lock(&br->hash_lock);
if (likely(!fdb_find_rcu(head, addr, vid))) {
- fdb = fdb_create(head, source, addr, vid, 0, 0);
+ fdb = fdb_create(head, source, md_dst, addr, vid, 0, 0);
if (fdb) {
if (unlikely(added_by_user))
fdb->added_by_user = 1;
@@ -781,7 +799,7 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
if (!(flags & NLM_F_CREATE))
return -ENOENT;

- fdb = fdb_create(head, source, addr, vid, 0, 0);
+ fdb = fdb_create(head, source, NULL, addr, vid, 0, 0);
if (!fdb)
return -ENOMEM;

@@ -844,7 +862,7 @@ static int __br_fdb_add(struct ndmsg *ndm, struct net_bridge *br,
}
local_bh_disable();
rcu_read_lock();
- br_fdb_update(br, p, addr, vid, true);
+ br_fdb_update(br, p, NULL, addr, vid, true);
rcu_read_unlock();
local_bh_enable();
} else if (ndm->ndm_flags & NTF_EXT_LEARNED) {
@@ -1071,7 +1089,7 @@ int br_fdb_external_learn_add(struct net_bridge *br, struct net_bridge_port *p,
head = &br->hash[br_mac_hash(addr, vid)];
fdb = br_fdb_find(br, addr, vid);
if (!fdb) {
- fdb = fdb_create(head, p, addr, vid, 0, 0);
+ fdb = fdb_create(head, p, NULL, addr, vid, 0, 0);
if (!fdb) {
err = -ENOMEM;
goto err_unlock;
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 7637f58c1226..3fd0fab49de2 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -150,7 +150,8 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
/* insert into forwarding database after filtering to avoid spoofing */
br = p->br;
if (p->flags & BR_LEARNING)
- br_fdb_update(br, p, eth_hdr(skb)->h_source, vid, false);
+ br_fdb_update(br, p, skb_metadata_dst(skb),
+ eth_hdr(skb)->h_source, vid, false);

local_rcv = !!(br->dev->flags & IFF_PROMISC);
dest = eth_hdr(skb)->h_dest;
@@ -230,7 +231,8 @@ static void __br_handle_local_finish(struct sk_buff *skb)

/* check if vlan is allowed, to avoid spoofing */
if (p->flags & BR_LEARNING && br_should_learn(p, skb, &vid))
- br_fdb_update(p->br, p, eth_hdr(skb)->h_source, vid, false);
+ br_fdb_update(p->br, p, skb_metadata_dst(skb),
+ eth_hdr(skb)->h_source, vid, false);
}

/* note: already called with rcu_read_lock */
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index fd9ee73e0a6d..66d33352681f 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -22,6 +22,7 @@
#include <linux/if_vlan.h>
#include <linux/rhashtable.h>
#include <linux/refcount.h>
+#include <net/dst_metadata.h>

#define BR_HASH_BITS 8
#define BR_HASH_SIZE (1 << BR_HASH_BITS)
@@ -164,6 +165,7 @@ struct net_bridge_vlan_group {
struct net_bridge_fdb_entry {
struct hlist_node hlist;
struct net_bridge_port *dst;
+ struct metadata_dst __rcu *md_dst;

mac_addr addr;
__u16 vlan_id;
@@ -524,7 +526,8 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf, unsigned long count,
int br_fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
const unsigned char *addr, u16 vid);
void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
- const unsigned char *addr, u16 vid, bool added_by_user);
+ struct metadata_dst *md_dst, const unsigned char *addr,
+ u16 vid, bool added_by_user);

int br_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
struct net_device *dev, const unsigned char *addr, u16 vid);
--
2.13.0
David Lamparter
2017-08-21 17:15:19 UTC
Permalink
This makes each FDB entry's metadata dst accessible through the same
ENCAP uapi as lwtunnel uses. The function signature is slightly
different due to metadata_dst <> lwtunnel_state.

Netlink encapsulation is done by callbacks in net_device_ops. This is
because the metadata is always used in the context of a port / device on
the bridge; it's not meaningful in a "vacuum". It makes no sense to
allow inputting metadata of a type that doesn't match the device (where
in lwtunnel it does, by just switching the encapsulation.) Also, this
way a device can do extended checks of the validity of incoming data
from the user, ensuring it is actually usable.

Note this is not related to ndo_fill_metadata_dst(), that one is used
only by OVS and operates on a packet that is currently being switched,
i.e. data plane. The API in this patch is control plane.

[TODO: maybe just pass the entire netlink attr block down?]
Signed-off-by: David Lamparter <***@diac24.net>
---
include/linux/netdevice.h | 18 +++++++++
include/net/ip_tunnels.h | 5 +++
include/uapi/linux/neighbour.h | 2 +
net/bridge/br.c | 2 +-
net/bridge/br_fdb.c | 79 +++++++++++++++++++++++++++++++-------
net/bridge/br_private.h | 1 +
net/ipv4/ip_tunnel_core.c | 87 +++++++++++++++++++++++++++++++++---------
7 files changed, 162 insertions(+), 32 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0f1c4cb2441e..2de46f8b3f4f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -828,6 +828,8 @@ struct xfrmdev_ops {
};
#endif

+struct metadata_dst;
+
/*
* This structure defines the management hooks for network devices.
* The following hooks can be defined; unless noted otherwise, they are
@@ -1128,6 +1130,15 @@ struct xfrmdev_ops {
* void (*ndo_xdp_flush)(struct net_device *dev);
* This function is used to inform the driver to flush a paticular
* xpd tx queue. Must be called on same CPU as xdp_xmit.
+ * int (*ndo_metadst_fill)(struct sk_buff *skb, struct metadata_dst *dst);
+ * Used to encapsulate a metadata_dst that is associated with this
+ * netdevice into the appropriate netlink attributes on skb.
+ * Needs to return a lwtunnel_encap_types value if valid data was filled.
+ * int (*ndo_metadst_build)(struct net_device *dev, struct nlattr *meta,
+ * struct metadata_dst **dst,
+ * struct netlink_ext_ack *extack);
+ * Reverse of the previous function, build a metadata_dst from netlink
+ * attributes. Should perform appropriate validation.
*/
struct net_device_ops {
int (*ndo_init)(struct net_device *dev);
@@ -1314,6 +1325,13 @@ struct net_device_ops {
int (*ndo_xdp_xmit)(struct net_device *dev,
struct xdp_buff *xdp);
void (*ndo_xdp_flush)(struct net_device *dev);
+
+ int (*ndo_metadst_fill)(struct sk_buff *skb,
+ struct metadata_dst *dst);
+ int (*ndo_metadst_build)(struct net_device *dev,
+ struct nlattr *meta,
+ struct metadata_dst **dst,
+ struct netlink_ext_ack *extack);
};

/**
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 520809912f03..e6181fb83324 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -451,6 +451,11 @@ void __init ip_tunnel_core_init(void);
void ip_tunnel_need_metadata(void);
void ip_tunnel_unneed_metadata(void);

+int ip_tunnel_fill_metadst(struct sk_buff *skb, struct metadata_dst *md_dst);
+int ip_tunnel_build_metadst(struct net_device *dev, struct nlattr *meta,
+ struct metadata_dst **dst,
+ struct netlink_ext_ack *extack);
+
#else /* CONFIG_INET */

static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state *lwtstate)
diff --git a/include/uapi/linux/neighbour.h b/include/uapi/linux/neighbour.h
index 3199d28980b3..cd98ce4b8dd9 100644
--- a/include/uapi/linux/neighbour.h
+++ b/include/uapi/linux/neighbour.h
@@ -27,6 +27,8 @@ enum {
NDA_MASTER,
NDA_LINK_NETNSID,
NDA_SRC_VNI,
+ NDA_ENCAP_TYPE,
+ NDA_ENCAP,
__NDA_MAX
};

diff --git a/net/bridge/br.c b/net/bridge/br.c
index 1407d1ba7577..822dfcef2649 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -140,7 +140,7 @@ static int br_switchdev_event(struct notifier_block *unused,
switch (event) {
case SWITCHDEV_FDB_ADD_TO_BRIDGE:
fdb_info = ptr;
- err = br_fdb_external_learn_add(br, p, fdb_info->addr,
+ err = br_fdb_external_learn_add(br, p, NULL, fdb_info->addr,
fdb_info->vid);
if (err) {
err = notifier_from_errno(err);
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 6ac3b916c39b..452d88bab1a0 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -671,6 +671,27 @@ static int fdb_fill_info(struct sk_buff *skb, const struct net_bridge *br,

if (fdb->vlan_id && nla_put(skb, NDA_VLAN, sizeof(u16), &fdb->vlan_id))
goto nla_put_failure;
+ if (fdb->md_dst && fdb->dst) {
+ struct net_device *dev = fdb->dst->dev;
+
+ if (dev->netdev_ops &&
+ dev->netdev_ops->ndo_metadst_fill) {
+ struct nlattr *nest;
+ int ret;
+
+ nest = nla_nest_start(skb, NDA_ENCAP);
+ if (!nest)
+ goto nla_put_failure;
+ ret = dev->netdev_ops->ndo_metadst_fill(skb,
+ fdb->md_dst);
+ if (ret < 0)
+ goto nla_put_failure;
+ nla_nest_end(skb, nest);
+
+ if (ret && nla_put_u16(skb, NDA_ENCAP_TYPE, ret))
+ goto nla_put_failure;
+ }
+ }

nlmsg_end(skb, nlh);
return 0;
@@ -776,10 +797,12 @@ int br_fdb_dump(struct sk_buff *skb,

/* Update (create or replace) forwarding database entry */
static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
- const __u8 *addr, __u16 state, __u16 flags, __u16 vid)
+ struct metadata_dst *md_dst, const __u8 *addr,
+ __u16 state, __u16 flags, __u16 vid)
{
struct hlist_head *head = &br->hash[br_mac_hash(addr, vid)];
struct net_bridge_fdb_entry *fdb;
+ struct metadata_dst *old_dst;
bool modified = false;

/* If the port cannot learn allow only local and static entries */
@@ -799,7 +822,7 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
if (!(flags & NLM_F_CREATE))
return -ENOENT;

- fdb = fdb_create(head, source, NULL, addr, vid, 0, 0);
+ fdb = fdb_create(head, source, md_dst, addr, vid, 0, 0);
if (!fdb)
return -ENOMEM;

@@ -810,6 +833,11 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,

if (fdb->dst != source) {
fdb->dst = source;
+
+ old_dst = xchg(&fdb->md_dst,
+ metadata_dst_clone(md_dst));
+ dst_release(&old_dst->dst);
+
modified = true;
}
}
@@ -849,8 +877,8 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
}

static int __br_fdb_add(struct ndmsg *ndm, struct net_bridge *br,
- struct net_bridge_port *p, const unsigned char *addr,
- u16 nlh_flags, u16 vid)
+ struct net_bridge_port *p, struct metadata_dst *md_dst,
+ const unsigned char *addr, u16 nlh_flags, u16 vid)
{
int err = 0;

@@ -862,14 +890,14 @@ static int __br_fdb_add(struct ndmsg *ndm, struct net_bridge *br,
}
local_bh_disable();
rcu_read_lock();
- br_fdb_update(br, p, NULL, addr, vid, true);
+ br_fdb_update(br, p, md_dst, addr, vid, true);
rcu_read_unlock();
local_bh_enable();
} else if (ndm->ndm_flags & NTF_EXT_LEARNED) {
- err = br_fdb_external_learn_add(br, p, addr, vid);
+ err = br_fdb_external_learn_add(br, p, md_dst, addr, vid);
} else {
spin_lock_bh(&br->hash_lock);
- err = fdb_add_entry(br, p, addr, ndm->ndm_state,
+ err = fdb_add_entry(br, p, md_dst, addr, ndm->ndm_state,
nlh_flags, vid);
spin_unlock_bh(&br->hash_lock);
}
@@ -886,6 +914,7 @@ int br_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
struct net_bridge_port *p = NULL;
struct net_bridge_vlan *v;
struct net_bridge *br = NULL;
+ struct metadata_dst *md_dst = NULL;
int err = 0;

if (!(ndm->ndm_state & (NUD_PERMANENT|NUD_NOARP|NUD_REACHABLE))) {
@@ -898,6 +927,22 @@ int br_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
return -EINVAL;
}

+ if (tb[NDA_ENCAP_TYPE] && tb[NDA_ENCAP]) {
+ if (!dev->netdev_ops ||
+ !dev->netdev_ops->ndo_metadst_build) {
+ pr_info("bridge: target device does not support ENCAP\n");
+ return -EINVAL;
+ }
+
+ err = dev->netdev_ops->ndo_metadst_build(dev, tb[NDA_ENCAP],
+ &md_dst, NULL);
+ if (err)
+ return err;
+ } else if (tb[NDA_ENCAP_TYPE] || tb[NDA_ENCAP]) {
+ pr_info("bridge: RTM_NEWNEIGH with unpaired ENCAP_TYPE / ENCAP\n");
+ return -EINVAL;
+ }
+
if (dev->priv_flags & IFF_EBRIDGE) {
br = netdev_priv(dev);
vg = br_vlan_group(br);
@@ -906,7 +951,8 @@ int br_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
if (!p) {
pr_info("bridge: RTM_NEWNEIGH %s not a bridge port\n",
dev->name);
- return -EINVAL;
+ err = -EINVAL;
+ goto out;
}
br = p->br;
vg = nbp_vlan_group(p);
@@ -916,13 +962,14 @@ int br_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
v = br_vlan_find(vg, vid);
if (!v || !br_vlan_should_use(v)) {
pr_info("bridge: RTM_NEWNEIGH with unconfigured vlan %d on %s\n", vid, dev->name);
- return -EINVAL;
+ err = -EINVAL;
+ goto out;
}

/* VID was specified, so use it. */
- err = __br_fdb_add(ndm, br, p, addr, nlh_flags, vid);
+ err = __br_fdb_add(ndm, br, p, md_dst, addr, nlh_flags, vid);
} else {
- err = __br_fdb_add(ndm, br, p, addr, nlh_flags, 0);
+ err = __br_fdb_add(ndm, br, p, md_dst, addr, nlh_flags, 0);
if (err || !vg || !vg->num_vlans)
goto out;

@@ -933,13 +980,14 @@ int br_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
list_for_each_entry(v, &vg->vlan_list, vlist) {
if (!br_vlan_should_use(v))
continue;
- err = __br_fdb_add(ndm, br, p, addr, nlh_flags, v->vid);
+ err = __br_fdb_add(ndm, br, p, md_dst, addr, nlh_flags, v->vid);
if (err)
goto out;
}
}

out:
+ dst_release(&md_dst->dst);
return err;
}

@@ -1077,9 +1125,11 @@ void br_fdb_unsync_static(struct net_bridge *br, struct net_bridge_port *p)
}

int br_fdb_external_learn_add(struct net_bridge *br, struct net_bridge_port *p,
+ struct metadata_dst *md_dst,
const unsigned char *addr, u16 vid)
{
struct net_bridge_fdb_entry *fdb;
+ struct metadata_dst *old_dst;
struct hlist_head *head;
bool modified = false;
int err = 0;
@@ -1089,7 +1139,7 @@ int br_fdb_external_learn_add(struct net_bridge *br, struct net_bridge_port *p,
head = &br->hash[br_mac_hash(addr, vid)];
fdb = br_fdb_find(br, addr, vid);
if (!fdb) {
- fdb = fdb_create(head, p, NULL, addr, vid, 0, 0);
+ fdb = fdb_create(head, p, md_dst, addr, vid, 0, 0);
if (!fdb) {
err = -ENOMEM;
goto err_unlock;
@@ -1101,6 +1151,9 @@ int br_fdb_external_learn_add(struct net_bridge *br, struct net_bridge_port *p,

if (fdb->dst != p) {
fdb->dst = p;
+ old_dst = xchg(&fdb->md_dst,
+ metadata_dst_clone(md_dst));
+ dst_release(&old_dst->dst);
modified = true;
}

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 66d33352681f..dd426ccf7475 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -538,6 +538,7 @@ int br_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
int br_fdb_sync_static(struct net_bridge *br, struct net_bridge_port *p);
void br_fdb_unsync_static(struct net_bridge *br, struct net_bridge_port *p);
int br_fdb_external_learn_add(struct net_bridge *br, struct net_bridge_port *p,
+ struct metadata_dst *md_dst,
const unsigned char *addr, u16 vid);
int br_fdb_external_learn_del(struct net_bridge *br, struct net_bridge_port *p,
const unsigned char *addr, u16 vid);
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 2f39479be92f..9f921d4e2544 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -228,13 +228,10 @@ static const struct nla_policy ip_tun_policy[LWTUNNEL_IP_MAX + 1] = {
[LWTUNNEL_IP_FLAGS] = { .type = NLA_U16 },
};

-static int ip_tun_build_state(struct nlattr *attr,
- unsigned int family, const void *cfg,
- struct lwtunnel_state **ts,
- struct netlink_ext_ack *extack)
+static int ip_tun_build_common(struct ip_tunnel_info *tun_info,
+ struct nlattr *attr,
+ struct netlink_ext_ack *extack)
{
- struct ip_tunnel_info *tun_info;
- struct lwtunnel_state *new_state;
struct nlattr *tb[LWTUNNEL_IP_MAX + 1];
int err;

@@ -243,14 +240,6 @@ static int ip_tun_build_state(struct nlattr *attr,
if (err < 0)
return err;

- new_state = lwtunnel_state_alloc(sizeof(*tun_info));
- if (!new_state)
- return -ENOMEM;
-
- new_state->type = LWTUNNEL_ENCAP_IP;
-
- tun_info = lwt_tun_info(new_state);
-
if (tb[LWTUNNEL_IP_ID])
tun_info->key.tun_id = nla_get_be64(tb[LWTUNNEL_IP_ID]);

@@ -272,16 +261,59 @@ static int ip_tun_build_state(struct nlattr *attr,
tun_info->mode = IP_TUNNEL_INFO_TX;
tun_info->options_len = 0;

- *ts = new_state;
+ return 0;
+}
+
+static int ip_tun_build_state(struct nlattr *attr,
+ unsigned int family, const void *cfg,
+ struct lwtunnel_state **ts,
+ struct netlink_ext_ack *extack)
+{
+ struct ip_tunnel_info *tun_info;
+ struct lwtunnel_state *new_state;
+ int err;
+
+ new_state = lwtunnel_state_alloc(sizeof(*tun_info));
+ if (!new_state)
+ return -ENOMEM;

+ new_state->type = LWTUNNEL_ENCAP_IP;
+
+ tun_info = lwt_tun_info(new_state);
+ err = ip_tun_build_common(tun_info, attr, extack);
+ if (err) {
+ lwtstate_free(new_state);
+ return err;
+ }
+
+ *ts = new_state;
return 0;
}

-static int ip_tun_fill_encap_info(struct sk_buff *skb,
- struct lwtunnel_state *lwtstate)
+int ip_tunnel_build_metadst(struct net_device *dev, struct nlattr *meta,
+ struct metadata_dst **dst,
+ struct netlink_ext_ack *extack)
{
- struct ip_tunnel_info *tun_info = lwt_tun_info(lwtstate);
+ struct metadata_dst *md_dst;
+ int err;
+
+ md_dst = metadata_dst_alloc(0, METADATA_IP_TUNNEL, GFP_ATOMIC);
+ if (!md_dst)
+ return -ENOMEM;

+ err = ip_tun_build_common(&md_dst->u.tun_info, meta, extack);
+ if (err) {
+ dst_release(&md_dst->dst);
+ return err;
+ }
+ *dst = md_dst;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(ip_tunnel_build_metadst);
+
+static int ip_tun_fill_common(struct sk_buff *skb,
+ struct ip_tunnel_info *tun_info)
+{
if (nla_put_be64(skb, LWTUNNEL_IP_ID, tun_info->key.tun_id,
LWTUNNEL_IP_PAD) ||
nla_put_in_addr(skb, LWTUNNEL_IP_DST, tun_info->key.u.ipv4.dst) ||
@@ -294,6 +326,25 @@ static int ip_tun_fill_encap_info(struct sk_buff *skb,
return 0;
}

+static int ip_tun_fill_encap_info(struct sk_buff *skb,
+ struct lwtunnel_state *lwtstate)
+{
+ struct ip_tunnel_info *tun_info = lwt_tun_info(lwtstate);
+ return ip_tun_fill_common(skb, tun_info);
+}
+
+int ip_tunnel_fill_metadst(struct sk_buff *skb, struct metadata_dst *md_dst)
+{
+ int err;
+ if (md_dst->type != METADATA_IP_TUNNEL)
+ return 0;
+ err = ip_tun_fill_common(skb, &md_dst->u.tun_info);
+ if (err)
+ return err;
+ return LWTUNNEL_ENCAP_IP;
+}
+EXPORT_SYMBOL_GPL(ip_tunnel_fill_metadst);
+
static int ip_tun_encap_nlsize(struct lwtunnel_state *lwtstate)
{
return nla_total_size_64bit(8) /* LWTUNNEL_IP_ID */
--
2.13.0
David Lamparter
2017-08-21 17:15:20 UTC
Permalink
This enables using an IPv4 multicast destination for gretap and enables
learning unicast destinations through the bridge fdb. Alternatively, a
zero destination can be used together with manual entry creation via
netlink.

This is essentially basic NVGRE support, if the GRE key is configured
appropriately. VLAN to key mapping is not supported. Implementing
NVGRE was actually an unintentional side effect.

[v2: move src/dst flipping to RX path, allow zero tunnel daddr]
Signed-off-by: David Lamparter <***@diac24.net>
---
net/ipv4/ip_gre.c | 40 +++++++++++++++++++++++++++++++++++-----
net/ipv4/ip_tunnel.c | 1 +
2 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 7a7829e839c2..1596f709e5b6 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -266,15 +266,35 @@ static int __ipgre_rcv(struct sk_buff *skb, const struct tnl_ptk_info *tpi,
skb_pop_mac_header(skb);
else
skb_reset_mac_header(skb);
- if (tunnel->collect_md) {
+ if (tunnel->collect_md ||
+ tunnel->parms.iph.daddr == 0 ||
+ ipv4_is_multicast(tunnel->parms.iph.daddr)) {
__be16 flags;
__be64 tun_id;

flags = tpi->flags & (TUNNEL_CSUM | TUNNEL_KEY);
tun_id = key32_to_tunnel_id(tpi->key);
- tun_dst = ip_tun_rx_dst(skb, flags, tun_id, 0);
- if (!tun_dst)
- return PACKET_REJECT;
+ if (tunnel->collect_md) {
+ tun_dst = ip_tun_rx_dst(skb, flags, tun_id, 0);
+ if (!tun_dst)
+ return PACKET_REJECT;
+ } else {
+ tun_dst = metadata_dst_alloc(0,
+ METADATA_IP_TUNNEL, GFP_ATOMIC);
+ if (!tun_dst)
+ return PACKET_REJECT;
+
+ /* build dst appropriate for responding */
+ tun_dst->u.tun_info.options_len = 0;
+ tun_dst->u.tun_info.mode = IP_TUNNEL_INFO_TX;
+
+ ip_tunnel_key_init(&tun_dst->u.tun_info.key,
+ tunnel->parms.iph.saddr,
+ iph->saddr,
+ tunnel->parms.iph.tos,
+ tunnel->parms.iph.ttl,
+ 0, 0, 0, tun_id, flags);
+ }
}

ip_tunnel_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error);
@@ -507,11 +527,14 @@ static netdev_tx_t gre_tap_xmit(struct sk_buff *skb,
struct net_device *dev)
{
struct ip_tunnel *tunnel = netdev_priv(dev);
+ struct ip_tunnel_info *tun_info = skb_tunnel_info(skb);

- if (tunnel->collect_md) {
+ if (tunnel->collect_md || tun_info) {
gre_fb_xmit(skb, dev, htons(ETH_P_TEB));
return NETDEV_TX_OK;
}
+ /* tunnel layer doesn't expect a metadata dst */
+ skb_dst_drop(skb);

if (gre_handle_offloads(skb, !!(tunnel->parms.o_flags & TUNNEL_CSUM)))
goto free_skb;
@@ -933,6 +956,7 @@ static int gre_tap_init(struct net_device *dev)
{
__gre_tunnel_init(dev);
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
+ netif_keep_dst(dev);

return ip_tunnel_init(dev);
}
@@ -940,6 +964,10 @@ static int gre_tap_init(struct net_device *dev)
static const struct net_device_ops gre_tap_netdev_ops = {
.ndo_init = gre_tap_init,
.ndo_uninit = ip_tunnel_uninit,
+#ifdef CONFIG_NET_IPGRE_BROADCAST
+ .ndo_open = ipgre_open,
+ .ndo_stop = ipgre_close,
+#endif
.ndo_start_xmit = gre_tap_xmit,
.ndo_set_mac_address = eth_mac_addr,
.ndo_validate_addr = eth_validate_addr,
@@ -947,6 +975,8 @@ static const struct net_device_ops gre_tap_netdev_ops = {
.ndo_get_stats64 = ip_tunnel_get_stats64,
.ndo_get_iflink = ip_tunnel_get_iflink,
.ndo_fill_metadata_dst = gre_fill_metadata_dst,
+ .ndo_metadst_fill = ip_tunnel_fill_metadst,
+ .ndo_metadst_build = ip_tunnel_build_metadst,
};

static void ipgre_tap_setup(struct net_device *dev)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 129d1a3616f8..451c11fc9ae5 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -140,6 +140,7 @@ struct ip_tunnel *ip_tunnel_lookup(struct ip_tunnel_net *itn,

hlist_for_each_entry_rcu(t, head, hash_node) {
if ((local != t->parms.iph.saddr || t->parms.iph.daddr != 0) &&
+ (local != t->parms.iph.saddr || !ipv4_is_multicast(t->parms.iph.daddr)) &&
(local != t->parms.iph.daddr || !ipv4_is_multicast(local)))
continue;
--
2.13.0
David Lamparter
2017-08-21 17:15:21 UTC
Permalink
This makes mpls_rt_xmit() available for use in upcoming VPLS code. Same
for mpls_route_input_rcu().

Signed-off-by: David Lamparter <***@diac24.net>
---
net/mpls/af_mpls.c | 48 ++++++++++++++++++++++++++++++------------------
net/mpls/internal.h | 4 ++++
2 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index c5b9ce41d66f..1de2b3501dc8 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -43,7 +43,7 @@ static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
struct nlmsghdr *nlh, struct net *net, u32 portid,
unsigned int nlm_flags);

-static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
+struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
{
struct mpls_route *rt = NULL;

@@ -313,15 +313,8 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
struct net *net = dev_net(dev);
struct mpls_shim_hdr *hdr;
struct mpls_route *rt;
- struct mpls_nh *nh;
struct mpls_entry_decoded dec;
- struct net_device *out_dev;
- struct mpls_dev *out_mdev;
struct mpls_dev *mdev;
- unsigned int hh_len;
- unsigned int new_header_size;
- unsigned int mtu;
- int err;

/* Careful this entire function runs inside of an rcu critical section */

@@ -356,9 +349,6 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
goto drop;
}

- nh = mpls_select_multipath(rt, skb);
- if (!nh)
- goto err;

/* Pop the label */
skb_pull(skb, sizeof(*hdr));
@@ -376,6 +366,32 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
goto err;
dec.ttl -= 1;

+ if (mpls_rt_xmit(skb, rt, dec))
+ goto drop;
+ return 0;
+
+err:
+ MPLS_INC_STATS(mdev, rx_errors);
+drop:
+ kfree_skb(skb);
+ return NET_RX_DROP;
+}
+
+int mpls_rt_xmit(struct sk_buff *skb, struct mpls_route *rt,
+ struct mpls_entry_decoded dec)
+{
+ struct mpls_nh *nh;
+ struct net_device *out_dev = NULL;
+ struct mpls_dev *out_mdev;
+ unsigned int hh_len;
+ unsigned int new_header_size;
+ unsigned int mtu;
+ int err;
+
+ nh = mpls_select_multipath(rt, skb);
+ if (!nh)
+ goto tx_err;
+
/* Find the output device */
out_dev = rcu_dereference(nh->nh_dev);
if (!mpls_output_possible(out_dev))
@@ -401,8 +417,9 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
if (unlikely(!new_header_size && dec.bos)) {
/* Penultimate hop popping */
if (!mpls_egress(dev_net(out_dev), rt, skb, dec))
- goto err;
+ goto tx_err;
} else {
+ struct mpls_shim_hdr *hdr;
bool bos;
int i;
skb_push(skb, new_header_size);
@@ -435,12 +452,7 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
out_mdev = out_dev ? mpls_dev_get(out_dev) : NULL;
if (out_mdev)
MPLS_INC_STATS(out_mdev, tx_errors);
- goto drop;
-err:
- MPLS_INC_STATS(mdev, rx_errors);
-drop:
- kfree_skb(skb);
- return NET_RX_DROP;
+ return -1;
}

static struct packet_type mpls_packet_type __read_mostly = {
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
index cf65aec2e551..b70c6663d4f3 100644
--- a/net/mpls/internal.h
+++ b/net/mpls/internal.h
@@ -210,4 +210,8 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu);
void mpls_stats_inc_outucastpkts(struct net_device *dev,
const struct sk_buff *skb);

+struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index);
+int mpls_rt_xmit(struct sk_buff *skb, struct mpls_route *rt,
+ struct mpls_entry_decoded dec);
+
#endif /* MPLS_INTERNAL_H */
--
2.13.0
David Lamparter
2017-08-21 17:15:22 UTC
Permalink
This wires up the neccessary calls for VPLS into the MPLS forwarding
pieces. Since CONFIG_MPLS_VPLS doesn't exist yet in Kconfig, it'll
never be enabled, so we're on the stubs for now.

[TODO: maybe rename VPLS_FLAGS to MPLS_FLAGS and use it for
non-pseudowire OAM bits too (e.g. enabling G-ACh or LSP ping.)]

Signed-off-by: David Lamparter <***@diac24.net>
---
include/uapi/linux/rtnetlink.h | 5 ++++
net/mpls/af_mpls.c | 65 ++++++++++++++++++++++++++++++++++++++++++
net/mpls/internal.h | 40 ++++++++++++++++++++++----
3 files changed, 104 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index dab7dad9e01a..b5a34e0e4327 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -326,6 +326,8 @@ enum rtattr_type_t {
RTA_PAD,
RTA_UID,
RTA_TTL_PROPAGATE,
+ RTA_VPLS_IF,
+ RTA_VPLS_FLAGS,
__RTA_MAX
};

@@ -334,6 +336,9 @@ enum rtattr_type_t {
#define RTM_RTA(r) ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct rtmsg))))
#define RTM_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct rtmsg))

+#define RTA_VPLS_F_CW_RX (1 << 0)
+#define RTA_VPLS_F_CW_TX (1 << 1)
+
/* RTM_MULTIPATH --- array of struct rtnexthop.
*
* "struct rtnexthop" describes all necessary nexthop information,
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 1de2b3501dc8..7a21d230f9e7 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -299,6 +299,11 @@ static bool mpls_egress(struct net *net, struct mpls_route *rt,
success = true;
break;
}
+ case MPT_VPLS:
+ /* nothing to do here, no TTL in Ethernet
+ * (and we shouldn't mess with the TTL in inner IP packets,
+ * pseudowires are supposed to be transparent) */
+ break;
case MPT_UNSPEC:
/* Should have decided which protocol it is by now */
break;
@@ -349,6 +354,8 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
goto drop;
}

+ if (rt->rt_payload_type == MPT_VPLS)
+ return vpls_rcv(skb, dev, pt, rt, hdr, orig_dev);

/* Pop the label */
skb_pull(skb, sizeof(*hdr));
@@ -469,6 +476,8 @@ static const struct nla_policy rtm_mpls_policy[RTA_MAX+1] = {
struct mpls_route_config {
u32 rc_protocol;
u32 rc_ifindex;
+ u32 rc_vpls_ifindex;
+ u8 rc_vpls_flags;
u8 rc_via_table;
u8 rc_via_alen;
u8 rc_via[MAX_VIA_ALEN];
@@ -541,6 +550,8 @@ static void mpls_route_update(struct net *net, unsigned index,
rt = rtnl_dereference(platform_label[index]);
rcu_assign_pointer(platform_label[index], new);

+ vpls_label_update(index, rt, new);
+
mpls_notify_route(net, index, rt, new, info);

/* If we removed a route free it now */
@@ -942,6 +953,7 @@ static int mpls_route_add(struct mpls_route_config *cfg,
struct mpls_route __rcu **platform_label;
struct net *net = cfg->rc_nlinfo.nl_net;
struct mpls_route *rt, *old;
+ struct net_device *vpls_dev = NULL;
int err = -EINVAL;
u8 max_via_alen;
unsigned index;
@@ -996,6 +1008,24 @@ static int mpls_route_add(struct mpls_route_config *cfg,
goto errout;
}

+ if (cfg->rc_vpls_ifindex) {
+ vpls_dev = dev_get_by_index(net, cfg->rc_vpls_ifindex);
+ if (!vpls_dev) {
+ err = -ENODEV;
+ NL_SET_ERR_MSG(extack, "Invalid VPLS ifindex");
+ goto errout;
+ }
+ /* we're under RTNL; and we'll drop routes when we're
+ * notified the device is going away. */
+ dev_put(vpls_dev);
+
+ if (!is_vpls_dev(vpls_dev)) {
+ err = -ENODEV;
+ NL_SET_ERR_MSG(extack, "Not a VPLS device");
+ goto errout;
+ }
+ }
+
err = -ENOMEM;
rt = mpls_rt_alloc(nhs, max_via_alen, max_labels);
if (IS_ERR(rt)) {
@@ -1006,6 +1036,8 @@ static int mpls_route_add(struct mpls_route_config *cfg,
rt->rt_protocol = cfg->rc_protocol;
rt->rt_payload_type = cfg->rc_payload_type;
rt->rt_ttl_propagate = cfg->rc_ttl_propagate;
+ rt->rt_vpls_dev = vpls_dev;
+ rt->rt_vpls_flags = cfg->rc_vpls_flags;

if (cfg->rc_mp)
err = mpls_nh_build_multi(cfg, rt, max_labels, extack);
@@ -1430,6 +1462,14 @@ static void mpls_ifdown(struct net_device *dev, int event)
if (!rt)
continue;

+ if (rt->rt_vpls_dev == dev) {
+ switch (event) {
+ case NETDEV_UNREGISTER:
+ mpls_route_update(net, index, NULL, NULL);
+ continue;
+ }
+ }
+
alive = 0;
deleted = 0;
change_nexthops(rt) {
@@ -1777,6 +1817,13 @@ static int rtm_to_route_config(struct sk_buff *skb,
case RTA_OIF:
cfg->rc_ifindex = nla_get_u32(nla);
break;
+ case RTA_VPLS_IF:
+ cfg->rc_vpls_ifindex = nla_get_u32(nla);
+ cfg->rc_payload_type = MPT_VPLS;
+ break;
+ case RTA_VPLS_FLAGS:
+ cfg->rc_vpls_flags = nla_get_u8(nla);
+ break;
case RTA_NEWDST:
if (nla_get_labels(nla, MAX_NEW_LABELS,
&cfg->rc_output_labels,
@@ -1911,6 +1958,14 @@ static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event,
ttl_propagate))
goto nla_put_failure;
}
+
+ if (rt->rt_vpls_dev)
+ if (nla_put_u32(skb, RTA_VPLS_IF, rt->rt_vpls_dev->ifindex))
+ goto nla_put_failure;
+ if (rt->rt_vpls_flags)
+ if (nla_put_u8(skb, RTA_VPLS_FLAGS, rt->rt_vpls_flags))
+ goto nla_put_failure;
+
if (rt->rt_nhn == 1) {
const struct mpls_nh *nh = rt->rt_nh;

@@ -2220,6 +2275,13 @@ static int mpls_getroute(struct sk_buff *in_skb, struct nlmsghdr *in_nlh,
if (nla_put_labels(skb, RTA_DST, 1, &in_label))
goto nla_put_failure;

+ if (rt->rt_vpls_dev)
+ if (nla_put_u32(skb, RTA_VPLS_IF, rt->rt_vpls_dev->ifindex))
+ goto nla_put_failure;
+ if (rt->rt_vpls_flags)
+ if (nla_put_u8(skb, RTA_VPLS_FLAGS, rt->rt_vpls_flags))
+ goto nla_put_failure;
+
if (nh->nh_labels &&
nla_put_labels(skb, RTA_NEWDST, nh->nh_labels,
nh->nh_label))
@@ -2491,6 +2553,8 @@ static int __init mpls_init(void)

rtnl_af_register(&mpls_af_ops);

+ vpls_init();
+
rtnl_register(PF_MPLS, RTM_NEWROUTE, mpls_rtm_newroute, NULL, 0);
rtnl_register(PF_MPLS, RTM_DELROUTE, mpls_rtm_delroute, NULL, 0);
rtnl_register(PF_MPLS, RTM_GETROUTE, mpls_getroute, mpls_dump_routes,
@@ -2510,6 +2574,7 @@ module_init(mpls_init);
static void __exit mpls_exit(void)
{
rtnl_unregister_all(PF_MPLS);
+ vpls_exit();
rtnl_af_unregister(&mpls_af_ops);
dev_remove_pack(&mpls_packet_type);
unregister_netdevice_notifier(&mpls_dev_notifier);
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
index b70c6663d4f3..f22e50b693ea 100644
--- a/net/mpls/internal.h
+++ b/net/mpls/internal.h
@@ -76,13 +76,9 @@ struct sk_buff;

enum mpls_payload_type {
MPT_UNSPEC, /* IPv4 or IPv6 */
+ MPT_VPLS = 2, /* pseudowire */
MPT_IPV4 = 4,
MPT_IPV6 = 6,
-
- /* Other types not implemented:
- * - Pseudo-wire with or without control word (RFC4385)
- * - GAL (RFC5586)
- */
};

struct mpls_nh { /* next hop label forwarding entry */
@@ -152,7 +148,10 @@ struct mpls_route { /* next hop label forwarding entry */
u8 rt_nhn_alive;
u8 rt_nh_size;
u8 rt_via_offset;
- u8 rt_reserved1;
+
+ u8 rt_vpls_flags;
+ struct net_device *rt_vpls_dev;
+
struct mpls_nh rt_nh[0];
};

@@ -214,4 +213,33 @@ struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index);
int mpls_rt_xmit(struct sk_buff *skb, struct mpls_route *rt,
struct mpls_entry_decoded dec);

+#ifdef CONFIG_MPLS_VPLS
+int vpls_rcv(struct sk_buff *skb, struct net_device *in_dev,
+ struct packet_type *pt, struct mpls_route *rt,
+ struct mpls_shim_hdr *hdr, struct net_device *orig_dev);
+void vpls_label_update(unsigned label, struct mpls_route *rt_old,
+ struct mpls_route *rt_new);
+__init int vpls_init(void);
+__exit void vpls_exit(void);
+int is_vpls_dev(struct net_device *dev);
+
+#else /* !CONFIG_MPLS_VPLS */
+static inline int vpls_rcv(struct sk_buff *skb, struct net_device *in_dev,
+ struct packet_type *pt, struct mpls_route *rt,
+ struct mpls_shim_hdr *hdr,
+ struct net_device *orig_dev)
+{
+ kfree_skb(skb);
+ return NET_RX_DROP;
+}
+static inline int is_vpls_dev(struct net_device *dev)
+{
+ return 0;
+}
+
+#define vpls_label_update(label, rt_old, rt_new) do { } while (0)
+#define vpls_init() do { } while (0)
+#define vpls_exit() do { } while (0)
+#endif /* !CONFIG_MPLS_VPLS */
+
#endif /* MPLS_INTERNAL_H */
--
2.13.0
David Lamparter
2017-08-21 17:15:23 UTC
Permalink
[work-in-progress, works but needs changes]
[v2: refactored lots of things, e.g. dst_metadata, no more genetlink]
[v4: removed pointless include/net/vpls.h, squashed pseudowire control
word support, squashed netlink lwtunnel access bits]

Signed-off-by: David Lamparter <***@diac24.net>
---
include/net/dst_metadata.h | 24 ++
include/uapi/linux/lwtunnel.h | 8 +
net/core/lwtunnel.c | 1 +
net/mpls/Kconfig | 11 +
net/mpls/Makefile | 1 +
net/mpls/vpls.c | 550 ++++++++++++++++++++++++++++++++++++++++++
6 files changed, 595 insertions(+)
create mode 100644 net/mpls/vpls.c

diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index 4bcc0f314853..307df53e5185 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -8,6 +8,7 @@
enum metadata_type {
METADATA_IP_TUNNEL,
METADATA_HW_PORT_MUX,
+ METADATA_VPLS,
};

struct hw_port_info {
@@ -15,12 +16,17 @@ struct hw_port_info {
u32 port_id;
};

+struct vpls_info {
+ u32 pw_label;
+};
+
struct metadata_dst {
struct dst_entry dst;
enum metadata_type type;
union {
struct ip_tunnel_info tun_info;
struct hw_port_info port_info;
+ struct vpls_info vpls_info;
} u;
};

@@ -54,6 +60,15 @@ static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb)
return NULL;
}

+static inline struct vpls_info *skb_vpls_info(struct sk_buff *skb)
+{
+ struct metadata_dst *md_dst = skb_metadata_dst(skb);
+ if (md_dst && md_dst->type == METADATA_VPLS)
+ return &md_dst->u.vpls_info;
+ return NULL;
+}
+
+
static inline bool skb_valid_dst(const struct sk_buff *skb)
{
struct dst_entry *dst = skb_dst(skb);
@@ -74,6 +89,9 @@ static inline int metadata_dst_cmp(const struct metadata_dst *a,
case METADATA_HW_PORT_MUX:
return memcmp(&a->u.port_info, &b->u.port_info,
sizeof(a->u.port_info));
+ case METADATA_VPLS:
+ return memcmp(&a->u.vpls_info, &b->u.vpls_info,
+ sizeof(a->u.vpls_info));
case METADATA_IP_TUNNEL:
return memcmp(&a->u.tun_info, &b->u.tun_info,
sizeof(a->u.tun_info) +
@@ -220,4 +238,10 @@ static inline struct metadata_dst *ipv6_tun_rx_dst(struct sk_buff *skb,
0, ip6_flowlabel(ip6h), flags, tunnel_id,
md_size);
}
+
+static inline struct metadata_dst *vpls_rx_dst(void)
+{
+ return metadata_dst_alloc(0, METADATA_VPLS, GFP_ATOMIC);
+}
+
#endif /* __NET_DST_METADATA_H */
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index 7fdd19ca7511..b7281441bccb 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -12,6 +12,7 @@ enum lwtunnel_encap_types {
LWTUNNEL_ENCAP_SEG6,
LWTUNNEL_ENCAP_BPF,
LWTUNNEL_ENCAP_SEG6_LOCAL,
+ LWTUNNEL_ENCAP_PSEUDOWIRE,
__LWTUNNEL_ENCAP_MAX,
};

@@ -67,4 +68,11 @@ enum {

#define LWT_BPF_MAX_HEADROOM 256

+enum {
+ LWT_PSEUDOWIRE_LOCAL_LABEL,
+ __LWT_PSEUDOWIRE_MAX,
+};
+
+#define LWT_PSEUDOWIRE_MAX (__LWT_PSEUDOWIRE_MAX - 1)
+
#endif /* _UAPI_LWTUNNEL_H_ */
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 0b171756453c..52563a4ecc22 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -49,6 +49,7 @@ static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type)
case LWTUNNEL_ENCAP_IP6:
case LWTUNNEL_ENCAP_IP:
case LWTUNNEL_ENCAP_NONE:
+ case LWTUNNEL_ENCAP_PSEUDOWIRE:
case __LWTUNNEL_ENCAP_MAX:
/* should not have got here */
WARN_ON(1);
diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
index 5c467ef97311..c15ba73efb34 100644
--- a/net/mpls/Kconfig
+++ b/net/mpls/Kconfig
@@ -27,6 +27,17 @@ config MPLS_ROUTING
---help---
Add support for forwarding of mpls packets.

+config MPLS_VPLS
+ bool "VPLS support"
+ default y
+ depends on MPLS_ROUTING && BRIDGE_NETFILTER=n
+ ---help---
+ Add support for de-&encapsulating VPLS. Not compatible with
+ bridge netfilter due to the latter stomping over VPLS' dst metadata.
+
+comment "disable 'Bridged IP/ARP packets filtering' for VPLS support"
+ depends on BRIDGE_NETFILTER
+
config MPLS_IPTUNNEL
tristate "MPLS: IP over MPLS tunnel support"
depends on LWTUNNEL && MPLS_ROUTING
diff --git a/net/mpls/Makefile b/net/mpls/Makefile
index 9ca923625016..3c028600a980 100644
--- a/net/mpls/Makefile
+++ b/net/mpls/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_MPLS_ROUTING) += mpls_router.o
obj-$(CONFIG_MPLS_IPTUNNEL) += mpls_iptunnel.o

mpls_router-y := af_mpls.o
+mpls_router-$(CONFIG_MPLS_VPLS) += vpls.o
diff --git a/net/mpls/vpls.c b/net/mpls/vpls.c
new file mode 100644
index 000000000000..12cbfe3063ae
--- /dev/null
+++ b/net/mpls/vpls.c
@@ -0,0 +1,550 @@
+/*
+ * net/mpls/vpls.c
+ *
+ * Copyright (C) 2016 David Lamparter
+ *
+ */
+
+#include <linux/netdevice.h>
+#include <linux/slab.h>
+#include <linux/ethtool.h>
+#include <linux/etherdevice.h>
+#include <linux/u64_stats_sync.h>
+#include <linux/mpls.h>
+
+#include <net/rtnetlink.h>
+#include <net/dst.h>
+#include <net/xfrm.h>
+#include <net/mpls.h>
+#include <linux/module.h>
+#include <net/dst_metadata.h>
+#include <net/ip_tunnels.h>
+#include <linux/lwtunnel.h>
+
+#include "internal.h"
+
+#define DRV_NAME "vpls"
+
+#define MIN_MTU 68 /* Min L3 MTU */
+#define MAX_MTU 65535 /* Max L3 MTU (arbitrary) */
+
+struct vpls_cw {
+ u8 type_flags;
+#define VPLS_CWTYPE(cw) ((cw)->type_flags & 0x0f)
+
+ u8 len;
+ u16 seqno;
+};
+
+struct vpls_wirelist {
+ struct rcu_head rcu;
+ size_t count;
+ unsigned wires[0];
+};
+
+struct vpls_priv {
+ struct net *encap_net;
+ struct vpls_wirelist __rcu *wires;
+};
+
+static int vpls_xmit_wire(struct sk_buff *skb, struct net_device *dev,
+ struct vpls_priv *vpls, u32 wire)
+{
+ struct mpls_route *rt;
+ struct mpls_entry_decoded dec;
+
+ dec.bos = 1;
+ dec.ttl = 255;
+
+ rt = mpls_route_input_rcu(vpls->encap_net, wire);
+ if (!rt)
+ return -ENOENT;
+ if (rt->rt_vpls_dev != dev)
+ return -EINVAL;
+
+ if (rt->rt_vpls_flags & RTA_VPLS_F_CW_TX) {
+ struct vpls_cw *cw;
+ if (skb_cow(skb, sizeof(*cw)))
+ return -ENOMEM;
+ cw = skb_push(skb, sizeof(*cw));
+ memset(cw, 0, sizeof(*cw));
+ }
+
+ return mpls_rt_xmit(skb, rt, dec);
+}
+
+static netdev_tx_t vpls_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+ int err = -EINVAL, ok_count = 0;
+ struct vpls_priv *priv = netdev_priv(dev);
+ struct vpls_info *vi;
+ struct pcpu_sw_netstats *stats;
+ size_t len = skb->len;
+
+ vi = skb_vpls_info(skb);
+
+ skb_orphan(skb);
+ skb_forward_csum(skb);
+
+ if (vi) {
+ err = vpls_xmit_wire(skb, dev, priv, vi->pw_label);
+ if (err)
+ goto out_err;
+ } else {
+ struct sk_buff *cloned;
+ struct vpls_wirelist *wl;
+ size_t i;
+
+ rcu_read_lock();
+ wl = rcu_dereference(priv->wires);
+ if (wl->count == 0) {
+ dev->stats.tx_carrier_errors++;
+ goto out_err_rcu;
+ }
+
+ for (i = 0; i < wl->count; i++) {
+ cloned = skb_clone(skb, GFP_KERNEL);
+ if (vpls_xmit_wire(cloned, dev, priv, wl->wires[i]))
+ consume_skb(cloned);
+ else
+ ok_count++;
+ }
+ rcu_read_unlock();
+
+ if (!ok_count)
+ goto out_err;
+ consume_skb(skb);
+ }
+
+ stats = this_cpu_ptr(dev->tstats);
+ u64_stats_update_begin(&stats->syncp);
+ stats->tx_packets++;
+ stats->tx_bytes += len;
+ u64_stats_update_end(&stats->syncp);
+
+ return 0;
+
+out_err_rcu:
+ rcu_read_unlock();
+out_err:
+ dev->stats.tx_errors++;
+
+ consume_skb(skb);
+ return err;
+}
+
+int vpls_rcv(struct sk_buff *skb, struct net_device *in_dev,
+ struct packet_type *pt, struct mpls_route *rt,
+ struct mpls_shim_hdr *hdr, struct net_device *orig_dev)
+{
+ struct net_device *dev = rt->rt_vpls_dev;
+ struct mpls_entry_decoded dec;
+ struct metadata_dst *md_dst;
+ struct pcpu_sw_netstats *stats;
+ void *next;
+
+ if (!dev)
+ goto drop_nodev;
+
+ dec = mpls_entry_decode(hdr);
+ if (!dec.bos) {
+ dev->stats.rx_frame_errors++;
+ goto drop;
+ }
+
+ /* bottom label is still in the skb */
+ next = skb_pull(skb, sizeof(*hdr));
+
+ if (rt->rt_vpls_flags & RTA_VPLS_F_CW_RX) {
+ struct vpls_cw *cw = next;
+ if (unlikely(!pskb_may_pull(skb, sizeof(*cw)))) {
+ dev->stats.rx_length_errors++;
+ goto drop;
+ }
+ next = skb_pull(skb, sizeof(*cw));
+
+ if (VPLS_CWTYPE(cw) != 0) {
+ /* insert MPLS OAM implementation here */
+ goto drop_nodev;
+ }
+ }
+
+ if (unlikely(!pskb_may_pull(skb, ETH_HLEN))) {
+ dev->stats.rx_length_errors++;
+ goto drop;
+ }
+
+ md_dst = vpls_rx_dst();
+ if (unlikely(!md_dst)) {
+ netdev_err(dev, "failed to allocate dst metadata\n");
+ goto drop;
+ }
+ md_dst->u.vpls_info.pw_label = dec.label;
+
+ skb->dev = dev;
+
+ skb_reset_mac_header(skb);
+ skb->protocol = eth_type_trans(skb, dev);
+ skb->ip_summed = CHECKSUM_NONE;
+ skb->pkt_type = PACKET_HOST;
+
+ skb_clear_hash(skb);
+ skb->vlan_tci = 0;
+ skb_set_queue_mapping(skb, 0);
+ skb_scrub_packet(skb, !net_eq(dev_net(in_dev), dev_net(dev)));
+
+ skb_reset_network_header(skb);
+ skb_probe_transport_header(skb, 0);
+
+ skb_dst_drop(skb);
+ skb_dst_set(skb, &md_dst->dst);
+
+ stats = this_cpu_ptr(dev->tstats);
+ u64_stats_update_begin(&stats->syncp);
+ stats->rx_packets++;
+ stats->rx_bytes += skb->len;
+ u64_stats_update_end(&stats->syncp);
+
+ netif_rx(skb);
+ return 0;
+
+drop:
+ dev->stats.rx_errors++;
+drop_nodev:
+ kfree_skb(skb);
+ return NET_RX_DROP;
+}
+
+void vpls_label_update(unsigned label, struct mpls_route *rt_old,
+ struct mpls_route *rt_new)
+{
+ struct vpls_priv *priv;
+ struct vpls_wirelist *wl, *wl_new;
+ size_t i;
+
+ ASSERT_RTNL();
+
+ if (rt_old && rt_new && rt_old->rt_vpls_dev == rt_new->rt_vpls_dev)
+ return;
+
+ if (rt_old && rt_old->rt_vpls_dev) {
+ priv = netdev_priv(rt_old->rt_vpls_dev);
+ wl = rcu_dereference(priv->wires);
+
+ for (i = 0; i < wl->count; i++)
+ if (wl->wires[i] == label)
+ break;
+
+ if (i == wl->count) {
+ netdev_err(rt_old->rt_vpls_dev,
+ "can't find pseudowire to remove!\n");
+ goto update_new;
+ }
+
+ wl_new = kmalloc(sizeof(*wl) +
+ (wl->count - 1) * sizeof(wl->wires[0]),
+ GFP_ATOMIC);
+ if (!wl_new) {
+ netdev_err(rt_old->rt_vpls_dev,
+ "out of memory for pseudowire delete!\n");
+ goto update_new;
+ }
+
+ wl_new->count = wl->count - 1;
+ memcpy(wl_new->wires, wl->wires, i * sizeof(wl->wires[0]));
+ memcpy(wl_new->wires + i, wl->wires + i + 1,
+ (wl->count - i - 1) * sizeof(wl->wires[0]));
+
+ rcu_assign_pointer(priv->wires, wl_new);
+ kfree_rcu(wl, rcu);
+
+ if (wl_new->count == 0)
+ netif_carrier_off(rt_old->rt_vpls_dev);
+ }
+
+update_new:
+ if (rt_new && rt_new->rt_vpls_dev) {
+ priv = netdev_priv(rt_new->rt_vpls_dev);
+ wl = rcu_dereference(priv->wires);
+
+ wl_new = kmalloc(sizeof(*wl) +
+ (wl->count + 1) * sizeof(wl->wires[0]),
+ GFP_ATOMIC);
+ if (!wl_new) {
+ netdev_err(rt_new->rt_vpls_dev,
+ "out of memory for pseudowire add!\n");
+ return;
+ }
+ wl_new->count = wl->count + 1;
+ memcpy(wl_new->wires, wl->wires,
+ wl->count * sizeof(wl->wires[0]));
+ wl_new->wires[wl->count] = label;
+
+ rcu_assign_pointer(priv->wires, wl_new);
+ kfree_rcu(wl, rcu);
+
+ if (wl_new->count == 1)
+ netif_carrier_on(rt_new->rt_vpls_dev);
+ }
+}
+
+/* fake multicast ability */
+static void vpls_set_multicast_list(struct net_device *dev)
+{
+}
+
+static int vpls_open(struct net_device *dev)
+{
+ struct vpls_priv *priv = netdev_priv(dev);
+ struct vpls_wirelist *wl;
+
+ wl = rcu_dereference(priv->wires);
+ if (wl->count > 0)
+ netif_carrier_on(dev);
+
+ return 0;
+}
+
+static int vpls_close(struct net_device *dev)
+{
+ netif_carrier_off(dev);
+ return 0;
+}
+
+static int is_valid_vpls_mtu(int new_mtu)
+{
+ return new_mtu >= MIN_MTU && new_mtu <= MAX_MTU;
+}
+
+static int vpls_change_mtu(struct net_device *dev, int new_mtu)
+{
+ if (!is_valid_vpls_mtu(new_mtu))
+ return -EINVAL;
+ dev->mtu = new_mtu;
+ return 0;
+}
+
+static int vpls_dev_init(struct net_device *dev)
+{
+ struct vpls_priv *priv = netdev_priv(dev);
+ priv->wires = kzalloc(sizeof(struct vpls_wirelist), GFP_KERNEL);
+ if (!priv->wires)
+ return -ENOMEM;
+
+ dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+ if (!dev->tstats) {
+ kfree(priv->wires);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static void vpls_dev_free(struct net_device *dev)
+{
+ struct vpls_priv *priv = netdev_priv(dev);
+
+ free_percpu(dev->tstats);
+
+ if (priv->wires)
+ kfree(priv->wires);
+
+ if (priv->encap_net)
+ put_net(priv->encap_net);
+
+ free_netdev(dev);
+}
+
+static const struct nla_policy vpls_meta_policy[LWT_PSEUDOWIRE_MAX + 1] = {
+ [LWT_PSEUDOWIRE_LOCAL_LABEL] = { .type = NLA_U32 },
+};
+
+static int vpls_fill_metadst(struct sk_buff *skb, struct metadata_dst *md_dst)
+{
+ struct vpls_info *vi;
+ if (md_dst->type != METADATA_VPLS)
+ return 0;
+
+ vi = &md_dst->u.vpls_info;
+ if (nla_put_u32(skb, LWT_PSEUDOWIRE_LOCAL_LABEL, vi->pw_label))
+ return -ENOMEM;
+ return LWTUNNEL_ENCAP_PSEUDOWIRE;
+}
+
+static int vpls_build_metadst(struct net_device *dev, struct nlattr *meta,
+ struct metadata_dst **dst,
+ struct netlink_ext_ack *extack)
+{
+ struct nlattr *tb[LWT_PSEUDOWIRE_MAX + 1];
+ struct metadata_dst *rv;
+ int err;
+ unsigned wire;
+
+ err = nla_parse_nested(tb, LWT_PSEUDOWIRE_MAX, meta,
+ vpls_meta_policy, extack);
+ if (err < 0)
+ return err;
+
+ if (!tb[LWT_PSEUDOWIRE_LOCAL_LABEL])
+ return -EINVAL;
+ wire = nla_get_u32(tb[LWT_PSEUDOWIRE_LOCAL_LABEL]);
+ if (wire < MPLS_LABEL_FIRST_UNRESERVED)
+ return -EINVAL;
+
+ rv = vpls_rx_dst();
+ if (!rv)
+ return -ENOMEM;
+ rv->u.vpls_info.pw_label = wire;
+
+ *dst = rv;
+ return 0;
+}
+
+static const struct net_device_ops vpls_netdev_ops = {
+ .ndo_init = vpls_dev_init,
+ .ndo_open = vpls_open,
+ .ndo_stop = vpls_close,
+ .ndo_start_xmit = vpls_xmit,
+ .ndo_change_mtu = vpls_change_mtu,
+ .ndo_get_stats64 = ip_tunnel_get_stats64,
+ .ndo_set_rx_mode = vpls_set_multicast_list,
+ .ndo_set_mac_address = eth_mac_addr,
+ .ndo_features_check = passthru_features_check,
+ .ndo_metadst_fill = vpls_fill_metadst,
+ .ndo_metadst_build = vpls_build_metadst,
+};
+
+int is_vpls_dev(struct net_device *dev)
+{
+ return dev->netdev_ops == &vpls_netdev_ops;
+}
+
+#define VPLS_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | \
+ NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA)
+
+static void vpls_setup(struct net_device *dev)
+{
+ ether_setup(dev);
+
+ dev->priv_flags &= ~IFF_TX_SKB_SHARING;
+ dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
+ dev->priv_flags |= IFF_NO_QUEUE;
+
+ dev->netdev_ops = &vpls_netdev_ops;
+ dev->features |= NETIF_F_LLTX;
+ dev->features |= VPLS_FEATURES;
+ dev->vlan_features = dev->features;
+ dev->priv_destructor = vpls_dev_free;
+
+ dev->hw_features = VPLS_FEATURES;
+ dev->hw_enc_features = VPLS_FEATURES;
+
+ netif_keep_dst(dev);
+}
+
+/*
+ * netlink interface
+ */
+
+static int vpls_validate(struct nlattr *tb[], struct nlattr *data[],
+ struct netlink_ext_ack *extack)
+{
+ if (tb[IFLA_ADDRESS]) {
+ if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) {
+ NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_ADDRESS],
+ "Invalid Ethernet address length");
+ return -EINVAL;
+ }
+ if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS]))) {
+ NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_ADDRESS],
+ "Invalid Ethernet address");
+ return -EADDRNOTAVAIL;
+ }
+ }
+ if (tb[IFLA_MTU]) {
+ if (!is_valid_vpls_mtu(nla_get_u32(tb[IFLA_MTU]))) {
+ NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_MTU],
+ "Invalid MTU");
+ return -EINVAL;
+ }
+ }
+ return 0;
+}
+
+static struct rtnl_link_ops vpls_link_ops;
+
+static int vpls_newlink(struct net *src_net, struct net_device *dev,
+ struct nlattr *tb[], struct nlattr *data[],
+ struct netlink_ext_ack *extack)
+{
+ int err;
+ struct vpls_priv *priv = netdev_priv(dev);
+
+ if (tb[IFLA_ADDRESS] == NULL)
+ eth_hw_addr_random(dev);
+
+ if (tb[IFLA_IFNAME])
+ nla_strlcpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
+ else
+ snprintf(dev->name, IFNAMSIZ, DRV_NAME "%%d");
+
+ err = register_netdevice(dev);
+ if (err < 0)
+ goto err;
+ priv->encap_net = get_net(src_net);
+
+ netif_carrier_off(dev);
+ return 0;
+
+err:
+ return err;
+}
+
+static void vpls_dellink(struct net_device *dev, struct list_head *head)
+{
+ unregister_netdevice_queue(dev, head);
+}
+
+
+static struct rtnl_link_ops vpls_link_ops = {
+ .kind = DRV_NAME,
+ .priv_size = sizeof(struct vpls_priv),
+ .setup = vpls_setup,
+ .validate = vpls_validate,
+ .newlink = vpls_newlink,
+ .dellink = vpls_dellink,
+};
+
+/*
+ * init/fini
+ */
+
+__init int vpls_init(void)
+{
+ int ret;
+
+ ret = rtnl_link_register(&vpls_link_ops);
+ if (ret)
+ goto out;
+
+ return 0;
+
+out:
+ return ret;
+}
+
+__exit void vpls_exit(void)
+{
+ rtnl_link_unregister(&vpls_link_ops);
+}
+
+#if 0
+/* not currently available as a separate module... */
+
+module_init(vpls_init);
+module_exit(vpls_exit);
+
+MODULE_DESCRIPTION("Virtual Private LAN Service");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_RTNL_LINK(DRV_NAME);
+#endif
--
2.13.0
Amine Kherbouche
2017-08-28 09:21:27 UTC
Permalink
any opinion on vpls driver path ?
Post by David Lamparter
[work-in-progress, works but needs changes]
[v2: refactored lots of things, e.g. dst_metadata, no more genetlink]
[v4: removed pointless include/net/vpls.h, squashed pseudowire control
word support, squashed netlink lwtunnel access bits]
---
include/net/dst_metadata.h | 24 ++
include/uapi/linux/lwtunnel.h | 8 +
net/core/lwtunnel.c | 1 +
net/mpls/Kconfig | 11 +
net/mpls/Makefile | 1 +
net/mpls/vpls.c | 550 ++++++++++++++++++++++++++++++++++++++++++
6 files changed, 595 insertions(+)
create mode 100644 net/mpls/vpls.c
Stephen Hemminger
2017-08-22 00:01:51 UTC
Permalink
On Mon, 21 Aug 2017 19:15:17 +0200
Post by David Lamparter
Hi all,
this is an update on the earlier "[RFC net-next] VPLS support". Note
I've changed the subject lines on some of the patches to better reflect
what they really do (tbh the earlier subject lines were crap.)
- https://github.com/eqvinox/vpls-iproute2
- https://github.com/opensourcerouting/frr/commits/vpls
- https://github.com/eqvinox/vpls-linux-kernel
(but please be aware that I'm amending and rebasing commits)
The NVGRE implementation in the 3rd patch in this series is actually an
accident - I was just wiring up gretap as a reference; only after I was
done I noticed that that sums up to NVGRE, more or less. IMHO, it does
serve well to demonstrate the bridge changes are not VPLS-specific.
I've tested some basic setups, the chain from LDP down into the kernel
works at least in these. FRR has some testcases around from OpenBSD
VPLS support, I haven't wired that up to run against Linux / this
patchset yet.
Same as before (API didn't change).
The patchset needs a lot of polishing (yes I left my TODO notes in the
commit messages), for now my primary concern is overall design
feedback. Roopa has already provided a lot of input (Thanks!); the
major topic I'm expecting to get discussion on is the bridge FDB
changes.
Got some useful input; but still need feedback on the bridge FDB
changes (first 2 patches). I don't believe it to have a significant
impact on existing bridge operation, and I believe a multipoint tunnel
driver without its own FDB (e.g. NVGRE in this set) should perform
better than one with its own FDB (e.g. existing VXLAN).
P.S.: For a little context on the bridge FDB changes - I'm hoping to
find some time to extend this to the MDB to allow aggregating dst
metadata and handing down a list of dst metas on TX. This isn't
specifically for VPLS but rather to give sufficient information to the
802.11 stack to allow it to optimize selecting rates (or unicasting)
for multicast traffic by having the multicast subscriber list known.
This is done by major commercial wifi solutions (e.g. google "dynamic
multicast optimization".)
https://github.com/eqvinox/vpls-linux-kernel/tree/mdb-hack
Please note that the patches in that branch are not at an acceptable
quality level, but you can see the semantic relation to 802.11.
I would, however, like to point out that this branch has pseudo-working
IGMP/MLD snooping for VPLS, and it'd be 20-ish lines to add it to NVGRE
(I'll do that as soon as I get to it, it'll pop up on that branch too.)
This is relevant to the discussion because it's a feature which is
non-obvious (to me) on how to do with the VXLAN model of having an
entirely separate FDB. Meanwhile, with this architecture, the proof of
8 files changed, 176 insertions(+), 15 deletions(-)
Cheers,
-David
include/linux/netdevice.h | 18 ++++++
include/net/dst_metadata.h | 51 ++++++++++++++---
include/net/ip_tunnels.h | 5 ++
include/uapi/linux/lwtunnel.h | 8 +++
include/uapi/linux/neighbour.h | 2 +
include/uapi/linux/rtnetlink.h | 5 ++
net/bridge/br.c | 2 +-
net/bridge/br_device.c | 4 ++
net/bridge/br_fdb.c | 119 ++++++++++++++++++++++++++++++++--------
net/bridge/br_input.c | 6 +-
net/bridge/br_private.h | 6 +-
net/core/lwtunnel.c | 1 +
net/ipv4/ip_gre.c | 40 ++++++++++++--
net/ipv4/ip_tunnel.c | 1 +
net/ipv4/ip_tunnel_core.c | 87 +++++++++++++++++++++++------
net/mpls/Kconfig | 11 ++++
net/mpls/Makefile | 1 +
net/mpls/af_mpls.c | 113 ++++++++++++++++++++++++++++++++------
net/mpls/internal.h | 44 +++++++++++++--
net/mpls/vpls.c | 550 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
20 files changed, 990 insertions(+), 84 deletions(-)
I know the bridge is an easy target to extend L2 forwarding, but it is not
the only option. Have you condidered building a new driver (like VXLAN does)
which does the forwarding you want. Having all features in one driver
makes for worse performance, and increased complexity.
David Lamparter
2017-08-22 00:29:56 UTC
Permalink
Post by Stephen Hemminger
Post by David Lamparter
P.S.: For a little context on the bridge FDB changes - I'm hoping to
find some time to extend this to the MDB to allow aggregating dst
metadata and handing down a list of dst metas on TX. This isn't
specifically for VPLS but rather to give sufficient information to the
802.11 stack to allow it to optimize selecting rates (or unicasting)
for multicast traffic by having the multicast subscriber list known.
This is done by major commercial wifi solutions (e.g. google "dynamic
multicast optimization".)
https://github.com/eqvinox/vpls-linux-kernel/tree/mdb-hack
Please note that the patches in that branch are not at an acceptable
quality level, but you can see the semantic relation to 802.11.
I would, however, like to point out that this branch has pseudo-working
IGMP/MLD snooping for VPLS, and it'd be 20-ish lines to add it to NVGRE
(I'll do that as soon as I get to it, it'll pop up on that branch too.)
This is relevant to the discussion because it's a feature which is
non-obvious (to me) on how to do with the VXLAN model of having an
entirely separate FDB. Meanwhile, with this architecture, the proof of
8 files changed, 176 insertions(+), 15 deletions(-)
I know the bridge is an easy target to extend L2 forwarding, but it is not
the only option. Have you condidered building a new driver
Yes I have; I dismissed the approach because even though an fdb is
reasonable to duplicate, I did not believe replicating multicast
snooping code into both VPLS and 802.11 (and possibly VXLAN) to be a
viable option. ...is it?
Post by Stephen Hemminger
(like VXLAN does) which does the forwarding you want. Having all
features in one driver makes for worse performance, and increased
complexity.
Can you elaborate? I agree with that sentence as a general statement,
but a general statement needs to apply to a specific situation. As
discussed in the previous thread with Nikolay, checking skb->_refdst
against 0 should be doable without touching additional cachelines, so
the performance cost should be rather small. For complexity - it's
keeping an extra pointer around, which is semantically bound to the
existing net_bridge_fdb_entry->dst. On the other hand, it spares us
from another copy of a fdb implementation, and two copies of multicast
snooping code... I honestly believe this patchset is a good approach.


-David
Nikolay Aleksandrov
2017-08-22 11:01:40 UTC
Permalink
Post by Stephen Hemminger
On Mon, 21 Aug 2017 19:15:17 +0200
Post by David Lamparter
Hi all,
this is an update on the earlier "[RFC net-next] VPLS support". Note
I've changed the subject lines on some of the patches to better reflect
what they really do (tbh the earlier subject lines were crap.)
- https://github.com/eqvinox/vpls-iproute2
- https://github.com/opensourcerouting/frr/commits/vpls
- https://github.com/eqvinox/vpls-linux-kernel
(but please be aware that I'm amending and rebasing commits)
The NVGRE implementation in the 3rd patch in this series is actually an
accident - I was just wiring up gretap as a reference; only after I was
done I noticed that that sums up to NVGRE, more or less. IMHO, it does
serve well to demonstrate the bridge changes are not VPLS-specific.
I've tested some basic setups, the chain from LDP down into the kernel
works at least in these. FRR has some testcases around from OpenBSD
VPLS support, I haven't wired that up to run against Linux / this
patchset yet.
Same as before (API didn't change).
The patchset needs a lot of polishing (yes I left my TODO notes in the
commit messages), for now my primary concern is overall design
feedback. Roopa has already provided a lot of input (Thanks!); the
major topic I'm expecting to get discussion on is the bridge FDB
changes.
Got some useful input; but still need feedback on the bridge FDB
changes (first 2 patches). I don't believe it to have a significant
impact on existing bridge operation, and I believe a multipoint tunnel
driver without its own FDB (e.g. NVGRE in this set) should perform
better than one with its own FDB (e.g. existing VXLAN).
P.S.: For a little context on the bridge FDB changes - I'm hoping to
find some time to extend this to the MDB to allow aggregating dst
metadata and handing down a list of dst metas on TX. This isn't
specifically for VPLS but rather to give sufficient information to the
802.11 stack to allow it to optimize selecting rates (or unicasting)
for multicast traffic by having the multicast subscriber list known.
This is done by major commercial wifi solutions (e.g. google "dynamic
multicast optimization".)
https://github.com/eqvinox/vpls-linux-kernel/tree/mdb-hack
Please note that the patches in that branch are not at an acceptable
quality level, but you can see the semantic relation to 802.11.
I would, however, like to point out that this branch has pseudo-working
IGMP/MLD snooping for VPLS, and it'd be 20-ish lines to add it to NVGRE
(I'll do that as soon as I get to it, it'll pop up on that branch too.)
This is relevant to the discussion because it's a feature which is
non-obvious (to me) on how to do with the VXLAN model of having an
entirely separate FDB. Meanwhile, with this architecture, the proof of
8 files changed, 176 insertions(+), 15 deletions(-)
Cheers,
-David
include/linux/netdevice.h | 18 ++++++
include/net/dst_metadata.h | 51 ++++++++++++++---
include/net/ip_tunnels.h | 5 ++
include/uapi/linux/lwtunnel.h | 8 +++
include/uapi/linux/neighbour.h | 2 +
include/uapi/linux/rtnetlink.h | 5 ++
net/bridge/br.c | 2 +-
net/bridge/br_device.c | 4 ++
net/bridge/br_fdb.c | 119 ++++++++++++++++++++++++++++++++--------
net/bridge/br_input.c | 6 +-
net/bridge/br_private.h | 6 +-
net/core/lwtunnel.c | 1 +
net/ipv4/ip_gre.c | 40 ++++++++++++--
net/ipv4/ip_tunnel.c | 1 +
net/ipv4/ip_tunnel_core.c | 87 +++++++++++++++++++++++------
net/mpls/Kconfig | 11 ++++
net/mpls/Makefile | 1 +
net/mpls/af_mpls.c | 113 ++++++++++++++++++++++++++++++++------
net/mpls/internal.h | 44 +++++++++++++--
net/mpls/vpls.c | 550 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
20 files changed, 990 insertions(+), 84 deletions(-)
I know the bridge is an easy target to extend L2 forwarding, but it is not
the only option. Have you condidered building a new driver (like VXLAN does)
which does the forwarding you want. Having all features in one driver
makes for worse performance, and increased complexity.
+1

As I said before, a separate implementation will be much cleaner and will not affect
the bridge in any way, paying both performance and complexity price for something that
the majority of users will not be using isn't worth it. In addition this creates a
silent dependency between the bridge and the fdb metadata dst users, it would be much
more preferable to be able to run them separately.
If there is any code that will need to be re-used by VPLS (or anyone else) figure out a way
to factor it out.
David Lamparter
2017-08-22 11:32:07 UTC
Permalink
Post by Nikolay Aleksandrov
Post by Stephen Hemminger
I know the bridge is an easy target to extend L2 forwarding, but it is not
the only option. Have you condidered building a new driver (like VXLAN does)
which does the forwarding you want. Having all features in one driver
makes for worse performance, and increased complexity.
+1
As I said before, a separate implementation will be much cleaner and will not affect
the bridge in any way, paying both performance and complexity price for something that
the majority of users will not be using isn't worth it. In addition this creates a
silent dependency between the bridge and the fdb metadata dst users, it would be much
more preferable to be able to run them separately.
If there is any code that will need to be re-used by VPLS (or anyone else) figure out a way
to factor it out.
Could you tell me why this argument didn't apply to the bridge vlan
tunnel code? It adds complexity to the bridge specifically for VXLAN
(and it does *not* transfer to VPLS or 802.11) and reduces performance

... by actually accessing the same metadata that this patchset does.


-David
Nikolay Aleksandrov
2017-08-22 11:55:04 UTC
Permalink
Post by David Lamparter
Post by Nikolay Aleksandrov
Post by Stephen Hemminger
I know the bridge is an easy target to extend L2 forwarding, but it is not
the only option. Have you condidered building a new driver (like VXLAN does)
which does the forwarding you want. Having all features in one driver
makes for worse performance, and increased complexity.
+1
As I said before, a separate implementation will be much cleaner and will not affect
the bridge in any way, paying both performance and complexity price for something that
the majority of users will not be using isn't worth it. In addition this creates a
silent dependency between the bridge and the fdb metadata dst users, it would be much
more preferable to be able to run them separately.
If there is any code that will need to be re-used by VPLS (or anyone else) figure out a way
to factor it out.
Could you tell me why this argument didn't apply to the bridge vlan
tunnel code? It adds complexity to the bridge specifically for VXLAN
(and it does *not* transfer to VPLS or 802.11) and reduces performance
... by actually accessing the same metadata that this patchset does.
-David
The separation is clean and does not add any dependencies, that code is well isolated.
As for performance, the impact is minimal as it adds a test for a port flag that is
already in the cache at that point. In fact it can be compiled-out entirely if you
disable bridge vlan support. The metadata you're referring to is not accessed if
the port flag is not set or vlan support is compiled out removing its impact entirely.
You can have a vxlan setup without bridge, no ?
David Lamparter
2017-08-22 12:06:10 UTC
Permalink
Post by Nikolay Aleksandrov
Post by David Lamparter
Post by Nikolay Aleksandrov
Post by Stephen Hemminger
I know the bridge is an easy target to extend L2 forwarding, but it is not
the only option. Have you condidered building a new driver (like VXLAN does)
which does the forwarding you want. Having all features in one driver
makes for worse performance, and increased complexity.
+1
As I said before, a separate implementation will be much cleaner and will not affect
the bridge in any way, paying both performance and complexity price for something that
the majority of users will not be using isn't worth it. In addition this creates a
silent dependency between the bridge and the fdb metadata dst users, it would be much
more preferable to be able to run them separately.
If there is any code that will need to be re-used by VPLS (or anyone else) figure out a way
to factor it out.
Could you tell me why this argument didn't apply to the bridge vlan
tunnel code? It adds complexity to the bridge specifically for VXLAN
(and it does *not* transfer to VPLS or 802.11) and reduces performance
... by actually accessing the same metadata that this patchset does.
The separation is clean and does not add any dependencies, that code
is well isolated. As for performance, the impact is minimal as it
adds a test for a port flag that is already in the cache at that
point.
Ah, ok, now this is useful input... I can add a BR_PORT_METADATA flag.
Post by Nikolay Aleksandrov
In fact it can be compiled-out entirely if you disable bridge vlan
support. The metadata you're referring to is not accessed if the port
flag is not set or vlan support is compiled out removing its impact
entirely.
Ok, I can probably adapt this patchset to do the same.
Post by Nikolay Aleksandrov
You can have a vxlan setup without bridge, no ?
You can run the VPLS code without a bridge too... behaviour will be the
same as with other ip tunnels when you set the destination to multicast
(packets get flooded.)


-David
Roopa Prabhu
2017-08-22 04:43:15 UTC
Permalink
Post by David Lamparter
Hi all,
this is an update on the earlier "[RFC net-next] VPLS support". Note
I've changed the subject lines on some of the patches to better reflect
what they really do (tbh the earlier subject lines were crap.)
- https://github.com/eqvinox/vpls-iproute2
- https://github.com/opensourcerouting/frr/commits/vpls
- https://github.com/eqvinox/vpls-linux-kernel
(but please be aware that I'm amending and rebasing commits)
The NVGRE implementation in the 3rd patch in this series is actually an
accident - I was just wiring up gretap as a reference; only after I was
done I noticed that that sums up to NVGRE, more or less. IMHO, it does
serve well to demonstrate the bridge changes are not VPLS-specific.
I've tested some basic setups, the chain from LDP down into the kernel
works at least in these. FRR has some testcases around from OpenBSD
VPLS support, I haven't wired that up to run against Linux / this
patchset yet.
Same as before (API didn't change).
The patchset needs a lot of polishing (yes I left my TODO notes in the
commit messages), for now my primary concern is overall design
feedback. Roopa has already provided a lot of input (Thanks!); the
major topic I'm expecting to get discussion on is the bridge FDB
changes.
Got some useful input; but still need feedback on the bridge FDB
changes (first 2 patches). I don't believe it to have a significant
impact on existing bridge operation, and I believe a multipoint tunnel
driver without its own FDB (e.g. NVGRE in this set) should perform
better than one with its own FDB (e.g. existing VXLAN).
P.S.: For a little context on the bridge FDB changes - I'm hoping to
find some time to extend this to the MDB to allow aggregating dst
metadata and handing down a list of dst metas on TX. This isn't
specifically for VPLS but rather to give sufficient information to the
802.11 stack to allow it to optimize selecting rates (or unicasting)
for multicast traffic by having the multicast subscriber list known.
This is done by major commercial wifi solutions (e.g. google "dynamic
multicast optimization".)
https://github.com/eqvinox/vpls-linux-kernel/tree/mdb-hack
Please note that the patches in that branch are not at an acceptable
quality level, but you can see the semantic relation to 802.11.
I would, however, like to point out that this branch has pseudo-working
IGMP/MLD snooping for VPLS, and it'd be 20-ish lines to add it to NVGRE
(I'll do that as soon as I get to it, it'll pop up on that branch too.)
This is relevant to the discussion because it's a feature which is
non-obvious (to me) on how to do with the VXLAN model of having an
entirely separate FDB. Meanwhile, with this architecture, the proof of
8 files changed, 176 insertions(+), 15 deletions(-)
David, what is special about the vpls igmp/mld snooping code ?...do
you have to snoop vpls attrs ?.
in the vxlan model.., the vxlan driver can snoop its own attrs eg
vxlan-id, remote dst etc.
and the pkt is passed up to the bridge where it will hit the normal
bridge igmp/mpld snooping code.
can you pls elaborate ?

keeping vpls specific code and api in a separate vpls driver allows
for cleanly extending it in the future.
David Lamparter
2017-08-22 11:24:51 UTC
Permalink
Post by Roopa Prabhu
Post by David Lamparter
This is relevant to the discussion because it's a feature which is
non-obvious (to me) on how to do with the VXLAN model of having an
entirely separate FDB. Meanwhile, with this architecture, the proof of
8 files changed, 176 insertions(+), 15 deletions(-)
David, what is special about the vpls igmp/mld snooping code ?...do
you have to snoop vpls attrs ?.
It just needs to snoop which tunnel endpoint[s] a multicast group is
subscribed on.
Post by Roopa Prabhu
in the vxlan model.., the vxlan driver can snoop its own attrs eg
vxlan-id, remote dst etc.
and the pkt is passed up to the bridge where it will hit the normal
bridge igmp/mpld snooping code.
can you pls elaborate ?
Yes, that's exactly what the hack I have is doing, it's snooping the
driver-specific attrs into a metadata_dst, passing it up to the bridge,
which puts the metadata_dst on the MDB entry.

What I'm arguing against is duplicating the entire MDB into all of the
drivers.

The snooping machinery is a pretty complex piece of code which holds
information as a cartesian combination of MAC * VLAN * Group. It has
several configuration knobs and can even send its own packets. It's
complex enough to have been the subject of several CVEs over its history
(e.g. CVE-2013-4129, CVE-2013-2636.)
Post by Roopa Prabhu
keeping vpls specific code and api in a separate vpls driver allows
for cleanly extending it in the future.
That's exactly my point: it's not VPLS specific. It's exactly the same
functionality for VXLAN, VPLS, NVGRE and 802.11. The functionality is
"remember the tunnel endpoint."
It's the same bits as lwtunnel uses, which is why the 2nd patch in this
series reuses the existing lwtunnel netlink encapsulation functions.


-David
Amine Kherbouche
2017-09-11 08:02:20 UTC
Permalink
Hi David,

Do you plan to send a v3?
Post by David Lamparter
Hi all,
this is an update on the earlier "[RFC net-next] VPLS support". Note
I've changed the subject lines on some of the patches to better reflect
what they really do (tbh the earlier subject lines were crap.)
- https://github.com/eqvinox/vpls-iproute2
- https://github.com/opensourcerouting/frr/commits/vpls
- https://github.com/eqvinox/vpls-linux-kernel
(but please be aware that I'm amending and rebasing commits)
Amine Kherbouche
2017-09-19 14:46:38 UTC
Permalink
Hi David,

What's next ? do you plan to send a v3 or should I do it ?
Post by Amine Kherbouche
Hi David,
Do you plan to send a v3?
Post by David Lamparter
Hi all,
this is an update on the earlier "[RFC net-next] VPLS support". Note
I've changed the subject lines on some of the patches to better reflect
what they really do (tbh the earlier subject lines were crap.)
- https://github.com/eqvinox/vpls-iproute2
- https://github.com/opensourcerouting/frr/commits/vpls
- https://github.com/eqvinox/vpls-linux-kernel
(but please be aware that I'm amending and rebasing commits)
Continue reading on narkive:
Loading...