Discussion:
[PATCH] macvlan: add tap device backend
(too old to reply)
David Miller
2009-08-07 03:20:33 UTC
Permalink
From: Arnd Bergmann <arnd at arndb.de>
Date: Thu, 6 Aug 2009 21:50:28 +0000
This is a first prototype of a new interface into the network
stack, to eventually replace tun/tap and the bridge driver
in certain virtual machine setups.
I don't know enough to say how good a solution this is for
the problem, but I certainly like this driver for it's
utter simplicity and minimalness.
Arnd Bergmann
2009-08-06 21:50:28 UTC
Permalink
This is a first prototype of a new interface into the network
stack, to eventually replace tun/tap and the bridge driver
in certain virtual machine setups.

Background
----------
The 'Edge Virtual Bridging' working group is discussing ways to overcome
the limitation of virtual bridges in hypervisors. One important part
of this is the Virtual Ethernet Port Aggregator (VEPA), as described in
http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-vepa-modular-0709-v01.pdf

In short, the idea of VEPA is that virtual machines do not communicate
with each other through direct bridging in the hypervisor but only via
an external managed switch that is already well integrated into the data
center, including network filtering, accounting and monitoring. While
we can do most of that efficiently in the Linux bridge code, doing it
externally simplifies the overall setup.

Related work
------------
Patches to implement VEPA in the Linux bridge driver have been posted by
Anna Fischer in June, see http://patchwork.ozlabs.org/patch/28702/. Those
patches are good and hopefully get merged in 2.6.32, but I think we can
take some shortcuts with an alternative approach:

The macvlan driver already has the property of forwarding all traffic
between guests and an external interface but not between the guests, just
as VEPA needs it. Also, VEPA does explicitly not want or need advanced
filtering in the way that netfilter-bridge provides, so we can use macvlan
to replace the bridge code in this setup, reducing the code path through
the kernel. This works fine with containers and network namespaces,
but not easily with kvm/qemu because we only have a network device.

Or Gerlitz posted a "raw" packet socket backend for qemu to deal with this,
at http://marc.info/?l=qemu-devel&m=124653801212767 and at least three
other people have done a similar functionality independently.

This driver
-----------
While the other approaches should work as well, doing it using a tap
interface should give additional benefits:

* We can keep using the optimizations for jumbo frames that we have put
into the tun/tap driver.

* No need for root permissions that packet sockets need, just use 'ip
link add link type macvtap' to create a new device and give it the right
permissions using udev (using one tap per macvlan netdev).

* support for multiqueue network adapters by opening the tap device
multiple times, using one file descriptor per guest CPU/network
queue/interrupt (if the adapter supports multiple queues on a single
MAC address).

* support for zero-copy receive/transmit using async I/O on the tap device
(if the adapter supports per MAC rx queues).

* The same framework in macvlan can be used to add a third backend
into a future kernel based virtio-net implementation.

This version of the driver does not support any of those features,
but they all appear possible to add ;).
The driver is currently called 'macvtap', but I'd be more than happy
to change that if anyone could suggest a better name. The code is
still in an early stage and I wish I had found more time to polish
it, but at this time, I'd first like to know if people agree with the
basic concept at all.

Cc: Patrick McHardy <kaber at trash.net>
Cc: Stephen Hemminger <shemminger at linux-foundation.org>
Cc: David S. Miller" <davem at davemloft.net>
Cc: "Michael S. Tsirkin" <mst at redhat.com>
Cc: Herbert Xu <herbert at gondor.apana.org.au>
Cc: Or Gerlitz <ogerlitz at voltaire.com>
Cc: "Fischer, Anna" <anna.fischer at hp.com>
Cc: netdev at vger.kernel.org
Cc: bridge at lists.linux-foundation.org
Cc: linux-kernel at vger.kernel.org
Cc: Edge Virtual Bridging <evb at yahoogroups.com>
Signed-off-by: Arnd Bergmann <arnd at arndb.de>

---

The evb mailing list eats Cc headers, please make sure to keep everybody
in your Cc list when replying there.
---
drivers/net/Kconfig | 12 ++
drivers/net/Makefile | 1 +
drivers/net/macvlan.c | 39 +++-----
drivers/net/macvlan.h | 37 +++++++
drivers/net/macvtap.c | 276 +++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 341 insertions(+), 24 deletions(-)
create mode 100644 drivers/net/macvlan.h
create mode 100644 drivers/net/macvtap.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 5f6509a..0b9ac6a 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -90,6 +90,18 @@ config MACVLAN
To compile this driver as a module, choose M here: the module
will be called macvlan.

+config MACVTAP
+ tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
+ depends on MACVLAN
+ help
+ This adds a specialized tap character device driver that is based
+ on the MAC-VLAN network interface, called macvtap. A macvtap device
+ can be added in the same way as a macvlan device, using 'type
+ macvlan', and then be accessed through the tap user space interface.
+
+ To compile this driver as a module, choose M here: the module
+ will be called macvtap.
+
config EQUALIZER
tristate "EQL (serial line load balancing) support"
---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ead8cab..8a2d2d7 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -162,6 +162,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
obj-$(CONFIG_DUMMY) += dummy.o
obj-$(CONFIG_IFB) += ifb.o
obj-$(CONFIG_MACVLAN) += macvlan.o
+obj-$(CONFIG_MACVTAP) += macvtap.o
obj-$(CONFIG_DE600) += de600.o
obj-$(CONFIG_DE620) += de620.o
obj-$(CONFIG_LANCE) += lance.o
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 99eed9f..9f7dc6a 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -30,22 +30,7 @@
#include <linux/if_macvlan.h>
#include <net/rtnetlink.h>

-#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)
-
-struct macvlan_port {
- struct net_device *dev;
- struct hlist_head vlan_hash[MACVLAN_HASH_SIZE];
- struct list_head vlans;
-};
-
-struct macvlan_dev {
- struct net_device *dev;
- struct list_head list;
- struct hlist_node hlist;
- struct macvlan_port *port;
- struct net_device *lowerdev;
-};
-
+#include "macvlan.h"

static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
const unsigned char *addr)
@@ -135,7 +120,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
else
nskb->pkt_type = PACKET_MULTICAST;

- netif_rx(nskb);
+ vlan->receive(nskb);
}
}
}
@@ -180,11 +165,11 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
skb->dev = dev;
skb->pkt_type = PACKET_HOST;

- netif_rx(skb);
+ vlan->receive(skb);
return NULL;
}

-static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
+int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
const struct macvlan_dev *vlan = netdev_priv(dev);
unsigned int len = skb->len;
@@ -202,6 +187,7 @@ static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
}
return NETDEV_TX_OK;
}
+EXPORT_SYMBOL_GPL(macvlan_start_xmit);

static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
unsigned short type, const void *daddr,
@@ -412,7 +398,7 @@ static const struct net_device_ops macvlan_netdev_ops = {
.ndo_validate_addr = eth_validate_addr,
};

-static void macvlan_setup(struct net_device *dev)
+void macvlan_setup(struct net_device *dev)
{
ether_setup(dev);

@@ -423,6 +409,7 @@ static void macvlan_setup(struct net_device *dev)
dev->ethtool_ops = &macvlan_ethtool_ops;
dev->tx_queue_len = 0;
}
+EXPORT_SYMBOL_GPL(macvlan_setup);

static int macvlan_port_create(struct net_device *dev)
{
@@ -472,7 +459,7 @@ static void macvlan_transfer_operstate(struct net_device *dev)
}
}

-static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
+int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
{
if (tb[IFLA_ADDRESS]) {
if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
@@ -482,9 +469,10 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
}
return 0;
}
+EXPORT_SYMBOL_GPL(macvlan_validate);

-static int macvlan_newlink(struct net_device *dev,
- struct nlattr *tb[], struct nlattr *data[])
+int macvlan_newlink(struct net_device *dev,
+ struct nlattr *tb[], struct nlattr *data[])
{
struct macvlan_dev *vlan = netdev_priv(dev);
struct macvlan_port *port;
@@ -524,6 +512,7 @@ static int macvlan_newlink(struct net_device *dev,
vlan->lowerdev = lowerdev;
vlan->dev = dev;
vlan->port = port;
+ vlan->receive = netif_rx;

err = register_netdevice(dev);
if (err < 0)
@@ -533,8 +522,9 @@ static int macvlan_newlink(struct net_device *dev,
macvlan_transfer_operstate(dev);
return 0;
}
+EXPORT_SYMBOL_GPL(macvlan_newlink);

-static void macvlan_dellink(struct net_device *dev)
+void macvlan_dellink(struct net_device *dev)
{
struct macvlan_dev *vlan = netdev_priv(dev);
struct macvlan_port *port = vlan->port;
@@ -545,6 +535,7 @@ static void macvlan_dellink(struct net_device *dev)
if (list_empty(&port->vlans))
macvlan_port_destroy(port->dev);
}
+EXPORT_SYMBOL_GPL(macvlan_dellink);

static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
.kind = "macvlan",
diff --git a/drivers/net/macvlan.h b/drivers/net/macvlan.h
new file mode 100644
index 0000000..3f3c6c3
--- /dev/null
+++ b/drivers/net/macvlan.h
@@ -0,0 +1,37 @@
+#ifndef _MACVLAN_H
+#define _MACVLAN_H
+
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+#include <linux/list.h>
+
+#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)
+
+struct macvlan_port {
+ struct net_device *dev;
+ struct hlist_head vlan_hash[MACVLAN_HASH_SIZE];
+ struct list_head vlans;
+};
+
+struct macvlan_dev {
+ struct net_device *dev;
+ struct list_head list;
+ struct hlist_node hlist;
+ struct macvlan_port *port;
+ struct net_device *lowerdev;
+
+ int (*receive)(struct sk_buff *skb);
+};
+
+extern int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev);
+
+extern void macvlan_setup(struct net_device *dev);
+
+extern int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]);
+
+extern int macvlan_newlink(struct net_device *dev,
+ struct nlattr *tb[], struct nlattr *data[]);
+
+extern void macvlan_dellink(struct net_device *dev);
+
+#endif /* _MACVLAN_H */
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
new file mode 100644
index 0000000..d99bfc0
--- /dev/null
+++ b/drivers/net/macvtap.c
@@ -0,0 +1,276 @@
+#include <linux/etherdevice.h>
+#include <linux/nsproxy.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/cache.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/wait.h>
+#include <linux/cdev.h>
+#include <linux/fs.h>
+
+#include <net/net_namespace.h>
+#include <net/rtnetlink.h>
+
+#include "macvlan.h"
+
+struct macvtap_dev {
+ struct macvlan_dev m;
+ struct cdev cdev;
+ struct sk_buff_head readq;
+ wait_queue_head_t wait;
+};
+
+/*
+ * Minor number matches netdev->ifindex, so need a large value
+ */
+static int macvtap_major;
+#define MACVTAP_NUM_DEVS 65536
+
+static int macvtap_receive(struct sk_buff *skb)
+{
+ struct macvtap_dev *vtap = netdev_priv(skb->dev);
+
+ skb_queue_tail(&vtap->readq, skb);
+ wake_up(&vtap->wait);
+ return 0;
+}
+
+static int macvtap_open(struct inode *inode, struct file *file)
+{
+ struct net *net = current->nsproxy->net_ns;
+ int ifindex = iminor(inode);
+ struct net_device *dev = dev_get_by_index(net, ifindex);
+ int err;
+
+ err = -ENODEV;
+ if (!dev)
+ goto out1;
+
+ file->private_data = netdev_priv(dev);
+ err = 0;
+out1:
+ return err;
+}
+
+static int macvtap_release(struct inode *inode, struct file *file)
+{
+ struct macvtap_dev *vtap = file->private_data;
+
+ if (!vtap)
+ return 0;
+
+ dev_put(vtap->m.dev);
+ return 0;
+}
+
+/* Get packet from user space buffer */
+static ssize_t macvtap_get_user(struct macvtap_dev *vtap,
+ const struct iovec *iv, size_t count,
+ int noblock)
+{
+ struct sk_buff *skb;
+ size_t len = count;
+
+ if (unlikely(len < ETH_HLEN))
+ return -EINVAL;
+
+ skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL);
+
+ if (!skb) {
+ vtap->m.dev->stats.rx_dropped++;
+ return -ENOMEM;
+ }
+
+ skb_reserve(skb, NET_IP_ALIGN);
+ skb_put(skb, count);
+
+ if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
+ vtap->m.dev->stats.rx_dropped++;
+ kfree_skb(skb);
+ return -EFAULT;
+ }
+
+ skb_set_network_header(skb, ETH_HLEN);
+ skb->dev = vtap->m.lowerdev;
+
+ macvlan_start_xmit(skb, vtap->m.dev);
+
+ return count;
+}
+
+static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
+ unsigned long count, loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t result;
+ struct macvtap_dev *vtap = file->private_data;
+
+ result = macvtap_get_user(vtap, iv, iov_length(iv, count),
+ file->f_flags & O_NONBLOCK);
+
+ return result;
+}
+
+/* Put packet to the user space buffer */
+static ssize_t macvtap_put_user(struct macvtap_dev *vtap,
+ struct sk_buff *skb,
+ struct iovec *iv, int len)
+{
+ int ret;
+
+ skb_push(skb, ETH_HLEN);
+ len = min_t(int, skb->len, len);
+
+ ret = skb_copy_datagram_iovec(skb, 0, iv, len);
+
+ vtap->m.dev->stats.rx_packets++;
+ vtap->m.dev->stats.rx_bytes += len;
+
+ return ret ? ret : len;
+}
+
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+ unsigned long count, loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ struct macvtap_dev *vtap = file->private_data;
+ DECLARE_WAITQUEUE(wait, current);
+ struct sk_buff *skb;
+ ssize_t len, ret = 0;
+
+ if (!vtap)
+ return -EBADFD;
+
+ len = iov_length(iv, count);
+ if (len < 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ add_wait_queue(&vtap->wait, &wait);
+ while (len) {
+ current->state = TASK_INTERRUPTIBLE;
+
+ /* Read frames from the queue */
+ if (!(skb=skb_dequeue(&vtap->readq))) {
+ if (file->f_flags & O_NONBLOCK) {
+ ret = -EAGAIN;
+ break;
+ }
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+ /* Nothing to read, let's sleep */
+ schedule();
+ continue;
+ }
+ ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);
+ kfree_skb(skb);
+ break;
+ }
+
+ current->state = TASK_RUNNING;
+ remove_wait_queue(&vtap->wait, &wait);
+
+out:
+ return ret;
+}
+
+struct file_operations macvtap_fops = {
+ .owner = THIS_MODULE,
+ .open = macvtap_open,
+ .release = macvtap_release,
+ .aio_read = macvtap_aio_read,
+ .aio_write = macvtap_aio_write,
+ .llseek = no_llseek,
+};
+
+static int macvtap_newlink(struct net_device *dev,
+ struct nlattr *tb[], struct nlattr *data[])
+{
+ struct macvtap_dev *vtap = netdev_priv(dev);
+ int err;
+
+ err = macvlan_newlink(dev, tb, data);
+ if (err)
+ goto out1;
+
+ cdev_init(&vtap->cdev, &macvtap_fops);
+ vtap->cdev.owner = THIS_MODULE;
+ err = cdev_add(&vtap->cdev, MKDEV(MAJOR(macvtap_major), dev->ifindex), 1);
+
+ if (err)
+ goto out2;
+
+ /*
+ * TODO: add class dev so device node gets created automatically
+ * by udev.
+ */
+ pr_debug("%s:%d: added cdev %d:%d for dev %s\n",
+ __func__, __LINE__, MAJOR(macvtap_major),
+ dev->ifindex, dev->name);
+
+ skb_queue_head_init(&vtap->readq);
+ init_waitqueue_head(&vtap->wait);
+ vtap->m.receive = macvtap_receive;
+
+ return 0;
+
+out2:
+ macvlan_dellink(dev);
+out1:
+ return err;
+}
+
+static void macvtap_dellink(struct net_device *dev)
+{
+ struct macvtap_dev *vtap = netdev_priv(dev);
+ cdev_del(&vtap->cdev);
+ /* TODO: kill open file descriptors */
+ macvlan_dellink(dev);
+}
+
+static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
+ .kind = "macvtap",
+ .priv_size = sizeof(struct macvtap_dev),
+ .setup = macvlan_setup,
+ .validate = macvlan_validate,
+ .newlink = macvtap_newlink,
+ .dellink = macvtap_dellink,
+};
+
+static int macvtap_init(void)
+{
+ int err;
+
+ err = alloc_chrdev_region(&macvtap_major, 0,
+ MACVTAP_NUM_DEVS, "macvtap");
+ if (err)
+ goto out1;
+
+ err = rtnl_link_register(&macvtap_link_ops);
+ if (err)
+ goto out2;
+
+ return 0;
+
+out2:
+ unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+out1:
+ return err;
+}
+module_init(macvtap_init);
+
+static void macvtap_exit(void)
+{
+ rtnl_link_unregister(&macvtap_link_ops);
+ unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+}
+module_exit(macvtap_exit);
+
+MODULE_ALIAS_RTNL_LINK("macvtap");
+MODULE_AUTHOR("Arnd Bergmann <arnd at arndb.de>");
+MODULE_LICENSE("GPL");
--
1.6.0.4
Daniel Robbins
2009-08-07 17:35:48 UTC
Permalink
This is a first prototype of a new interface into the network
stack, to eventually replace tun/tap and the bridge driver
in certain virtual machine setups.
I have some general questions about the intended use and benefits of
VEPA, from an IT perspective:

In which virtual machine setups and technologies do you forsee this
interface being used?
Is this new interface to be used within a virtual machine or
container, on the master node, or both?
What interface(s) would need to be configured for a single virtual
machine to use VEPA to access the network?
What are the current flexibility, security or performance limitations
of tun/tap and bridge that make this new interface necessary or
beneficial?
Is this new interface useful at all for VPN solutions or is it
*specifically* targeted for connecting virtual machines to the
network?
Is this essentially a bridge with layer-2 isolation for the virtual
machine interfaces built-in? If isolation is provided, what mechanism
is used to accomplish this, and how secure is it?
Does VEPA look like a regular ethernet interface (eth0) on the virtual
machine side?
Are there any associated user-space tools required for configuring a VEPA?

Do you have any HOWTO-style documentation that would demonstrate how
this interface would be used in production? Or a FAQ?

This seems like a very interesting effort but I don't quite have a
good grasp of VEPA's benefits and limitations -- I imagine that others
are in the same boat too.

Best Regards,

Daniel
Arnd Bergmann
2009-08-09 20:42:24 UTC
Permalink
Post by Arnd Bergmann
* The same framework in macvlan can be used to add a third backend
into a future kernel based virtio-net implementation.
Could you split the patches up, to make this last easier?
patch 1 - export framework
patch 2 - code using it
Sure, will do.
Post by Arnd Bergmann
+/* Get packet from user space buffer */
+static ssize_t macvtap_get_user(struct macvtap_dev *vtap,
+ const struct iovec *iv, size_t count,
+ int noblock)
+{
+ struct sk_buff *skb;
+ size_t len = count;
+
+ if (unlikely(len < ETH_HLEN))
+ return -EINVAL;
+
+ skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL);
+
+ if (!skb) {
+ vtap->m.dev->stats.rx_dropped++;
+ return -ENOMEM;
+ }
+
+ skb_reserve(skb, NET_IP_ALIGN);
+ skb_put(skb, count);
+
+ if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
+ vtap->m.dev->stats.rx_dropped++;
+ kfree_skb(skb);
+ return -EFAULT;
+ }
+
+ skb_set_network_header(skb, ETH_HLEN);
+ skb->dev = vtap->m.lowerdev;
+
+ macvlan_start_xmit(skb, vtap->m.dev);
+
+ return count;
+}
With tap, we discovered that not limiting the number of outstanding
skbs hurts UDP performance. And the solution was to limit
the number of outstanding packets - with hacks to work around
the fact that userspace .
Something seems to be missing in your last sentence here.

My driver OTOH is also missing any sort of flow control in both
RX and TX direction ;) For RX, there should probably just be
a limit of frames that get buffered in the ring.

For TX, I guess there should be a way to let the packet
scheduler handle this and give us a chance to block and
unblock at the right time. I haven't found out yet how to
do that.

Would it be enough to check the dev_queue_xmit() return
code for NETDEV_TX_BUSY?

How would I get notified when it gets free again?
Post by Arnd Bergmann
+ ret = skb_copy_datagram_iovec(skb, 0, iv, len);
+
+ vtap->m.dev->stats.rx_packets++;
+ vtap->m.dev->stats.rx_bytes += len;
where does atomicity guarantee for these counters come from?
AFAIK, we never do for any driver. They are statistics only and
need not be 100% correct, so the networking stack goes for
lower overhead and 99.9% correct.
Post by Arnd Bergmann
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+ unsigned long count, loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ struct macvtap_dev *vtap = file->private_data;
+ DECLARE_WAITQUEUE(wait, current);
+ struct sk_buff *skb;
+ ssize_t len, ret = 0;
+
+ if (!vtap)
+ return -EBADFD;
+
+ len = iov_length(iv, count);
+ if (len < 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ add_wait_queue(&vtap->wait, &wait);
+ while (len) {
+ current->state = TASK_INTERRUPTIBLE;
+
+ /* Read frames from the queue */
+ if (!(skb=skb_dequeue(&vtap->readq))) {
+ if (file->f_flags & O_NONBLOCK) {
+ ret = -EAGAIN;
+ break;
+ }
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+ /* Nothing to read, let's sleep */
+ schedule();
+ continue;
+ }
+ ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);
Don't cast away the constness. Instead, fix macvtap_put_user
to used skb_copy_datagram_const_iovec which does not modify the iovec.
Ah, good catch. I had copied that from the tun driver before you
fixed it there and failed to fix it the right way when I adapted
it for the new interface.

Thanks for the review,

Arnd <><
Arnd Bergmann
2009-08-10 13:29:46 UTC
Permalink
Post by Arnd Bergmann
Would it be enough to check the dev_queue_xmit() return
code for NETDEV_TX_BUSY?
How would I get notified when it gets free again?
You can do this by creating a socket. Look at how tun does
this now.
Hmm, I was hoping to be able to avoid this, because I can
interact more directly with the outbound physical interface
using dev_queue_xmit() instead of netif_rx_ni().

I'll have a look. Thanks,

Arnd <><
Michael S. Tsirkin
2009-08-10 13:42:29 UTC
Permalink
Post by Arnd Bergmann
Post by Arnd Bergmann
Would it be enough to check the dev_queue_xmit() return
code for NETDEV_TX_BUSY?
How would I get notified when it gets free again?
You can do this by creating a socket. Look at how tun does
this now.
Hmm, I was hoping to be able to avoid this, because I can
interact more directly with the outbound physical interface
using dev_queue_xmit() instead of netif_rx_ni().
Yea, that's what tun does. socket just notifies you when
packets are freed.
Post by Arnd Bergmann
I'll have a look. Thanks,
Arnd <><
Michael S. Tsirkin
2009-08-10 08:50:03 UTC
Permalink
Post by Arnd Bergmann
Post by Arnd Bergmann
* The same framework in macvlan can be used to add a third backend
into a future kernel based virtio-net implementation.
Could you split the patches up, to make this last easier?
patch 1 - export framework
patch 2 - code using it
Sure, will do.
Post by Arnd Bergmann
+/* Get packet from user space buffer */
+static ssize_t macvtap_get_user(struct macvtap_dev *vtap,
+ const struct iovec *iv, size_t count,
+ int noblock)
+{
+ struct sk_buff *skb;
+ size_t len = count;
+
+ if (unlikely(len < ETH_HLEN))
+ return -EINVAL;
+
+ skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL);
+
+ if (!skb) {
+ vtap->m.dev->stats.rx_dropped++;
+ return -ENOMEM;
+ }
+
+ skb_reserve(skb, NET_IP_ALIGN);
+ skb_put(skb, count);
+
+ if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
+ vtap->m.dev->stats.rx_dropped++;
+ kfree_skb(skb);
+ return -EFAULT;
+ }
+
+ skb_set_network_header(skb, ETH_HLEN);
+ skb->dev = vtap->m.lowerdev;
+
+ macvlan_start_xmit(skb, vtap->m.dev);
+
+ return count;
+}
With tap, we discovered that not limiting the number of outstanding
skbs hurts UDP performance. And the solution was to limit
the number of outstanding packets - with hacks to work around
the fact that userspace .
Something seems to be missing in your last sentence here.
Most userspace does not seem to implement software flow control for UDP,
even though it probably should.
Post by Arnd Bergmann
My driver OTOH is also missing any sort of flow control in both
RX and TX direction ;) For RX, there should probably just be
a limit of frames that get buffered in the ring.
For TX, I guess there should be a way to let the packet
scheduler handle this and give us a chance to block and
unblock at the right time. I haven't found out yet how to
do that.
Would it be enough to check the dev_queue_xmit() return
code for NETDEV_TX_BUSY?
How would I get notified when it gets free again?
You can do this by creating a socket. Look at how tun does
this now.
Post by Arnd Bergmann
Post by Arnd Bergmann
+ ret = skb_copy_datagram_iovec(skb, 0, iv, len);
+
+ vtap->m.dev->stats.rx_packets++;
+ vtap->m.dev->stats.rx_bytes += len;
where does atomicity guarantee for these counters come from?
AFAIK, we never do for any driver. They are statistics only and
need not be 100% correct, so the networking stack goes for
lower overhead and 99.9% correct.
Post by Arnd Bergmann
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+ unsigned long count, loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ struct macvtap_dev *vtap = file->private_data;
+ DECLARE_WAITQUEUE(wait, current);
+ struct sk_buff *skb;
+ ssize_t len, ret = 0;
+
+ if (!vtap)
+ return -EBADFD;
+
+ len = iov_length(iv, count);
+ if (len < 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ add_wait_queue(&vtap->wait, &wait);
+ while (len) {
+ current->state = TASK_INTERRUPTIBLE;
+
+ /* Read frames from the queue */
+ if (!(skb=skb_dequeue(&vtap->readq))) {
+ if (file->f_flags & O_NONBLOCK) {
+ ret = -EAGAIN;
+ break;
+ }
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+ /* Nothing to read, let's sleep */
+ schedule();
+ continue;
+ }
+ ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);
Don't cast away the constness. Instead, fix macvtap_put_user
to used skb_copy_datagram_const_iovec which does not modify the iovec.
Ah, good catch. I had copied that from the tun driver before you
fixed it there and failed to fix it the right way when I adapted
it for the new interface.
Thanks for the review,
Arnd <><
Patrick McHardy
2009-08-10 06:47:27 UTC
Permalink
Post by Arnd Bergmann
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
new file mode 100644
index 0000000..d99bfc0
--- /dev/null
+++ b/drivers/net/macvtap.c
+static int macvtap_open(struct inode *inode, struct file *file)
+{
+ struct net *net = current->nsproxy->net_ns;
+ int ifindex = iminor(inode);
+ struct net_device *dev = dev_get_by_index(net, ifindex);
+ int err;
+
+ err = -ENODEV;
+ if (!dev)
+ goto out1;
+
+ file->private_data = netdev_priv(dev);
+ err = 0;
+ return err;
+}
macvlan will remove all macvlan/vtap devices when the underlying
device in unregistered, at which time you need to release the
device references you're holding. I'd suggest to change the
macvlan_device_event() handler to use

vlan->dev->rtnl_link_ops->dellink(vlan->dev)

instead of macvlan_dellink() so the macvtap_dellink callback
is invoked.
Arnd Bergmann
2009-08-10 18:43:32 UTC
Permalink
Post by Patrick McHardy
macvlan will remove all macvlan/vtap devices when the underlying
device in unregistered, at which time you need to release the
device references you're holding. I'd suggest to change the
macvlan_device_event() handler to use
vlan->dev->rtnl_link_ops->dellink(vlan->dev)
instead of macvlan_dellink() so the macvtap_dellink callback
is invoked.
Ok, will do that. Thanks,

Arnd <><
Michael S. Tsirkin
2009-08-09 08:02:16 UTC
Permalink
Post by Arnd Bergmann
This driver
-----------
While the other approaches should work as well, doing it using a tap
* We can keep using the optimizations for jumbo frames that we have put
into the tun/tap driver.
* No need for root permissions that packet sockets need, just use 'ip
link add link type macvtap' to create a new device and give it the right
permissions using udev (using one tap per macvlan netdev).
* support for multiqueue network adapters by opening the tap device
multiple times, using one file descriptor per guest CPU/network
queue/interrupt (if the adapter supports multiple queues on a single
MAC address).
* support for zero-copy receive/transmit using async I/O on the tap device
(if the adapter supports per MAC rx queues).
* The same framework in macvlan can be used to add a third backend
into a future kernel based virtio-net implementation.
Could you split the patches up, to make this last easier?
patch 1 - export framework
patch 2 - code using it
Post by Arnd Bergmann
This version of the driver does not support any of those features,
but they all appear possible to add ;).
The driver is currently called 'macvtap', but I'd be more than happy
to change that if anyone could suggest a better name. The code is
still in an early stage and I wish I had found more time to polish
it, but at this time, I'd first like to know if people agree with the
basic concept at all.
Cc: Patrick McHardy <kaber at trash.net>
Cc: Stephen Hemminger <shemminger at linux-foundation.org>
Cc: David S. Miller" <davem at davemloft.net>
Cc: "Michael S. Tsirkin" <mst at redhat.com>
Cc: Herbert Xu <herbert at gondor.apana.org.au>
Cc: Or Gerlitz <ogerlitz at voltaire.com>
Cc: "Fischer, Anna" <anna.fischer at hp.com>
Cc: netdev at vger.kernel.org
Cc: bridge at lists.linux-foundation.org
Cc: linux-kernel at vger.kernel.org
Cc: Edge Virtual Bridging <evb at yahoogroups.com>
Signed-off-by: Arnd Bergmann <arnd at arndb.de>
---
The evb mailing list eats Cc headers, please make sure to keep everybody
in your Cc list when replying there.
---
drivers/net/Kconfig | 12 ++
drivers/net/Makefile | 1 +
drivers/net/macvlan.c | 39 +++-----
drivers/net/macvlan.h | 37 +++++++
drivers/net/macvtap.c | 276 +++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 341 insertions(+), 24 deletions(-)
create mode 100644 drivers/net/macvlan.h
create mode 100644 drivers/net/macvtap.c
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 5f6509a..0b9ac6a 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -90,6 +90,18 @@ config MACVLAN
To compile this driver as a module, choose M here: the module
will be called macvlan.
+config MACVTAP
+ tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
+ depends on MACVLAN
+ help
+ This adds a specialized tap character device driver that is based
+ on the MAC-VLAN network interface, called macvtap. A macvtap device
+ can be added in the same way as a macvlan device, using 'type
+ macvlan', and then be accessed through the tap user space interface.
+
+ To compile this driver as a module, choose M here: the module
+ will be called macvtap.
+
config EQUALIZER
tristate "EQL (serial line load balancing) support"
---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ead8cab..8a2d2d7 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -162,6 +162,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
obj-$(CONFIG_DUMMY) += dummy.o
obj-$(CONFIG_IFB) += ifb.o
obj-$(CONFIG_MACVLAN) += macvlan.o
+obj-$(CONFIG_MACVTAP) += macvtap.o
obj-$(CONFIG_DE600) += de600.o
obj-$(CONFIG_DE620) += de620.o
obj-$(CONFIG_LANCE) += lance.o
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 99eed9f..9f7dc6a 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -30,22 +30,7 @@
#include <linux/if_macvlan.h>
#include <net/rtnetlink.h>
-#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)
-
-struct macvlan_port {
- struct net_device *dev;
- struct hlist_head vlan_hash[MACVLAN_HASH_SIZE];
- struct list_head vlans;
-};
-
-struct macvlan_dev {
- struct net_device *dev;
- struct list_head list;
- struct hlist_node hlist;
- struct macvlan_port *port;
- struct net_device *lowerdev;
-};
-
+#include "macvlan.h"
static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
const unsigned char *addr)
@@ -135,7 +120,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
else
nskb->pkt_type = PACKET_MULTICAST;
- netif_rx(nskb);
+ vlan->receive(nskb);
}
}
}
@@ -180,11 +165,11 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
skb->dev = dev;
skb->pkt_type = PACKET_HOST;
- netif_rx(skb);
+ vlan->receive(skb);
return NULL;
}
-static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
+int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
const struct macvlan_dev *vlan = netdev_priv(dev);
unsigned int len = skb->len;
@@ -202,6 +187,7 @@ static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
}
return NETDEV_TX_OK;
}
+EXPORT_SYMBOL_GPL(macvlan_start_xmit);
static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
unsigned short type, const void *daddr,
@@ -412,7 +398,7 @@ static const struct net_device_ops macvlan_netdev_ops = {
.ndo_validate_addr = eth_validate_addr,
};
-static void macvlan_setup(struct net_device *dev)
+void macvlan_setup(struct net_device *dev)
{
ether_setup(dev);
@@ -423,6 +409,7 @@ static void macvlan_setup(struct net_device *dev)
dev->ethtool_ops = &macvlan_ethtool_ops;
dev->tx_queue_len = 0;
}
+EXPORT_SYMBOL_GPL(macvlan_setup);
static int macvlan_port_create(struct net_device *dev)
{
@@ -472,7 +459,7 @@ static void macvlan_transfer_operstate(struct net_device *dev)
}
}
-static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
+int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
{
if (tb[IFLA_ADDRESS]) {
if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
@@ -482,9 +469,10 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
}
return 0;
}
+EXPORT_SYMBOL_GPL(macvlan_validate);
-static int macvlan_newlink(struct net_device *dev,
- struct nlattr *tb[], struct nlattr *data[])
+int macvlan_newlink(struct net_device *dev,
+ struct nlattr *tb[], struct nlattr *data[])
{
struct macvlan_dev *vlan = netdev_priv(dev);
struct macvlan_port *port;
@@ -524,6 +512,7 @@ static int macvlan_newlink(struct net_device *dev,
vlan->lowerdev = lowerdev;
vlan->dev = dev;
vlan->port = port;
+ vlan->receive = netif_rx;
err = register_netdevice(dev);
if (err < 0)
@@ -533,8 +522,9 @@ static int macvlan_newlink(struct net_device *dev,
macvlan_transfer_operstate(dev);
return 0;
}
+EXPORT_SYMBOL_GPL(macvlan_newlink);
-static void macvlan_dellink(struct net_device *dev)
+void macvlan_dellink(struct net_device *dev)
{
struct macvlan_dev *vlan = netdev_priv(dev);
struct macvlan_port *port = vlan->port;
@@ -545,6 +535,7 @@ static void macvlan_dellink(struct net_device *dev)
if (list_empty(&port->vlans))
macvlan_port_destroy(port->dev);
}
+EXPORT_SYMBOL_GPL(macvlan_dellink);
static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
.kind = "macvlan",
diff --git a/drivers/net/macvlan.h b/drivers/net/macvlan.h
new file mode 100644
index 0000000..3f3c6c3
--- /dev/null
+++ b/drivers/net/macvlan.h
@@ -0,0 +1,37 @@
+#ifndef _MACVLAN_H
+#define _MACVLAN_H
+
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+#include <linux/list.h>
+
+#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)
+
+struct macvlan_port {
+ struct net_device *dev;
+ struct hlist_head vlan_hash[MACVLAN_HASH_SIZE];
+ struct list_head vlans;
+};
+
+struct macvlan_dev {
+ struct net_device *dev;
+ struct list_head list;
+ struct hlist_node hlist;
+ struct macvlan_port *port;
+ struct net_device *lowerdev;
+
+ int (*receive)(struct sk_buff *skb);
+};
+
+extern int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev);
+
+extern void macvlan_setup(struct net_device *dev);
+
+extern int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]);
+
+extern int macvlan_newlink(struct net_device *dev,
+ struct nlattr *tb[], struct nlattr *data[]);
+
+extern void macvlan_dellink(struct net_device *dev);
+
+#endif /* _MACVLAN_H */
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
new file mode 100644
index 0000000..d99bfc0
--- /dev/null
+++ b/drivers/net/macvtap.c
@@ -0,0 +1,276 @@
+#include <linux/etherdevice.h>
+#include <linux/nsproxy.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/cache.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/wait.h>
+#include <linux/cdev.h>
+#include <linux/fs.h>
+
+#include <net/net_namespace.h>
+#include <net/rtnetlink.h>
+
+#include "macvlan.h"
+
+struct macvtap_dev {
+ struct macvlan_dev m;
+ struct cdev cdev;
+ struct sk_buff_head readq;
+ wait_queue_head_t wait;
+};
+
+/*
+ * Minor number matches netdev->ifindex, so need a large value
+ */
+static int macvtap_major;
+#define MACVTAP_NUM_DEVS 65536
+
+static int macvtap_receive(struct sk_buff *skb)
+{
+ struct macvtap_dev *vtap = netdev_priv(skb->dev);
+
+ skb_queue_tail(&vtap->readq, skb);
+ wake_up(&vtap->wait);
+ return 0;
+}
+
+static int macvtap_open(struct inode *inode, struct file *file)
+{
+ struct net *net = current->nsproxy->net_ns;
+ int ifindex = iminor(inode);
+ struct net_device *dev = dev_get_by_index(net, ifindex);
+ int err;
+
+ err = -ENODEV;
+ if (!dev)
+ goto out1;
+
+ file->private_data = netdev_priv(dev);
+ err = 0;
+ return err;
+}
+
+static int macvtap_release(struct inode *inode, struct file *file)
+{
+ struct macvtap_dev *vtap = file->private_data;
+
+ if (!vtap)
+ return 0;
+
+ dev_put(vtap->m.dev);
+ return 0;
+}
+
+/* Get packet from user space buffer */
+static ssize_t macvtap_get_user(struct macvtap_dev *vtap,
+ const struct iovec *iv, size_t count,
+ int noblock)
+{
+ struct sk_buff *skb;
+ size_t len = count;
+
+ if (unlikely(len < ETH_HLEN))
+ return -EINVAL;
+
+ skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL);
+
+ if (!skb) {
+ vtap->m.dev->stats.rx_dropped++;
+ return -ENOMEM;
+ }
+
+ skb_reserve(skb, NET_IP_ALIGN);
+ skb_put(skb, count);
+
+ if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
+ vtap->m.dev->stats.rx_dropped++;
+ kfree_skb(skb);
+ return -EFAULT;
+ }
+
+ skb_set_network_header(skb, ETH_HLEN);
+ skb->dev = vtap->m.lowerdev;
+
+ macvlan_start_xmit(skb, vtap->m.dev);
+
+ return count;
+}
With tap, we discovered that not limiting the number of outstanding
skbs hurts UDP performance. And the solution was to limit
the number of outstanding packets - with hacks to work around
the fact that userspace .
Post by Arnd Bergmann
+
+static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
+ unsigned long count, loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t result;
+ struct macvtap_dev *vtap = file->private_data;
+
+ result = macvtap_get_user(vtap, iv, iov_length(iv, count),
+ file->f_flags & O_NONBLOCK);
+
+ return result;
+}
+
+/* Put packet to the user space buffer */
+static ssize_t macvtap_put_user(struct macvtap_dev *vtap,
+ struct sk_buff *skb,
+ struct iovec *iv, int len)
+{
+ int ret;
+
+ skb_push(skb, ETH_HLEN);
+ len = min_t(int, skb->len, len);
+
+ ret = skb_copy_datagram_iovec(skb, 0, iv, len);
+
+ vtap->m.dev->stats.rx_packets++;
+ vtap->m.dev->stats.rx_bytes += len;
where does atomicity guarantee for these counters come from?
Post by Arnd Bergmann
+
+ return ret ? ret : len;
+}
+
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+ unsigned long count, loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ struct macvtap_dev *vtap = file->private_data;
+ DECLARE_WAITQUEUE(wait, current);
+ struct sk_buff *skb;
+ ssize_t len, ret = 0;
+
+ if (!vtap)
+ return -EBADFD;
+
+ len = iov_length(iv, count);
+ if (len < 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ add_wait_queue(&vtap->wait, &wait);
+ while (len) {
+ current->state = TASK_INTERRUPTIBLE;
+
+ /* Read frames from the queue */
+ if (!(skb=skb_dequeue(&vtap->readq))) {
+ if (file->f_flags & O_NONBLOCK) {
+ ret = -EAGAIN;
+ break;
+ }
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+ /* Nothing to read, let's sleep */
+ schedule();
+ continue;
+ }
+ ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);
Don't cast away the constness. Instead, fix macvtap_put_user
to used skb_copy_datagram_const_iovec which does not modify the iovec.
Post by Arnd Bergmann
+ kfree_skb(skb);
+ break;
+ }
+
+ current->state = TASK_RUNNING;
+ remove_wait_queue(&vtap->wait, &wait);
+
+ return ret;
+}
+
+struct file_operations macvtap_fops = {
+ .owner = THIS_MODULE,
+ .open = macvtap_open,
+ .release = macvtap_release,
+ .aio_read = macvtap_aio_read,
+ .aio_write = macvtap_aio_write,
+ .llseek = no_llseek,
+};
+
+static int macvtap_newlink(struct net_device *dev,
+ struct nlattr *tb[], struct nlattr *data[])
+{
+ struct macvtap_dev *vtap = netdev_priv(dev);
+ int err;
+
+ err = macvlan_newlink(dev, tb, data);
+ if (err)
+ goto out1;
+
+ cdev_init(&vtap->cdev, &macvtap_fops);
+ vtap->cdev.owner = THIS_MODULE;
+ err = cdev_add(&vtap->cdev, MKDEV(MAJOR(macvtap_major), dev->ifindex), 1);
+
+ if (err)
+ goto out2;
+
+ /*
+ * TODO: add class dev so device node gets created automatically
+ * by udev.
+ */
+ pr_debug("%s:%d: added cdev %d:%d for dev %s\n",
+ __func__, __LINE__, MAJOR(macvtap_major),
+ dev->ifindex, dev->name);
+
+ skb_queue_head_init(&vtap->readq);
+ init_waitqueue_head(&vtap->wait);
+ vtap->m.receive = macvtap_receive;
+
+ return 0;
+
+ macvlan_dellink(dev);
+ return err;
+}
+
+static void macvtap_dellink(struct net_device *dev)
+{
+ struct macvtap_dev *vtap = netdev_priv(dev);
+ cdev_del(&vtap->cdev);
+ /* TODO: kill open file descriptors */
+ macvlan_dellink(dev);
+}
+
+static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
+ .kind = "macvtap",
+ .priv_size = sizeof(struct macvtap_dev),
+ .setup = macvlan_setup,
+ .validate = macvlan_validate,
+ .newlink = macvtap_newlink,
+ .dellink = macvtap_dellink,
+};
+
+static int macvtap_init(void)
+{
+ int err;
+
+ err = alloc_chrdev_region(&macvtap_major, 0,
+ MACVTAP_NUM_DEVS, "macvtap");
+ if (err)
+ goto out1;
+
+ err = rtnl_link_register(&macvtap_link_ops);
+ if (err)
+ goto out2;
+
+ return 0;
+
+ unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+ return err;
+}
+module_init(macvtap_init);
+
+static void macvtap_exit(void)
+{
+ rtnl_link_unregister(&macvtap_link_ops);
+ unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+}
+module_exit(macvtap_exit);
+
+MODULE_ALIAS_RTNL_LINK("macvtap");
+MODULE_AUTHOR("Arnd Bergmann <arnd at arndb.de>");
+MODULE_LICENSE("GPL");
--
1.6.0.4
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paul Congdon (UC Davis)
2009-08-07 19:10:07 UTC
Permalink
Responding to Daniel's questions...
Post by Daniel Robbins
I have some general questions about the intended use and benefits of
In which virtual machine setups and technologies do you forsee this
interface being used?
The benefit of VEPA is the coordination and unification with the external network switch. So, in environments where you are needing/wanting your feature rich, wire speed, external network device (firewall/switch/IPS/content-filter) to provide consistent policy enforcement, and you want your VMs traffic to be subject to that enforcement, you will want their traffic directed externally. Perhaps you have some VMs that are on a DMZ or clustering an application or implementing a multi-tier application where you would normally place a firewall in-between the tiers.
Post by Daniel Robbins
Is this new interface to be used within a virtual machine or
container, on the master node, or both?
It is really an interface to a new type of virtual switch. When you create virtual network, I would imagine it being a new mode of operation (bridge, NAT, VEPA, etc).
Post by Daniel Robbins
What interface(s) would need to be configured for a single virtual
machine to use VEPA to access the network?
It would be the same as if that machine were configure to use a bridge to access the network, but the bridge mode would be different.
Post by Daniel Robbins
What are the current flexibility, security or performance limitations
of tun/tap and bridge that make this new interface necessary or
beneficial?
If you have VMs that will be communicating with one another on the same physical machine, and you want their traffic to be exposed to an in-line network device such as a application firewall/IPS/content-filter (without this feature) you will have to have this device co-located within the same physical server. This will use up CPU cycles that you presumable purchased to run applications, it will require a lot of consistent configuration on all physical machines, it could invoke potentially a lot of software licensing, additional cost, etc.. Everything would need to be replicated on each physical machine. With the VEPA capability, you can leverage all this functionality in an external network device and have it managed and configured in one place. The external implementation is likely a higher performance, silicon based implementation. It should make it easier to migrate machines from one physical server to another and maintain the same network policy enforcement.
Post by Daniel Robbins
Is this new interface useful at all for VPN solutions or is it
*specifically* targeted for connecting virtual machines to the
network?
I'm not sure I see the benefit for VPN solutions, but I'd have to understand the deployment scenario better. Certainly this is targeting connecting VMs to the adjacent physical LAN.
Post by Daniel Robbins
Is this essentially a bridge with layer-2 isolation for the virtual
machine interfaces built-in? If isolation is provided, what mechanism
is used to accomplish this, and how secure is it?
That might be an over simplification, but you can achieve layer-2 isolation if you connect to a standard external switch. If that switch has 'hairpin' forwarding, then the VMs can talk at L2, but their traffic is forced through the bridge. If that bridge is a security device (e.g. firewall), then their traffic is exposed to that.

The isolation in the outbound direction is created by the way frames are forwarded. They are simply dropped on the wire, so no VMs can talk directly to one another without their traffic first going external. In the inbound direction, the isolation is created using the forwarding table.
Post by Daniel Robbins
Does VEPA look like a regular ethernet interface (eth0) on the virtual
machine side?
Yes
Post by Daniel Robbins
Are there any associated user-space tools required for configuring a
VEPA?
The standard brctl utility has been augmented to enable/disable the capability.
Post by Daniel Robbins
Do you have any HOWTO-style documentation that would demonstrate how
this interface would be used in production? Or a FAQ?
None yet.
Post by Daniel Robbins
This seems like a very interesting effort but I don't quite have a
good grasp of VEPA's benefits and limitations -- I imagine that others
are in the same boat too.
There are some seminar slides available on the IEEE 802.1 web-site and elsewhere. The patch had a reference to a seminar, but here is another one you might find helpful:

http://www.internet2.edu/presentations/jt2009jul/20090719-congdon.pdf

I'm happy to try to explain further...

Paul
Stephen Hemminger
2009-08-07 19:35:54 UTC
Permalink
On Fri, 7 Aug 2009 12:10:07 -0700
Post by Paul Congdon (UC Davis)
Responding to Daniel's questions...
Post by Daniel Robbins
I have some general questions about the intended use and benefits of
In which virtual machine setups and technologies do you forsee this
interface being used?
The benefit of VEPA is the coordination and unification with the external network switch. So, in environments where you are needing/wanting your feature rich, wire speed, external network device (firewall/switch/IPS/content-filter) to provide consistent policy enforcement, and you want your VMs traffic to be subject to that enforcement, you will want their traffic directed externally. Perhaps you have some VMs that are on a DMZ or clustering an application or implementing a multi-tier application where you would normally place a firewall in-between the tiers.
I do have to raise the point that Linux is perfectly capable of keeping up without
the need of an external switch. Whether you want policy external or internal is
a architecture decision that should not be driven by mis-information about performance.
Fischer, Anna
2009-08-07 19:44:05 UTC
Permalink
Subject: Re: [Bridge] [PATCH] macvlan: add tap device backend
On Fri, 7 Aug 2009 12:10:07 -0700
Post by Paul Congdon (UC Davis)
Responding to Daniel's questions...
Post by Daniel Robbins
I have some general questions about the intended use and benefits
of
Post by Paul Congdon (UC Davis)
Post by Daniel Robbins
In which virtual machine setups and technologies do you forsee this
interface being used?
The benefit of VEPA is the coordination and unification with the
external network switch. So, in environments where you are
needing/wanting your feature rich, wire speed, external network device
(firewall/switch/IPS/content-filter) to provide consistent policy
enforcement, and you want your VMs traffic to be subject to that
enforcement, you will want their traffic directed externally. Perhaps
you have some VMs that are on a DMZ or clustering an application or
implementing a multi-tier application where you would normally place a
firewall in-between the tiers.
I do have to raise the point that Linux is perfectly capable of keeping up without
the need of an external switch. Whether you want policy external or internal is
a architecture decision that should not be driven by mis-information about performance.
VEPA is not only about enabling faster packet processing (like firewall/switch/IPS/content-filter etc) by doing this on the external switch.

Due to rather low performance of software-based I/O virtualization approaches a lot of effort has recently been going into hardware-based implementations of virtual network interfaces like SRIOV NICs provide. Without VEPA, such a NIC would have to implement sophisticated virtual switching capabilities. VEPA however is very simple and therefore perfectly suited for a hardware-based implementation. So in the future, it will give you direct I/O like performance and all the capabilities your adjacent switch provides.

Anna
david
2009-08-07 20:17:31 UTC
Permalink
Subject: RE: [Bridge] [PATCH] macvlan: add tap device backend
Subject: Re: [Bridge] [PATCH] macvlan: add tap device backend
On Fri, 7 Aug 2009 12:10:07 -0700
Post by Paul Congdon (UC Davis)
Responding to Daniel's questions...
Post by Daniel Robbins
I have some general questions about the intended use and benefits
of
Post by Paul Congdon (UC Davis)
Post by Daniel Robbins
In which virtual machine setups and technologies do you forsee this
interface being used?
The benefit of VEPA is the coordination and unification with the
external network switch. So, in environments where you are
needing/wanting your feature rich, wire speed, external network device
(firewall/switch/IPS/content-filter) to provide consistent policy
enforcement, and you want your VMs traffic to be subject to that
enforcement, you will want their traffic directed externally. Perhaps
you have some VMs that are on a DMZ or clustering an application or
implementing a multi-tier application where you would normally place a
firewall in-between the tiers.
I do have to raise the point that Linux is perfectly capable of keeping up without
the need of an external switch. Whether you want policy external or internal is
a architecture decision that should not be driven by mis-information about performance.
VEPA is not only about enabling faster packet processing (like firewall/switch/IPS/content-filter etc) by doing this on the external switch.
Due to rather low performance of software-based I/O virtualization approaches a lot of effort has recently been going into hardware-based implementations of virtual network interfaces like SRIOV NICs provide. Without VEPA, such a NIC would have to implement sophisticated virtual switching capabilities. VEPA however is very simple and therefore perfectly suited for a hardware-based implementation. So in the future, it will give you direct I/O like performance and all the capabilities your adjacent switch provides.
the performance overhead isn't from switching the packets, it's from
running the firewall/IDS/etc software on the same system.

with VEPA the communications from one VM to another VM running on the same
host will be forced to go out the interface to the datacenter switching
fabric. The overall performance of the network link will be slightly
slower, but it allows for other devices to be inserted into the path.

this is something that I would want available if I were to start using VMs
for things. I don't want to have to duplicate my IDS/firewalling functions
within each host system as well as having them as part of the switching
fabric.

David Lang
Paul Congdon (UC Davis)
2009-08-07 19:47:38 UTC
Permalink
Post by Stephen Hemminger
I do have to raise the point that Linux is perfectly capable of keeping up without
the need of an external switch. Whether you want policy external or internal is
a architecture decision that should not be driven by mis-information about performance.
No argument here. I agree that you can do a lot in Linux. It is, as you
say, an architecture decision, that can be enabled with this addition mode
of operation. Without the mode of forcing things external, however, you
would always need to put this function internal or play games with VLANs
overlapping to get traffic to forward the way you want it.

Paul
Arnd Bergmann
2009-08-07 21:38:57 UTC
Permalink
Post by Stephen Hemminger
On Fri, 7 Aug 2009 12:10:07 -0700
Post by Paul Congdon (UC Davis)
Responding to Daniel's questions...
Post by Daniel Robbins
I have some general questions about the intended use and benefits of
In which virtual machine setups and technologies do you forsee this
interface being used?
The benefit of VEPA is the coordination and unification with the
external network switch. So, in environments where you are
needing/wanting your feature rich, wire speed, external network
device (firewall/switch/IPS/content-filter) to provide consistent
policy enforcement, and you want your VMs traffic to be subject to
that enforcement, you will want their traffic directed externally.
Perhaps you have some VMs that are on a DMZ or clustering an
application or implementing a multi-tier application where you
would normally place a firewall in-between the tiers.
I do have to raise the point that Linux is perfectly capable of keeping up without
the need of an external switch. Whether you want policy external or internal is
a architecture decision that should not be driven by mis-information about performance.
In general, I agree that Linux on a decent virtual machine host will be
able to handle forwarding of network data fast enough, often faster than
the external connectivity allows if it needs to transmit every frame twice.

However, there is a tradeoff between CPU cycles and I/O bandwidth. If your
application needs lots of CPU but you have spare capacity on the PCI bus, the
network wire and the external switch, VEPA can also be a win on the performance
side. As always, performance depends on the application, even if it's not the
main driving factor here.

Arnd <><
Arnd Bergmann
2009-08-07 22:05:32 UTC
Permalink
Post by Paul Congdon (UC Davis)
Responding to Daniel's questions...
Thanks for the detailed responses. I'll add some more about the
specifics of the macvlan implementation that differs from the
bridge based VEPA implementation.
Post by Paul Congdon (UC Davis)
Post by Daniel Robbins
Is this new interface to be used within a virtual machine or
container, on the master node, or both?
It is really an interface to a new type of virtual switch. When
you create virtual network, I would imagine it being a new mode
of operation (bridge, NAT, VEPA, etc).
I think the question was whether the patch needs to applied in the
host or the guest. Both the implementation that you and Anna did
and the one that I posted only apply to the *host* (master node),
the virtual machine does not need to know about it.
Post by Paul Congdon (UC Davis)
Post by Daniel Robbins
What interface(s) would need to be configured for a single virtual
machine to use VEPA to access the network?
It would be the same as if that machine were configure to use a
bridge to access the network, but the bridge mode would be different.
Right, with the bridge based VEPA, you would set up a kvm guest
or a container with the regular tools, then use the sysfs interface
to put the bridge device into VEPA mode.

With the macvlan based mode, you use 'ip link' to add a new tap
device to an external network interface and not use a bridge at
all. Then you configure KVM to use that tap device instead of the
regular bridge/tap setup.
Post by Paul Congdon (UC Davis)
Post by Daniel Robbins
What are the current flexibility, security or performance limitations
of tun/tap and bridge that make this new interface necessary or
beneficial?
If you have VMs that will be communicating with one another on
the same physical machine, and you want their traffic to be
exposed to an in-line network device such as a application
firewall/IPS/content-filter (without this feature) you will have
to have this device co-located within the same physical server.
This will use up CPU cycles that you presumable purchased to run
applications, it will require a lot of consistent configuration
on all physical machines, it could invoke potentially a lot of
software licensing, additional cost, etc.. Everything would
need to be replicated on each physical machine. With the VEPA
capability, you can leverage all this functionality in an
external network device and have it managed and configured in
one place. The external implementation is likely a higher
performance, silicon based implementation. It should make it
easier to migrate machines from one physical server to another
and maintain the same network policy enforcement.
It's worth noting that depending on your network connectivity,
performance is likely to go down significantly with VEPA over the
existing bridge/tap setup, because all frames have to be sent
twice through an external wire that has a limited capacity, so you may
lose inter-guest bandwidth and get more latency in many cases, while
you free up CPU cycles. With the bridge based VEPA, you might not
even gain many cycles because much of the overhead is still there.
On the cost side, external switches can also get quite expensive
compared to x86 servers.

IMHO the real win of VEPA is on the management side, where you can
use a single set of tool for managing the network, rather than
having your network admins deal with both the external switches
and the setup of linux netfilter rules etc.

The macvlan based VEPA has the same features as the bridge based
VEPA, but much simpler code, which allows a number of shortcuts
to save CPU cycles.
Post by Paul Congdon (UC Davis)
The isolation in the outbound direction is created by the way frames
are forwarded. They are simply dropped on the wire, so no VMs can
talk directly to one another without their traffic first going
external. In the inbound direction, the isolation is created using
the forwarding table.
Right. Note that in the macvlan case, the filtering on inbound data is an inherent
part of the macvlan setup, it does use the dynamic forwarding table of the
bridge driver.
Post by Paul Congdon (UC Davis)
Post by Daniel Robbins
Are there any associated user-space tools required for configuring a
VEPA?
The standard brctl utility has been augmented to enable/disable the capability.
That is for the bridge based VEPA, while my patch uses the 'ip link'
command that ships with most distros. It does not need any modifications
right now, but might need them if we add other features like support for
multiple MAC addresses in a single guest.

Arnd <><
Fischer, Anna
2009-08-10 12:40:27 UTC
Permalink
Subject: Re: [Bridge] [PATCH] macvlan: add tap device backend
Post by Paul Congdon (UC Davis)
Responding to Daniel's questions...
Thanks for the detailed responses. I'll add some more about the
specifics of the macvlan implementation that differs from the
bridge based VEPA implementation.
Post by Paul Congdon (UC Davis)
Post by Daniel Robbins
Is this new interface to be used within a virtual machine or
container, on the master node, or both?
It is really an interface to a new type of virtual switch. When
you create virtual network, I would imagine it being a new mode
of operation (bridge, NAT, VEPA, etc).
I think the question was whether the patch needs to applied in the
host or the guest. Both the implementation that you and Anna did
and the one that I posted only apply to the *host* (master node),
the virtual machine does not need to know about it.
Post by Paul Congdon (UC Davis)
Post by Daniel Robbins
What interface(s) would need to be configured for a single virtual
machine to use VEPA to access the network?
It would be the same as if that machine were configure to use a
bridge to access the network, but the bridge mode would be different.
Right, with the bridge based VEPA, you would set up a kvm guest
or a container with the regular tools, then use the sysfs interface
to put the bridge device into VEPA mode.
With the macvlan based mode, you use 'ip link' to add a new tap
device to an external network interface and not use a bridge at
all. Then you configure KVM to use that tap device instead of the
regular bridge/tap setup.
Post by Paul Congdon (UC Davis)
Post by Daniel Robbins
What are the current flexibility, security or performance
limitations
Post by Paul Congdon (UC Davis)
Post by Daniel Robbins
of tun/tap and bridge that make this new interface necessary or
beneficial?
If you have VMs that will be communicating with one another on
the same physical machine, and you want their traffic to be
exposed to an in-line network device such as a application
firewall/IPS/content-filter (without this feature) you will have
to have this device co-located within the same physical server.
This will use up CPU cycles that you presumable purchased to run
applications, it will require a lot of consistent configuration
on all physical machines, it could invoke potentially a lot of
software licensing, additional cost, etc.. Everything would
need to be replicated on each physical machine. With the VEPA
capability, you can leverage all this functionality in an
external network device and have it managed and configured in
one place. The external implementation is likely a higher
performance, silicon based implementation. It should make it
easier to migrate machines from one physical server to another
and maintain the same network policy enforcement.
It's worth noting that depending on your network connectivity,
performance is likely to go down significantly with VEPA over the
existing bridge/tap setup, because all frames have to be sent
twice through an external wire that has a limited capacity, so you may
lose inter-guest bandwidth and get more latency in many cases, while
you free up CPU cycles. With the bridge based VEPA, you might not
even gain many cycles because much of the overhead is still there.
On the cost side, external switches can also get quite expensive
compared to x86 servers.
IMHO the real win of VEPA is on the management side, where you can
use a single set of tool for managing the network, rather than
having your network admins deal with both the external switches
and the setup of linux netfilter rules etc.
The macvlan based VEPA has the same features as the bridge based
VEPA, but much simpler code, which allows a number of shortcuts
to save CPU cycles.
I am not yet convinced that the macvlan based VEPA would be significantly
better from a performance point-of-view. Really, once you have
implemented all the missing bits and pieces to make the macvlan
driver a VEPA-compatible device, the code path for packet processing
will be very similar. Also, I think you have to keep in mind that,
ultimately, if a user is seriously concerned about high performance,
then they would go for a hardware-based solution, e.g. a SRIOV NIC
with VEPA capabilities. Once you have made the decision for a software-
based approach, tiny performance differences should not have such a
big impact, and so I don't think that this should influence too much
the design decision on where VEPA capabilities should be placed in
the kernel.

If you compare macvtap with traditional QEMU networking interfaces that
are typically used in current bridged setups, then yes, performance will be
different. However, I think that this is not necessarily a fair
comparison, and the performance difference does not come from the
bridge being slow, but simply because you have implemented a better
solution to connect a virtual interface to a backend device that
can be assigned to a VM. There is no reason why you could not do this
for a bridge port as well.

Anna
Arnd Bergmann
2009-08-10 19:04:54 UTC
Permalink
Post by Fischer, Anna
If you compare macvtap with traditional QEMU networking interfaces that
are typically used in current bridged setups, then yes, performance will be
different. However, I think that this is not necessarily a fair
comparison, and the performance difference does not come from the
bridge being slow, but simply because you have implemented a better
solution to connect a virtual interface to a backend device that
can be assigned to a VM. There is no reason why you could not do this
for a bridge port as well.
It's not necessarily the bridge itself being slow (though some people
claim it is) but more the bridge preventing optimizations or making
them hard.

You already mentioned hardware filtering by unicast and multicast
mac addresses, which macvlan already does (for unicast) but which would be
relatively complex with a bridge due to the way it does MAC address
learning.

If we want to do zero copy receives, the hardware will on top of
this have to choose the receive buffer based on the mac address,
with the buffer provided by the guest. I think this is not easy
with macvlan but doable, while I have no idea where you would start
using the bridge code.

Arnd <><
Michael S. Tsirkin
2009-08-10 19:32:27 UTC
Permalink
Post by Arnd Bergmann
Post by Fischer, Anna
If you compare macvtap with traditional QEMU networking interfaces that
are typically used in current bridged setups, then yes, performance will be
different. However, I think that this is not necessarily a fair
comparison, and the performance difference does not come from the
bridge being slow, but simply because you have implemented a better
solution to connect a virtual interface to a backend device that
can be assigned to a VM. There is no reason why you could not do this
for a bridge port as well.
It's not necessarily the bridge itself being slow (though some people
claim it is) but more the bridge preventing optimizations or making
them hard.
You already mentioned hardware filtering by unicast and multicast
mac addresses, which macvlan already does (for unicast) but which would be
relatively complex with a bridge due to the way it does MAC address
learning.
If we want to do zero copy receives, the hardware will on top of
this have to choose the receive buffer based on the mac address,
with the buffer provided by the guest. I think this is not easy
with macvlan but doable, while I have no idea where you would start
using the bridge code.
Arnd <><
Similar thing for zero copy sends. You need to know when
the buffers have been consumed to notify userspace,
and this is very hard with a generic bridge in the middle.
--
MST
Continue reading on narkive:
Loading...