Discussion:
[Bridge] [RFC PATCH v2] bridge: make it possible for packets to traverse the bridge without hitting netfilter
Florian Westphal
2015-02-23 16:06:19 UTC
Permalink
The netfilter code is made with flexibility instead of performance in mind.
So when all we want is to pass packets between different interfaces, the
performance penalty of hitting netfilter code can be considerable, even when
all the firewalling is disabled for the bridge.
This change makes it possible to disable netfilter on a per bridge basis.
In the case interesting to us, this can lead to more than 15% speedup
compared to the case when only bridge-iptables is disabled.
I wonder what the speed difference is between no-rules (i.e., we hit jump label
in NF_HOOK), one single (ebtables) accept-all rule, and this patch, for
the call_nf==false case.

I guess your 15% speedup figure is coming from ebtables' O(n) rule
evaluation overhead? If yes, how many rules are we talking about?

Iff thats true, then the 'better' (I know, it won't help you) solution
would be to use nftables bridgeport-based verdict maps...

If thats still too much overhead, then we clearly need to do *something*...

Thanks,
Florian
David Miller
2015-02-26 16:34:31 UTC
Permalink
From: Imre Palik <***@amazon.de>
Date: Thu, 26 Feb 2015 11:19:25 +0100
I am on 4k pages, and perf is not working :-(
(I am trying to fix those too, but that is far from being a low hanging fruit.)
So my guess would be that the packet pipeline doesn't fit in the cache/tlb
Pure specualtion until you can actually use perf to measure these
things.

And I don't want to apply patches which were designed based upon
pure speculation.
Pablo Neira Ayuso
2015-03-06 14:29:32 UTC
Permalink
Post by David Miller
Date: Thu, 26 Feb 2015 11:19:25 +0100
I am on 4k pages, and perf is not working :-(
(I am trying to fix those too, but that is far from being a low hanging fruit.)
So my guess would be that the packet pipeline doesn't fit in the cache/tlb
Pure specualtion until you can actually use perf to measure these
things.
And I don't want to apply patches which were designed based upon
pure speculation.
Removed those pieces of the packet pipeline that I don't necessarily
need one-by-one. Then measured their effect on small packet
performance.
This was the only part that produced considerable effect.
The pure speculation was about why the effect is more than 15%
increase in packet throughput, although the code path avoided
contains way less code than 15% of the packet pipeline. It seems,
Felix Fietkau profiled similar changes, and found my guess well
founded.
Now could anybody explain me what else is wrong with my patch?
We have to come up with a more generic solution for this.

These sysfs tweaks you're proposing look to me like an obscure way to
tune this.
Florian Westphal
2015-03-06 16:37:00 UTC
Permalink
Post by Pablo Neira Ayuso
Post by David Miller
Date: Thu, 26 Feb 2015 11:19:25 +0100
I am on 4k pages, and perf is not working :-(
(I am trying to fix those too, but that is far from being a low hanging fruit.)
So my guess would be that the packet pipeline doesn't fit in the cache/tlb
Pure specualtion until you can actually use perf to measure these
things.
And I don't want to apply patches which were designed based upon
pure speculation.
Removed those pieces of the packet pipeline that I don't necessarily
need one-by-one. Then measured their effect on small packet
performance.
This was the only part that produced considerable effect.
The pure speculation was about why the effect is more than 15%
increase in packet throughput, although the code path avoided
contains way less code than 15% of the packet pipeline. It seems,
Felix Fietkau profiled similar changes, and found my guess well
founded.
Now could anybody explain me what else is wrong with my patch?
We have to come up with a more generic solution for this.
Jiri Benc suggested to allowing to attach netfilter hooks e.g. via tc
action, maybe that would be an option worth investigating.

Then you could for instance add filtering rules only to the bridge port
that needs it.
Post by Pablo Neira Ayuso
These sysfs tweaks you're proposing look to me like an obscure way to
tune this.
I agree, adding more tunables isn't all that helpful, in the past this
only helped to prolong the problem.
David Woodhouse
2018-03-09 15:31:15 UTC
Permalink
Post by Florian Westphal
Post by Pablo Neira Ayuso
 
Removed those pieces of the packet pipeline that I don't necessarily
need one-by-one.  Then measured their effect on small packet
performance.
 
This was the only part that produced considerable effect.
 
The pure speculation was about why the effect is more than 15%
increase in packet throughput, although the code path avoided
contains way less code than 15% of the packet pipeline.  It seems,
Felix Fietkau profiled similar changes, and found my guess well
founded.
 
Now could anybody explain me what else is wrong with my patch?
 
We have to come up with a more generic solution for this.
Jiri Benc suggested to allowing to attach netfilter hooks e.g. via tc
action, maybe that would be an option worth investigating.
Then you could for instance add filtering rules only to the bridge port
that needs it.
Post by Pablo Neira Ayuso
These sysfs tweaks you're proposing look to me like an obscure way to
tune this.
I agree, adding more tunables isn't all that helpful, in the past this
only helped to prolong the problem.
How feasible would it be to make it completely dynamic?

A given hook could automatically disable itself (for a given device) if
the result of running it the first time was *tautologically* false for
that device (i.e. regardless of the packet itself, or anything else).

The hook would need to be automatically re-enabled if the rule chain
ever changes (and might subsequently disable itself again).

Is that something that's worth exploring for the general case?

Eschewing a 15% speedup on the basis that "well, even though we've had
three of these already for a decade, we're worried that adding a fourth
might open the floodgates to further patches" does seem a little odd to
me, FWIW.
David Miller
2018-03-09 15:57:24 UTC
Permalink
From: David Woodhouse <***@infradead.org>
Date: Fri, 09 Mar 2018 15:31:15 +0000
Post by David Woodhouse
Eschewing a 15% speedup on the basis that "well, even though we've had
three of these already for a decade, we're worried that adding a fourth
might open the floodgates to further patches" does seem a little odd to
me, FWIW.
The cost we are dealing with is a fundamental one which is a result of
the hook design.

Indirect calls are killer.

Indirect calls are even more killer now in the age of Spectre and
retpolines.

I definitely would rather see the fundamental issue addressed rather
than poking at it randomly with knobs for this case and that.

Thank you.
David Woodhouse
2018-03-09 16:15:14 UTC
Permalink
Post by David Miller
Date: Fri, 09 Mar 2018 15:31:15 +0000
Post by David Woodhouse
Eschewing a 15% speedup on the basis that "well, even though we've had
three of these already for a decade, we're worried that adding a fourth
might open the floodgates to further patches" does seem a little odd to
me, FWIW.
The cost we are dealing with is a fundamental one which is a result of
the hook design.
Indirect calls are killer.
Indirect calls are even more killer now in the age of Spectre and
retpolines.
Imre's 15% measurement was, obviously, before that. We should redo it
and confirm the numbers.
Post by David Miller
I definitely would rather see the fundamental issue addressed rather
than poking at it randomly with knobs for this case and that.
Yeah. What do you think of the suggestion I made — that a given hook
should automatically disable itself if it tautologically does nothing?
Coupled with notifiers for when the rules change, to re-enable the
hooks again? I confess I don't even know how realistic that is.
Florian Westphal
2018-03-09 16:26:08 UTC
Permalink
Post by David Woodhouse
Post by Florian Westphal
Post by Pablo Neira Ayuso
 
Removed those pieces of the packet pipeline that I don't necessarily
need one-by-one.  Then measured their effect on small packet
performance.
 
This was the only part that produced considerable effect.
 
The pure speculation was about why the effect is more than 15%
increase in packet throughput, although the code path avoided
contains way less code than 15% of the packet pipeline.  It seems,
Felix Fietkau profiled similar changes, and found my guess well
founded.
 
Now could anybody explain me what else is wrong with my patch?
 
We have to come up with a more generic solution for this.
Jiri Benc suggested to allowing to attach netfilter hooks e.g. via tc
action, maybe that would be an option worth investigating.
Then you could for instance add filtering rules only to the bridge port
that needs it.
Post by Pablo Neira Ayuso
These sysfs tweaks you're proposing look to me like an obscure way to
tune this.
I agree, adding more tunables isn't all that helpful, in the past this
only helped to prolong the problem.
How feasible would it be to make it completely dynamic?
A given hook could automatically disable itself (for a given device) if
the result of running it the first time was *tautologically* false for
that device (i.e. regardless of the packet itself, or anything else).
The hook would need to be automatically re-enabled if the rule chain
ever changes (and might subsequently disable itself again).
Is that something that's worth exploring for the general case?
AF_BRIDGE hooks sit in the net namespace, so its enough for
one bridge to request filtering to bring in the hook overhead for all
bridges in the same netns.

Alternatives:
- place the bridges that need filtering in different netns
- use tc ingress for filtering
- use nftables ingress hook for filtering (it sits in almost same
location as tc ingress hook) to attach the ruleset to those bridge
ports that need packet filtering.

(The original request came from user with tons of bridges where only
one single bridge needed filtering).

One alternative I see is to place the bridge hooks into the
bridge device (net_bridge struct, which is in netdev private area).

But, as you already mentioned we would need to annotate the hooks
to figure out which device(s) they are for.

This sounds rather fragile to me, so i would just use nft ingress:

#!/sbin/nft -f

table netdev ingress {
chain in_public { type filter hook ingress device eth0 priority 0;
ip saddr 192.168.0.1 counter
}
}

David Miller
2015-03-06 17:49:30 UTC
Permalink
From: Imre Palik <***@gmail.com>
Date: Fri, 06 Mar 2015 11:34:29 +0100
Removed those pieces of the packet pipeline that I don't necessarily
need one-by-one. Then measured their effect on small packet
performance.
This was the only part that produced considerable effect.
The pure speculation was about why the effect is more than 15%
increase in packet throughput, although the code path avoided
contains way less code than 15% of the packet pipeline. It seems,
Felix Fietkau profiled similar changes, and found my guess well
founded.
Yes and that's the part being left out, the "why".

That's part of what I expect in the justification.

Look, we're not doing things this way. We need to find a clean
and generic way to make the netfilter hooks as cheap as possible
when netfilter rules are not in use.
Felix Fietkau
2015-02-26 21:17:42 UTC
Permalink
Post by Florian Westphal
The netfilter code is made with flexibility instead of performance in mind.
So when all we want is to pass packets between different interfaces, the
performance penalty of hitting netfilter code can be considerable, even when
all the firewalling is disabled for the bridge.
This change makes it possible to disable netfilter on a per bridge basis.
In the case interesting to us, this can lead to more than 15% speedup
compared to the case when only bridge-iptables is disabled.
I wonder what the speed difference is between no-rules (i.e., we hit jump label
in NF_HOOK), one single (ebtables) accept-all rule, and this patch, for
the call_nf==false case.
I guess your 15% speedup figure is coming from ebtables' O(n) rule
evaluation overhead? If yes, how many rules are we talking about?
Iff thats true, then the 'better' (I know, it won't help you) solution
would be to use nftables bridgeport-based verdict maps...
If thats still too much overhead, then we clearly need to do *something*...
I work with MIPS based routers that typically only have 32 or 64 KB of
Dcache. I've had quite a bit of 'fun' working on optimizing netfilter on
these systems. I've done a lot of measurements using oprofile (going to
use perf on my next run).

On these devices, even without netfilter compiled in, the data
structures and code are already way too big for the hot path to fit in
the Dcache (not to mention Icache). This problem has typically gotten a
little bit worse with every new kernel release, aside from just a few
exceptions.

This means that in the hot path, any unnecessary memory access to packet
data (especially IP headers) or to some degree also extra data
structures for netfilter, ebtables, etc. has a significant and visible
performance impact. The impact of the memory accesses is orders of
magnitude bigger than the pure cycles used for running the actual code.

In OpenWrt, I made similar hacks a long time ago, and on the system I
tested on, the speedup was even bigger than 15%, probably closer to 30%.
By the way, this was also with a completely empty ruleset.

Maybe there's a way to get reasonable performance by optimizing NF_HOOK,
however I'd like to remind you guys that if we have to fetch some
netfilter/nftables/ebtables data structures and run part of the table
processing code on a system where no rules are present (or ebtables
functionality is otherwise not needed for a particular bridge), then
performance is going to suck - at least on most small scale embedded
devices.

Based on that, I support the general approach taken by this patch, at
least until somebody has shown that a better approach is feasible.

- Felix
Loading...