[Galene] fq-codel trashing

Galène videoconferencing server discussion list archives
 help / color / mirror / Atom feed

* [Galene] fq-codel trashing
@ 2021-01-12 13:46 Juliusz Chroboczek
  2021-01-12 15:55 ` [Galene] " Toke Høiland-Jørgensen
  0 siblings, 1 reply; 16+ messages in thread
From: Juliusz Chroboczek @ 2021-01-12 13:46 UTC (permalink / raw)
  To: galene; +Cc: Dave Taht

Dave, Toke,

We're having a large meeting on the lab's Galène server, and it looks like
fq-codel is trashing: the new_flow_count increases as fast as the packet
counter.  How do I fix that?

-- Juliusz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-12 13:46 [Galene] fq-codel trashing Juliusz Chroboczek
@ 2021-01-12 15:55 ` Toke Høiland-Jørgensen
  2021-01-12 16:01   ` Dave Taht
  2021-01-12 17:38   ` Juliusz Chroboczek
  0 siblings, 2 replies; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-12 15:55 UTC (permalink / raw)
  To: Juliusz Chroboczek, galene; +Cc: Dave Taht

Juliusz Chroboczek <jch@irif.fr> writes:

> Dave, Toke,
>
> We're having a large meeting on the lab's Galène server, and it looks like
> fq-codel is trashing: the new_flow_count increases as fast as the packet
> counter.  How do I fix that?

That just sounds like wherever you're running FQ-CoDel is not actually
the bottleneck? If there's no backpressure the queue will drain for each
packet, so that by the time the next packet comes along it'll be
considered "new" again.

If you're running a few video flows on a gigabit link (and thus using a
fraction of the bandwidth) that would be expected. But it only happens
if you're also running a shaper (such as HTB or TBF) on the interface,
since if FQ-CoDel is installed on a physical interface it will set the
TCQ_F_CAN_BYPASS flag, which means that when the queue is completely
empty it will be bypassed entirely...

I just replicated the phenomenon like this:

$ sudo tc qdisc replace dev testns root tbf rate 1gbit latency 10ms burst 1024
$ sudo tc qdisc add dev testns parent 8002: fq_codel
$ tc -s qdisc
qdisc tbf 8002: dev testns root refcnt 2 rate 1Gbit burst 1000b lat 10ms 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc fq_codel 8003: dev testns parent 8002: limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0

$ ping -c 5 fc00:dead:cafe:1::2                                          
PING fc00:dead:cafe:1::2(fc00:dead:cafe:1::2) 56 data bytes
64 bytes from fc00:dead:cafe:1::2: icmp_seq=1 ttl=64 time=0.088 ms
64 bytes from fc00:dead:cafe:1::2: icmp_seq=2 ttl=64 time=0.138 ms
64 bytes from fc00:dead:cafe:1::2: icmp_seq=3 ttl=64 time=0.102 ms
64 bytes from fc00:dead:cafe:1::2: icmp_seq=4 ttl=64 time=0.103 ms
64 bytes from fc00:dead:cafe:1::2: icmp_seq=5 ttl=64 time=0.104 ms

--- fc00:dead:cafe:1::2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4041ms
rtt min/avg/max/mdev = 0.088/0.107/0.138/0.016 ms

$ tc -s qdisc                  
qdisc tbf 8002: dev testns root refcnt 2 rate 1Gbit burst 1000b lat 10ms 
 Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc fq_codel 8003: dev testns parent 8002: limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 
 Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
  maxpacket 118 drop_overlimit 0 new_flow_count 5 ecn_mark 0
  new_flows_len 0 old_flows_len 0



Whereas if fq_codel is the root qdisc:

$ sudo tc qdisc replace dev testns root fq_codel
$ tc -s qdisc
qdisc fq_codel 8004: dev testns root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0

$ ping -c 5 fc00:dead:cafe:1::2                 
PING fc00:dead:cafe:1::2(fc00:dead:cafe:1::2) 56 data bytes
64 bytes from fc00:dead:cafe:1::2: icmp_seq=1 ttl=64 time=0.085 ms
64 bytes from fc00:dead:cafe:1::2: icmp_seq=2 ttl=64 time=0.098 ms
64 bytes from fc00:dead:cafe:1::2: icmp_seq=3 ttl=64 time=0.098 ms
64 bytes from fc00:dead:cafe:1::2: icmp_seq=4 ttl=64 time=0.097 ms
64 bytes from fc00:dead:cafe:1::2: icmp_seq=5 ttl=64 time=0.096 ms

--- fc00:dead:cafe:1::2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4058ms

$ tc -s qdisc
qdisc fq_codel 8004: dev testns root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 
 Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0


-Toke

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-12 15:55 ` [Galene] " Toke Høiland-Jørgensen
@ 2021-01-12 16:01   ` Dave Taht
  2021-01-12 17:38   ` Juliusz Chroboczek
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Taht @ 2021-01-12 16:01 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Juliusz Chroboczek, galene, Dave Taht

On Tue, Jan 12, 2021 at 7:55 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>
> Juliusz Chroboczek <jch@irif.fr> writes:
>
> > Dave, Toke,
> >
> > We're having a large meeting on the lab's Galène server, and it looks like
> > fq-codel is trashing: the new_flow_count increases as fast as the packet
> > counter.  How do I fix that?

It is a pretty useless statistic. I never liked it. In this case it is
a positive sign you are
not filling the queue, not a negative one, as toke goes into more detail below.

If you are having fun with fq_codel, try cake instead. Lots of more stats,
some of which are way more relevant to your application.

tc qdisc add dev eth0 root cake bandwidth XMbit # besteffort if you
don't want to bother with diffserv

tc -s qdisc show


                   Bulk  Best Effort        Voice
  thresh      562496bit        9Mbit     2250Kbit
  target         32.3ms        5.0ms        8.1ms
  interval      127.3ms      100.0ms      103.1ms
  pk_delay        9.2ms        224us        494us
  av_delay        2.9ms         29us         35us
  sp_delay        153us          4us          3us
  backlog            0b           0b           0b
  pkts            74853     35044344       162590
  bytes        80542836   9388672264     27301751
  way_inds         2538      2129502        10508
  way_miss         2575       549864         7374
  way_cols            0            0            0
  drops              11         7987            0
  marks              22            1            0
  ack_drop            0      3616230            0
  sp_flows            1            1            1
  bk_flows            0            1            0
  un_flows            0            0            0
  max_len         15605        25262         3169
  quantum           300          300          300





>
> That just sounds like wherever you're running FQ-CoDel is not actually
> the bottleneck? If there's no backpressure the queue will drain for each
> packet, so that by the time the next packet comes along it'll be
> considered "new" again.
>
> If you're running a few video flows on a gigabit link (and thus using a
> fraction of the bandwidth) that would be expected. But it only happens
> if you're also running a shaper (such as HTB or TBF) on the interface,
> since if FQ-CoDel is installed on a physical interface it will set the
> TCQ_F_CAN_BYPASS flag, which means that when the queue is completely
> empty it will be bypassed entirely...
>
> I just replicated the phenomenon like this:
>
> $ sudo tc qdisc replace dev testns root tbf rate 1gbit latency 10ms burst 1024
> $ sudo tc qdisc add dev testns parent 8002: fq_codel
> $ tc -s qdisc
> qdisc tbf 8002: dev testns root refcnt 2 rate 1Gbit burst 1000b lat 10ms
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc fq_codel 8003: dev testns parent 8002: limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>   maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
>   new_flows_len 0 old_flows_len 0
>
> $ ping -c 5 fc00:dead:cafe:1::2
> PING fc00:dead:cafe:1::2(fc00:dead:cafe:1::2) 56 data bytes
> 64 bytes from fc00:dead:cafe:1::2: icmp_seq=1 ttl=64 time=0.088 ms
> 64 bytes from fc00:dead:cafe:1::2: icmp_seq=2 ttl=64 time=0.138 ms
> 64 bytes from fc00:dead:cafe:1::2: icmp_seq=3 ttl=64 time=0.102 ms
> 64 bytes from fc00:dead:cafe:1::2: icmp_seq=4 ttl=64 time=0.103 ms
> 64 bytes from fc00:dead:cafe:1::2: icmp_seq=5 ttl=64 time=0.104 ms
>
> --- fc00:dead:cafe:1::2 ping statistics ---
> 5 packets transmitted, 5 received, 0% packet loss, time 4041ms
> rtt min/avg/max/mdev = 0.088/0.107/0.138/0.016 ms
>
> $ tc -s qdisc
> qdisc tbf 8002: dev testns root refcnt 2 rate 1Gbit burst 1000b lat 10ms
>  Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc fq_codel 8003: dev testns parent 8002: limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>  Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>   maxpacket 118 drop_overlimit 0 new_flow_count 5 ecn_mark 0
>   new_flows_len 0 old_flows_len 0
>
>
>
> Whereas if fq_codel is the root qdisc:
>
> $ sudo tc qdisc replace dev testns root fq_codel
> $ tc -s qdisc
> qdisc fq_codel 8004: dev testns root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>   maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
>   new_flows_len 0 old_flows_len 0
>
> $ ping -c 5 fc00:dead:cafe:1::2
> PING fc00:dead:cafe:1::2(fc00:dead:cafe:1::2) 56 data bytes
> 64 bytes from fc00:dead:cafe:1::2: icmp_seq=1 ttl=64 time=0.085 ms
> 64 bytes from fc00:dead:cafe:1::2: icmp_seq=2 ttl=64 time=0.098 ms
> 64 bytes from fc00:dead:cafe:1::2: icmp_seq=3 ttl=64 time=0.098 ms
> 64 bytes from fc00:dead:cafe:1::2: icmp_seq=4 ttl=64 time=0.097 ms
> 64 bytes from fc00:dead:cafe:1::2: icmp_seq=5 ttl=64 time=0.096 ms
>
> --- fc00:dead:cafe:1::2 ping statistics ---
> 5 packets transmitted, 5 received, 0% packet loss, time 4058ms
>
> $ tc -s qdisc
> qdisc fq_codel 8004: dev testns root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>  Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>   maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
>   new_flows_len 0 old_flows_len 0
>
>
> -Toke
> _______________________________________________
> Galene mailing list -- galene@lists.galene.org
> To unsubscribe send an email to galene-leave@lists.galene.org



-- 
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman

dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-12 15:55 ` [Galene] " Toke Høiland-Jørgensen
  2021-01-12 16:01   ` Dave Taht
@ 2021-01-12 17:38   ` Juliusz Chroboczek
  2021-01-12 17:42     ` Dave Taht
  1 sibling, 1 reply; 16+ messages in thread
From: Juliusz Chroboczek @ 2021-01-12 17:38 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: galene, Dave Taht

>> We're having a large meeting on the lab's Galène server, and it looks like
>> fq-codel is trashing: the new_flow_count increases as fast as the packet
>> counter.

> That just sounds like wherever you're running FQ-CoDel is not actually
> the bottleneck?

Right.

> But it only happens if you're also running a shaper (such as HTB or TBF)
> on the interface, since if FQ-CoDel is installed on a physical interface
> it will set the TCQ_F_CAN_BYPASS flag, which means that when the queue
> is completely empty it will be bypassed entirely...

I'm not running a shaper.  The interface is reported as

    00:03.0 Ethernet controller: Red Hat, Inc Virtio network device

Ethtool doesn't give anything interesting.

Here's the output of tc -s qdisc show:

    qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn 
     Sent 1270306213261 bytes 3032714775 pkt (dropped 2839148, overlimits 0 requeues 10623) 
     backlog 0b 0p requeues 10623
      maxpacket 68130 drop_overlimit 1576971 new_flow_count 732081615 ecn_mark 22
      new_flows_len 0 old_flows_len 0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-12 17:38   ` Juliusz Chroboczek
@ 2021-01-12 17:42     ` Dave Taht
  2021-01-12 18:10       ` Juliusz Chroboczek
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Taht @ 2021-01-12 17:42 UTC (permalink / raw)
  To: Juliusz Chroboczek; +Cc: galene, Dave Taht

On Tue, Jan 12, 2021 at 9:38 AM Juliusz Chroboczek <jch@irif.fr> wrote:
>
> >> We're having a large meeting on the lab's Galène server, and it looks like
> >> fq-codel is trashing: the new_flow_count increases as fast as the packet
> >> counter.
>
> > That just sounds like wherever you're running FQ-CoDel is not actually
> > the bottleneck?
>
> Right.
>
> > But it only happens if you're also running a shaper (such as HTB or TBF)
> > on the interface, since if FQ-CoDel is installed on a physical interface
> > it will set the TCQ_F_CAN_BYPASS flag, which means that when the queue
> > is completely empty it will be bypassed entirely...
>
> I'm not running a shaper.  The interface is reported as
>
>     00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
>
> Ethtool doesn't give anything interesting.
>
> Here's the output of tc -s qdisc show:
>
>     qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
>      Sent 1270306213261 bytes 3032714775 pkt (dropped 2839148, overlimits 0 requeues 10623)
>      backlog 0b 0p requeues 10623
>       maxpacket 68130 drop_overlimit 1576971 new_flow_count 732081615 ecn_mark 22
>       new_flows_len 0 old_flows_len 0

Nice to see drops and marks.

IF you have spare cpu, you can also run cake native no shaper and
strip gro (maxpacket 64k indicates gro is on). On the other
hand, gro is saves on context switches in vms.

> _______________________________________________
> Galene mailing list -- galene@lists.galene.org
> To unsubscribe send an email to galene-leave@lists.galene.org



-- 
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman

dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-12 17:42     ` Dave Taht
@ 2021-01-12 18:10       ` Juliusz Chroboczek
  2021-01-12 19:05         ` Dave Taht
  2021-01-12 19:29         ` Michael Ströder
  0 siblings, 2 replies; 16+ messages in thread
From: Juliusz Chroboczek @ 2021-01-12 18:10 UTC (permalink / raw)
  To: Dave Taht; +Cc: galene, Dave Taht

> IF you have spare cpu, you can also run cake native no shaper and
> strip gro (maxpacket 64k indicates gro is on).

GRO will only trigger for HTTP, WebSocket and TURN traffic, I believe;
most of Galène's traffic is RTP over UDP, with packets under 1200 bytes.

We just had a meeting with 70 people and at around 40 cameras switched on,
Galène became unusable — there were too many voice drops, which indicates
two issues:

  * I need to think of a better way of prioritising voice over video when
    under load;
  * there are fairness issues — some clients were receiving okay-ish
    audio, others were not.
  
Galène recovered after some people switched their cameras off, I didn't
need to restart anything.  At the highest point, Galène was at 270% CPU,
and the TURN server was using another 50%.  That's on a four-core VM.

> On the other hand, gro is saves on context switches in vms.

    $ PRODUCT INSTALL GALENE /CONFIGURATION=DEFAULT /LOG

-- Juliusz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-12 18:10       ` Juliusz Chroboczek
@ 2021-01-12 19:05         ` Dave Taht
  2021-01-12 19:52           ` Michael Ströder
  2021-01-12 21:02           ` Juliusz Chroboczek
  2021-01-12 19:29         ` Michael Ströder
  1 sibling, 2 replies; 16+ messages in thread
From: Dave Taht @ 2021-01-12 19:05 UTC (permalink / raw)
  To: Juliusz Chroboczek; +Cc: galene, Dave Taht

On Tue, Jan 12, 2021 at 10:10 AM Juliusz Chroboczek <jch@irif.fr> wrote:
>
> > IF you have spare cpu, you can also run cake native no shaper and
> > strip gro (maxpacket 64k indicates gro is on).
>
> GRO will only trigger for HTTP, WebSocket and TURN traffic, I believe;
> most of Galène's traffic is RTP over UDP, with packets under 1200 bytes.
>
> We just had a meeting with 70 people and at around 40 cameras switched on,
> Galène became unusable — there were too many voice drops, which indicates
> two issues:

i have generally found vms perform badly for lots of small packets and r/t.

please try cake. And collect a capture on the underlying hw if possible.

>   * I need to think of a better way of prioritising voice over video when
>     under load;

I take it (I really didn't understand) that unbundling these two types
is not currently feasible
in the javascript or webbrowser. ?Still.

I am pretty intersted in this layer of stuff but since adding crypto
to rtp it's got really hard to look at it.

>   * there are fairness issues — some clients were receiving okay-ish
>     audio, others were not.

collisions in fq_codel start to occur as the sqrt(1024) so at 70 users
the odds are 2-3 of your sessions were colliding.

Cake uses 8 way set associative. 0 collisions at this load.

I imagine you were not tracking the actual backlog in fq_codel during
this conference?
(while :; do tc -s qdisc show dev bla >> whatever; done

all that said, I tend to point fingers at loss at the virtio layer and
underlying hw first just because I'm defensive about
people blaming fq_codel for anything.  :) I also hate vms for r/t traffic.

you had a lot of GRO traffic it looked like. you can disable gso/gro
via ethtool and stick with fq_codel or use cake with
the
>
> Galène recovered after some people switched their cameras off, I didn't
> need to restart anything.  At the highest point, Galène was at 270% CPU,
> and the TURN server was using another 50%.  That's on a four-core VM.

but it does sound like more virtual cores will help at this load?

you in a position to profile? you are context switching like crazy
most likely also.

> > On the other hand, gro is saves on context switches in vms.
>
>     $ PRODUCT INSTALL GALENE /CONFIGURATION=DEFAULT /LOG
>
> -- Juliusz

-- 
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman

dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-12 19:05         ` Dave Taht
@ 2021-01-12 19:52           ` Michael Ströder
  2021-01-12 21:02           ` Juliusz Chroboczek
  1 sibling, 0 replies; 16+ messages in thread
From: Michael Ströder @ 2021-01-12 19:52 UTC (permalink / raw)
  To: galene

On 1/12/21 8:05 PM, Dave Taht wrote:
> i have generally found vms perform badly for lots of small packets and r/t.

Being a rookie with all this I can only say I'm quite surprised what my
Galène/libvirt/qemu-kvm setup does on really slow fan-less hardware.

Maybe this could also depend on kernel, qemu, libvirt versions? I
vaguely remember that there was work done for improving virtio. Running
on openSUSE Tumbleweed all the relevant software is pretty new in my setup.

Ciao, Michael.

P.S.: I've extracted some screenshots from grafana I could send in a
personal e-mail.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-12 19:05         ` Dave Taht
  2021-01-12 19:52           ` Michael Ströder
@ 2021-01-12 21:02           ` Juliusz Chroboczek
  1 sibling, 0 replies; 16+ messages in thread
From: Juliusz Chroboczek @ 2021-01-12 21:02 UTC (permalink / raw)
  To: Dave Taht; +Cc: galene

> please try cake.

Both my servers are used in production right now, so I'm not very keen on
experimenting.  I should have some funding for Galène soon.

> And collect a capture on the underlying hw if possible.

I don't have access to the underlying hardware, it's a cloud provider.

>> * I need to think of a better way of prioritising voice over video when
>> under load;

> I take it (I really didn't understand) that unbundling these two types
> is not currently feasible in the javascript or webbrowser.

[...]

> collisions in fq_codel start to occur as the sqrt(1024) so at 70 users
> the odds are 2-3 of your sessions were colliding.

I don't think the network is the problem.  The problem is that there's
a lot of buffering in Pion, and as I push packets down the layers, I have
no way of finding out the amount already queued -- so when we're short on
CPU, the video tends to crowd out the audio.

I'm not sure what the solution is.

>> Galène recovered after some people switched their cameras off, I didn't
>> need to restart anything.  At the highest point, Galène was at 270% CPU,
>> and the TURN server was using another 50%.  That's on a four-core VM.

> but it does sound like more virtual cores will help at this load?

According to profiling, 60% CPU time is spent in the write system call.

> you are context switching like crazy most likely also.

I'll measure next time, but I doubt it: Go's scheduler operates almost
entirely in userspace.

-- Juliusz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-12 18:10       ` Juliusz Chroboczek
  2021-01-12 19:05         ` Dave Taht
@ 2021-01-12 19:29         ` Michael Ströder
  2021-01-12 21:22           ` Juliusz Chroboczek
  1 sibling, 1 reply; 16+ messages in thread
From: Michael Ströder @ 2021-01-12 19:29 UTC (permalink / raw)
  To: galene

On 1/12/21 7:10 PM, Juliusz Chroboczek wrote:
> We just had a meeting with 70 people and at around 40 cameras switched on,

Which send quality were they all using? "normal"?

Frankly I understand only ~5% of what you're talking about in this thread.

But do you really expect all users' end devices to decode 40 video
streams? Not to speak of all the crappy Internet connections, already
over-loaded while other family members are watching Netflix or similar
situations in shared flats.

As said before I'm running Galène in a VM on an insanely slow hardware.
But even with this setup and send quality set to "lowest" we managed to
overwhelm some older iPad devices or older laptops with just 7 video
streams. In my local tests with slow and ancient 10+ years old laptops I
even get video drop-outs within my LAN with only 3 video streams.

> Galène became unusable — there were too many voice drops, which indicates
> two issues:
> 
>   * I need to think of a better way of prioritising voice over video when
>     under load;

We had one hearing impaired user who hears a little bit with in-ear
devices. Normally the user also follows spoken text by lip-reading to
get more context. But this is nearly impossible for her in a video
session because audio and video are not sufficiently synchronised with
our setup.

It probably would be helpful that the prioritising voice over video
could be changed per user to at least have a chance for such a special case.

>   * there are fairness issues — some clients were receiving okay-ish
>     audio, others were not.

Not sure whether that's really a fairness issue within Galène. I can see
differing latencies in /stats for different connections. The connection
with higher latency, most times on all "Down" streams, has the higher
latency consistently throughout whole session. I suspect the receiver
side is the issue.

> Galène recovered after some people switched their cameras off, I didn't
> need to restart anything.

I can confirm that Galène behaves pretty predictable if streams are
turned on or off.

> At the highest point, Galène was at 270% CPU,
> and the TURN server was using another 50%.  That's on a four-core VM.

I'm far away from such a setup. So I wonder whether my response is
useful at all.

Ciao, Michael.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-12 19:29         ` Michael Ströder
@ 2021-01-12 21:22           ` Juliusz Chroboczek
  2021-01-13 19:09             ` Michael Ströder
  0 siblings, 1 reply; 16+ messages in thread
From: Juliusz Chroboczek @ 2021-01-12 21:22 UTC (permalink / raw)
  To: Michael Ströder; +Cc: galene

>> We just had a meeting with 70 people and at around 40 cameras switched on,

> Which send quality were they all using? "normal"?

The quality selected in the menu is the maximum allowable quality; Galène
will rather eagerly drop down beneath it, all the way down to 200kbit/s
(it will drop quality even more aggressively in the future).  So we were
running at "normal", but the resulting bitrate was somewhere between "low"
and "lowest".

> Frankly I understand only ~5% of what you're talking about in this thread.

Yeah, that's also our case.  This doesn't prevent us from speaking, though ;-)

> But do you really expect all users' end devices to decode 40 video streams?

[...]

> In my local tests with slow and ancient 10+ years old laptops I even get
> video drop-outs within my LAN with only 3 video streams.

At lab meetings, everyone has got rather nice laptops, and Galène works
reasonably well up to 25 videos or so, except for the elegant people with
the fancy fanless MacBooks.

The situation is different at lectures, of course, but then the students
are not too keen on switching their cameras on.  (Back in December,
a student who wasn't reacting to a question admitted to being busy with
frying eggs.  I naturally accepted his excuse as perfectly legitimate.)

> Not to speak of all the crappy Internet connections, already over-loaded
> while other family members are watching Netflix or similar situations in
> shared flats.

Galène should in principle drop the quality down when a link is congesed.
We're currently dropping down to the lowest rate that everyone can tolerate
(but not beneath 200kbit/s), with simulcast or SVC, as discussed in
a previous mail, we'll be able to send different qualities to different
users.

> We had one hearing impaired user who hears a little bit with in-ear
> devices. Normally the user also follows spoken text by lip-reading to
> get more context. But this is nearly impossible for her in a video
> session because audio and video are not sufficiently synchronised with
> our setup.

Hmm... was that with Galène?  Which browser?

In principle, Galène generates all the bits of protocol to perform
accurate lipsynch on the receiving side.  I have veryfied that it works
well with Chrome.

> Not sure whether that's really a fairness issue within Galène. I can see
> differing latencies in /stats for different connections. The connection
> with higher latency, most times on all "Down" streams, has the higher
> latency consistently throughout whole session. I suspect the receiver
> side is the issue.

You might be mis-reading the statistics.  For an up stream, Galène only
keeps track of the amount of jitter.  For a down stream, Galène keeps
track of both average delay and jitter.

Up:   ±3ms     means 3ms average jitter;
Down: 30ms±3ms means 30ms average delay with 3ms average jitter.

>> At the highest point, Galène was at 270% CPU, and the TURN server was
>> using another 50%.  That's on a four-core VM.

> I'm far away from such a setup. So I wonder whether my response is
> useful at all.

It's very useful.  We want Galène to scale all the way from a single core
ARMv7 (thanks for the Beaglebone, Dave) to a 16-core, 32-thread server.

-- Juliusz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-12 21:22           ` Juliusz Chroboczek
@ 2021-01-13 19:09             ` Michael Ströder
  2021-01-14 12:59               ` Juliusz Chroboczek
  0 siblings, 1 reply; 16+ messages in thread
From: Michael Ströder @ 2021-01-13 19:09 UTC (permalink / raw)
  To: galene

On 1/12/21 10:22 PM, Juliusz Chroboczek wrote:
>>> We just had a meeting with 70 people and at around 40 cameras switched on,
> 
>> Which send quality were they all using? "normal"?
> 
> The quality selected in the menu is the maximum allowable quality; Galène
> will rather eagerly drop down beneath it, all the way down to 200kbit/s
> (it will drop quality even more aggressively in the future).  So we were
> running at "normal", but the resulting bitrate was somewhere between "low"
> and "lowest".

But if users are thrown out of sessions and reconnect the browser will
again try a higher send rate. Right? If that happens to several sending
users within a short time-frame you will get some peaks affecting all
the receiving users. Well, I'm speculating here.
(This reminds me of tuning of feedback control systems.)

>> We had one hearing impaired user who hears a little bit with in-ear
>> devices. Normally the user also follows spoken text by lip-reading to
>> get more context. But this is nearly impossible for her in a video
>> session because audio and video are not sufficiently synchronised with
>> our setup.
> 
> Hmm... was that with Galène?  Which browser?

Yes, with Galène. I don't know which browser was used though.

> In principle, Galène generates all the bits of protocol to perform
> accurate lipsynch on the receiving side.  I have veryfied that it works
> well with Chrome.

I'm not really capable of lip-reading, so I can't tell which quality a
lip-reading user would need. But there was some visible jitter even
within my LAN with just a one-to-one test.

Maybe I can contact this user to do dome specific tests...

>> Not sure whether that's really a fairness issue within Galène. I
>> can see differing latencies in /stats for different connections.
>> The connection with higher latency, most times on all "Down"
>> streams, has the higher latency consistently throughout whole
>> session. I suspect the receiver side is the issue.>
> You might be mis-reading the statistics.

Yes, maybe.

>  For an up stream, Galène only
> keeps track of the amount of jitter.  For a down stream, Galène keeps
> track of both average delay and jitter.
> 
> Up:   ±3ms     means 3ms average jitter;
> Down: 30ms±3ms means 30ms average delay with 3ms average jitter.

This matches my interpretation. As said the average down-stream delay
was consistently higher for users having issues.

Most users here are using the usual VDSL or cable-TV connections within
Germany. I have no knowledge whether some users are still using slower
DSL lines.

For connections/users without issues I see average down-stream delays of
40..60 ms with jitter 10..40 ms.

Users reported no problems even with down-stream delays of 100..150 ms.

Real problems start when average down-stream delays is 150+ ms most of
the time.

Ciao, Michael.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-13 19:09             ` Michael Ströder
@ 2021-01-14 12:59               ` Juliusz Chroboczek
  2021-01-14 13:03                 ` Michael Ströder
  0 siblings, 1 reply; 16+ messages in thread
From: Juliusz Chroboczek @ 2021-01-14 12:59 UTC (permalink / raw)
  To: Michael Ströder; +Cc: galene

> Real problems start when average down-stream delays is 150+ ms most of
> the time.

That's rather extreme.  Next time I have a meeting on Galène, I'll
artificially increase my delay to 200ms to see how well it works.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-14 12:59               ` Juliusz Chroboczek
@ 2021-01-14 13:03                 ` Michael Ströder
  2021-01-14 13:10                   ` Juliusz Chroboczek
  0 siblings, 1 reply; 16+ messages in thread
From: Michael Ströder @ 2021-01-14 13:03 UTC (permalink / raw)
  To: galene

On 1/14/21 1:59 PM, Juliusz Chroboczek wrote:
>> Real problems start when average down-stream delays is 150+ ms most of
>> the time.
> 
> That's rather extreme.  Next time I have a meeting on Galène, I'll
> artificially increase my delay to 200ms to see how well it works.

Just wanted to share what happens in practice. So better don't dedicate
too much effort to such a scenario. It is a rather rare technical issue
at the receiver's side which has to be solved there anyway.

Ciao, Michael.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-14 13:03                 ` Michael Ströder
@ 2021-01-14 13:10                   ` Juliusz Chroboczek
  2021-01-14 13:23                     ` Michael Ströder
  0 siblings, 1 reply; 16+ messages in thread
From: Juliusz Chroboczek @ 2021-01-14 13:10 UTC (permalink / raw)
  To: Michael Ströder; +Cc: galene

> Just wanted to share what happens in practice. So better don't dedicate
> too much effort to such a scenario.

A lot of our students follow lectures over 4G.  Making our Galène usable by
people with slight on a poor 4G connection is a worthy goal.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Galene] Re: fq-codel trashing
  2021-01-14 13:10                   ` Juliusz Chroboczek
@ 2021-01-14 13:23                     ` Michael Ströder
  0 siblings, 0 replies; 16+ messages in thread
From: Michael Ströder @ 2021-01-14 13:23 UTC (permalink / raw)
  To: galene

On 1/14/21 2:10 PM, Juliusz Chroboczek wrote:
>> Just wanted to share what happens in practice. So better don't dedicate
>> too much effort to such a scenario.
> 
> A lot of our students follow lectures over 4G.  Making our Galène usable by
> people with slight on a poor 4G connection is a worthy goal.

Just wanted to make clear that I'm not asking you to take care of such a
high latency.

Ciao, Michael.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-01-14 13:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-12 13:46 [Galene] fq-codel trashing Juliusz Chroboczek
2021-01-12 15:55 ` [Galene] " Toke Høiland-Jørgensen
2021-01-12 16:01   ` Dave Taht
2021-01-12 17:38   ` Juliusz Chroboczek
2021-01-12 17:42     ` Dave Taht
2021-01-12 18:10       ` Juliusz Chroboczek
2021-01-12 19:05         ` Dave Taht
2021-01-12 19:52           ` Michael Ströder
2021-01-12 21:02           ` Juliusz Chroboczek
2021-01-12 19:29         ` Michael Ströder
2021-01-12 21:22           ` Juliusz Chroboczek
2021-01-13 19:09             ` Michael Ströder
2021-01-14 12:59               ` Juliusz Chroboczek
2021-01-14 13:03                 ` Michael Ströder
2021-01-14 13:10                   ` Juliusz Chroboczek
2021-01-14 13:23                     ` Michael Ströder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox