* [Galene] fq-codel trashing @ 2021-01-12 13:46 Juliusz Chroboczek 2021-01-12 15:55 ` [Galene] " Toke Høiland-Jørgensen 0 siblings, 1 reply; 16+ messages in thread From: Juliusz Chroboczek @ 2021-01-12 13:46 UTC (permalink / raw) To: galene; +Cc: Dave Taht Dave, Toke, We're having a large meeting on the lab's Galène server, and it looks like fq-codel is trashing: the new_flow_count increases as fast as the packet counter. How do I fix that? -- Juliusz ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-12 13:46 [Galene] fq-codel trashing Juliusz Chroboczek @ 2021-01-12 15:55 ` Toke Høiland-Jørgensen 2021-01-12 16:01 ` Dave Taht 2021-01-12 17:38 ` Juliusz Chroboczek 0 siblings, 2 replies; 16+ messages in thread From: Toke Høiland-Jørgensen @ 2021-01-12 15:55 UTC (permalink / raw) To: Juliusz Chroboczek, galene; +Cc: Dave Taht Juliusz Chroboczek <jch@irif.fr> writes: > Dave, Toke, > > We're having a large meeting on the lab's Galène server, and it looks like > fq-codel is trashing: the new_flow_count increases as fast as the packet > counter. How do I fix that? That just sounds like wherever you're running FQ-CoDel is not actually the bottleneck? If there's no backpressure the queue will drain for each packet, so that by the time the next packet comes along it'll be considered "new" again. If you're running a few video flows on a gigabit link (and thus using a fraction of the bandwidth) that would be expected. But it only happens if you're also running a shaper (such as HTB or TBF) on the interface, since if FQ-CoDel is installed on a physical interface it will set the TCQ_F_CAN_BYPASS flag, which means that when the queue is completely empty it will be bypassed entirely... I just replicated the phenomenon like this: $ sudo tc qdisc replace dev testns root tbf rate 1gbit latency 10ms burst 1024 $ sudo tc qdisc add dev testns parent 8002: fq_codel $ tc -s qdisc qdisc tbf 8002: dev testns root refcnt 2 rate 1Gbit burst 1000b lat 10ms Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc fq_codel 8003: dev testns parent 8002: limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 new_flows_len 0 old_flows_len 0 $ ping -c 5 fc00:dead:cafe:1::2 PING fc00:dead:cafe:1::2(fc00:dead:cafe:1::2) 56 data bytes 64 bytes from fc00:dead:cafe:1::2: icmp_seq=1 ttl=64 time=0.088 ms 64 bytes from fc00:dead:cafe:1::2: icmp_seq=2 ttl=64 time=0.138 ms 64 bytes from fc00:dead:cafe:1::2: icmp_seq=3 ttl=64 time=0.102 ms 64 bytes from fc00:dead:cafe:1::2: icmp_seq=4 ttl=64 time=0.103 ms 64 bytes from fc00:dead:cafe:1::2: icmp_seq=5 ttl=64 time=0.104 ms --- fc00:dead:cafe:1::2 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4041ms rtt min/avg/max/mdev = 0.088/0.107/0.138/0.016 ms $ tc -s qdisc qdisc tbf 8002: dev testns root refcnt 2 rate 1Gbit burst 1000b lat 10ms Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc fq_codel 8003: dev testns parent 8002: limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 maxpacket 118 drop_overlimit 0 new_flow_count 5 ecn_mark 0 new_flows_len 0 old_flows_len 0 Whereas if fq_codel is the root qdisc: $ sudo tc qdisc replace dev testns root fq_codel $ tc -s qdisc qdisc fq_codel 8004: dev testns root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 new_flows_len 0 old_flows_len 0 $ ping -c 5 fc00:dead:cafe:1::2 PING fc00:dead:cafe:1::2(fc00:dead:cafe:1::2) 56 data bytes 64 bytes from fc00:dead:cafe:1::2: icmp_seq=1 ttl=64 time=0.085 ms 64 bytes from fc00:dead:cafe:1::2: icmp_seq=2 ttl=64 time=0.098 ms 64 bytes from fc00:dead:cafe:1::2: icmp_seq=3 ttl=64 time=0.098 ms 64 bytes from fc00:dead:cafe:1::2: icmp_seq=4 ttl=64 time=0.097 ms 64 bytes from fc00:dead:cafe:1::2: icmp_seq=5 ttl=64 time=0.096 ms --- fc00:dead:cafe:1::2 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4058ms $ tc -s qdisc qdisc fq_codel 8004: dev testns root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 new_flows_len 0 old_flows_len 0 -Toke ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-12 15:55 ` [Galene] " Toke Høiland-Jørgensen @ 2021-01-12 16:01 ` Dave Taht 2021-01-12 17:38 ` Juliusz Chroboczek 1 sibling, 0 replies; 16+ messages in thread From: Dave Taht @ 2021-01-12 16:01 UTC (permalink / raw) To: Toke Høiland-Jørgensen; +Cc: Juliusz Chroboczek, galene, Dave Taht On Tue, Jan 12, 2021 at 7:55 AM Toke Høiland-Jørgensen <toke@toke.dk> wrote: > > Juliusz Chroboczek <jch@irif.fr> writes: > > > Dave, Toke, > > > > We're having a large meeting on the lab's Galène server, and it looks like > > fq-codel is trashing: the new_flow_count increases as fast as the packet > > counter. How do I fix that? It is a pretty useless statistic. I never liked it. In this case it is a positive sign you are not filling the queue, not a negative one, as toke goes into more detail below. If you are having fun with fq_codel, try cake instead. Lots of more stats, some of which are way more relevant to your application. tc qdisc add dev eth0 root cake bandwidth XMbit # besteffort if you don't want to bother with diffserv tc -s qdisc show Bulk Best Effort Voice thresh 562496bit 9Mbit 2250Kbit target 32.3ms 5.0ms 8.1ms interval 127.3ms 100.0ms 103.1ms pk_delay 9.2ms 224us 494us av_delay 2.9ms 29us 35us sp_delay 153us 4us 3us backlog 0b 0b 0b pkts 74853 35044344 162590 bytes 80542836 9388672264 27301751 way_inds 2538 2129502 10508 way_miss 2575 549864 7374 way_cols 0 0 0 drops 11 7987 0 marks 22 1 0 ack_drop 0 3616230 0 sp_flows 1 1 1 bk_flows 0 1 0 un_flows 0 0 0 max_len 15605 25262 3169 quantum 300 300 300 > > That just sounds like wherever you're running FQ-CoDel is not actually > the bottleneck? If there's no backpressure the queue will drain for each > packet, so that by the time the next packet comes along it'll be > considered "new" again. > > If you're running a few video flows on a gigabit link (and thus using a > fraction of the bandwidth) that would be expected. But it only happens > if you're also running a shaper (such as HTB or TBF) on the interface, > since if FQ-CoDel is installed on a physical interface it will set the > TCQ_F_CAN_BYPASS flag, which means that when the queue is completely > empty it will be bypassed entirely... > > I just replicated the phenomenon like this: > > $ sudo tc qdisc replace dev testns root tbf rate 1gbit latency 10ms burst 1024 > $ sudo tc qdisc add dev testns parent 8002: fq_codel > $ tc -s qdisc > qdisc tbf 8002: dev testns root refcnt 2 rate 1Gbit burst 1000b lat 10ms > Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) > backlog 0b 0p requeues 0 > qdisc fq_codel 8003: dev testns parent 8002: limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) > backlog 0b 0p requeues 0 > maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 > new_flows_len 0 old_flows_len 0 > > $ ping -c 5 fc00:dead:cafe:1::2 > PING fc00:dead:cafe:1::2(fc00:dead:cafe:1::2) 56 data bytes > 64 bytes from fc00:dead:cafe:1::2: icmp_seq=1 ttl=64 time=0.088 ms > 64 bytes from fc00:dead:cafe:1::2: icmp_seq=2 ttl=64 time=0.138 ms > 64 bytes from fc00:dead:cafe:1::2: icmp_seq=3 ttl=64 time=0.102 ms > 64 bytes from fc00:dead:cafe:1::2: icmp_seq=4 ttl=64 time=0.103 ms > 64 bytes from fc00:dead:cafe:1::2: icmp_seq=5 ttl=64 time=0.104 ms > > --- fc00:dead:cafe:1::2 ping statistics --- > 5 packets transmitted, 5 received, 0% packet loss, time 4041ms > rtt min/avg/max/mdev = 0.088/0.107/0.138/0.016 ms > > $ tc -s qdisc > qdisc tbf 8002: dev testns root refcnt 2 rate 1Gbit burst 1000b lat 10ms > Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0) > backlog 0b 0p requeues 0 > qdisc fq_codel 8003: dev testns parent 8002: limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0) > backlog 0b 0p requeues 0 > maxpacket 118 drop_overlimit 0 new_flow_count 5 ecn_mark 0 > new_flows_len 0 old_flows_len 0 > > > > Whereas if fq_codel is the root qdisc: > > $ sudo tc qdisc replace dev testns root fq_codel > $ tc -s qdisc > qdisc fq_codel 8004: dev testns root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) > backlog 0b 0p requeues 0 > maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 > new_flows_len 0 old_flows_len 0 > > $ ping -c 5 fc00:dead:cafe:1::2 > PING fc00:dead:cafe:1::2(fc00:dead:cafe:1::2) 56 data bytes > 64 bytes from fc00:dead:cafe:1::2: icmp_seq=1 ttl=64 time=0.085 ms > 64 bytes from fc00:dead:cafe:1::2: icmp_seq=2 ttl=64 time=0.098 ms > 64 bytes from fc00:dead:cafe:1::2: icmp_seq=3 ttl=64 time=0.098 ms > 64 bytes from fc00:dead:cafe:1::2: icmp_seq=4 ttl=64 time=0.097 ms > 64 bytes from fc00:dead:cafe:1::2: icmp_seq=5 ttl=64 time=0.096 ms > > --- fc00:dead:cafe:1::2 ping statistics --- > 5 packets transmitted, 5 received, 0% packet loss, time 4058ms > > $ tc -s qdisc > qdisc fq_codel 8004: dev testns root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64 > Sent 590 bytes 5 pkt (dropped 0, overlimits 0 requeues 0) > backlog 0b 0p requeues 0 > maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 > new_flows_len 0 old_flows_len 0 > > > -Toke > _______________________________________________ > Galene mailing list -- galene@lists.galene.org > To unsubscribe send an email to galene-leave@lists.galene.org -- "For a successful technology, reality must take precedence over public relations, for Mother Nature cannot be fooled" - Richard Feynman dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-12 15:55 ` [Galene] " Toke Høiland-Jørgensen 2021-01-12 16:01 ` Dave Taht @ 2021-01-12 17:38 ` Juliusz Chroboczek 2021-01-12 17:42 ` Dave Taht 1 sibling, 1 reply; 16+ messages in thread From: Juliusz Chroboczek @ 2021-01-12 17:38 UTC (permalink / raw) To: Toke Høiland-Jørgensen; +Cc: galene, Dave Taht >> We're having a large meeting on the lab's Galène server, and it looks like >> fq-codel is trashing: the new_flow_count increases as fast as the packet >> counter. > That just sounds like wherever you're running FQ-CoDel is not actually > the bottleneck? Right. > But it only happens if you're also running a shaper (such as HTB or TBF) > on the interface, since if FQ-CoDel is installed on a physical interface > it will set the TCQ_F_CAN_BYPASS flag, which means that when the queue > is completely empty it will be bypassed entirely... I'm not running a shaper. The interface is reported as 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device Ethtool doesn't give anything interesting. Here's the output of tc -s qdisc show: qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn Sent 1270306213261 bytes 3032714775 pkt (dropped 2839148, overlimits 0 requeues 10623) backlog 0b 0p requeues 10623 maxpacket 68130 drop_overlimit 1576971 new_flow_count 732081615 ecn_mark 22 new_flows_len 0 old_flows_len 0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-12 17:38 ` Juliusz Chroboczek @ 2021-01-12 17:42 ` Dave Taht 2021-01-12 18:10 ` Juliusz Chroboczek 0 siblings, 1 reply; 16+ messages in thread From: Dave Taht @ 2021-01-12 17:42 UTC (permalink / raw) To: Juliusz Chroboczek; +Cc: galene, Dave Taht On Tue, Jan 12, 2021 at 9:38 AM Juliusz Chroboczek <jch@irif.fr> wrote: > > >> We're having a large meeting on the lab's Galène server, and it looks like > >> fq-codel is trashing: the new_flow_count increases as fast as the packet > >> counter. > > > That just sounds like wherever you're running FQ-CoDel is not actually > > the bottleneck? > > Right. > > > But it only happens if you're also running a shaper (such as HTB or TBF) > > on the interface, since if FQ-CoDel is installed on a physical interface > > it will set the TCQ_F_CAN_BYPASS flag, which means that when the queue > > is completely empty it will be bypassed entirely... > > I'm not running a shaper. The interface is reported as > > 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device > > Ethtool doesn't give anything interesting. > > Here's the output of tc -s qdisc show: > > qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn > Sent 1270306213261 bytes 3032714775 pkt (dropped 2839148, overlimits 0 requeues 10623) > backlog 0b 0p requeues 10623 > maxpacket 68130 drop_overlimit 1576971 new_flow_count 732081615 ecn_mark 22 > new_flows_len 0 old_flows_len 0 Nice to see drops and marks. IF you have spare cpu, you can also run cake native no shaper and strip gro (maxpacket 64k indicates gro is on). On the other hand, gro is saves on context switches in vms. > _______________________________________________ > Galene mailing list -- galene@lists.galene.org > To unsubscribe send an email to galene-leave@lists.galene.org -- "For a successful technology, reality must take precedence over public relations, for Mother Nature cannot be fooled" - Richard Feynman dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-12 17:42 ` Dave Taht @ 2021-01-12 18:10 ` Juliusz Chroboczek 2021-01-12 19:05 ` Dave Taht 2021-01-12 19:29 ` Michael Ströder 0 siblings, 2 replies; 16+ messages in thread From: Juliusz Chroboczek @ 2021-01-12 18:10 UTC (permalink / raw) To: Dave Taht; +Cc: galene, Dave Taht > IF you have spare cpu, you can also run cake native no shaper and > strip gro (maxpacket 64k indicates gro is on). GRO will only trigger for HTTP, WebSocket and TURN traffic, I believe; most of Galène's traffic is RTP over UDP, with packets under 1200 bytes. We just had a meeting with 70 people and at around 40 cameras switched on, Galène became unusable — there were too many voice drops, which indicates two issues: * I need to think of a better way of prioritising voice over video when under load; * there are fairness issues — some clients were receiving okay-ish audio, others were not. Galène recovered after some people switched their cameras off, I didn't need to restart anything. At the highest point, Galène was at 270% CPU, and the TURN server was using another 50%. That's on a four-core VM. > On the other hand, gro is saves on context switches in vms. $ PRODUCT INSTALL GALENE /CONFIGURATION=DEFAULT /LOG -- Juliusz ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-12 18:10 ` Juliusz Chroboczek @ 2021-01-12 19:05 ` Dave Taht 2021-01-12 19:52 ` Michael Ströder 2021-01-12 21:02 ` Juliusz Chroboczek 2021-01-12 19:29 ` Michael Ströder 1 sibling, 2 replies; 16+ messages in thread From: Dave Taht @ 2021-01-12 19:05 UTC (permalink / raw) To: Juliusz Chroboczek; +Cc: galene, Dave Taht On Tue, Jan 12, 2021 at 10:10 AM Juliusz Chroboczek <jch@irif.fr> wrote: > > > IF you have spare cpu, you can also run cake native no shaper and > > strip gro (maxpacket 64k indicates gro is on). > > GRO will only trigger for HTTP, WebSocket and TURN traffic, I believe; > most of Galène's traffic is RTP over UDP, with packets under 1200 bytes. > > We just had a meeting with 70 people and at around 40 cameras switched on, > Galène became unusable — there were too many voice drops, which indicates > two issues: i have generally found vms perform badly for lots of small packets and r/t. please try cake. And collect a capture on the underlying hw if possible. > * I need to think of a better way of prioritising voice over video when > under load; I take it (I really didn't understand) that unbundling these two types is not currently feasible in the javascript or webbrowser. ?Still. I am pretty intersted in this layer of stuff but since adding crypto to rtp it's got really hard to look at it. > * there are fairness issues — some clients were receiving okay-ish > audio, others were not. collisions in fq_codel start to occur as the sqrt(1024) so at 70 users the odds are 2-3 of your sessions were colliding. Cake uses 8 way set associative. 0 collisions at this load. I imagine you were not tracking the actual backlog in fq_codel during this conference? (while :; do tc -s qdisc show dev bla >> whatever; done all that said, I tend to point fingers at loss at the virtio layer and underlying hw first just because I'm defensive about people blaming fq_codel for anything. :) I also hate vms for r/t traffic. you had a lot of GRO traffic it looked like. you can disable gso/gro via ethtool and stick with fq_codel or use cake with the > > Galène recovered after some people switched their cameras off, I didn't > need to restart anything. At the highest point, Galène was at 270% CPU, > and the TURN server was using another 50%. That's on a four-core VM. but it does sound like more virtual cores will help at this load? you in a position to profile? you are context switching like crazy most likely also. > > On the other hand, gro is saves on context switches in vms. > > $ PRODUCT INSTALL GALENE /CONFIGURATION=DEFAULT /LOG > > -- Juliusz -- "For a successful technology, reality must take precedence over public relations, for Mother Nature cannot be fooled" - Richard Feynman dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-12 19:05 ` Dave Taht @ 2021-01-12 19:52 ` Michael Ströder 2021-01-12 21:02 ` Juliusz Chroboczek 1 sibling, 0 replies; 16+ messages in thread From: Michael Ströder @ 2021-01-12 19:52 UTC (permalink / raw) To: galene On 1/12/21 8:05 PM, Dave Taht wrote: > i have generally found vms perform badly for lots of small packets and r/t. Being a rookie with all this I can only say I'm quite surprised what my Galène/libvirt/qemu-kvm setup does on really slow fan-less hardware. Maybe this could also depend on kernel, qemu, libvirt versions? I vaguely remember that there was work done for improving virtio. Running on openSUSE Tumbleweed all the relevant software is pretty new in my setup. Ciao, Michael. P.S.: I've extracted some screenshots from grafana I could send in a personal e-mail. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-12 19:05 ` Dave Taht 2021-01-12 19:52 ` Michael Ströder @ 2021-01-12 21:02 ` Juliusz Chroboczek 1 sibling, 0 replies; 16+ messages in thread From: Juliusz Chroboczek @ 2021-01-12 21:02 UTC (permalink / raw) To: Dave Taht; +Cc: galene > please try cake. Both my servers are used in production right now, so I'm not very keen on experimenting. I should have some funding for Galène soon. > And collect a capture on the underlying hw if possible. I don't have access to the underlying hardware, it's a cloud provider. >> * I need to think of a better way of prioritising voice over video when >> under load; > I take it (I really didn't understand) that unbundling these two types > is not currently feasible in the javascript or webbrowser. [...] > collisions in fq_codel start to occur as the sqrt(1024) so at 70 users > the odds are 2-3 of your sessions were colliding. I don't think the network is the problem. The problem is that there's a lot of buffering in Pion, and as I push packets down the layers, I have no way of finding out the amount already queued -- so when we're short on CPU, the video tends to crowd out the audio. I'm not sure what the solution is. >> Galène recovered after some people switched their cameras off, I didn't >> need to restart anything. At the highest point, Galène was at 270% CPU, >> and the TURN server was using another 50%. That's on a four-core VM. > but it does sound like more virtual cores will help at this load? According to profiling, 60% CPU time is spent in the write system call. > you are context switching like crazy most likely also. I'll measure next time, but I doubt it: Go's scheduler operates almost entirely in userspace. -- Juliusz ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-12 18:10 ` Juliusz Chroboczek 2021-01-12 19:05 ` Dave Taht @ 2021-01-12 19:29 ` Michael Ströder 2021-01-12 21:22 ` Juliusz Chroboczek 1 sibling, 1 reply; 16+ messages in thread From: Michael Ströder @ 2021-01-12 19:29 UTC (permalink / raw) To: galene On 1/12/21 7:10 PM, Juliusz Chroboczek wrote: > We just had a meeting with 70 people and at around 40 cameras switched on, Which send quality were they all using? "normal"? Frankly I understand only ~5% of what you're talking about in this thread. But do you really expect all users' end devices to decode 40 video streams? Not to speak of all the crappy Internet connections, already over-loaded while other family members are watching Netflix or similar situations in shared flats. As said before I'm running Galène in a VM on an insanely slow hardware. But even with this setup and send quality set to "lowest" we managed to overwhelm some older iPad devices or older laptops with just 7 video streams. In my local tests with slow and ancient 10+ years old laptops I even get video drop-outs within my LAN with only 3 video streams. > Galène became unusable — there were too many voice drops, which indicates > two issues: > > * I need to think of a better way of prioritising voice over video when > under load; We had one hearing impaired user who hears a little bit with in-ear devices. Normally the user also follows spoken text by lip-reading to get more context. But this is nearly impossible for her in a video session because audio and video are not sufficiently synchronised with our setup. It probably would be helpful that the prioritising voice over video could be changed per user to at least have a chance for such a special case. > * there are fairness issues — some clients were receiving okay-ish > audio, others were not. Not sure whether that's really a fairness issue within Galène. I can see differing latencies in /stats for different connections. The connection with higher latency, most times on all "Down" streams, has the higher latency consistently throughout whole session. I suspect the receiver side is the issue. > Galène recovered after some people switched their cameras off, I didn't > need to restart anything. I can confirm that Galène behaves pretty predictable if streams are turned on or off. > At the highest point, Galène was at 270% CPU, > and the TURN server was using another 50%. That's on a four-core VM. I'm far away from such a setup. So I wonder whether my response is useful at all. Ciao, Michael. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-12 19:29 ` Michael Ströder @ 2021-01-12 21:22 ` Juliusz Chroboczek 2021-01-13 19:09 ` Michael Ströder 0 siblings, 1 reply; 16+ messages in thread From: Juliusz Chroboczek @ 2021-01-12 21:22 UTC (permalink / raw) To: Michael Ströder; +Cc: galene >> We just had a meeting with 70 people and at around 40 cameras switched on, > Which send quality were they all using? "normal"? The quality selected in the menu is the maximum allowable quality; Galène will rather eagerly drop down beneath it, all the way down to 200kbit/s (it will drop quality even more aggressively in the future). So we were running at "normal", but the resulting bitrate was somewhere between "low" and "lowest". > Frankly I understand only ~5% of what you're talking about in this thread. Yeah, that's also our case. This doesn't prevent us from speaking, though ;-) > But do you really expect all users' end devices to decode 40 video streams? [...] > In my local tests with slow and ancient 10+ years old laptops I even get > video drop-outs within my LAN with only 3 video streams. At lab meetings, everyone has got rather nice laptops, and Galène works reasonably well up to 25 videos or so, except for the elegant people with the fancy fanless MacBooks. The situation is different at lectures, of course, but then the students are not too keen on switching their cameras on. (Back in December, a student who wasn't reacting to a question admitted to being busy with frying eggs. I naturally accepted his excuse as perfectly legitimate.) > Not to speak of all the crappy Internet connections, already over-loaded > while other family members are watching Netflix or similar situations in > shared flats. Galène should in principle drop the quality down when a link is congesed. We're currently dropping down to the lowest rate that everyone can tolerate (but not beneath 200kbit/s), with simulcast or SVC, as discussed in a previous mail, we'll be able to send different qualities to different users. > We had one hearing impaired user who hears a little bit with in-ear > devices. Normally the user also follows spoken text by lip-reading to > get more context. But this is nearly impossible for her in a video > session because audio and video are not sufficiently synchronised with > our setup. Hmm... was that with Galène? Which browser? In principle, Galène generates all the bits of protocol to perform accurate lipsynch on the receiving side. I have veryfied that it works well with Chrome. > Not sure whether that's really a fairness issue within Galène. I can see > differing latencies in /stats for different connections. The connection > with higher latency, most times on all "Down" streams, has the higher > latency consistently throughout whole session. I suspect the receiver > side is the issue. You might be mis-reading the statistics. For an up stream, Galène only keeps track of the amount of jitter. For a down stream, Galène keeps track of both average delay and jitter. Up: ±3ms means 3ms average jitter; Down: 30ms±3ms means 30ms average delay with 3ms average jitter. >> At the highest point, Galène was at 270% CPU, and the TURN server was >> using another 50%. That's on a four-core VM. > I'm far away from such a setup. So I wonder whether my response is > useful at all. It's very useful. We want Galène to scale all the way from a single core ARMv7 (thanks for the Beaglebone, Dave) to a 16-core, 32-thread server. -- Juliusz ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-12 21:22 ` Juliusz Chroboczek @ 2021-01-13 19:09 ` Michael Ströder 2021-01-14 12:59 ` Juliusz Chroboczek 0 siblings, 1 reply; 16+ messages in thread From: Michael Ströder @ 2021-01-13 19:09 UTC (permalink / raw) To: galene On 1/12/21 10:22 PM, Juliusz Chroboczek wrote: >>> We just had a meeting with 70 people and at around 40 cameras switched on, > >> Which send quality were they all using? "normal"? > > The quality selected in the menu is the maximum allowable quality; Galène > will rather eagerly drop down beneath it, all the way down to 200kbit/s > (it will drop quality even more aggressively in the future). So we were > running at "normal", but the resulting bitrate was somewhere between "low" > and "lowest". But if users are thrown out of sessions and reconnect the browser will again try a higher send rate. Right? If that happens to several sending users within a short time-frame you will get some peaks affecting all the receiving users. Well, I'm speculating here. (This reminds me of tuning of feedback control systems.) >> We had one hearing impaired user who hears a little bit with in-ear >> devices. Normally the user also follows spoken text by lip-reading to >> get more context. But this is nearly impossible for her in a video >> session because audio and video are not sufficiently synchronised with >> our setup. > > Hmm... was that with Galène? Which browser? Yes, with Galène. I don't know which browser was used though. > In principle, Galène generates all the bits of protocol to perform > accurate lipsynch on the receiving side. I have veryfied that it works > well with Chrome. I'm not really capable of lip-reading, so I can't tell which quality a lip-reading user would need. But there was some visible jitter even within my LAN with just a one-to-one test. Maybe I can contact this user to do dome specific tests... >> Not sure whether that's really a fairness issue within Galène. I >> can see differing latencies in /stats for different connections. >> The connection with higher latency, most times on all "Down" >> streams, has the higher latency consistently throughout whole >> session. I suspect the receiver side is the issue.> > You might be mis-reading the statistics. Yes, maybe. > For an up stream, Galène only > keeps track of the amount of jitter. For a down stream, Galène keeps > track of both average delay and jitter. > > Up: ±3ms means 3ms average jitter; > Down: 30ms±3ms means 30ms average delay with 3ms average jitter. This matches my interpretation. As said the average down-stream delay was consistently higher for users having issues. Most users here are using the usual VDSL or cable-TV connections within Germany. I have no knowledge whether some users are still using slower DSL lines. For connections/users without issues I see average down-stream delays of 40..60 ms with jitter 10..40 ms. Users reported no problems even with down-stream delays of 100..150 ms. Real problems start when average down-stream delays is 150+ ms most of the time. Ciao, Michael. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-13 19:09 ` Michael Ströder @ 2021-01-14 12:59 ` Juliusz Chroboczek 2021-01-14 13:03 ` Michael Ströder 0 siblings, 1 reply; 16+ messages in thread From: Juliusz Chroboczek @ 2021-01-14 12:59 UTC (permalink / raw) To: Michael Ströder; +Cc: galene > Real problems start when average down-stream delays is 150+ ms most of > the time. That's rather extreme. Next time I have a meeting on Galène, I'll artificially increase my delay to 200ms to see how well it works. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-14 12:59 ` Juliusz Chroboczek @ 2021-01-14 13:03 ` Michael Ströder 2021-01-14 13:10 ` Juliusz Chroboczek 0 siblings, 1 reply; 16+ messages in thread From: Michael Ströder @ 2021-01-14 13:03 UTC (permalink / raw) To: galene On 1/14/21 1:59 PM, Juliusz Chroboczek wrote: >> Real problems start when average down-stream delays is 150+ ms most of >> the time. > > That's rather extreme. Next time I have a meeting on Galène, I'll > artificially increase my delay to 200ms to see how well it works. Just wanted to share what happens in practice. So better don't dedicate too much effort to such a scenario. It is a rather rare technical issue at the receiver's side which has to be solved there anyway. Ciao, Michael. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-14 13:03 ` Michael Ströder @ 2021-01-14 13:10 ` Juliusz Chroboczek 2021-01-14 13:23 ` Michael Ströder 0 siblings, 1 reply; 16+ messages in thread From: Juliusz Chroboczek @ 2021-01-14 13:10 UTC (permalink / raw) To: Michael Ströder; +Cc: galene > Just wanted to share what happens in practice. So better don't dedicate > too much effort to such a scenario. A lot of our students follow lectures over 4G. Making our Galène usable by people with slight on a poor 4G connection is a worthy goal. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Galene] Re: fq-codel trashing 2021-01-14 13:10 ` Juliusz Chroboczek @ 2021-01-14 13:23 ` Michael Ströder 0 siblings, 0 replies; 16+ messages in thread From: Michael Ströder @ 2021-01-14 13:23 UTC (permalink / raw) To: galene On 1/14/21 2:10 PM, Juliusz Chroboczek wrote: >> Just wanted to share what happens in practice. So better don't dedicate >> too much effort to such a scenario. > > A lot of our students follow lectures over 4G. Making our Galène usable by > people with slight on a poor 4G connection is a worthy goal. Just wanted to make clear that I'm not asking you to take care of such a high latency. Ciao, Michael. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2021-01-14 13:23 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-01-12 13:46 [Galene] fq-codel trashing Juliusz Chroboczek 2021-01-12 15:55 ` [Galene] " Toke Høiland-Jørgensen 2021-01-12 16:01 ` Dave Taht 2021-01-12 17:38 ` Juliusz Chroboczek 2021-01-12 17:42 ` Dave Taht 2021-01-12 18:10 ` Juliusz Chroboczek 2021-01-12 19:05 ` Dave Taht 2021-01-12 19:52 ` Michael Ströder 2021-01-12 21:02 ` Juliusz Chroboczek 2021-01-12 19:29 ` Michael Ströder 2021-01-12 21:22 ` Juliusz Chroboczek 2021-01-13 19:09 ` Michael Ströder 2021-01-14 12:59 ` Juliusz Chroboczek 2021-01-14 13:03 ` Michael Ströder 2021-01-14 13:10 ` Juliusz Chroboczek 2021-01-14 13:23 ` Michael Ströder
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox