Dear all, I've just pushed a new revision of the protocol, that does renegotiation correctly. Please let me know if you see any issues. -- Juliusz
On 2/3/21 8:04 PM, Juliusz Chroboczek wrote:
> I've just pushed a new revision of the protocol, that does renegotiation
> correctly. Please let me know if you see any issues.
Anything in particular you want to be tested?
Ciao, Michael.
>> I've just pushed a new revision of the protocol, that does renegotiation >> correctly. Please let me know if you see any issues. > Anything in particular you want to be tested? Behaviour against anything else than Chrome. For example, we used to destroy the stream and create it anew when the user changed the value in the "Receive" menu. We now renegotiate, and while I've hacked around issues with Chrome, I'm not sure how other browsers will react to the renegotiation. The other thing to test is behaviour when large numbers of users arrive and leave simultaneously, which will happen at the next lecture. -- Juliusz
On 2/3/21 10:11 PM, Juliusz Chroboczek wrote:
>>> I've just pushed a new revision of the protocol, that does renegotiation
>>> correctly. Please let me know if you see any issues.
>
>> Anything in particular you want to be tested?
>
> Behaviour against anything else than Chrome.
I now see *lots* of these messages:
ice WARNING: 2021/02/04 11:07:24 failed to send packet: write udp
10.23.45.67:53560->10.23.45.89:50995: use of closed network connection
Is this something to worry about?
Also my screen sharing was frozen in a 2-user conference several times.
Unsharing and sharing again made it work again.
I'm using these PION env vars:
PION_LOG_TRACE=""
PION_LOG_DEBUG=""
PIONS_LOG_INFO="ice"
PIONS_LOG_WARNING="all"
PIONS_LOG_ERROR="all"
Ciao, Michael.
On 2/4/21 11:49 AM, Michael Ströder wrote:
> On 2/3/21 10:11 PM, Juliusz Chroboczek wrote:
>>>> I've just pushed a new revision of the protocol, that does renegotiation
>>>> correctly. Please let me know if you see any issues.
>>
>>> Anything in particular you want to be tested?
>>
>> Behaviour against anything else than Chrome.
>
> I now see *lots* of these messages:
>
> ice WARNING: 2021/02/04 11:07:24 failed to send packet: write udp
> 10.23.45.67:53560->10.23.45.89:50995: use of closed network connection
>
> Is this something to worry about?
Hmm, this definitely did not happen before yesterdays' commits. And it
is really filling my log at quite high rate.
Ciao, Michael.
Could you pull and see if it's better?
On 2/4/21 9:04 PM, Juliusz Chroboczek wrote:
> Could you pull and see if it's better?
I cannot claim to have a real test plan.
But I've managed to reproduce the issue with git revision
6054ae6cc6b25e92f3fc482eb60d2ec0e7ee9ff6 and two Firefox 85 with
different profiles by wildly switching receive options, leaving session
and re-entering it.
I did not manage to reproduce with git revision
b4240c45059d9b4d8c8119fae8a3ccc1a5969377.
So at first glance the latter is an improvement.
But decent testing is something else. ;-)
Ciao, Michael.
On 2/4/21 9:47 PM, Michael Ströder wrote:
> On 2/4/21 9:04 PM, Juliusz Chroboczek wrote:
>> Could you pull and see if it's better?
>
> I cannot claim to have a real test plan.
>
> But I've managed to reproduce the issue with git revision
> 6054ae6cc6b25e92f3fc482eb60d2ec0e7ee9ff6 and two Firefox 85 with
> different profiles by wildly switching receive options, leaving session
> and re-entering it.
>
> I did not manage to reproduce with git revision
> b4240c45059d9b4d8c8119fae8a3ccc1a5969377.
>
> So at first glance the latter is an improvement.
>
> But decent testing is something else. ;-)
Hmm, *after* writing the above log messages started to appear again.
Ciao, Michael.
> Hmm, *after* writing the above log messages started to appear again.
Are there any user-visible issues when this happens? I suspect it's
nothing to worry about, just one of the threads that send data racing with
the thread that shuts the other threads down.
On 2/4/21 10:57 PM, Juliusz Chroboczek wrote: >> Hmm, *after* writing the above log messages started to appear again. > > Are there any user-visible issues when this happens? Not really. > I suspect it's nothing to worry about, just one of the threads that > send data racing with the thread that shuts the other threads down. Hmm, but pion ice logs this as warning. Really ignore hundreds/thousands of messages per minute? And why didn't I saw this before? Ciao, Michael.
>> I suspect it's nothing to worry about, just one of the threads that >> send data racing with the thread that shuts the other threads down. > Hmm, but pion ice logs this as warning. Really ignore hundreds/thousands > of messages per minute? And why didn't I saw this before? It looks like I wasn't always closing connections correctly. Could you please check if 66de0d solves the issue?
On 2/4/21 11:53 PM, Juliusz Chroboczek wrote:
>>> I suspect it's nothing to worry about, just one of the threads that
>>> send data racing with the thread that shuts the other threads down.
>
>> Hmm, but pion ice logs this as warning. Really ignore hundreds/thousands
>> of messages per minute? And why didn't I saw this before?
>
> It looks like I wasn't always closing connections correctly. Could you
> please check if 66de0d solves the issue?
Sorry, I still get these warnings.
Ciao, Michael.
On 2/5/21 10:59 AM, Michael Ströder wrote:
> On 2/4/21 11:53 PM, Juliusz Chroboczek wrote:
>>>> I suspect it's nothing to worry about, just one of the threads that
>>>> send data racing with the thread that shuts the other threads down.
>>
>>> Hmm, but pion ice logs this as warning. Really ignore hundreds/thousands
>>> of messages per minute? And why didn't I saw this before?
>>
>> It looks like I wasn't always closing connections correctly. Could you
>> please check if 66de0d solves the issue?
>
> Sorry, I still get these warnings.
Well, my impression is that there are less warning messages logged, but
not something I can really quantify. And they are logged after both test
users (one operator, one normal user) left the session until I restart
Galène.
Ciao, Michael.
>> Sorry, I still get these warnings. > And they are logged after both test users (one operator, one normal > user) left the session until I restart Galène. I'm unable to reproduce the warnings, but I am seeing something wrong with the system call activity which indicates something is not being closed correctly.
commit c3a19c9128f2922d2285cc840dc9b633e0f625ce (HEAD -> master, origin/master) Author: Juliusz Chroboczek <jch@irif.fr> Date: Fri Feb 5 12:20:33 2021 +0100 Avoid race between closing connections and terminating client. We need to terminate all down connections synchronously, otherwise we risk leaving open connections lying around.
On 2/5/21 12:22 PM, Juliusz Chroboczek wrote:
> commit c3a19c9128f2922d2285cc840dc9b633e0f625ce (HEAD -> master, origin/master)
> Author: Juliusz Chroboczek <jch@irif.fr>
> Date: Fri Feb 5 12:20:33 2021 +0100
>
> Avoid race between closing connections and terminating client.
>
> We need to terminate all down connections synchronously, otherwise
> we risk leaving open connections lying around.
So far this looks good. The harder test will be this evening with more
users.
Ciao, Michael.
On 2/5/21 3:19 PM, Michael Ströder wrote:
> On 2/5/21 12:22 PM, Juliusz Chroboczek wrote:
>> commit c3a19c9128f2922d2285cc840dc9b633e0f625ce (HEAD -> master, origin/master)
>> Author: Juliusz Chroboczek <jch@irif.fr>
>> Date: Fri Feb 5 12:20:33 2021 +0100
>>
>> Avoid race between closing connections and terminating client.
>>
>> We need to terminate all down connections synchronously, otherwise
>> we risk leaving open connections lying around.
>
> So far this looks good. The harder test will be this evening with more
> users.
Worked just fine with 7 users and no warning messages at all.
Ciao, Michael.
On 2/5/21 9:40 PM, Michael Ströder wrote:
> On 2/5/21 3:19 PM, Michael Ströder wrote:
>> On 2/5/21 12:22 PM, Juliusz Chroboczek wrote:
>>> commit c3a19c9128f2922d2285cc840dc9b633e0f625ce (HEAD -> master, origin/master)
>>> Author: Juliusz Chroboczek <jch@irif.fr>
>>> Date: Fri Feb 5 12:20:33 2021 +0100
>>>
>>> Avoid race between closing connections and terminating client.
>>>
>>> We need to terminate all down connections synchronously, otherwise
>>> we risk leaving open connections lying around.
>>
>> So far this looks good. The harder test will be this evening with more
>> users.
>
> Worked just fine with 7 users and no warning messages at all.
Unfortunately I get these warning messages again with git revision 8f89ac0.
Prior to that the user wasn't able to use the chat and pion/ice logged
this several times but without further info:
ice WARNING: 2021/02/10 18:59:02 pingAllCandidates called with no
candidate pairs. Connection is not possible yet.
The user left the session and re-entered. This resulted in the user
being listed twice and ~70 messages/sec saying:
failed to send packet: write udp 10.1.1.13:52736->10.1.1.28:49213: use
of closed network connection
Note that this seems to be a connection relayed via TURN server. I can't
tell why relaying was enforced though.
Ciao, Michael.
Michael Ströder <michael@stroeder.com> writes:
> On 2/5/21 9:40 PM, Michael Ströder wrote:
>> On 2/5/21 3:19 PM, Michael Ströder wrote:
>>> On 2/5/21 12:22 PM, Juliusz Chroboczek wrote:
>>>> commit c3a19c9128f2922d2285cc840dc9b633e0f625ce (HEAD -> master, origin/master)
>>>> Author: Juliusz Chroboczek <jch@irif.fr>
>>>> Date: Fri Feb 5 12:20:33 2021 +0100
>>>>
>>>> Avoid race between closing connections and terminating client.
>>>>
>>>> We need to terminate all down connections synchronously, otherwise
>>>> we risk leaving open connections lying around.
>>>
>>> So far this looks good. The harder test will be this evening with more
>>> users.
>>
>> Worked just fine with 7 users and no warning messages at all.
>
> Unfortunately I get these warning messages again with git revision
> 8f89ac0.
I'm also seeing warnings on this version:
Feb 10 18:58:37 video galene[1251625]: 2021/02/10 18:58:37 client: client is dead
Feb 10 19:08:16 video galene[1251625]: 2021/02/10 19:08:16 client: websocket: close 1006 (abnormal closure): unexpected EOF
Feb 10 19:15:29 video galene[1251625]: 2021/02/10 19:15:29 Read RTCP: io: read/write on closed pipe
Feb 10 19:15:29 video galene[1251625]: 2021/02/10 19:15:29 Read RTCP: io: read/write on closed pipe
Feb 10 19:15:29 video galene[1251625]: 2021/02/10 19:15:29 Read RTCP: io: read/write on closed pipe
Feb 10 19:15:29 video galene[1251625]: 2021/02/10 19:15:29 client: client is dead
Feb 10 19:24:24 video galene[1251625]: 2021/02/10 19:24:24 Read RTCP: io: read/write on closed pipe
Feb 10 19:24:24 video galene[1251625]: 2021/02/10 19:24:24 Read RTCP: io: read/write on closed pipe
Feb 10 19:24:24 video galene[1251625]: 2021/02/10 19:24:24 client: read tcp 45.145.95.8:443->192.38.142.147:1128: read: connection reset by pe
Feb 10 20:25:23 video galene[1251625]: 2021/02/10 20:25:23 sendPLI: io: read/write on closed pipe
Feb 10 20:25:25 video galene[1251625]: 2021/02/10 20:25:25 sendPLI: io: read/write on closed pipe
Feb 10 20:25:25 video galene[1251625]: 2021/02/10 20:25:25 Replace: file does not exist
Feb 10 20:25:25 video galene[1251625]: 2021/02/10 20:25:25 Replace: file does not exist
Feb 10 20:25:25 video galene[1251625]: 2021/02/10 20:25:25 Replace: file does not exist
Feb 10 20:25:25 video galene[1251625]: 2021/02/10 20:25:25 Replace: file does not exist
Feb 10 20:25:25 video galene[1251625]: 2021/02/10 20:25:25 Replace: file does not exist
I also have frozen video frame of a user that left stuck in the group
view now (but he's not in the user list).
-Toke
> Feb 10 19:24:24 video galene[1251625]: 2021/02/10 19:24:24 Read RTCP: io: read/write on closed pipe
This one is harmless. Still looking at the other ones.
commit a4cd27988f49454b2947d65a7df58143a1c9d9f0 (HEAD -> master, origin/master)
Author: Juliusz Chroboczek <jch@irif.fr>
Date: Thu Feb 11 15:56:47 2021 +0100
Don't complain about ErrClosedPipe in RTCP receiver.
This simply indicates that the server closed the connection
before we received the close indication from the client.
> The user left the session and re-entered. This resulted in the user
> being listed twice and ~70 messages/sec saying:
This one is worrying. It indicates that the per-client automaton is
stuck, which might lead to the whole group getting deadlocked later on.
If you manage to reproduce it, and if nobody important is using the
server, it would be helpful if you could send a SIGABRT to the server
after that happens and send me the backtrace by private mail.
-- Juliusz
> Feb 10 20:25:25 video galene[1251625]: 2021/02/10 20:25:25 Replace: file does not exist > I also have frozen video frame of a user that left stuck in the group > view now (but he's not in the user list). Did these two things happen at roughly the same time, or are they unrelated events?
Juliusz Chroboczek <jch@irif.fr> writes:
>> Feb 10 20:25:25 video galene[1251625]: 2021/02/10 20:25:25 Replace: file does not exist
>
>> I also have frozen video frame of a user that left stuck in the group
>> view now (but he's not in the user list).
>
> Did these two things happen at roughly the same time, or are they
> unrelated events?
Hmm, the log timestamps are UTC, and seeing as I sent that email right
after it happened, I suppose they could have been related? :)
-Toke
> I also have frozen video frame of a user that left stuck in the group
> view now (but he's not in the user list).
Thanks Toke, I've managed to reproduce it. It happens if the client shuts
down before negotiation is complete. Probably some subtle ordering issue.
-- Juliusz
Juliusz Chroboczek <jch@irif.fr> writes:
>> I also have frozen video frame of a user that left stuck in the group
>> view now (but he's not in the user list).
>
> Thanks Toke, I've managed to reproduce it. It happens if the client shuts
> down before negotiation is complete. Probably some subtle ordering
> issue.
Great! And yeah, the person this happened to did have a bit of a flaky
connection, so I guess it could have been because he got disconnected
while negotiation was still ongoing...
-Toke
> Feb 10 20:25:25 video galene[1251625]: 2021/02/10 20:25:25 Replace: file does not exist > I also have frozen video frame of a user that left stuck in the group > view now (but he's not in the user list). This should be fixed now. We were pushing connections asynchronously, which could cause the push actions to arrive in the wrong order: if a connection was created and then deleted, the deletion could arrive before the creation, in which case the deletion would log a warning (file does not exist) and the creation would succeed, leaving a frozen video. Of course, we cannot just replace asynchronous with synchronous, as that might cause deadlocks. So I've had to implement an explicit queue of pending actions, and make communication between threads use unbounded amounts of buffering. Personal opinion -- I much prefer the semantics of Erlang's mailboxes (which implement unbounded buffering) to that of Go's channels (which use bounded amounts of buffering, and are therefore susceptible to deadlocks). -- Juliusz
Juliusz Chroboczek <jch@irif.fr> writes:
>> Feb 10 20:25:25 video galene[1251625]: 2021/02/10 20:25:25 Replace: file does not exist
>
>> I also have frozen video frame of a user that left stuck in the group
>> view now (but he's not in the user list).
>
> This should be fixed now.
>
> We were pushing connections asynchronously, which could cause the push
> actions to arrive in the wrong order: if a connection was created and then
> deleted, the deletion could arrive before the creation, in which case the
> deletion would log a warning (file does not exist) and the creation would
> succeed, leaving a frozen video.
>
> Of course, we cannot just replace asynchronous with synchronous, as that
> might cause deadlocks. So I've had to implement an explicit queue of
> pending actions, and make communication between threads use unbounded
> amounts of buffering.
Right, deployed the latest revision; will let you know if I experience
any more problems with this :)
-Toke
having webrtc tests that actually blew up the network (e.g. netem with 20% loss, and blackholing ports, mistreating ecn, mistreating dscp) would help long term... in not just galene's case. Perhaps tests like this exist somewhere?