Message boards : Number crunching : Process still present 5 min after writing finish file
Message board moderation

To post messages, you must log in.

AuthorMessage
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 3
Message 5034 - Posted: 12 Jan 2022, 10:21:39 UTC
Last modified: 12 Jan 2022, 10:26:54 UTC

Three times now since December I've had tasks on different Raspberry Pies fail with:
<core_client_version>7.16.16</core_client_version>
<![CDATA[
<message>
Process still present 5 min after writing finish file; aborting</message>
<stderr_txt>
00:20:43 (10500): called boinc_finish(0)

</stderr_txt>
]]>

This is generated after the task has been crunching for a length of time that suggests it completed normally (as does the error message). The oldest of these tasks was completed successfully by two other hosts, and validated (the other two are "in progress" by wingmen).

Most tasks on the hosts in question are completing and being validated.

What's this all about? Is it something I need to fix my end?
ID: 5034 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 0
Message 5037 - Posted: 12 Jan 2022, 16:51:29 UTC - in response to Message 5034.  

No, it is caused by server delay after crash.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 5037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 3
Message 5038 - Posted: 12 Jan 2022, 18:36:38 UTC - in response to Message 5037.  

OK, thank you.
ID: 5038 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 3
Message 5959 - Posted: 29 Nov 2022, 13:18:23 UTC

This is still a problem...
ID: 5959 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Grant (SSSF)

Send message
Joined: 23 Apr 22
Posts: 156
Credit: 69,772,000
RAC: 76
Message 5965 - Posted: 30 Nov 2022, 4:56:09 UTC - in response to Message 5959.  

This is still a problem...
Most likely with the system. The CPU is too busy to clean things up in time, or the drive is way too slow- or a combination of both.

Here is an example how someone else resolved the problem with their Raspberry Pi and Universe@home.
Grant
Darwin NT
ID: 5965 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 3
Message 5966 - Posted: 30 Nov 2022, 11:14:39 UTC - in response to Message 5965.  

Interesting, but I'm not convinced and it doesn't give me a solution anyway. One of the two Pies affected (and the one erroring most frequently) runs BOINC with one of two cores, so reducing the number of running U@H tasks to two isn't an option (and the problem only occurs with U@H). Also I have a third Pi, and it's a lowly Pi 2 running BOINC on all four cores without any issues. All three have the same file system, and always have. Can the five minute time-out be increased, to see if that makes the problem go away?
ID: 5966 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 10 May 20
Posts: 308
Credit: 4,733,484,700
RAC: 1,694
Message 5967 - Posted: 30 Nov 2022, 17:03:29 UTC - in response to Message 5966.  
Last modified: 30 Nov 2022, 17:04:27 UTC

No the timeout can't be increased unless you can convince the BOINC devs to increase the timeout again.

The timeout in the original BOINC was only 10 seconds but enough people complained about the error that DA was persuaded to increase the timeout to 300 seconds which he felt was long enough.

https://github.com/BOINC/boinc/issues/3017

https://github.com/BOINC/boinc/pull/3019

A proud member of the OFA (Old Farts Association)
ID: 5967 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Grant (SSSF)

Send message
Joined: 23 Apr 22
Posts: 156
Credit: 69,772,000
RAC: 76
Message 5968 - Posted: 1 Dec 2022, 11:03:59 UTC - in response to Message 5966.  

All three have the same file system, and always have. Can the five minute time-out be increased, to see if that makes the problem go away?
The problem is the storage I/O bottleneck.
A single Universe task produces 5 result files, most other projects just one. So that results in a lot of disk I/O. If it's not completed in 5 minutes, giving it more time really isn't a good idea.

The issue usually only occurred with very high core/thread systems and older HDD storage.
So for it to occur on a system with only a couple of cores indicates an extremely slow storage device. Either a faster storage device (in particular random I/O performance), or more system RAM allocated to disk write caching (if available) may help.
Grant
Darwin NT
ID: 5968 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 3
Message 5969 - Posted: 1 Dec 2022, 17:07:41 UTC - in response to Message 5968.  

All three Pies use the same spec Sandisk SD card for storage, but only the Pi 3's have a problem (a Pi 4 in the case of that other thread). The error reports state that the "finish file" was written. Does that mean the files U@H uploads, or something else?
ID: 5969 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Grant (SSSF)

Send message
Joined: 23 Apr 22
Posts: 156
Credit: 69,772,000
RAC: 76
Message 5970 - Posted: 2 Dec 2022, 6:51:42 UTC - in response to Message 5969.  
Last modified: 2 Dec 2022, 7:06:24 UTC

All three Pies use the same spec Sandisk SD card for storage, but only the Pi 3's have a problem (a Pi 4 in the case of that other thread). The error reports state that the "finish file" was written. Does that mean the files U@H uploads, or something else?
Yes, the result files that are produced when the Task is finished get returned to the Univerae@home project. But to return them, they have to be saved to disk first.

This is the part that's causing the problem-
Process still present 5 min after writing finish file;
The Universe application has saved the result files to disk- that it has asked the Operating System to save those files, but once they have actually been saved the OS should then report back to the programme that the files have been saved and then the Task and all it's associated processes can be finalised & then end normally.

But with the storage bottleneck, even though the file might have been written, the Universe application still hasn't received confirmation of that from the OS after 5 min, so the BOINC Manage clobbers it thinking that there's an issue with that Task.

The problem is all due to the system the work is running on- for whatever reason, there is a significant bottleneck with the file I/O, and that is the cause of the Tasks failing. After 5mins, the programme hasn't received confirmation back from the OS that the files have been saved.
Grant
Darwin NT
ID: 5970 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 10 May 20
Posts: 308
Credit: 4,733,484,700
RAC: 1,694
Message 5971 - Posted: 2 Dec 2022, 7:42:49 UTC

I agree that disk I/O on the Sandisk micro SDHC cards are the slow component of the process. Especially on a Pi3 which is not a fast device in the first place.
Burden the disk I/O of the SoC with the read-writes other other projects running on the cores and the saddle it with the 4 output files from the just completed Universe task is asking too much of the system.

I would replace the micro SDHC cards with faster U3/V90 or U3/V30 rated cards. Also I would extend out the minimum checkpoint time from the standard 60 seconds in the Manager to something more reasonable like every 3 minutes. That will lessen the frequency of disk writes and give the SoC a bit of a breather between I/O operations.

A proud member of the OFA (Old Farts Association)
ID: 5971 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 3
Message 5972 - Posted: 2 Dec 2022, 15:28:49 UTC - in response to Message 5971.  

So why does the even slower Pi 2, with more cores running BOINC, have no problems writing in a timely manner to the same type of SD card?
ID: 5972 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 10 May 20
Posts: 308
Credit: 4,733,484,700
RAC: 1,694
Message 5973 - Posted: 2 Dec 2022, 17:37:40 UTC - in response to Message 5972.  

Are the same projects running on the Pi2 and Pi3? If the same then damaged micro SDHC card on the Pi3 or something wrong with the OS and different from the Pi2.

A proud member of the OFA (Old Farts Association)
ID: 5973 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Grant (SSSF)

Send message
Joined: 23 Apr 22
Posts: 156
Credit: 69,772,000
RAC: 76
Message 5974 - Posted: 2 Dec 2022, 22:37:19 UTC - in response to Message 5973.  

Are the same projects running on the Pi2 and Pi3? If the same then damaged micro SDHC card on the Pi3 or something wrong with the OS and different from the Pi2.
Yep- even if the hardware is the same, the BIOS is configured the same (eg the huge difference between AHCI & IDE for SATA drive perfromance), the OS is the same the Projects are the same & the Tasks being run are the same, are each system configured in exactly the same way?
Are the file systems the same? Is write caching enabled on the systems that are having issues? What is the size of the read and write caches on the good & bad systems? Are the running processes between both the good & bad systems the same?

You could monitor the disk I/O on the good and bad systems and compare the activity. Run some disk benchmarks to see how the different systems perform.
Grant
Darwin NT
ID: 5974 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 3
Message 5975 - Posted: 3 Dec 2022, 14:09:27 UTC - in response to Message 5974.  
Last modified: 3 Dec 2022, 14:10:35 UTC

All three Pies are set up with the same file system and SD card type and size. All three have their software upgraded once a month on the same day. All three are running U@H, Einstein, and WCG. Just recently I've let the two Pi 3's loose on Asteroids, but the erroring predates that.

I've tested write speed with the following command:
sudo dd if=/dev/zero of=/tmp/output bs=8k count=10k; sudo rm -f /tmp/output

The Pi 2 achieves around 135 MB/s, whilst the two Pi 3's achieve around 270 MB/s. The person in the linked post has a Pi 4. Obviously I know nothing about its file system, but the assumption would be it's faster still.

These Pies only upload at night, which has allowed me to copy an entire set of U@H results on one of the Pi 3's to another directory on the same SD card. It happened in the blink of an eye, which is hardly surprising as most of the files are very small, with one "large" file weighing in at 216731 bytes. Taking five minutes to write those files would be absurd.

To summarise:

Pi 2: No problems.
Pi 3: Problems.
Pi 3: Problems.
Pi 4: Problems.

I think the time it takes to write the files not the issue. When I see something that works just fine on a slower system, and then it starts occasionally locking up on some faster systems, my immediate thought is "thread deadlock".
ID: 5975 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Grant (SSSF)

Send message
Joined: 23 Apr 22
Posts: 156
Credit: 69,772,000
RAC: 76
Message 5976 - Posted: 3 Dec 2022, 21:58:25 UTC - in response to Message 5975.  

I've tested write speed with the following command:
sudo dd if=/dev/zero of=/tmp/output bs=8k count=10k; sudo rm -f /tmp/output

The Pi 2 achieves around 135 MB/s, whilst the two Pi 3's achieve around 270 MB/s. The person in the linked post has a Pi 4. Obviously I know nothing about its file system, but the assumption would be it's faster still.
What about random I/O with increasing loads?


I think the time it takes to write the files not the issue. When I see something that works just fine on a slower system, and then it starts occasionally locking up on some faster systems, my immediate thought is "thread deadlock".
The issue is that the OS is not completing the writes within 5 minutes, and reporting that to the programme.


What it could come down to is that the CPU just isn't up to the load involved with trying to perform that many writes at once- the faster ta storage device is, then the greater the load on the CPU to keep the data coming. If the write cache was larger, while it would briefly increase the CPU still further, it should allow the writes to complete, and then be reported back tothe programme as being completed, in less than 5 minutes.
And with Universe producing 5 Result files for every Task completed, it's write load when a Task is completed is 5 times higher than projects that produce only a single result file. And keep in mind that not only are there the result file to write, but the OS would also be doing writes updating it's file system records as well- and that would be close to 5 times greater than with other projects as well.
But if the systems with the issue have faster CPUs, then there must be some other difference between the good and problem systems that results in the I/O bottleneck.


In the end it could just come down to the fact that the system just isn't up to the loads produced by Universe when a Task completes if the system is also doing other (relatively) significant disk I/O at the same time.
Grant
Darwin NT
ID: 5976 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 3
Message 5977 - Posted: 5 Dec 2022, 9:30:34 UTC - in response to Message 5976.  

I'm sorry, but I simply don't buy the argument that a system with two processor cores idle that can copy the full set of U@H results files in the blink of an eye has a problem with super-overloaded disk I/O that results in massive delays saving half a dozen small files. If things were that bad, I would expect major problems with other tasks, such as logging in and doing the monthly software update (just consider the disk I/O load during that task). But the reality is everything is working normally except U@H, and all three Pies are responsive to logging in and to the command line. However, even the fastest machine on the planet will get stuck for five minutes writing five small files if the threads involved deadlock, and only the (five minute) thread wait timeout breaks that deadlock (probably long after the files were actually written).
ID: 5977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Process still present 5 min after writing finish file




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek