Message boards :
Number crunching :
"Output file . . .absent"
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 Mar 16 Posts: 13 Credit: 1,113,528,333 RAC: 0 |
got about 130 tasks with compute errors after 1 or 2 seconds and the event log printed something like this: Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Computation for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 finished Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Output file universe_bh2_190723_398_6021738979_20000_1-999999_745100_1_r565751666_0 for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 absent Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Output file universe_bh2_190723_398_6021738979_20000_1-999999_745100_1_r565751666_1 for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 absent Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Output file universe_bh2_190723_398_6021738979_20000_1-999999_745100_1_r565751666_2 for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 absent Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Output file universe_bh2_190723_398_6021738979_20000_1-999999_745100_1_r565751666_4 for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 absent Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Output file universe_bh2_190723_398_6021738979_20000_1-999999_745100_1_r565751666_5 for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 absent Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Starting task universe_bh2_190723_398_6021763979_20000_1-999999_770100_0 happened only on 1 machine though and not on another [completely identical one] ______________ "less than a pixel" |
Send message Joined: 10 May 20 Posts: 310 Credit: 4,733,484,700 RAC: 0 |
You have a slow storage system or a anti-virus program running that is locking the slots directories so the client can't access or read the output file. A proud member of the OFA (Old Farts Association) |
Send message Joined: 22 Mar 16 Posts: 13 Credit: 1,113,528,333 RAC: 0 |
You have a slow storage system or a anti-virus program running that is locking the slots directories so the client can't access or read the output file. Thanks for your input. I'm on 22.04 LTS, have no AV (or would you recommend that?) and the storage is a recent NVMe which I checked anyway but is OK. Could it be the server? It obviously couldn't handle yesterdays uploads and since the pride-show starts already before officially starting when the needy bunker tasks, the downloads were getting more and more difficult around that time. Buggy downloads? . . . just a theory. Anyway it didn't happen since then. So for now it is just a peculiar curiosity. Thanks again for your thoughts. BTW: what are the requirements for joining the OFA? ______________ "less than a pixel" |
Send message Joined: 23 Apr 22 Posts: 167 Credit: 69,772,000 RAC: 0 |
Buggy downloads? . . . just a theory.Unlikely. The error means that there were no result output files on your system to return to the project. The usual causes are an AV programme taking offence and deleting the files, or system I/O is so heavy that the Science application writes the files- but they don't actually get written to the disk (or at the very least they are written to the disk but the changes don't make it to the computers file system so as far as it's concerned they don't exist). But why it would occur after a few seconds? That should result in a computation error, and even that should result in some sort of result files being produced. So i guess corrupted downloads are a possibility, but the resulting errors are very odd if that was the case. Grant Darwin NT |
Send message Joined: 3 May 18 Posts: 2 Credit: 30,882,667 RAC: 0 |
The task output might be more revealing than the BOINC client log. |
Send message Joined: 10 May 20 Posts: 310 Credit: 4,733,484,700 RAC: 0 |
The task output might be more revealing than the BOINC client log. But the task output ((result file(s)) is what is missing. No help there. If you are thinking of the stderr.txt output, that is what he grabbed the errors from. No help there either. A proud member of the OFA (Old Farts Association) |
Send message Joined: 10 May 20 Posts: 310 Credit: 4,733,484,700 RAC: 0 |
Since you are on Ubuntu 22.04 I assume you are running the 7.18.1 BOINC client. There were changes in what BOINC is allowed to access because of permissions on the /tmp directory used by the systemd service daemon file in the latest releases. Look at this issue. https://github.com/BOINC/boinc/issues/3355 Affecting some projects because the application is writing to /tmp where the user does not have access permission. Some projects have had to rewrite their applications to work around the issue or have the user make changes to the BOINC systemd service file and change these parameters. GPUGrid was one of the projects affected I know about since I crunch there. PrivateTmp=true ProtectSystem=strict ProtectControlGroups=true I have not experienced this issue at Universe here though. So I still think you had a slow storage system where the writing of the output file was delayed and never made it to disk or the storage system was bogged down and couldn't respond in the time for the BOINC process to read the file and push it upstream home to the project scheduler. A proud member of the OFA (Old Farts Association) |
Send message Joined: 22 Mar 16 Posts: 13 Credit: 1,113,528,333 RAC: 0 |
Thanks for all the input. First of all it didn't reproduce so far, so it's not urgent, more a curiosity. But why it would occur after a few seconds? That should result in a computation error, and even that should result in some sort of result files being produced. Yeah. I have errors that clearly identify as "Error while downloading", and those don't start to compute. So that's OK then. But why did the others even start to compute? As for the client. Yes it's 7.18.1 I have 4 with the very same OS/update/upgrade/client etc. but it happened only with one. I know 4 is a small sample, but that has me tilted back to the hardware side. Although it's a recent NVMe SSD and I checked/benchmarked it, I could still be the "lucky" guy with a faulty brand new NVMe SSD. The perpetrator has a complete hardware twin that managed to behave. So I will take both of the project and do tests/benchmarks and see if there is much difference. (It's a good time for that now since the pride show obviously overloads the project-server, so when all egos are satisfied I might be done with my tests and we can go back to regular boincing then) ______________ "less than a pixel" |
Send message Joined: 22 Mar 16 Posts: 13 Credit: 1,113,528,333 RAC: 0 |
. . . almost forgot about this: yes! I could reproduce it on another machine and another project[though different app and different log-lingo yet same 1-2 second thing] So it's not hardware or project-specific. It happens sometimes after updates to the runtime environment. restart of the client is not enough, have to restart the computer and then it's gone ______________ "less than a pixel" |