Message boards :
Number crunching :
extreme long wu's
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next
Author | Message |
---|---|
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
I posted about this in December also. It had been running fine until then. Something must have changed to create the issue. What changed? |
Send message Joined: 23 Mar 16 Posts: 95 Credit: 23,431,842 RAC: 0 |
@Jacob. So you are running a system that is running a preview version of Windows, and which reboots every 1.5 days (apparently highlighting a failing in the Boinc watchdog), and adrianxw says "me too" to that. Have you heard the expression "bleeding edge"? Is it any wonder the problem is rare and that krzyszp can't reproduce it? My Pi also runs 365/7/24, running Boinc when it would otherwise be idling, but it's running a standard release of Raspbian, and ordinarily it does not reboot. It may not be bleeding edge, but that's perhaps why it works just fine. BTW, ATLAS@Home has had a problem with sporadic long-running tasks for far longer than Universe@Home. Regarding your work on Work Fetch, would you be the person to complain to about the time I spent trying to get my Windows PC fetching sensible amounts of work for each project? LHC@Home and WCG routinely grabbed more work than they could possibly complete by the deadline, and did so aggressively, whilst ATLAS would grab one task now and again, but mostly it just kept reporting back that the work queue was full. The problem was exacerbated by Boinc choosing to run the most recently downloaded task flat out whilst ignoring a task that was hard up against the deadline (and which would subsequently time out). It took much fiddling with project settings and XML files to make the system work in a reasonable manner. Boinc is great, but it's very far from perfect. |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
@Brummig 1: I'm glad you're not having the nasty problem in this thread. Please don't dump on others that do have the problem and are trying to fix it for everyone else. 2: Regarding work fetch, sounds like you're looking for blood. Good thing I'm kind :) I recommend turning on the <work_fetch_debug> option, using either Options > Event Log Options, or editing cc_config.xml ... then, when odd behavior happens, inspect the work fetch output to make sure it behaved correctly. If something looks odd or doesn't make sense, feel free to send me a private message, and I'll try to help you, if you remain patient. Same team, bro. |
Send message Joined: 23 Mar 16 Posts: 95 Credit: 23,431,842 RAC: 0 |
Jacob, I'm not dumping on those that have the problem. I simply asked those with the problem not to dump on the silent, problem-free majority by requesting that the project be suspended until the issue is fixed (which may be some time, of course). Thanks for offering to look into the fetch problem, but for the moment I would very much prefer to leave things alone (since my fixes are currently working). Moreover, I suspect I wouldn't be able to recreate the problem, because all the CERN projects have been consolidated into LHC@Home (leaving two bruisers, LHC and WCG, to slog it out with no collateral damage). The last to be consolidated was ATLAS, and the last ATLAS@Home task was sent out a week or so ago. And no, I'm not looking for blood, but you did choose to mention your contribution was the Work Fetch algorithm that (presumably) gave me so much grief, when all I asked was why you had to reboot your machine so often. |
Send message Joined: 4 Feb 15 Posts: 17 Credit: 158,222,691 RAC: 0 |
From beginning of march i have computed over 19k of task and only less than 3% have this issue. So for me it is not a big problem. Of course it will be good to have this fixed, but I remember times where projects that are bigger, with better financing and having opinion that they are stable have a lot bigger fail ratio. |
Send message Joined: 4 Feb 15 Posts: 846 Credit: 144,180,465 RAC: 0 |
Let me explains something: "Failed computing" factors is: windows_x86_64 - 0.6751% 0.01 windows_intelx86 - 6.1462% x86_64-pc-linux-gnu - 0.3296% Number of "197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED" is equal to 29 at the moment for about 45'000 WU's All of them was finished successfully by wingman, e.g. http://universeathome.pl/universe/workunit.php?wuid=9391200 Can you just show me direction to find out, why windows 32 have substantially bigger fail rate then win64 and both of them are less reliable then Linux? Also, it is seriously extremely difficult to find a bug when other machines finishes their tasks correctly... Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
Did you change something around Christmas time? It was okay before Christmas, and problematic afterwards. I don't just mean the program, changes may be in the data etc. What compiler do you use? |
Send message Joined: 4 Feb 15 Posts: 846 Credit: 144,180,465 RAC: 0 |
Did you change something around Christmas time? It was okay before Christmas, and problematic afterwards. I don't just mean the program, changes may be in the data etc. What compiler do you use? No, same application is here from end of July last year. Only parameters in input files was changed, but those are very small changes (some parameters slowly increased during the time)... I use g++ on Linux and Visual Studio 2010 for Windows. Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
I am assuming that the problem started about the time of the thread, certainly, I had not seen the fault before I posted in here. But that is an assumption. Are we SURE the problem started about then? Do you have error rates for periods earlier? What compiler are you using for Windows? Trying to narrow some gaps, throw out certain ideas etc. |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
Sorry, can't edit, I can see you added the compilers later, was not there when I wrote the above. |
Send message Joined: 23 Mar 16 Posts: 95 Credit: 23,431,842 RAC: 0 |
I was reading something on CPDN earlier that said that they see very small differences between the results returned by a task on one operating system and the results returned by the same task on a different operating system. Now if, say, an intermediate result were compared with something, then that very small difference between operating systems could mean that a task that runs just fine on one OS could take a different path and go completely awry on another. It could also be that the small changes made to the input file were sufficient to trigger that chaotic behaviour. That may or may not help in tracking down the problem, but it does at least provide a possible explanation for the mystery. |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
Is there any way to send me the inputs to the tasks that I aborted, so I may try them again locally? |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
After a week of setting this project with maximum resource share, in order to repro the problem, and monitoring the results closely, I have so far been unsuccessful in repro'ing it. I will now set it for normal resource share, and try to keep a lookout for the problem. Again, if anybody has the input files that recreate the issue, I'd LOVE to test with them. I wish I would have saved mine from the problematic tasks earlier, before I aborted them. |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
Jacob, I would try the opposite. Get a couple of other programs to run alongside Universe and see if you can induce the problem. That would explain why krzyszp can not reproduce it, since he probably runs it in isolation. I was running Einstein on 3 cores and Universe on 4 cores (with a GTX 1060 on Folding supported by the other core) when I would get the problem every few weeks. I am trying Universe by itself now, but will need to let it go for at least a month without a long runner before it means anything. Is it possible that BOINC somehow perturbs Universe due to the presence of another program? I don't know. |
Send message Joined: 21 Feb 15 Posts: 53 Credit: 1,385,888 RAC: 0 |
Don't get me wrong, I was running it alongside other programs (I have other high priority RNA World VM tasks, GPUGrid GPU tasks, WuProp non-cpu-intensive tasks, etc.) So, yeah, now it will run amidst more since I set the resource share to be equal to my other 10 projects... And I will continue to try to repro the issue. Thanks. |
Send message Joined: 4 Feb 15 Posts: 846 Credit: 144,180,465 RAC: 0 |
I just have released new version of BHspin application, we will see if the problem still continue in less then 24 hours. (The new app version is for Windows and Linux only at the moment Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 4 Feb 15 Posts: 846 Credit: 144,180,465 RAC: 0 |
I had to rollback version due to bug in this version :( Krzysztof 'krzyszp' Piszczek Member of Radioactive@Home team My Patreon profile Universe@Home on YT |
Send message Joined: 28 Feb 15 Posts: 253 Credit: 200,562,581 RAC: 0 |
I had to rollback version due to bug in this version :( I deleted my first batch of 0.02 BHppin v2, but have just received more at 02:30 UTC on 28 March. Are they safe to run, or should I delete them also? |
Send message Joined: 1 Oct 16 Posts: 32 Credit: 268,033 RAC: 0 |
All gone quiet... |
Send message Joined: 5 Feb 17 Posts: 6 Credit: 2,135,900 RAC: 0 |
Are you saying that only about 10% of computers are experiencing issues? It is strange because there was a performance drop by 70% during last month: http://universeathome.pl/universe/history.php |