-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When only doing cubical transfer cal from the selfcal worker dist_max_chunks is always set to 0 #1198
Comments
I know this does not help here, but IMHO any computation should be automatically optimized via at most two parameters per node, which are cores and memory. |
I fully agree @gigjozsa . In fact, I was going to open a separate issue about that, but since we're here -- I'm not going to question which parameters should be available in cubical of course, but as far as CARACal is concerned there's some confusion about the cross-talk between |
@paoloserra But I though @molnard89 Your issue is a Cubical or Stimela related one though because in principle 0 should mean all available memory. Have you set your |
@PeterKamphuis (but also @o-smirnov for guidance!) My understanding is that:
Indeed, a run with That's why I'm saying that there is some (to me confusing) cross-talk and possible redundancy between these parameters. There may be good reasons to have both parameters in Cubical, but we should discuss how to handle them in CARACal in a way that is understandable to a user. |
I think we should aim for having these parameters set globally, not per task within a worker (and maybe not even per worker). |
@PeterKamphuis I set shared_mem to 300 GB, and it complained for not being able to allocate 273 GB. I attached the log. |
@molnard89 I am not too surprised with 300 Gb available it will fail on a 273 Gb allocations, at least I think certain buffers should be taken into account when creating these numbers. I wonder if this is an issue of Cubical not reading the Stimela limit as this line seems to imply that Cubical thinks there is ~500 Gb available: 'Total peak memory usage estimated at 25.32GiB: 5.74% of total system memory.' But I am no expert and someone else has to confirm this/look at this. Can you confirm the system actually has ~500 Gb? |
Of course I have also no idea why Cubical claims peak memory usage will be 25 Gb and then assign a 273 Gb array in the next step. |
Some Yes it's true that the cause of the problem is the max-chunks 0, ncpu 56 setting. It tries to read in 55 chunks at once, which is an unreasonably large amount of data (1411575 rows, 13000 channels, 2 corrs works out to precisely a megashitload.) But yes, the memory use seems way underestimated in this case, maybe @JSKenyon can have a look at the numbers. Note also that you have PA rotation/derotation enabled. I'll take a look at the implementation, but I suspect that in intself may double memory use in the I/O thread.... |
I've also filed this: ratt-ru/CubiCal#389. Had we had the (error message) implemented, this would have forced the disable-rotation option into Stimela a long time ago. |
OK, this is a combination of different problems here:
I have fixed the first two problems in ratt-ru/CubiCal#393, and this now runs in seemingly more reasonable <~300GB peak use. So hopefully will run through fine on @molnard89's fat box now. Test image pushed, please set
to test. |
Also, looking at how this transfer job runs on my test node, On my system, We can also use smaller chunks: something like I'm not sure how to capture it all into sensible defaults just yet. But for sure, when transferring gains onto high-freq-res data, smaller time chunks, fewer workers should be a go-to heuristic. |
Before you test again, let me fix this first: ratt-ru/CubiCal#396. Looks like the logging is slowing it down artificially. |
OK, fixed. Use the |
OK, with the test image and |
With apologies for the poor formatting, here are three plots showing CPU usage vs time reported by In all cases you should look at the red line. The y-axis range is 0-8 (maximum for this machine). Red horizontal lines correspond to 1 CPU step. The top labels mark the start of a container, so only look at the (The blue line is RAM from 0 to 100% of the system RAM.) My main conclusion is that, on this machine, reducing That's not what I thought we said this morning, i.e., that cubical would use the number of CPUs set by Let me know if there are any errors or missing info. ncpu = 8, max-chunks = 8
ncpu = 4, max-chunks = 4
ncpu = 8, max-chunks = 2
|
Well but the first two are exactly what you expect right? And then in the last one it is not CPU =max-chunks. It is very hard to see from the plot but is it possible that the CPU usage spikes are so quick that they average to 3 CPUs over a sec, so all 8 fire for 0.3s? This line though:
for that last test is completely out of line with what I understood from the discussion this morning. |
Ah ok, this is a previous release of CubiCal. So that's a bit confusing, because I was thinking about the latest release this morning, which is slightly different from what this one does. What's happening here (previous release) is that it spins up as many workers as specified by @SpheMakh needs to roll a new release, as this logic is a little more... logical in the new release. @paoloserra could you please also report wall time in your benchmarking? It doesn't matter if more or fewer CPUs are used if the process takes a similar time to finish, right? |
10, 12 and 15 min, respectively Sure, it might not matter much if the wall times are similar, but the handling of resources needs to be transparent and understandable for a user.
@PeterKamphuis sure, that's what we wanted to test right? |
@paoloserra I meant in the output, It doesn't look like the CPU usage is limited by the max-chunks as it peaks above 2. I thought that this morning you said that it seemed like max-chunks was limiting the CPU usage. This looks more like it is limiting cubical in such a way that it never really needs/uses all the CPU, in line with what @o-smirnov said above. This could be different for larger chunks in the new Cubical version? |
Got it. My previous claim was based on monitoring the CPU usage "live". The plots I'm now making are better and, as you've noticed, clarify that Not sure what the strategy is here folks. So we wait for @SpheMakh to roll a new release of Stimela with the latest Cubical and test again? |
Hi all! For what it is worth, I opened an issue about the new CubiCal release two weeks ago. @Athanaseus do you have an ETA on it? |
If I understood @o-smirnov 's explanation yesterday, my interpretation is a bit different. There is usually a dedicated CPU for I/O, which puts a bottleneck in the process, while calculating gains is instantaneous in comparison, so even there are more than max-chunks = 2 CPUs at the disposal of the process, it only ever uses 3 (1 for I/O plus two to calculate the gains quickly that are then queued for writing), and the rest is idling. If this is true, ncpu = 3, max-chunks = 2 should be identical in terms of runtime, and in general ncpu should be max max-chunks+1 (in this particular case - for other computations each worker assigned to a chunk could be multithreaded, and therefore use more than one core to work on its data chunk?).
So basically I agree with this, with the difference that ncpu can be larger than max-chunks, if the workers/processes responsible for each chunk can multithread (and therefore use more than one CPU), which doesn't seem to be the case for cubical (or is not needed because it's such a quick calculation). |
@molnard89 If I followed everything correctly your explanation is indeed the one for the current Cubical version we are using. But after the upgrade to the new version/release we should have 2 workers that are multithreaded +1 I/O worker right? Which then uses 8 CPUs. If I understand top correctly the CPU usage is an average over 1 sec in the way @paoloserra runs it. Did I get all of that correctly? I have been out of caracal for a bit so playing catch up here. |
I agree with @molnard89 (for the Cubical version currently used in CARACal). At most So, for now, I think it would make sense to have a default setting When the Cubical version at ratt-ru/Stimela-classic#661 becomes available we can test whether this behaviour has changed. Let me know whether you're happy with this. |
At the end of a selfcal loop my pipeline run crashed when I tried to transfer the model to a higher frequency resolution dataset due to memory issues (cubical tried to grab more memory than available using the default
dist_max_chunks
parameter). So I set out to re-run the worker only doing the transfer model step with cubical. However, it kept crashing, and it turned out in this scenariodist_max_chunks
is always set to 0, there's no way to modify it. Following @KshitijT 's advice, I hardcodeddist_max_chunks
= 4 to my selfcal worker, which temporarily fixed the problem for me. So a possible fix could be to allow the user to set this value in the config file.Related to the above just a comment: it's a bit counter-intuitive to me that settings under
cal_cubical
such ascal_cubical
are linked to the transfer model step even ifcal_cubical
is turned off.The text was updated successfully, but these errors were encountered: