Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When only doing cubical transfer cal from the selfcal worker dist_max_chunks is always set to 0 #1198

Closed
molnard89 opened this issue Jul 6, 2020 · 25 comments
Assignees
Milestone

Comments

@molnard89
Copy link
Collaborator

At the end of a selfcal loop my pipeline run crashed when I tried to transfer the model to a higher frequency resolution dataset due to memory issues (cubical tried to grab more memory than available using the default dist_max_chunks parameter). So I set out to re-run the worker only doing the transfer model step with cubical. However, it kept crashing, and it turned out in this scenario dist_max_chunks is always set to 0, there's no way to modify it. Following @KshitijT 's advice, I hardcoded dist_max_chunks = 4 to my selfcal worker, which temporarily fixed the problem for me. So a possible fix could be to allow the user to set this value in the config file.

Related to the above just a comment: it's a bit counter-intuitive to me that settings under cal_cubical such as cal_cubical are linked to the transfer model step even if cal_cubical is turned off.

@gigjozsa
Copy link
Collaborator

gigjozsa commented Jul 7, 2020

I know this does not help here, but IMHO any computation should be automatically optimized via at most two parameters per node, which are cores and memory.

@paoloserra
Copy link
Collaborator

paoloserra commented Jul 7, 2020

I fully agree @gigjozsa . In fact, I was going to open a separate issue about that, but since we're here -- I'm not going to question which parameters should be available in cubical of course, but as far as CARACal is concerned there's some confusion about the cross-talk between ncpu and dist_max_chunk. I think that dist_max_chunk should not be a CARACal parameter, and should be set based on ncpu. Let's discuss here.

@PeterKamphuis
Copy link
Collaborator

@paoloserra But I though dist-max-chunks relates to memory not cpu, even though they are not independent it is not the same either. But in principle I agree that memory and cpu maximums should be set automatically. Clearly that is not easy though as dist-max-chunks=0 means all available memory that messes up in this case.

@molnard89 Your issue is a Cubical or Stimela related one though because in principle 0 should mean all available memory. Have you set your cal_cubical: shared_mem parameter? Cause that should limit stimela's and hence cubicals memory usage I thought? As for these parameters only being available under cal_cubical and not calibrate and transfer_model, it makes sense to me that the maximum cpu and memory usage of cubical is always the same regardless of the task it is doing. Of course for dist-max-chunks this is complicated as the chunk size changes with the different steps.

@paoloserra
Copy link
Collaborator

@PeterKamphuis (but also @o-smirnov for guidance!) My understanding is that:

  • the time chunk size sets how much data is shipped to a CPU (and that must be a multiple of the time solution interval)
  • max-chunks sets how many of the above chunks are loaded into memory at any given time; since each of these chunks is sent to a CPU, max-chunks effectively sets the maximum number of CPUs you use.

Indeed, a run with ncpu=0 (i.e., use all CPUs) and dist_max_chunks=4 uses 4 CPUs, not all of them.

That's why I'm saying that there is some (to me confusing) cross-talk and possible redundancy between these parameters. There may be good reasons to have both parameters in Cubical, but we should discuss how to handle them in CARACal in a way that is understandable to a user.

@paoloserra
Copy link
Collaborator

it makes sense to me that the maximum cpu and memory usage of cubical is always the same regardless of the task it is doing. Of course for dist-max-chunks this is complicated as the chunk size changes with the different steps.

I think we should aim for having these parameters set globally, not per task within a worker (and maybe not even per worker).

@molnard89
Copy link
Collaborator Author

@PeterKamphuis I set shared_mem to 300 GB, and it complained for not being able to allocate 273 GB. I attached the log.

log-caracal_20200706.txt

@PeterKamphuis
Copy link
Collaborator

@molnard89 I am not too surprised with 300 Gb available it will fail on a 273 Gb allocations, at least I think certain buffers should be taken into account when creating these numbers. I wonder if this is an issue of Cubical not reading the Stimela limit as this line seems to imply that Cubical thinks there is ~500 Gb available:

'Total peak memory usage estimated at 25.32GiB: 5.74% of total system memory.'

But I am no expert and someone else has to confirm this/look at this. Can you confirm the system actually has ~500 Gb?

@PeterKamphuis
Copy link
Collaborator

Of course I have also no idea why Cubical claims peak memory usage will be 25 Gb and then assign a 273 Gb array in the next step.

@o-smirnov
Copy link
Member

Of course I have also no idea why Cubical claims peak memory usage will be 25 Gb and then assign a 273 Gb array in the next step.

Some scient cosmologists would call that a pretty decent estimate!

Yes it's true that the cause of the problem is the max-chunks 0, ncpu 56 setting. It tries to read in 55 chunks at once, which is an unreasonably large amount of data (1411575 rows, 13000 channels, 2 corrs works out to precisely a megashitload.)

But yes, the memory use seems way underestimated in this case, maybe @JSKenyon can have a look at the numbers.

Note also that you have PA rotation/derotation enabled. I'll take a look at the implementation, but I suspect that in intself may double memory use in the I/O thread....

@o-smirnov
Copy link
Member

I've also filed this: ratt-ru/CubiCal#389. Had we had the (error message) implemented, this would have forced the disable-rotation option into Stimela a long time ago.

@o-smirnov
Copy link
Member

OK, this is a combination of different problems here:

  • Running 56 workers for only 4 chunks (see do not use more worker processes than there are chunks per tile ratt-ru/CubiCal#392). One would think this would be harmless, but because of the way the scheduler works, workers 0-3 get the first tile, workers 4-7 get the second tile, etc. etc. So while only 4 worker processes are active at any given time, the other ones are sitting there idle but hogging some residual RAM (which normally gets recycled as soon as they're given more work to do, but in this case they're not, so it isn't).

  • PA (de)rotation uses extra RAM.

  • RAM usage estimates are way under in this scenario. memory use grossly underestimated in apply-only mode ratt-ru/CubiCal#388. 100k rows by 13k channels by 2 corrs is a shitload, so it should be using a shitload of RAM.

I have fixed the first two problems in ratt-ru/CubiCal#393, and this now runs in seemingly more reasonable <~300GB peak use. So hopefully will run through fine on @molnard89's fat box now.

Test image pushed, please set

general:
  cabs:
    - name: cubical
      tag: 1.5.5-oms

to test.

@o-smirnov
Copy link
Member

Also, looking at how this transfer job runs on my test node, dist-max-chunks 4 is way too many. It seems completely I/O bound. Interpolating solutions takes very little CPU time, the rest of the time the workers just sit and wait for the I/O thread to serve up another tile.

On my system, dist-max-chunks 2 is optimal (going to 1 chunk it becomes CPU-bound), and uses half the memory. Maybe if you have a very fast filesystem, 3 or 4 would make sense.

We can also use smaller chunks: something like --dist-max-chunks 6 --data-time-chunk 2 --data-freq-chunk 7500 is a lot more economical on memory (peak <60Gb), and I reckon will be just as fast if not faster. I mean, when you're transferring solutions onto a dataset with that many channels in it, there's really no benefit at all in using larger time chunks.

I'm not sure how to capture it all into sensible defaults just yet. But for sure, when transferring gains onto high-freq-res data, smaller time chunks, fewer workers should be a go-to heuristic.

@o-smirnov
Copy link
Member

I reckon will be just as fast if not faster

Before you test again, let me fix this first: ratt-ru/CubiCal#396. Looks like the logging is slowing it down artificially.

@o-smirnov
Copy link
Member

OK, fixed. Use the 1.5.5-oms tag for testing, as indicated previously.

@o-smirnov
Copy link
Member

OK, with the test image and --dist-max-chunks 2 --data-time-chunk 4, the MS runs to completion in ~3 hours, and memory use stays under 100G.

@paoloserra
Copy link
Collaborator

With apologies for the poor formatting, here are three plots showing CPU usage vs time reported by top while running cubical with three different ncpu and max-chunks settings. The plots' time resolution is 1 sec.

In all cases you should look at the red line. The y-axis range is 0-8 (maximum for this machine). Red horizontal lines correspond to 1 CPU step. The top labels mark the start of a container, so only look at the calibrate_cubical_field0_iter1 part.

(The blue line is RAM from 0 to 100% of the system RAM.)

My main conclusion is that, on this machine, reducing max-chunks to a value smaller than ncpu makes one use less CPUs than implied by ncpu.

That's not what I thought we said this morning, i.e., that cubical would use the number of CPUs set by ncpu by simultaneously processing the fewer chunks (max-chunks < ncpu) with more threads.

Let me know if there are any errors or missing info.


ncpu = 8, max-chunks = 8

# INFO      00:21:43 - main               [0.1 0.1 0.2Gb] [dist] Parallelization and distribution options
# INFO      00:21:43 - main               [0.1 0.1 0.2Gb]  - ncpu .............................................. = 8
# INFO      00:21:43 - main               [0.1 0.1 0.2Gb]  - nworker ........................................... = 0
# INFO      00:21:43 - main               [0.1 0.1 0.2Gb]  - nthread ........................................... = 0
# INFO      00:21:43 - main               [0.1 0.1 0.2Gb]  - max-chunks ........................................ = 8
# INFO      00:21:43 - main               [0.1 0.1 0.2Gb]  - min-chunks ........................................ = 0
# INFO      00:21:43 - main               [0.1 0.1 0.2Gb]  - pin ............................................... = 0
# INFO      00:21:43 - main               [0.1 0.1 0.2Gb]  - pin-io ............................................ = False
# INFO      00:21:43 - main               [0.1 0.1 0.2Gb]  - pin-main .......................................... = io
# INFO      00:21:43 - main               [0.1 0.1 0.2Gb]  - safe .............................................. = True
...
# INFO      00:21:43 - main               [0.2 0.2 0.2Gb] multi-process mode: 7+1 workers, single thread

Screenshot 2020-08-05 at 12 01 35


ncpu = 4, max-chunks = 4

# INFO      09:19:04 - main               [0.1 0.1 0.0Gb] [dist] Parallelization and distribution options
# INFO      09:19:04 - main               [0.1 0.1 0.0Gb]  - ncpu .............................................. = 4
# INFO      09:19:04 - main               [0.1 0.1 0.0Gb]  - nworker ........................................... = 0
# INFO      09:19:04 - main               [0.1 0.1 0.0Gb]  - nthread ........................................... = 0
# INFO      09:19:04 - main               [0.1 0.1 0.0Gb]  - max-chunks ........................................ = 4
# INFO      09:19:04 - main               [0.1 0.1 0.0Gb]  - min-chunks ........................................ = 0
# INFO      09:19:04 - main               [0.1 0.1 0.0Gb]  - pin ............................................... = 0
# INFO      09:19:04 - main               [0.1 0.1 0.0Gb]  - pin-io ............................................ = False
# INFO      09:19:04 - main               [0.1 0.1 0.0Gb]  - pin-main .......................................... = io
# INFO      09:19:04 - main               [0.1 0.1 0.0Gb]  - safe .............................................. = True
...
# INFO      09:19:05 - main               [0.2 0.2 0.0Gb] multi-process mode: 3+1 workers, single thread

Screenshot 2020-08-05 at 12 02 40


ncpu = 8, max-chunks = 2

# INFO      09:34:06 - main               [0.1 0.1 0.1Gb] [dist] Parallelization and distribution options
# INFO      09:34:06 - main               [0.1 0.1 0.1Gb]  - ncpu .............................................. = 8
# INFO      09:34:06 - main               [0.1 0.1 0.1Gb]  - nworker ........................................... = 0
# INFO      09:34:06 - main               [0.1 0.1 0.1Gb]  - nthread ........................................... = 0
# INFO      09:34:06 - main               [0.1 0.1 0.1Gb]  - max-chunks ........................................ = 2
# INFO      09:34:06 - main               [0.1 0.1 0.1Gb]  - min-chunks ........................................ = 0
# INFO      09:34:06 - main               [0.1 0.1 0.1Gb]  - pin ............................................... = 0
# INFO      09:34:06 - main               [0.1 0.1 0.1Gb]  - pin-io ............................................ = False
# INFO      09:34:06 - main               [0.1 0.1 0.1Gb]  - pin-main .......................................... = io
# INFO      09:34:06 - main               [0.1 0.1 0.1Gb]  - safe .............................................. = True
...
# INFO      09:34:07 - main               [0.2 0.2 0.1Gb] multi-process mode: 7+1 workers, single thread

Screenshot 2020-08-05 at 12 03 16

@PeterKamphuis
Copy link
Collaborator

Well but the first two are exactly what you expect right? And then in the last one it is not CPU =max-chunks. It is very hard to see from the plot but is it possible that the CPU usage spikes are so quick that they average to 3 CPUs over a sec, so all 8 fire for 0.3s?

This line though:

multi-process mode: 7+1 workers, single thread

for that last test is completely out of line with what I understood from the discussion this morning.

@o-smirnov
Copy link
Member

Ah ok, this is a previous release of CubiCal. So that's a bit confusing, because I was thinking about the latest release this morning, which is slightly different from what this one does. What's happening here (previous release) is that it spins up as many workers as specified by --dist-ncpu (hence "7+1 workers" anyway), but because max-chunks is small, only a subset of these workers receives jobs. Hence @paoloserra sees fewer CPUs used.

@SpheMakh needs to roll a new release, as this logic is a little more... logical in the new release.

@paoloserra could you please also report wall time in your benchmarking? It doesn't matter if more or fewer CPUs are used if the process takes a similar time to finish, right?

@paoloserra
Copy link
Collaborator

could you please also report wall time in your benchmarking?

10, 12 and 15 min, respectively

Sure, it might not matter much if the wall times are similar, but the handling of resources needs to be transparent and understandable for a user.

And then in the last one it is not CPU =max-chunks.

@PeterKamphuis sure, that's what we wanted to test right?

@PeterKamphuis
Copy link
Collaborator

@paoloserra I meant in the output, It doesn't look like the CPU usage is limited by the max-chunks as it peaks above 2. I thought that this morning you said that it seemed like max-chunks was limiting the CPU usage. This looks more like it is limiting cubical in such a way that it never really needs/uses all the CPU, in line with what @o-smirnov said above. This could be different for larger chunks in the new Cubical version?

@paoloserra
Copy link
Collaborator

Got it.

My previous claim was based on monitoring the CPU usage "live". The plots I'm now making are better and, as you've noticed, clarify that max-chunks does not set a hard limit on the number of CPUs used. Yet, it does make that number much lower than ncpu, which I think we should fix.

Not sure what the strategy is here folks. So we wait for @SpheMakh to roll a new release of Stimela with the latest Cubical and test again?

@JSKenyon
Copy link

JSKenyon commented Aug 5, 2020

Hi all! For what it is worth, I opened an issue about the new CubiCal release two weeks ago. @Athanaseus do you have an ETA on it?

@molnard89
Copy link
Collaborator Author

@PeterKamphuis

is it possible that the CPU usage spikes are so quick that they average to 3 CPUs over a sec, so all 8 fire for 0.3s

If I understood @o-smirnov 's explanation yesterday, my interpretation is a bit different. There is usually a dedicated CPU for I/O, which puts a bottleneck in the process, while calculating gains is instantaneous in comparison, so even there are more than max-chunks = 2 CPUs at the disposal of the process, it only ever uses 3 (1 for I/O plus two to calculate the gains quickly that are then queued for writing), and the rest is idling. If this is true, ncpu = 3, max-chunks = 2 should be identical in terms of runtime, and in general ncpu should be max max-chunks+1 (in this particular case - for other computations each worker assigned to a chunk could be multithreaded, and therefore use more than one core to work on its data chunk?).

This looks more like it is limiting cubical in such a way that it never really needs/uses all the CPU

So basically I agree with this, with the difference that ncpu can be larger than max-chunks, if the workers/processes responsible for each chunk can multithread (and therefore use more than one CPU), which doesn't seem to be the case for cubical (or is not needed because it's such a quick calculation).

@PeterKamphuis
Copy link
Collaborator

PeterKamphuis commented Aug 6, 2020

@molnard89 If I followed everything correctly your explanation is indeed the one for the current Cubical version we are using. But after the upgrade to the new version/release we should have 2 workers that are multithreaded +1 I/O worker right? Which then uses 8 CPUs. If I understand top correctly the CPU usage is an average over 1 sec in the way @paoloserra runs it. Did I get all of that correctly? I have been out of caracal for a bit so playing catch up here.

@paoloserra
Copy link
Collaborator

I agree with @molnard89 (for the Cubical version currently used in CARACal). At most dist_max_chunks+1 CPUs are used independent of the value of ncpu.

So, for now, I think it would make sense to have a default setting dist_max-chunks = ncpu-1. Experienced users can always decouple those two parameters, but for us mortals this is unnecessarily confusing. With this change, if I'm using too much RAM and cannot further shrink the chunk size I will simply decrease ncpu. If there's agreement I can make this change in the branch https://github.com/caracal-pipeline/caracal/tree/chunkitup .

When the Cubical version at ratt-ru/Stimela-classic#661 becomes available we can test whether this behaviour has changed.

Let me know whether you're happy with this.

@o-smirnov o-smirnov modified the milestone: R1.0.4 Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants