Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove const to pass Oracle compiler on Solaris #3294

Merged
merged 6 commits into from
Feb 9, 2019
Merged

Remove const to pass Oracle compiler on Solaris #3294

merged 6 commits into from
Feb 9, 2019

Conversation

mattdowle
Copy link
Member

@mattdowle mattdowle commented Jan 19, 2019

Closes #3285

I went a bit further than the patch from Prof Ripley which removed the const. With the following thought process.

  1. The MIN(getDTthreads(), nx) can be left to OpenMP. Either OpenMP will only start the number of threads it needs for very small loops, or the overhead of selecting threads from the pool is so small it's not worth worrying about. Either way it's not worth the small extra code complexity of MIN() so just leave that to OpenMP. I don't actually know if this is true, but of all the things we don't know, I'm willing to trust OpenMP on this and move on to something higher value.
  2. Since the MIN() was probably why it was held in a variable up front (to split up the lines), I removed the variable and put the getDTthreads() back inline the #pragma line. Several reasons. First, saving a variable saves accidentally using them (e.g. nth vs threads). It's not a hard and fast rule to avoid creating new variables but in this case I think it makes sense. Second, there are some grep's in CRAN_Release.cmd that look for the pragma lines and check that num_thread() is specified. Because missing that directive would then be sensitive to the global OpenMP setting shared by other packages. It's slightly easier for me when running CRAN_Release.cmd to see the consistent pattern num_threads(getDTthreads()) without a variable being there. Now that we know a const var being passed to num_threads() is the problem on Solaris, the CRAN_Release.cmd greps could be strengthened to ensure no variable passed to num_threads() just because otherwise we'd have to find a way that such variable wasn't declared const.
  3. I collapsed the two pragma into one to save lines and indentation and because that's what I've done elsewhere. So far I've only split up into two pragma when it's necessary for some per-thread initialization. In these cases where it can be one pragma then I do use one pragma, as a general rule.
  4. I removed the schedule(static) because that's the default schedule. If we explicit that then the thought is that for consistency it should be explicit everywhere (and all defaults?) in all parallel regions. This would distract the eye from non-default code which is more important.
  5. I haven't been using the shared() directive, rather leaving that to OpenMP. I can see whether the variable is shared or private from where it is declared. Needing to look at the shared() directive and what it names is an extra step when looking and understanding the code. If using shared() is supposed to make the code clearer, then I find myself looking to check it is really shared and whether there are any other shared variables that are not named in it. So I prefer just to leave out the shared() directive.
  6. When assigning to variables of type bool, use true not 1. To signal to the reader's eye that it's a bool that can only take true/false. This is the very reason we are ok with assigning naked to a shared bool. Naked in the sense of no atomic protection. In this special case, the threads can only set the bool true so it doesn't matter if they overlap. Even if the bool is actually an int it doesn't matter if a thread race happens because any bits written anywhere in the variable will be true (non-zero). However it's safe to assume the leading bit would only be written and a naked one-way write is safe.
  7. Our usage of bool in this way is not what volatile is for, iiuc. So I removed the volatile keyword. See this article: https://barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword. What we really want for this idiom is a qualifier "naked one-value write" so the compiler would check for us that a single value (true) was only ever written to that variable from inside the parallel region. But I don't think such a qualifier exists in C. volatile on the other hand, iiuc, is for reading a value that might have been changed by an external process (and therefore has to be read and re-read a lot even when the optimizer might be sure no update could have happened to it). However, when we write to that bool we don't care what its value is now, we just want that write to get there eventually and stick. The important thing is that no thread could ever write false.

That's the way I think about these items currently.

@mattdowle mattdowle added this to the 1.12.2 milestone Jan 19, 2019
@codecov
Copy link

codecov bot commented Jan 19, 2019

Codecov Report

Merging #3294 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3294      +/-   ##
==========================================
- Coverage   94.73%   94.72%   -0.01%     
==========================================
  Files          65       65              
  Lines       12131    12125       -6     
==========================================
- Hits        11492    11486       -6     
  Misses        639      639
Impacted Files Coverage Δ
src/reorder.c 97.46% <ø> (-0.04%) ⬇️
src/frolladaptive.c 87.73% <100%> (-0.45%) ⬇️
src/froll.c 98.11% <100%> (-0.04%) ⬇️
src/frollR.c 100% <100%> (ø) ⬆️
src/gsumm.c 90.9% <0%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5171b6...3e4f316. Read the comment docs.

@mattdowle mattdowle requested a review from jangorecki January 19, 2019 01:21
Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice cleanup. Few comments.

  1. The MIN(getDTthreads(), nx) can be left to OpenMP.

Even if openMP does not handle that as good as it could - by detecting number of iterations in parallel loop and limiting numbers of threads to initialize, then it shouldn't be big deal unless machines having hundreds or more cores are being used, then probably overhead can be noticeable.
Still there is a following scenario to investigate (already investigated, see comment below). Using R pseudo code

f = function(x) omp_for(i in 1:2) do(x[i])
g = function(X) omp_for(i in 1:2) f(X[i])

Lets say we have 4 cpus. We want to use nested parallelism. If g omp_for allocates all available cpus=4, then there might be no cpus left to allocate inside f omp_for. Limiting num_threads to min(getDTthreads(), 2) will ensure that 2 cpus are being used by g and then another two are being utilized in f. Thus we will use all 4 cpus. This is not obvious in case when we remove MIN.

  1. I removed the schedule(static) because that's the default schedule.

It might not be always default I think, see https://stackoverflow.com/questions/21291581/openmp-parallel-for-what-is-default-schedule

  1. ... If using shared() is supposed to make the code clearer

No, I use it to be sure that openMP knows that other threads can update that vartiable. Not sure if it is really necessary. Still it can be useful to explicitly "document" that variable is shared. Same applies to volatile, I was going safe approach because those compiler details you mentioned, like "naked one-value write", are still not obvious to me.

@jangorecki
Copy link
Member

jangorecki commented Jan 20, 2019

Regarding comment to 1: after investigation I can confirm that nested parallelism utilize all available cores, so the outer omp for loop does not block access to remaining cores for a inner omp for loop.

One thing I spotted, not related to this branch, is that running the following code on machine having 4 cpus

library(data.table)
n = 5e6
set.seed(108)
dt = setDT(lapply(1:4, function(i) rnorm(n)))
system.time(ans<-frollmean(dt, c(1e3, 2e3), algo="exact"))

spawns 15 threads, each using around 23-26% cpu. Not sure how that compares to running 4 threads using 99-100% each.

@mattdowle
Copy link
Member Author

Let's tackle the nested parallelism first. Why do we have any? I experimented with it once and managed to not need it. I think we should avoid nested parallelism wherever possible. I can see just one use of omp_set_nested() and it's in frollR.c. Each thread gets a column. But then it's nested within each column too. Why does that section need to be nested?

schedule(static)

The default for def-sched-var in all implementations is static I believe.
We can go through all 45 parallel regions and write schedule(static) for clarity, and have a grep in CRAN_Release.cmd to ensure that schedule(static) is present. Or we can say that all implementations do have default static now and if they change in future then they probably have very good reason to and those implementations know better than us. def-sched-var is not affected by any environment variable so it's not like users can change it. Although, it is implementation defined, so an implementation might provide a way for a user to change it.

volatile

I understand you're trying to document. But my point was that's not what volatile is for, iiuc. Using it in that context can give the wrong impression to someone who does know what the true use of volatile is for. How about defining our own qualifier : _onlyWriteTrue_ flag = false; or something like that: #define _onlyWriteTrue_ bool.

shared()

Do you know if when you use shared(), can you still use a shared variable that isn't named in shared()? So the compiler catches that for us? If so, always using shared() might be a good standard (although extra steps). Personally, I think it just gets in the way. You know that variables are shared unless declared inside the thread's scope -- and you can see that. I don't want to have to remember to look at the parallel directive as well to see if a variable has been declared shared or private there. In general. So far.

@jangorecki
Copy link
Member

jangorecki commented Feb 5, 2019

Let's tackle the nested parallelism first. Why do we have any?

by default (algo="fast") we use only one level, but for algo="exact" we calculate mean of a window for every single observation, which can be made safely in parallel (at least in theory, as #3346 exposed - but not related to nested parallelism).

I experimented with it once and managed to not need it.

I remember, in parallel subset.

I think we should avoid nested parallelism wherever possible.

Agree. According to this SO it is not easy to control max threads in nested parallelism. We would need to calculate value to pass to inner num_threads.
I will re-work that part - as it was initially - when algo="exact" the outer loop will not goes in parallel, only inner loop will be parallel.
Anyway it might be better after merging this PR to avoid conflicts. Maybe remove omp_set_nested calls for now, so nested parallelism will be ignored, and I will submit PR after merging this one.

I can see just one use of omp_set_nested() and it's in frollR.c. Each thread gets a column. But then it's nested within each column too. Why does that section need to be nested?

That nested part is time consuming, and outer loop might have use just few cores. If (number of columns * number of windows) to calculate would be lesser than numbers of threads then algo will not use all existing threads, which is important as that inner loop is slow.

schedule(static)
...
is not affected by any environment variable so it's not like users can change it.

There are OMP_SCHEDULE=sheduling-type and omp_set_schedule but changing default doesn't make sense in our case. Different loops should have different preference. IIUC static best fit when there are many smaller iterations that are similar in time complexity, while dynamic fits when there is less iterations and time complexity varies. That is why in frollmeanExact (inner parallel loop) when we iterate on each rows and calculating window mean the schedule(static) is used. At the same time in frollfunR (outer parallel loop) the schedule(auto) is used because one iteration might be calculating window size 30 and another 3000, so their time will be very different. We can generalize that iterations over rows should use static while iterations over columns auto/dynamic. Somewhere in roll branch history there might be commit (not sure if I committed that actually) that measure static/dynamic difference when used on rows and it the diff was significant, but AFAIR I haven't checked static/dynamic for columns.

volatile

agree.

shared()

I understand your point. On the other hand this clause nicely documents that there is interaction between threads happening in the loop. Looking at variables it can be spotted as well but it is not that verbose.

@mattdowle
Copy link
Member Author

There are OMP_SCHEDULE=sheduling-type and omp_set_schedule

But those control the schedule when schudule(runtime) is used. They don't control the schedule when the schedule directive is missing.

@mattdowle mattdowle merged commit 055047b into master Feb 9, 2019
@mattdowle mattdowle deleted the solaris branch February 9, 2019 01:33
@jangorecki
Copy link
Member

I found good reference about omp schedule options: https://www.dartmouth.edu/~rc/classes/intro_openmp/schedule_loops.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error with Oracle compiler (likely Solaris only)
2 participants