-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove const to pass Oracle compiler on Solaris #3294
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3294 +/- ##
==========================================
- Coverage 94.73% 94.72% -0.01%
==========================================
Files 65 65
Lines 12131 12125 -6
==========================================
- Hits 11492 11486 -6
Misses 639 639
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice cleanup. Few comments.
- The
MIN(getDTthreads(), nx)
can be left to OpenMP.
Even if openMP does not handle that as good as it could - by detecting number of iterations in parallel loop and limiting numbers of threads to initialize, then it shouldn't be big deal unless machines having hundreds or more cores are being used, then probably overhead can be noticeable.
Still there is a following scenario to investigate (already investigated, see comment below). Using R pseudo code
f = function(x) omp_for(i in 1:2) do(x[i])
g = function(X) omp_for(i in 1:2) f(X[i])
Lets say we have 4 cpus. We want to use nested parallelism. If g
omp_for
allocates all available cpus=4, then there might be no cpus left to allocate inside f
omp_for
. Limiting num_threads
to min(getDTthreads(), 2)
will ensure that 2 cpus are being used by g
and then another two are being utilized in f
. Thus we will use all 4 cpus. This is not obvious in case when we remove MIN
.
- I removed the schedule(static) because that's the default schedule.
It might not be always default I think, see https://stackoverflow.com/questions/21291581/openmp-parallel-for-what-is-default-schedule
- ... If using shared() is supposed to make the code clearer
No, I use it to be sure that openMP knows that other threads can update that vartiable. Not sure if it is really necessary. Still it can be useful to explicitly "document" that variable is shared. Same applies to volatile
, I was going safe approach because those compiler details you mentioned, like "naked one-value write", are still not obvious to me.
Regarding comment to 1: after investigation I can confirm that nested parallelism utilize all available cores, so the outer omp for loop does not block access to remaining cores for a inner omp for loop. One thing I spotted, not related to this branch, is that running the following code on machine having 4 cpus library(data.table)
n = 5e6
set.seed(108)
dt = setDT(lapply(1:4, function(i) rnorm(n)))
system.time(ans<-frollmean(dt, c(1e3, 2e3), algo="exact")) spawns 15 threads, each using around 23-26% cpu. Not sure how that compares to running 4 threads using 99-100% each. |
Let's tackle the nested parallelism first. Why do we have any? I experimented with it once and managed to not need it. I think we should avoid nested parallelism wherever possible. I can see just one use of
The default for
I understand you're trying to document. But my point was that's not what volatile is for, iiuc. Using it in that context can give the wrong impression to someone who does know what the true use of volatile is for. How about defining our own qualifier : _
Do you know if when you use shared(), can you still use a shared variable that isn't named in shared()? So the compiler catches that for us? If so, always using shared() might be a good standard (although extra steps). Personally, I think it just gets in the way. You know that variables are shared unless declared inside the thread's scope -- and you can see that. I don't want to have to remember to look at the parallel directive as well to see if a variable has been declared shared or private there. In general. So far. |
by default (
I remember, in parallel subset.
Agree. According to this SO it is not easy to control max threads in nested parallelism. We would need to calculate value to pass to inner
That nested part is time consuming, and outer loop might have use just few cores. If (number of columns * number of windows) to calculate would be lesser than numbers of threads then algo will not use all existing threads, which is important as that inner loop is slow.
There are
agree.
I understand your point. On the other hand this clause nicely documents that there is interaction between threads happening in the loop. Looking at variables it can be spotted as well but it is not that verbose. |
But those control the schedule when |
I found good reference about omp schedule options: https://www.dartmouth.edu/~rc/classes/intro_openmp/schedule_loops.html |
Closes #3285
I went a bit further than the patch from Prof Ripley which removed the
const
. With the following thought process.MIN(getDTthreads(), nx)
can be left to OpenMP. Either OpenMP will only start the number of threads it needs for very small loops, or the overhead of selecting threads from the pool is so small it's not worth worrying about. Either way it's not worth the small extra code complexity of MIN() so just leave that to OpenMP. I don't actually know if this is true, but of all the things we don't know, I'm willing to trust OpenMP on this and move on to something higher value.nth
vsthreads
). It's not a hard and fast rule to avoid creating new variables but in this case I think it makes sense. Second, there are somegrep
's in CRAN_Release.cmd that look for the pragma lines and check that num_thread() is specified. Because missing that directive would then be sensitive to the global OpenMP setting shared by other packages. It's slightly easier for me when running CRAN_Release.cmd to see the consistent pattern num_threads(getDTthreads()) without a variable being there. Now that we know aconst var
being passed to num_threads() is the problem on Solaris, the CRAN_Release.cmd greps could be strengthened to ensure no variable passed to num_threads() just because otherwise we'd have to find a way that such variable wasn't declared const.true
not1
. To signal to the reader's eye that it's a bool that can only take true/false. This is the very reason we are ok with assigning naked to a shared bool. Naked in the sense of no atomic protection. In this special case, the threads can only set the bool true so it doesn't matter if they overlap. Even if the bool is actually an int it doesn't matter if a thread race happens because any bits written anywhere in the variable will be true (non-zero). However it's safe to assume the leading bit would only be written and a naked one-way write is safe.volatile
is for, iiuc. So I removed the volatile keyword. See this article: https://barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword. What we really want for this idiom is a qualifier "naked one-value write" so the compiler would check for us that a single value (true) was only ever written to that variable from inside the parallel region. But I don't think such a qualifier exists in C.volatile
on the other hand, iiuc, is for reading a value that might have been changed by an external process (and therefore has to be read and re-read a lot even when the optimizer might be sure no update could have happened to it). However, when we write to that bool we don't care what its value is now, we just want that write to get there eventually and stick. The important thing is that no thread could ever writefalse
.That's the way I think about these items currently.