Skip to content

Commit

Permalink
Revert and fix earlier windows NUMA patch
Browse files Browse the repository at this point in the history
revert 9048ac0 due to core spread problem and fix new OS compatibility with another method.

This code assumes that if one NUMA node has more than one processor groups, they are created equal(having equal amount of cores assigned to each of the groups), and also the total number of available cores contained in such groups are equal to the number of available cores within one NUMA node because of how best_node function works.

closes #3798
fixes #3787

No functional change.
  • Loading branch information
noobpwnftw authored and vondele committed Nov 22, 2021
1 parent a943b1d commit 7218ec4
Showing 1 changed file with 34 additions and 20 deletions.
54 changes: 34 additions & 20 deletions src/misc.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ typedef bool(*fun1_t)(LOGICAL_PROCESSOR_RELATIONSHIP,
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX, PDWORD);
typedef bool(*fun2_t)(USHORT, PGROUP_AFFINITY);
typedef bool(*fun3_t)(HANDLE, CONST GROUP_AFFINITY*, PGROUP_AFFINITY);
typedef bool(*fun4_t)(USHORT, PGROUP_AFFINITY, USHORT, PUSHORT);
}
#endif

Expand Down Expand Up @@ -495,14 +496,14 @@ void bindThisThread(size_t) {}

#else

/// best_group() retrieves logical processor information using Windows specific
/// API and returns the best group id for the thread with index idx. Original
/// best_node() retrieves logical processor information using Windows specific
/// API and returns the best node id for the thread with index idx. Original
/// code from Texel by Peter Österlund.

int best_group(size_t idx) {
int best_node(size_t idx) {

int threads = 0;
int groups = 0;
int nodes = 0;
int cores = 0;
DWORD returnLength = 0;
DWORD byteOffset = 0;
Expand Down Expand Up @@ -530,8 +531,8 @@ int best_group(size_t idx) {

while (byteOffset < returnLength)
{
if (ptr->Relationship == RelationGroup)
groups += ptr->Group.MaximumGroupCount;
if (ptr->Relationship == RelationNumaNode)
nodes++;

else if (ptr->Relationship == RelationProcessorCore)
{
Expand All @@ -546,23 +547,23 @@ int best_group(size_t idx) {

free(buffer);

std::vector<int> core_groups;
std::vector<int> groups;

// Run as many threads as possible on the same group until core limit is
// reached, then move on filling the next group.
for (int n = 0; n < groups; n++)
for (int i = 0; i < cores / groups; i++)
core_groups.push_back(n);
// Run as many threads as possible on the same node until core limit is
// reached, then move on filling the next node.
for (int n = 0; n < nodes; n++)
for (int i = 0; i < cores / nodes; i++)
groups.push_back(n);

// In case a core has more than one logical processor (we assume 2) and we
// have still threads to allocate, then spread them evenly across available
// groups.
// nodes.
for (int t = 0; t < threads - cores; t++)
core_groups.push_back(t % groups);
groups.push_back(t % nodes);

// If we still have more threads than the total number of logical processors
// then return -1 and let the OS to decide what to do.
return idx < core_groups.size() ? core_groups[idx] : -1;
return idx < groups.size() ? groups[idx] : -1;
}


Expand All @@ -571,22 +572,35 @@ int best_group(size_t idx) {
void bindThisThread(size_t idx) {

// Use only local variables to be thread-safe
int group = best_group(idx);
int node = best_node(idx);

if (group == -1)
if (node == -1)
return;

// Early exit if the needed API are not available at runtime
HMODULE k32 = GetModuleHandle("Kernel32.dll");
auto fun2 = (fun2_t)(void(*)())GetProcAddress(k32, "GetNumaNodeProcessorMaskEx");
auto fun3 = (fun3_t)(void(*)())GetProcAddress(k32, "SetThreadGroupAffinity");
auto fun4 = (fun4_t)(void(*)())GetProcAddress(k32, "GetNumaNodeProcessorMask2");

if (!fun2 || !fun3)
return;

GROUP_AFFINITY affinity;
if (fun2(group, &affinity))
fun3(GetCurrentThread(), &affinity, nullptr);
if (!fun4) {
GROUP_AFFINITY affinity;
if (fun2(node, &affinity))
fun3(GetCurrentThread(), &affinity, nullptr);
} else {
// If a numa node has more than one processor group, we assume they are
// sized equal and we spread threads evenly across the groups.
USHORT elements, returnedElements;
elements = GetMaximumProcessorGroupCount();
GROUP_AFFINITY *affinity = (GROUP_AFFINITY*)malloc(
elements * sizeof(GROUP_AFFINITY));
if (fun4(node, affinity, elements, &returnedElements))
fun3(GetCurrentThread(), &affinity[idx % returnedElements], nullptr);
free(affinity);
}
}

#endif
Expand Down

10 comments on commit 7218ec4

@g3g6
Copy link

@g3g6 g3g6 commented on 7218ec4 Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this version it doesnt run on Xeon X5450 with OS Windwows Server 2003.

It is known behavior or this setup is just obsolete?

@vondele
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you describe the precise problem you have in a github issue (e.g. error message etc)?

@g3g6
Copy link

@g3g6 g3g6 commented on 7218ec4 Nov 29, 2021 via email

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vondele
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sopel97
Copy link
Member

@Sopel97 Sopel97 commented on 7218ec4 Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetMaximumProcessorGroupCount is only supported since windows 7/server 2008

@vondele
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should do the same as we do shortly above that calll, (see if available in the dll, and early exit if not).

@noobpwnftw
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stockfish/src/misc.cpp

Lines 19 to 23 in e4a0c6c

#ifdef _WIN32
#if _WIN32_WINNT < 0x0601
#undef _WIN32_WINNT
#define _WIN32_WINNT 0x0601 // Force to include needed API prototypes
#endif

https://docs.microsoft.com/en-us/cpp/porting/modifying-winver-and-win32-winnt?view=msvc-170

According to our existing definition, our minimum required Windows version is Windows 7/Windows Server 2008.

Theoretically we can have such a check, but is it necessary?

@g3g6
Copy link

@g3g6 g3g6 commented on 7218ec4 Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessary, but because it was change from day to day that its why I asked. Probably it would be good idea to officially mention supported platforms.

@vondele
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not support NUMA for old versions of SF, but would avoid that the code fails to run on those older systems, in the sense that ideally the released versions do not regress for a non-core feature.

@noobpwnftw
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will refine this with another PR.

Please sign in to comment.