Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blast patch causes binaries to null-pointer-exception/core-dump #53165

Open
jchorl opened this issue Jan 10, 2025 · 2 comments
Open

blast patch causes binaries to null-pointer-exception/core-dump #53165

jchorl opened this issue Jan 10, 2025 · 2 comments

Comments

@jchorl
Copy link

jchorl commented Jan 10, 2025

Hi!

I noticed that one of the blast binaries, blastn, throws a null-pointer-exception:

terminate called after throwing an instance of 'ncbi::CCoreException'                                                                        
  what():  NCBI C++ Exception:                                        
    T1 "/opt/conda/conda-bld/blast_1736479973134/work/c++/src/corelib/ncbiobj.cpp", line 1010: Critical: (CCoreException::eNullPtr) ncbi::COb
ject::ThrowNullPointerException() - Attempt to access NULL pointer.   
     Stack trace:                                                     
      /opt/conda/bin/../lib/ncbi-blast+/libxncbi.so ???:0 ncbi::CObject::ThrowNullPointerException() offset=0xC1 addr=0x786ad233e1d1         
      /opt/conda/bin/../lib/ncbi-blast+/libxblast.so ???:0 ncbi::blast::CBlastNode::~CBlastNode() offset=0xD0F addr=0x786ad40a7b4f           
      blastn ???:0 ncbi::blast::CBlastnNode::~CBlastnNode() offset=0xA addr=0x65145c6e59ea                                                   
      /opt/conda/bin/../lib/ncbi-blast+/libxncbi.so ???:0 ncbi::CThread::Wrapper(void*) offset=0x1B6 addr=0x786ad2377a96                     
      /lib/x86_64-linux-gnu/libc.so.6 ???:0  offset=0x891C4 addr=0x786ad1f611c4                                                              
      /lib/x86_64-linux-gnu/libc.so.6 ???:0 __clone offset=0x40 addr=0x786ad1fe0ac0

After a lot of debugging, I chased it down to this patch: https://github.com/bioconda/bioconda-recipes/blob/8e7728ea4d3ce0bf6c27fec18d0f0fd8f6be332d/recipes/blast/phonehome.patch

The root cause is fairly complex. Unfortunately I can't link out to blast code because it's not on github. I was using the latest version, 2.16.0.

In blast_node.cpp:

static void
s_UnregisterDataLoader(const string & dbloader_prefix)
{
    static CRef<CObjectManager> om = CObjectManager::GetInstance();
    CObjectManager::TRegisteredNames loader_names;
    // whoops! om is NULL!
    om->GetRegisteredNames(loader_names);

Diving further, the concurrency controls in ncbi_usage_report.cpp cause the object manager to stay alive and non-null while all threads unregister their data loaders. Only after all threads have unregistered, then the usage report code cleans itself up.

However, after applying this patch, the object manager gets cleaned up after the first call to s_UnregisterDataLoader. When the other threads call it, they hit the null-pointer-exception.

There is clearly tight coordination between the blast-node code and the usage-report code. I don't well understand all the concurrency controls in ncbi_usage_report code to understand how to safely turn off NCBI_USAGE_REPORT_SUPPORTED in a multi-threaded environment. There are plenty of guards, mutexes and signalling that must help the usage reporting coordinate with destruction. You may notice in ncbi_usage_report.hpp:

#if defined(NCBI_THREADS)
#  define NCBI_USAGE_REPORT_SUPPORTED 1
#endif

So clearly ncbi expects NCBI_USAGE_REPORT_SUPPORTED to be defined in a multi-threaded environment, and they sure do depend on it.

I don't think it's right to file a bug against them because their code does work as-is.

Perhaps NCBI_USAGE_REPORT_SUPPORTED is not the best way to disable reporting. Looking around their codebase, there seem to be other ways, such as env vars.

So, firstly, could we revert this patch as it causes blastn not to function correctly?

Alternatively, perhaps we can find a different way to patch the usage reporter to disable it. E.g. we could probably just override:

static bool gs_IsEnabled = NCBI_PARAM_TYPE(USAGE_REPORT, Enabled)::GetDefault(); // could override this to false

Thanks!

@aliciaaevans
Copy link
Contributor

There have been some past issues with packages that do a phone home causing our CI providers to block all our builds for "suspicious" activity. I'm not sure if that was the case here, but it could be. So, that's why I would be hesitant to just remove the patch. From the original PR for the patch it seems like maybe it was causing a timeout during the build because the request was blocked by the CI provider:

I only noticed, because the program took some time to terminate due to the timeout of the phone-home call, which was blocked by the firewall.

I think your suggestion to disable it in another way is a good option.

@jchorl
Copy link
Author

jchorl commented Jan 10, 2025

There have been some past issues with packages that do a phone home causing our CI providers to block all our builds for "suspicious" activity. I'm not sure if that was the case here, but it could be. So, that's why I would be hesitant to just remove the patch. From the original PR for the patch it seems like maybe it was causing a timeout during the build because the request was blocked by the CI provider:

I only noticed, because the program took some time to terminate due to the timeout of the phone-home call, which was blocked by the firewall.

I think your suggestion to disable it in another way is a good option.

Thank you for following up! I opened #53173 , where I did explore another mechanism to disable this. I was optimistic that reintroducing the NCBI_USAGE_REPORT_SUPPORTED define would bring back the concurrency controls required to safely shut down, while also disabling reporting a different way. Unfortunately that too didn't work - shutting down the usage reporter in a multi-threaded manner seems to require the same concurrency controls it uses to operate.

I do see that the build passed on that PR, which seems promising. Do you happen to be familiar with the mutexes/signaling in ncbi_usage_report.cpp? Perhaps if we could make sense of what's load-bearing there we could patch differently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants