Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[<Ray component: Core, Serve>] When Distributed Tracing is enabled, calls to Ray Serve endpoints gets hanged and no response is returned #49728

Open
venkatkalluru opened this issue Jan 8, 2025 · 0 comments
Labels
bug Something that is supposed to be working; but isn't serve Ray Serve Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@venkatkalluru
Copy link

venkatkalluru commented Jan 8, 2025

What happened + What you expected to happen

The Bug / Problem:
When distributed tracing is enabled in Ray, any REST calls made to the serve deployments or generic endpoints like /-/routes gets hanged and no response is returned. Details Below.

Expected Behavior
Calls to Serve REST Endpoints and other generic endpoints should not hang and return the functionality response.

Useful Information / Details
When distributed tracing is enabled, some of the tracing functionality is working fine and reporting to Tempo instances. It is just that the REST calls won't work. After debugging, we found that it is failing because there is no _ray_trace_ctx argument available on calling actor methods. Detailed stack trace below.

I had to do the changes shown in this PR to get it to work. We need your help if these are right set of changes to do. If not, please suggest/guide the right set of changes to do.

Exception in callback <function LongPollClient._process_update.<locals>.chained at 0x12e36caf0>
handle: <Handle LongPollClient._process_update.<locals>.chained>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
  File "/lib/python3.10/site-packages/ray/serve/_private/long_poll.py", line 172, in chained
    callback(arg)
  File "/lib/python3.10/site-packages/ray/serve/_private/router.py", line 420, in update_running_replicas
    self._replica_scheduler.update_running_replicas(running_replicas)
  File "/lib/python3.10/site-packages/ray/serve/_private/replica_scheduler/replica_scheduler.py", line 33, in update_running_replicas
    return self.update_replicas(
  File "/lib/python3.10/site-packages/ray/serve/_private/replica_scheduler/pow_2_scheduler.py", line 294, in update_replicas
    r.push_proxy_handle(self._self_actor_handle)
  File "/lib/python3.10/site-packages/ray/serve/_private/replica_scheduler/replica_wrapper.py", line 64, in push_proxy_handle
    self._actor_handle.push_proxy_handle.remote(handle)
  File "/lib/python3.10/site-packages/ray/actor.py", line 202, in remote
    return self._remote(args, kwargs)
  File "/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 445, in _start_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/lib/python3.10/site-packages/ray/actor.py", line 345, in _remote
    return invocation(args, kwargs)
  File "/lib/python3.10/site-packages/ray/actor.py", line 326, in invocation
    return actor._actor_method_call(
  File "/lib/python3.10/site-packages/ray/actor.py", line 1452, in _actor_method_call
    list_args = signature.flatten_args(function_signature, args, kwargs)
  File "/lib/python3.10/site-packages/ray/_private/signature.py", line 110, in flatten_args
    raise TypeError(str(exc)) from None
TypeError: got an unexpected keyword argument '_ray_trace_ctx'

Versions / Dependencies

uname
Linux x86_64
ray --version
ray, version 2.40.0

Reproduction script

Enable Distributed Tracing where Ray Serve deployments exists and try to call the endpoints for inference and the problem can be re-produced.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@venkatkalluru venkatkalluru added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 8, 2025
@jcotant1 jcotant1 added the serve Ray Serve Related Issue label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't serve Ray Serve Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants