Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] sort large dataset by ray.data.Dataset always fail #49679

Open
Yanghello opened this issue Jan 7, 2025 · 2 comments
Open

[data] sort large dataset by ray.data.Dataset always fail #49679

Yanghello opened this issue Jan 7, 2025 · 2 comments
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@Yanghello
Copy link

What happened + What you expected to happen

I run sort with large scaling data (1000mil row x 1000 col data, total 1TB), with test code as:

import ray

ray.init()

ctx = ray.data.DataContext.get_current()
ctx.use_push_based_shuffle = True

data_path = "hdfs://ip:port/home/data/testdata/1y_rows_1000_columns/"

data = ray.data.read_csv(data_path)

data = data.sort("id")

data = data.materialize()
print(data.count())
print(data.schema())
print(data.take(10))

running in my ray cluster (16cpu64gb x 40worker), it always fail with error of worker dead. It is there something config I missing to config for sort with large data ?
截屏2025-01-07 18 59 26

Versions / Dependencies

ray == 2.10.0

Reproduction script

import ray

ray.init()

ctx = ray.data.DataContext.get_current()
ctx.use_push_based_shuffle = True

data_path = "hdfs://ip:port/home/data/testdata/1y_rows_1000_columns/"

data = ray.data.read_csv(data_path)

data = data.sort("id")

data = data.materialize()
print(data.count())
print(data.schema())
print(data.take(10))

Issue Severity

None

@Yanghello Yanghello added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 7, 2025
@jcotant1 jcotant1 added the data Ray Data-related issues label Jan 7, 2025
@Jay-ju
Copy link
Contributor

Jay-ju commented Jan 8, 2025

Could it be a memory issue?

@Yanghello
Copy link
Author

Could it be a memory issue?

It seem to be a memory issue, but i don't know how to fix it.
And I found ray has a nightly benchmark for sort with 1TB data, work well for 16c64g x 20 cluster,. I don't know why it's doesn't work for my cluster that has double resource.
image
https://docs.ray.io/en/releases-2.10.0/data/shuffling-data.html#enabling-push-based-shuffle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants