[data] sort large dataset by ray.data.Dataset always fail #49679

Yanghello · 2025-01-07T11:02:40Z

What happened + What you expected to happen

I run sort with large scaling data (1000mil row x 1000 col data, total 1TB), with test code as:

import ray

ray.init()

ctx = ray.data.DataContext.get_current()
ctx.use_push_based_shuffle = True

data_path = "hdfs://ip:port/home/data/testdata/1y_rows_1000_columns/"

data = ray.data.read_csv(data_path)

data = data.sort("id")

data = data.materialize()
print(data.count())
print(data.schema())
print(data.take(10))

running in my ray cluster (16cpu64gb x 40worker)， it always fail with error of worker dead. It is there something config I missing to config for sort with large data ?

Versions / Dependencies

ray == 2.10.0

Reproduction script

import ray

ray.init()

ctx = ray.data.DataContext.get_current()
ctx.use_push_based_shuffle = True

data_path = "hdfs://ip:port/home/data/testdata/1y_rows_1000_columns/"

data = ray.data.read_csv(data_path)

data = data.sort("id")

data = data.materialize()
print(data.count())
print(data.schema())
print(data.take(10))

Issue Severity

None

Jay-ju · 2025-01-08T07:35:20Z

Could it be a memory issue?

Yanghello · 2025-01-08T13:03:12Z

Could it be a memory issue?

It seem to be a memory issue, but i don't know how to fix it.
And I found ray has a nightly benchmark for sort with 1TB data, work well for 16c64g x 20 cluster,. I don't know why it's doesn't work for my cluster that has double resource.

https://docs.ray.io/en/releases-2.10.0/data/shuffling-data.html#enabling-push-based-shuffle

Yanghello added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 7, 2025

jcotant1 added the data Ray Data-related issues label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] sort large dataset by ray.data.Dataset always fail #49679

[data] sort large dataset by ray.data.Dataset always fail #49679

Yanghello commented Jan 7, 2025

Jay-ju commented Jan 8, 2025

Yanghello commented Jan 8, 2025

[data] sort large dataset by ray.data.Dataset always fail #49679

[data] sort large dataset by ray.data.Dataset always fail #49679

Comments

Yanghello commented Jan 7, 2025

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Jay-ju commented Jan 8, 2025

Yanghello commented Jan 8, 2025