Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cherry-pick] Fixed stratified splitting with Dask (#1883) #2494

Merged
merged 1 commit into from
Sep 14, 2022

Conversation

jppgks
Copy link
Contributor

@jppgks jppgks commented Sep 14, 2022

Assumes every partition fits in memory

Given $P$ the set of partitions of dataset $D$, and $S$ the set of unique values in the stratify column. When stratified splitting every partition individually, $\forall s \in S:$

$$ \begin{align} &\sum_{p \in P} \%_{train} \times | p_{S = s} | & \%_{train} \texttt{ of every partition's } s \texttt{ values will be in that partition's train} \\ = &\%_{train} \times \sum_{p \in P} | p_{S = s} | & \texttt{which is } \%_{train} \texttt{ of the sum of all } s \texttt{ value counts over all partitions} \\ = &\%_{train} \times | D_{S = s} | & \texttt{which is } \%_{train} \texttt{ of the total number of } s \texttt{ values} \\ \end{align} $$

$\%_{train}$ of records in $D$ with value $s$ in the stratify column will land in the training set. Equivalent proof for the valid and test split.

Assumes every partition fits in memory

Given $P$ the set of partitions of dataset $D$, and $S$ the set of unique values in the stratify column. When stratified splitting every partition individually, $\forall s \in S:$

$$
\begin{align}
 &\sum_{p \in P} \\%\_{train} \times | p\_{S = s} | & \\%\_{train} \texttt{ of every partition's } s
\texttt{ values will be in that partition's train} \\
= &\\%\_{train} \times \sum_{p \in P} | p\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the sum of all } s \texttt{ value counts over all partitions} \\
= &\\%\_{train} \times | D\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the total number of } s \texttt{ values} \\
\end{align}
$$

$\\%\_{train}$ of records in $D$ with value $s$ in the stratify column will land in the training set. Equivalent proof for the valid and test split.
@github-actions
Copy link

Unit Test Results

         6 files  ±0         6 suites  ±0   2h 37m 16s ⏱️ - 14m 56s
  3 385 tests +3  3 290 ✔️ +3    78 💤 ±0  17 ±0 
10 155 runs  +9  9 863 ✔️ +9  258 💤 ±0  34 ±0 

For more details on these failures, see this check.

Results for commit 10b65a5. ± Comparison against base commit 627cd36.

@tgaddair tgaddair merged commit 8e98ad1 into release-0.6 Sep 14, 2022
@tgaddair tgaddair deleted the cp-stratified-split branch September 14, 2022 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants