Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-equi join with .EACHI fails unexpectedly (seemingly related to values in join columns) #4489

Closed
Henrik-P opened this issue May 25, 2020 · 4 comments · Fixed by #4493
Closed
Labels
non-equi joins rolling, overlapping, non-equi joins
Milestone

Comments

@Henrik-P
Copy link

Henrik-P commented May 25, 2020

I have some time series with values. Apart from different values, the structure of the data sets are the same. A small example:

set.seed(2)
d = data.table(time = 1:8, v = sample(8))
d
#    time v
# 1:    1 5
# 2:    2 7
# 3:    3 6
# 4:    4 1
# 5:    5 8
# 6:    6 4
# 7:    7 2
# 8:    8 3

For each time, I want to determine the number of rows after the focal time where the values are larger than the focal value. E.g. for time 2 (value 7) there is one subsequent value which is higher (8 at time 5).

There may be better ways to reach my goal, but here's the code I used and which gave me the desired result:

d[d, on = .(time > time, v > v), .N, by = .EACHI]
#    time v N
# 1:    1 5 3
# 2:    2 7 1
# 3:    3 6 1
# 4:    4 1 4
# 5:    5 8 0
# 6:    6 4 0
# 7:    7 2 1
# 8:    8 3 0

The code works for most data, but in some rare instances it errors. After digging back and forth, I thought I had found the reason: the error was triggered by setting the value 'v' equal to the time on two rows. E.g.:

d[time == 2L, v := 2L]
d[time == 7L, v := 7L]

d[d, on = .(time > time, v > v), .N, by = .EACHI, verbose = TRUE]

# This chunk of the message is same in the code above:

# i.time has same type (integer) as x.time. No coercion needed.
# i.v has same type (integer) as x.v. No coercion needed.
# Non-equi join operators detected ... 
# forder took ... forder.c received 8 rows and 2 columns
# 0.020s elapsed (0.000s cpu) 
# Generating non-equi group ids ... done in 0.000s elapsed (0.000s cpu) 
# Recomputing forder with non-equi ids ... Assigning to all 8 rows
# RHS_list_of_columns == false
# RHS for item 1 has been duplicated because NAMED==7 MAYBE_SHARED==1, but then is being plonked. length(values)==8; length(cols)==1)
# forder.c received 8 rows and 3 columns
# done in 0.000s elapsed (0.000s cpu)

# From here there are some differences in the message:
 
# Found 3 non-equi group(s) ...     # Above: Found 4 non-equi group(s) ...
# Starting bmerge ...
# forder.c received 8 rows and 2 columns
# bmerge done in 0.000s elapsed (0.000s cpu) 
# forder.c received 14 rows and 3 columns     # Above: forder.c received 11 rows and 3 columns
# Constructing irows for '!byjoin || nqbyjoin'

# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
# Join results in 18 rows; more than 16 = nrow(x)+nrow(i).
# Check for duplicate key values in i each of which join to the same group in x over and over again.
# If that's ok, try by=.EACHI to run j for each group to avoid the large allocation.

However, if the value at time 5 instead of time 7 was changed, it works...

set.seed(2)
d = data.table(time = 1:8, v = sample(8))
d[time == 2L, v := 2L]
d[time == 5L, v := 5L]

d[d, on = .(time > time, v > v), .N, by = .EACHI, verbose = TRUE]

What's going on here? (most parsimonous guess: user error ;) I think I'm slightly confused by the error message. Although Join results in 18 rows; more than 16 = nrow(x)+nrow(i). is true for the join per se, I use .EACHI to aggregate the data to have less rows than the nrow(x)+nrow(i) limit. In addition, try by=.EACHI is indeed a great suggestion, but I already use it. I clearly have misunderstood something here.


Tried on Windows, R version 3.6.3, data.table 1.12.9 IN DEVELOPMENT built 2020-02-20, and on Windows, R version 4.0.0, data.table 1.12.8

@Henrik-P Henrik-P added the non-equi joins rolling, overlapping, non-equi joins label May 25, 2020
@jangorecki
Copy link
Member

jangorecki commented May 25, 2020

d[d, on = .(time > time, v > v), .N, by = .EACHI, allow.cartesian=TRUE]

solves the problem, which looks to be an example of #4383.
#4370 solves that issue by providing alternative option datatable.join.many, but it unfortunately doesn't work for non-equi joins as of now.

@Henrik-P
Copy link
Author

Henrik-P commented May 25, 2020

Ahaa! Thanks a lot Jan for your rapid reply.

I (naively) thought that we, simply stated, either should use allow.cartesian=TRUE to allow for results with number of rows > nrow(x)+nrow(i), or use by = .EACHI to aggregate the result to fewer rows. Like here, where nrow(result) = nrow(x) = nrow(i) < nrow(x)+nrow(i). So allow.cartesian=FALSE applies its limits on the intermediate result, not the actual end result, after .EACHI?

@arunsrinivasan
Copy link
Member

arunsrinivasan commented May 25, 2020

For non-equi joins, allow.cartesian conditionality is hard to be checked correctly (as it requires a rearrange of matching indices for some datasets -- depends on how the non-equi groups were computed). We might have to rethink the logic of issuing error after correcting the logic for non-equi joins.

Perhaps in the meanwhile, allow.cartesian can be always TRUE for non-equi joins alone, implicitly. I am not sure when I'll be able to look into it next.

@Henrik-P
Copy link
Author

Henrik-P commented May 25, 2020

Arun, thanks a lot for your feedback. When I saw your rethink the logic of issuing error, I realized had I forgot to mention some thoughts on the error message in OP. I have added them now, just so that they are "archived" somewhere, FWIW. Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-equi joins rolling, overlapping, non-equi joins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants