non-equi join with .EACHI fails unexpectedly (seemingly related to values in join columns) #4489

Henrik-P · 2020-05-25T14:10:14Z

I have some time series with values. Apart from different values, the structure of the data sets are the same. A small example:

set.seed(2)
d = data.table(time = 1:8, v = sample(8))
d
#    time v
# 1:    1 5
# 2:    2 7
# 3:    3 6
# 4:    4 1
# 5:    5 8
# 6:    6 4
# 7:    7 2
# 8:    8 3

For each time, I want to determine the number of rows after the focal time where the values are larger than the focal value. E.g. for time 2 (value 7) there is one subsequent value which is higher (8 at time 5).

There may be better ways to reach my goal, but here's the code I used and which gave me the desired result:

d[d, on = .(time > time, v > v), .N, by = .EACHI]
#    time v N
# 1:    1 5 3
# 2:    2 7 1
# 3:    3 6 1
# 4:    4 1 4
# 5:    5 8 0
# 6:    6 4 0
# 7:    7 2 1
# 8:    8 3 0

The code works for most data, but in some rare instances it errors. After digging back and forth, I thought I had found the reason: the error was triggered by setting the value 'v' equal to the time on two rows. E.g.:

d[time == 2L, v := 2L]
d[time == 7L, v := 7L]

d[d, on = .(time > time, v > v), .N, by = .EACHI, verbose = TRUE]

# This chunk of the message is same in the code above:

# i.time has same type (integer) as x.time. No coercion needed.
# i.v has same type (integer) as x.v. No coercion needed.
# Non-equi join operators detected ... 
# forder took ... forder.c received 8 rows and 2 columns
# 0.020s elapsed (0.000s cpu) 
# Generating non-equi group ids ... done in 0.000s elapsed (0.000s cpu) 
# Recomputing forder with non-equi ids ... Assigning to all 8 rows
# RHS_list_of_columns == false
# RHS for item 1 has been duplicated because NAMED==7 MAYBE_SHARED==1, but then is being plonked. length(values)==8; length(cols)==1)
# forder.c received 8 rows and 3 columns
# done in 0.000s elapsed (0.000s cpu)

# From here there are some differences in the message:
 
# Found 3 non-equi group(s) ...     # Above: Found 4 non-equi group(s) ...
# Starting bmerge ...
# forder.c received 8 rows and 2 columns
# bmerge done in 0.000s elapsed (0.000s cpu) 
# forder.c received 14 rows and 3 columns     # Above: forder.c received 11 rows and 3 columns
# Constructing irows for '!byjoin || nqbyjoin'

# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
# Join results in 18 rows; more than 16 = nrow(x)+nrow(i).
# Check for duplicate key values in i each of which join to the same group in x over and over again.
# If that's ok, try by=.EACHI to run j for each group to avoid the large allocation.

However, if the value at time 5 instead of time 7 was changed, it works...

set.seed(2)
d = data.table(time = 1:8, v = sample(8))
d[time == 2L, v := 2L]
d[time == 5L, v := 5L]

d[d, on = .(time > time, v > v), .N, by = .EACHI, verbose = TRUE]

What's going on here? (most parsimonous guess: user error ;) I think I'm slightly confused by the error message. Although Join results in 18 rows; more than 16 = nrow(x)+nrow(i). is true for the join per se, I use .EACHI to aggregate the data to have less rows than the nrow(x)+nrow(i) limit. In addition, try by=.EACHI is indeed a great suggestion, but I already use it. I clearly have misunderstood something here.

Tried on Windows, R version 3.6.3, data.table 1.12.9 IN DEVELOPMENT built 2020-02-20, and on Windows, R version 4.0.0, data.table 1.12.8

The text was updated successfully, but these errors were encountered:

jangorecki · 2020-05-25T15:07:36Z

d[d, on = .(time > time, v > v), .N, by = .EACHI, allow.cartesian=TRUE]

solves the problem, which looks to be an example of #4383.
#4370 solves that issue by providing alternative option datatable.join.many, but it unfortunately doesn't work for non-equi joins as of now.

Henrik-P · 2020-05-25T16:05:35Z

Ahaa! Thanks a lot Jan for your rapid reply.

I (naively) thought that we, simply stated, either should use allow.cartesian=TRUE to allow for results with number of rows > nrow(x)+nrow(i), or use by = .EACHI to aggregate the result to fewer rows. Like here, where nrow(result) = nrow(x) = nrow(i) < nrow(x)+nrow(i). So allow.cartesian=FALSE applies its limits on the intermediate result, not the actual end result, after .EACHI?

arunsrinivasan · 2020-05-25T19:00:58Z

For non-equi joins, allow.cartesian conditionality is hard to be checked correctly (as it requires a rearrange of matching indices for some datasets -- depends on how the non-equi groups were computed). We might have to rethink the logic of issuing error after correcting the logic for non-equi joins.

Perhaps in the meanwhile, allow.cartesian can be always TRUE for non-equi joins alone, implicitly. I am not sure when I'll be able to look into it next.

Henrik-P · 2020-05-25T19:54:14Z

Arun, thanks a lot for your feedback. When I saw your rethink the logic of issuing error, I realized had I forgot to mention some thoughts on the error message in OP. I have added them now, just so that they are "archived" somewhere, FWIW. Cheers.

Henrik-P added the non-equi joins rolling, overlapping, non-equi joins label May 25, 2020

ColeMiller1 mentioned this issue May 25, 2020

Non-equi joins are allow.cartesian = TRUE #4493

Merged

jangorecki added this to the 1.12.9 milestone May 26, 2020

mattdowle closed this as completed in #4493 Jun 2, 2020

ColeMiller1 mentioned this issue Jul 15, 2020

Expand allow.cartesian error hints for non-equi joins #2086

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-equi join with .EACHI fails unexpectedly (seemingly related to values in join columns) #4489

non-equi join with .EACHI fails unexpectedly (seemingly related to values in join columns) #4489

Henrik-P commented May 25, 2020 •

edited

Loading

jangorecki commented May 25, 2020 •

edited

Loading

Henrik-P commented May 25, 2020 •

edited

Loading

arunsrinivasan commented May 25, 2020 •

edited

Loading

Henrik-P commented May 25, 2020 •

edited

Loading

non-equi join with .EACHI fails unexpectedly (seemingly related to values in join columns) #4489

non-equi join with .EACHI fails unexpectedly (seemingly related to values in join columns) #4489

Comments

Henrik-P commented May 25, 2020 • edited Loading

jangorecki commented May 25, 2020 • edited Loading

Henrik-P commented May 25, 2020 • edited Loading

arunsrinivasan commented May 25, 2020 • edited Loading

Henrik-P commented May 25, 2020 • edited Loading

Henrik-P commented May 25, 2020 •

edited

Loading

jangorecki commented May 25, 2020 •

edited

Loading

Henrik-P commented May 25, 2020 •

edited

Loading

arunsrinivasan commented May 25, 2020 •

edited

Loading

Henrik-P commented May 25, 2020 •

edited

Loading