Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setDF should drop indices #4889

Closed
OfekShilon opened this issue Feb 3, 2021 · 11 comments · Fixed by #4893
Closed

setDF should drop indices #4889

OfekShilon opened this issue Feb 3, 2021 · 11 comments · Fixed by #4893
Labels
Milestone

Comments

@OfekShilon
Copy link
Contributor

ff2.zip

Attached is a small zip archive of a saved data.table that demonstrates a problem where sometimes the [ operator gives misleading results.

load("./ff2")   # from the path you extracted the archive to
dim(ff[symbol == "ZM" & date == 20210128])
#  [1] 0 2     <---- wrong results, this subset is not empty

One way to show the correct result:

> dim(ff[symbol == "ZM"][date == 20210128])
# [1] 98  2

Many seemingly no-ops fix the data.table state. One example:

ff[, symbol:=symbol]
dim(ff[symbol == "ZM" & date == 20210128])
# [1] 98  2

Tested on both Windows and Ubuntu, and data.table versions 1.13.6 and 1.11.8 correspondingly. (the sessionInfo below is of the windows machine)

# Output of sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.13.6

loaded via a namespace (and not attached):
[1] bit_4.0.4 compiler_3.6.1 tools_3.6.1 bit64_4.0.5

@tlapak

This comment has been minimized.

@OfekShilon

This comment has been minimized.

@tlapak

This comment has been minimized.

@OfekShilon

This comment has been minimized.

@jangorecki

This comment has been minimized.

@OfekShilon

This comment has been minimized.

@jangorecki

This comment has been minimized.

@jangorecki jangorecki added the High label Feb 4, 2021
@jangorecki jangorecki added this to the 1.13.7 milestone Feb 4, 2021
@jangorecki jangorecki changed the title [ gives wrong results [.<- invalidates index Feb 4, 2021
@tdeenes
Copy link
Member

tdeenes commented Feb 4, 2021

I think the problem is with setDF and not with the assignment. From the manual of setDF:

All data.table attributes including any keys of the input data.table are stripped off.

The correct behaviour (and the adjusted doc) should read as:

All data.table attributes including any keys and indices of the input data.table are stripped off.

d <- data.table(x = 1)
setindex(d, "x")
indices(d)
# [1] "x"

setDF(d)
indices(d)
# [1] "x" -> should have been stripped off

@OfekShilon
Copy link
Contributor Author

OfekShilon commented Feb 4, 2021

I agree with @tdeenes, raised this in the 2nd part of my comment above:

(2)
setDF clears a few DT-only attributes, but not index. That's very risky:

dt <- data.table(***)
# ... do stuff that creates indices
setDF(dt)
# ... do stuff that changes data, breaking the index
setDT(dt)
#  index attribute is still there, but the data doesn't match.

The 1st half still worries me too, but it might be a different bug.

Also @jangorecki note that the update that uses <- is performed on a data.frame, not data.table.

@OfekShilon
Copy link
Contributor Author

OfekShilon commented Feb 4, 2021

Self contained repro (without loading anything, suitable for unit tests):

d <- data.table(a=1:100, b=1:100)
invisible(d[a == 50])   

setDF(d)
d[1:50, "a"] <- d[51:100, "a"]
setDT(d)

print(d[a == 99])    # one line, wrong result

# manual fix:
setindex(d,NULL)
print(d[a == 99])    # two lines, right result

@jangorecki jangorecki changed the title [.<- invalidates index setDF should drop indices Feb 4, 2021
@OfekShilon
Copy link
Contributor Author

I created a single line PR for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants