mention options in importing vign #3578

jangorecki · 2019-05-22T06:50:01Z

No description provided.

MichaelChirico · 2019-05-22T07:00:13Z

vignettes/datatable-importing.Rmd

+## Avoid of package options
+
+Common practice is to provide customization of various options globally for a package using `options` function. `data.table` is no exception here. Use of options of your dependency package should be avoided inside your package, and use of options by end user should be used with extra care. The reason for that is because those options works globally, for your package, other packages, and user's code.
+Consider the case when `data.table` is imported by `pkgX`, where `pkgX` compute a join. Then an end user sets `options(datatable.nomatch=NULL)`, as a result join performed by `pkgX` is now an inner join, not outer join. Options are generally safe when you work just with `data.table`, or the package you are developing will be an internal package that will work just with `data.table`. Remember that global options should be well documented.


A safe alternative is to set/reset options when needed, but this is strictly dominated for data.table because all options could be set explicitly in the function calls (I think?)

Discovered a case where it appears impossible to avoid using options() here:

https://twitter.com/michael_chirico/status/1126737388393230336

codecov · 2019-05-22T07:01:46Z

Codecov Report

Merging #3578 into master will increase coverage by 0.6%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master    #3578     +/-   ##
=========================================
+ Coverage   97.58%   98.19%   +0.6%     
=========================================
  Files          66       66             
  Lines       12695    12922    +227     
=========================================
+ Hits        12389    12689    +300     
+ Misses        306      233     -73

Impacted Files	Coverage Δ
R/fcast.R	`99.4% <0%> (-0.01%)`	⬇️
src/rbindlist.c	`100% <0%> (ø)`	⬆️
src/froll.c	`100% <0%> (ø)`	⬆️
src/freadR.c	`100% <0%> (ø)`	⬆️
src/uniqlist.c	`100% <0%> (ø)`	⬆️
src/assign.c	`100% <0%> (ø)`	⬆️
src/wrappers.c	`100% <0%> (ø)`	⬆️
R/xts.R	`100% <0%> (ø)`	⬆️
src/init.c	`100% <0%> (ø)`	⬆️
src/vecseq.c	`100% <0%> (ø)`	⬆️
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c68e95e...49485a8. Read the comment docs.

mattdowle · 2019-05-22T18:24:29Z

I fear it's not sufficient to just document this. There should be something more robust. I've been aware of this issue for some time and have not created any new options that affect behavior, only to manage migration temporarily and stated in news that those options will eventually be removed (e.g. datatable.old.unique.by.key). We only have 1 long-standing option which changes the results: datatable.nomatch. If a user changes that option for their own usage (because they prefer the nomatch=0L default), it could affect packages behavior. Even packages that didn't set any datatable options and were completely compliant with the text in this PR.

datatable.naturaljoin is now the 2nd one. Can we find an acceptable way for natural join to be explicit so we don't need the new option?

Here are all the options as of now in dev : https://github.com/Rdatatable/data.table/blob/master/R/onLoad.R#L44. Other than the those two (datatable.nomatch and datatable.naturaljoin) they all affect printing, optimization, or are temporary for migration.

This does mean we should try to remove datatable.nomatch too, or find a way for code inside packages to ignore its value.

mattdowle · 2019-05-22T22:23:26Z

The thing with on=.NATURAL, is the user needs to know what a natural join is. That name comes from SQL. A lot of them don't seem to know SQL and find SQL difficult. Also .NATURAL is a bit long to type.
How about on=.NAMES ?

MichaelChirico · 2019-05-23T00:08:28Z

suggestion of .NATURAL was to make the cognitive burden of new symbols hopefully minimal... you're right that to blank slate users "natural" is probably a big ?

I would suggest .SHARED might come across easier...

jangorecki · 2019-05-23T03:40:01Z

Agree about options risky behaviour. There is a way to detect if call was made from package topenv(parent.frame(1)) (applied), so we could potentially use this. On the other hand it introduce some inconsistency which can make some confusion, thus probably best would be to raise message when such behaviour is detected. Filled #3585.

There are many potential options already mentioned...
.COMMON
.NATURAL
.NAMES
.SHARED
Why I think .NATURAL is the best is because it is already used in some existing standard. It happens that this standard, SQL, is generally the most well recognised standard for data manipulation. So as Michael said we can avoid introducing new symbols. At least those who know SQL will pick it up easily.

mattdowle · 2019-05-23T17:39:38Z

Yes but to me, what is "natural" is X["ip576"] using the first column of X's key, like super-charged rownames. Anything else doesn't feel natural.
If I see X[.(id="ip576")] I know it's going to ignore the id= part and return the same as X["ip576"]. But if I see X[.(id="ip576"), on=.NAMES] then it feels natural that it's going to use the column names to join (id to id) in this case.

mattdowle · 2019-05-23T17:44:11Z

Blue sky idea: passing a DT to on= isn't used for anything iirc. How about X[on=Y, ...] doing what X[Y, on=.NATURAL] does now in dev. When on= is passed a data.table, it wouldn't be allowed to use i= too.
In the example I gave in previous comment, it would be X[on=.(id="ip576")].

MichaelChirico · 2019-05-23T17:47:57Z

Blue sky idea

it's neat and could work, but are you proposing a replacement to the new feature? The fundamental use case of "X[Y] might reasonably be expected to do a natural join" remains...

mattdowle · 2019-05-23T17:55:33Z

but are you proposing a replacement to the new feature?

just accessing the new feature with a different api, to avoid the new symbol and new option, and currently I'm quite liking it too even if it didn't avoid the new symbol and option.

The fundamental use case of "X[Y] might reasonably be expected to do a natural join

Fundamental to data.table is that X[Y] is a join using X's key like super-charged rownames, and that data.table has no rownames because key(X) replaces rownames. Even if newcomers do expect a natural join from X[Y], I can't see that changing in data.table because we like and use it a lot. There is an error message when X has no key and that suggests on=.

mention options in importing vign

87d8364

MichaelChirico reviewed May 22, 2019

View reviewed changes

jangorecki mentioned this pull request May 23, 2019

deprecate datatable.nomatch; message first step #3585

Closed

mattdowle mentioned this pull request May 29, 2019

add message that datatable.nomatch will be deprecated #3612

Merged

revised new section in vignette

49485a8

mattdowle added this to the 1.12.4 milestone May 31, 2019

mattdowle merged commit 06a5ed6 into master May 31, 2019

mattdowle deleted the vign-ops branch May 31, 2019 02:17

mattdowle mentioned this pull request May 31, 2019

X[on=Y] natural join #3621

Closed

MichaelChirico mentioned this pull request Jul 23, 2019

fwrite gains scipen #3716

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mention options in importing vign #3578

mention options in importing vign #3578

jangorecki commented May 22, 2019

MichaelChirico May 22, 2019

codecov bot commented May 22, 2019 •

edited

Loading

mattdowle commented May 22, 2019 •

edited

Loading

mattdowle commented May 22, 2019

MichaelChirico commented May 23, 2019

jangorecki commented May 23, 2019 •

edited

Loading

mattdowle commented May 23, 2019 •

edited

Loading

mattdowle commented May 23, 2019 •

edited

Loading

MichaelChirico commented May 23, 2019

mattdowle commented May 23, 2019

mention options in importing vign #3578

mention options in importing vign #3578

Conversation

jangorecki commented May 22, 2019

MichaelChirico May 22, 2019

Choose a reason for hiding this comment

codecov bot commented May 22, 2019 • edited Loading

Codecov Report

mattdowle commented May 22, 2019 • edited Loading

mattdowle commented May 22, 2019

MichaelChirico commented May 23, 2019

jangorecki commented May 23, 2019 • edited Loading

mattdowle commented May 23, 2019 • edited Loading

mattdowle commented May 23, 2019 • edited Loading

MichaelChirico commented May 23, 2019

mattdowle commented May 23, 2019

codecov bot commented May 22, 2019 •

edited

Loading

mattdowle commented May 22, 2019 •

edited

Loading

jangorecki commented May 23, 2019 •

edited

Loading

mattdowle commented May 23, 2019 •

edited

Loading

mattdowle commented May 23, 2019 •

edited

Loading