-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata in data.tables #4804
Comments
Seems related: #623 |
Yes, very related. Have thumbs-upped that issue. Although
|
@gayyaM Not really following how this request extends linked one. Could you please provide example code and expected results? |
Sure. Let me know if this is clear. The idea is that
|
I've been using
You can even attach the attribute to a column:
Maybe this helps? |
I like the idea @fcocquemas, I've done this in the past, but often find it gets lost after basic operations. Repex require(data.table)
# Define some metadata.
md = list(x = "some metadata about iris::Species", y = "some more")
# Coerce iris to a data.table.
DT = data.table(iris)
# Assign some metadata at an variable level
attr(DT$Species, "metadata") = list(Description = md$x)
str(DT)
# Classes ‘data.table’ and 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# ..- attr(*, "metadata")=List of 1
# .. ..$ Description: chr "some metadata about iris::Species"
# - attr(*, ".internal.selfref")=<externalptr>
# Basic coercion.
DT[, Species := as.character(Species)]
str(DT)
# Classes ‘data.table’ and 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : chr "setosa" "setosa" "setosa" "setosa" ...
# - attr(*, ".internal.selfref")=<externalptr>
# Metadata gone :( |
Yes, that's fair, when attached to a column, it will disappear when you transform the data. My use case is mostly long-term storage so it's not a big issue. You could probably overload |
I've used There are so many times when this would be useful, especially with factor handling. If (for instance) you have a model for 50 states, it would be nice to automatically apply the same levels to a prediction dataset that has one state that doesn't happen to be Alabama. |
Another approach is to use an ancillary data.table as a "codebook", optionally that codebook can be an attribute of your main data.table. It's especially useful when you need to label graphs with human-readable labels instead of variable codes. Over the long-run I've found it somewhat easier to keep track of variable definitions, units, and imputations in an entirely separate object. Unfortunately there's never been any widely used standard to annotate statistical datasets. My own codes often includes things like: library(data.table)
dt <- data.table(x = 1:5)
dt[, `:=`(y = 2 * x, z = x * x)]
dt.meta <- fread("
code, label, unit, type, description
x, weight, kg, numeric, long description
y, double weight, kg, numeric, long description with formula
z, squared weight, kg, numeric, long description with formula
")
Another way is to use Hmisc |
Might also want to take a look at how labels/format/etc are implemented in the haven package. they're implemented as vectors with attributes but have some helper functions. These objects seem to behave reasonably well as columns in a data.table. |
That's an interesting approach, @mbacou. I've used the |
To add my two cents, I've been using the |
I'm now using R and data.table for the first time. My current workaround for storing metadata about an experiment is to define a method that preserves the attributes on a data.table when I add a new set of experimental observations:
BTW I'd happily take suggestions on how to improve my coding style in R. My only prior experience with statistical experimentation was in the early 1990s, using S, to experimentally validate my PRNG package mrandom! |
AFAIK most of the design-energy around And... I believe I do understand why a data-analyst would want to have metadata on their I'll close this with an explanation of why I think the top-level documentation for Thanks for reading through this long explanation of my newbie-difficulties with |
TLDR
I want to avoid column names like
Z-normalised temperature (from DB1 before 2018 & from DB2 afterwards)
, without relying on code comments and in a more native way (e.g. comments can't be saved in aDT.rds
file on disk). Hence, request to introduce ametadata
function that could retrieve metadata which could have been optionally supplied by the user.Problem
Pre-modelling, a large amount of time is spent collating and joining data from various sources, transforming some columns to make them just right for the model. Thus, a column may have been transformed multiple number of times, with various edge cases dealt with on a case by case basis.
Current solution
Use descriptive column names or comments. This can easily get unwieldy with column names like
Z-normalised temperature (from DB1 before 2018 & from DB2 afterwards)
. Comments on the other hand, can't be shared as a part of the data.table.Proposed solution
Introduce a
metadata
function that contains info stored by the user that optionally describes each column. This would probably also involve asetmetadata
function ormetadata<-
function? Not sure.The text was updated successfully, but these errors were encountered: