Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PosDefException with simple linear regression when x/y ranges are small #375

Open
mortenpi opened this issue Jun 5, 2020 · 5 comments
Open
Labels
Milestone

Comments

@mortenpi
Copy link

mortenpi commented Jun 5, 2020

I'm getting a PosDefException: matrix is not positive definite; Cholesky factorization failed when doing a simple lm(@formula(y ~ x), df) when the range of x and y values is tiny compared to their absolute value. Should be possible to replicate with this dataset:

δ(n; q=3) = mod(n, q)/(q-1) - 1
function dataset(N)
    x0, Δx, δx = 13.5, 1e-8, 1e-8
    y0, Δy, δy = 200, 1e-7, 2e-7
    DataFrame(
        x = x0 .+ Δx .* (-N:N) .+ δx .* δ.(1:2N+1; q=3),
        y = y0 .+ Δy .* (-N:N) .+ δy .* δ.(3:2N+3; q=5),
    )
end
df = dataset(10)
lm(@formula(y ~ x), df)

I have a full MWE here as a gist.

I can see how this is numerically an edge case, but I feel that it should ideally just work. For my purposes, I can pretty easily work around the issue by just shifting and scaling the x and y values, fitting and then scaling the coefficients back:

function linreg(df::DataFrame, x::Symbol, y::Symbol)
    xs, ys = df[:,x], df[:,y]
    xmin, xmax = extrema(xs)
    ymin, ymax = extrema(ys)
    _df = DataFrame(x=(xs .- xmin) ./ (xmax - xmin), y=(ys .- ymin) ./ (ymax - ymin))
    m = lm(@formula(y ~ x), _df)
    _b, _k = coef(m)
    k = _k * (ymax - ymin) / (xmax - xmin)
    b = _b * (ymax - ymin) + ymin - xmin*k
    return b, k
end

Seems to give a reasonable result:

fit

@mortenpi mortenpi changed the title PosDefException with simple linear regression when data variance is small PosDefException with simple linear regression when x/y ranges are small Jun 5, 2020
@andreasnoack
Copy link
Member

I think this is one of the relatively rare cases where you'd need the extra precision that the QR factorization provides relative to the Cholesky. The functionality for using QR is there but we don't expose it in a nice way so let's keep this one open to track the progress of that.

@mortenpi
Copy link
Author

mortenpi commented Jun 8, 2020

Would it make sense for the logic to be "Try Cholesky. If error, try QR"? I.e so that the user wouldn't have to do anything?

@andreasnoack
Copy link
Member

So far, we have aimed for type stability and that wouldn't be possible with the approach you are suggesting. I'd prefer that we just provide an easy way to switch to QR and we could maybe also consider suggesting the QR factorization version in the error message thrown when the Cholesky fails.

@casasgomezuribarri
Copy link

casasgomezuribarri commented Dec 17, 2020

Hi, sorry for reviving this. I am having a similar issue with LinearBinaryClassifier, and I am really lost about why it could be. The error I get when I try to fit the machine (I am using MLJ) is the same as described here:

ERROR: LoadError: TaskFailedException:
PosDefException: matrix is not positive definite; Cholesky factorization failed.

My dataset has 100 rows and 3500 columns, and all features are of similar magnitude. The minimum value across the entire X array is -0.016700 and the maximum value is 0.354000, so I wouldn't say the range is small compared to their absolute value.

Any idea of why this could be happening?

EDIT: I think I know what the reason might be now. Since my X dataset contains the infrared absorbance at several (3500) wavenumbers, it happens that some pairs of wavenumbers have the exact same absorbance for all samples (making them essentially the same variable). This has been solved in a few other issues by setting the third argument of lm() to true.

However, since my model is embedded inside a pipeline, my implementation is to wrap the pipeline in a machine and use fit!().

Is there a way of allowing collinearity in this context?

@load LinearBinaryClassifier pkg=GLM
LinearBinaryClassifierPipe = @pipeline(Standardizer(),
                                       OneHotEncoder(drop_last = true),
                                       LinearBinaryClassifier())
LogisticModel = machine(LinearBinaryClassifierPipe, X, yc)
fit!(LogisticModel)

@nalimilan
Copy link
Member

@casasgomezuribarri You have much more variables than observations so it doesn't make since to fit a GLM. You probably want to apply some kind of dimensionality reduction algorithm like partial least squares.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants