Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clean missing data module #22

Merged

Conversation

imatiach-msft
Copy link
Contributor

@imatiach-msft imatiach-msft commented Jun 9, 2017

Add clean missing data module with modes:
1.) Replace with mean - replaces missings with mean of fit column
2.) Replace with median - replaces missings with approximate median of fit column
3.) Replace with custom value - replaces missings with custom value specified by user
As well as several smoke tests
Also added HasInputCols and HasOutputCols as traits for multiple columns (which will be needed when imputation will be added).
Resolved issue: Add support for missing value cleaner #11

@msftclas
Copy link

msftclas commented Jun 9, 2017

@imatiach-msft,
Thanks for your contribution as a Microsoft full-time employee or intern. You do not need to sign a CLA.
Thanks,
Microsoft Pull Request Bot

@rastala
Copy link
Contributor

rastala commented Jun 9, 2017

LGTM, let's remember to add example notebook once this gets into build.

@mmlspark-bot
Copy link
Contributor

createMockDataset
}

override def schemaForDataset: StructType = ???
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why leave this unimplemented?

@@ -121,6 +121,18 @@ trait HasOutputCol extends Wrappable {
def getOutputCol: String = $(outputCol)
}

trait HasInputCols extends Wrappable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great thanks but we should be very careful about versioning Core now that we are public. Is there a way to add a @since tag or would that be too superfluous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can definitely do this later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we can do this later

Copy link
Contributor

@drdarshan drdarshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new estimator looks great, I'm concerned about changes to Core at this point even though they look useful. Do you see any concerns with adding new APIs? Could you please discuss with Eli/AK? Thanks!

@imatiach-msft
Copy link
Contributor Author

discussed, we can add the "since" tag later, we should add it to all APIs, similar to Spark

@imatiach-msft imatiach-msft force-pushed the ilmat/missing-value-cleaner branch 2 times, most recently from ee412f5 to 38abb47 Compare June 15, 2017 17:25
@imatiach-msft
Copy link
Contributor Author

Updated PR - added support for string, boolean types for custom value and added test cases

@mmlspark-bot
Copy link
Contributor

drdarshan
drdarshan previously approved these changes Jun 19, 2017
Copy link
Contributor

@drdarshan drdarshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@mmlspark-bot
Copy link
Contributor

[PASS Pass! — The build has succeeded.

MMLSpark 0.5.dev14+1.g44a495e

  • Source: imatiach-msft/mmlspark,
    ilmat/missing-value-cleaner at revision 44a495e
    (by Ilya Matiach [email protected]).

  • Build: MMLSpark, 119329
    (built by elbarzil-vm on elbarzil-vm, 2017-06-19 15:22)

  • Info: 0.5.dev14+1.g44a495e: imatiach-msft/mmlspark/ilmat/missing-value-cleaner@44a495ee; MMLSpark#119329

  • Queued by:
    Eli Barzilay for Eli Barzilay

Copy the link to this HDInsight Script Action to setup this build on a cluster.

This is a build for github PR #22

Associated Changes


@elibarzilay elibarzilay merged commit b2bea9c into microsoft:master Jun 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants