Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct the objective param when using fobj #1292

Merged

Conversation

sinnfashen
Copy link
Contributor

@sinnfashen sinnfashen commented Dec 8, 2021

Currently there will be no objective param if a fobj is provided.

That caused the LightGBM mistakenly think it's regression during initial setup as that's the [default value] for it(https://github.com/microsoft/LightGBM/blob/master/include/LightGBM/config.h#L139).

It may seem harmless at first glance as we are providing grad and hess during iteration. But it actually makes the output incorrect by having the is_constant_hessian to true all the time (which is the case for unweighted regression), and cause the calculated sum_hessian split incorrect, which eventually leads to wrong model output.

The fix I propose here is simply to pass objective=custom rather than no objective in param when provided fobj.

I'm pretty new to Scala and Java as this is the first Scala project I've seen. I've tested things in sbt console. But not sure how to setup the environment properly as something fails when I run sbt test on master without any change. Would be really appreciate if anyone can point me to any updated doc I can reference. (the runme link in Contribute.md leads to 404).

Could also include code similar with the code below (which I used to debug the mismatch) as a test case if needed. You may also use the same code to verify the bug.

import com.microsoft.azure.synapse.ml.lightgbm.dataset.LightGBMDataset
import com.microsoft.azure.synapse.ml.lightgbm.params.FObjTrait
import com.microsoft.azure.synapse.ml.lightgbm._
import org.apache.spark.ml.feature.VectorAssembler
import scala.math.exp
import scala.util.Random

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().config("spark.master", "local[*]").getOrCreate()
import spark.implicits._

val rng = new Random(1206)
val x:Seq[Double] = for (i <- 0 to 99) yield (rng.nextDouble)
val y = for (i <- 0 to 99) yield (if (rng.nextDouble > x(i)) 1 else 0)
val df = (x.zip(y).map {case (x,y) => (x,y,0)}).toDF("feature", "label", "init_score")
val converted_df = new VectorAssembler().setInputCols(Array("feature")).setOutputCol("features").transform(df)
converted_df.show()
converted_df.registerTempTable("temp_table")

def baseModel: LightGBMClassifier = {
  new LightGBMClassifier()
    .setFeaturesCol("features")
    .setNumLeaves(2)
    .setLambdaL2(0)
    .setLearningRate(0.1)
    .setMinGainToSplit(0)
    .setLabelCol("label")
    .setNumTasks(1)
    .setMaxDepth(1)
    .setVerbosity(1)
    .setNumThreads(1)
    .setMinDataInLeaf(20)
    .setInitScoreCol("init_score")
}
class CrossEntropy extends FObjTrait {
      override def getGradient(predictions: Array[Array[Double]],
                               trainingData: LightGBMDataset): (Array[Float], Array[Float]) = {
        // Get the labels
        val labels = trainingData.getLabel()
        // Compute gradient and hessian
        val probabilities = predictions.map(rowPrediction =>
          rowPrediction.map(prediction => 1.0 / (1.0 + exp(-prediction))))
        val grad =  probabilities.zip(labels).map {
          case (prob: Array[Double], label: Float) => (prob(0)-label).toFloat
        }
        val hess = probabilities.map(probabilityArray => (probabilityArray(0) * (1 - probabilityArray(0))).toFloat)
        (grad, hess)
      }
    }

baseModel.setNumIterations(1).setObjective("binary").fit(converted_df).transform(converted_df).show(10, false)

baseModel.setNumIterations(1).setFObj(new CrossEntropy()).fit(converted_df).transform(converted_df).show(10, false)

val native_model = baseModel.setNumIterations(3).setObjective("binary").fit(converted_df)
val native_res = native_model.transform(converted_df)
native_res.show(10, false)

// For binary classification: cross entropy is equivalent with binary.
val fobj_model = baseModel.setNumIterations(3).setFObj(new CrossEntropy()).fit(converted_df)
val fobj_res = fobj_model.transform(converted_df)
fobj_res.show(10, false)

native_res.select("rawPrediction").except(fobj_res.select("rawPrediction")).count == 0

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723
Copy link
Collaborator

I put some instructions in your raised issue and but I will also put the localbuild instructions here

https://microsoft.github.io/SynapseML/docs/reference/developer-readme/

@codecov-commenter
Copy link

codecov-commenter commented Dec 9, 2021

Codecov Report

Merging #1292 (084ce7c) into master (a261480) will increase coverage by 7.73%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1292      +/-   ##
==========================================
+ Coverage   75.62%   83.36%   +7.73%     
==========================================
  Files         273      273              
  Lines       13453    13453              
  Branches      689      689              
==========================================
+ Hits        10174    11215    +1041     
+ Misses       3279     2238    -1041     
Impacted Files Coverage Δ
...azure/synapse/ml/lightgbm/params/TrainParams.scala 100.00% <100.00%> (+12.50%) ⬆️
.../execution/streaming/continuous/HTTPSourceV2.scala 92.08% <0.00%> (-0.72%) ⬇️
...ynapse/ml/lightgbm/dataset/DatasetAggregator.scala 95.52% <0.00%> (+2.48%) ⬆️
...ure/synapse/ml/lightgbm/dataset/DatasetUtils.scala 61.11% <0.00%> (+2.77%) ⬆️
...icrosoft/azure/synapse/ml/io/http/HTTPSchema.scala 88.65% <0.00%> (+2.83%) ⬆️
...soft/azure/synapse/ml/lightgbm/LightGBMUtils.scala 74.50% <0.00%> (+3.92%) ⬆️
...ure/synapse/ml/io/http/SimpleHTTPTransformer.scala 92.18% <0.00%> (+6.25%) ⬆️
...osoft/azure/synapse/ml/core/contracts/Params.scala 95.65% <0.00%> (+6.52%) ⬆️
...re/synapse/ml/cognitive/CognitiveServiceBase.scala 75.96% <0.00%> (+6.97%) ⬆️
...re/synapse/ml/lightgbm/params/LightGBMParams.scala 84.46% <0.00%> (+7.57%) ⬆️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a261480...084ce7c. Read the comment docs.

Copy link
Contributor

@imatiach-msft imatiach-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find and fix!

@imatiach-msft
Copy link
Contributor

Interesting I don't see "custom" value for objective mentioned in the lightgbm docs here:
https://lightgbm.readthedocs.io/en/latest/Parameters.html#objective

but I do see it for metric:
https://lightgbm.readthedocs.io/en/latest/Parameters.html#metric
which has an alias for "None", "na" and "null"

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants