Releases: microsoft/SynapseML
Releases · microsoft/SynapseML
v0.11.2-spark3.3
chore: bump to spark 3.3.1
v0.11.1-spark3.3
chore: make it so custom versions are possible
SynapseML v0.11.1
SynapseML v0.11.1
Bug Fixes 🐞
- set default values for aadToken & url for internal Synapse (#1918)
- ONNX model shape inference cannot handle batch with shape [-1] (#1906)
- forgot to add getPValue to python side (#1909)
- generate random dir for each test (#1908)
- add back diagnosticsInfo for MVAD (#1892)
- DML run get timeout if big dataset has more feature columns (Workaround Synapse Spark optimizer issue) (#1903)
- fix date parsing in FaceSuite test (#1896)
- fix Build pipeline (#1904)
- Retry OnnxHub call to improve test reliability (#1889)
- Normalize line-endings (#1883)
- Remove case matching for erased generic type (#1880)
- fix bug #1869, DML .setFitIntercept should be set to true (#1876)
- Remove extraneous "Foo" type from Py codegen (#1867)
- Allow variable size in ONNX inputs (#1851)
- Abstain from CodeQL for markdown-only changes (#1865)
- fix style
- update OpenAIEmbedding internalServiceType
Build 🏭
- bump peter-evans/create-or-update-comment from 2 to 3 (#1907)
- bump ossf/scorecard-action from 2.1.2 to 2.1.3 (#1898)
- bump amannn/action-semantic-pull-request from 5.1.0 to 5.2.0 (#1878)
- bump @sideway/formula from 3.0.0 to 3.0.1 in /website (#1874)
- bump webpack from 5.75.0 to 5.76.1 in /website (#1870)
Documentation 📘
- Fix installation instruction in the webpage for the build.sbt file (#1921)
- note discrete treatment data type (#1905)
- add custom chatbot creation to form demo (#1888)
- add overview page for simple DNN and fix some typos (#1879)
- Fix a typo in installation docs
- fix link issue in CONTRIBUTING.md (#1864)
- fix a few issues in cognitive service demo (#1861)
Features 🌈
- add streaming API for MVAD (#1893)
- [DistributionBalanceMeasure] Add implementation + unit tests for custom reference distribution (#1885)
- Add ChatGPT through the
OpenAIChatCompletion
transformer (#1887) - support new api version of form recognizer (#1882)
- Add a new function to DMLModel, getPValue (#1863)
- update default internal endpoint for cog services (#1859)
Maintenance 🔧
- bump to v0.11.1 (#1933)
- Adding telemetry for the dataset metadata. This one is specially for … (#1917)
- fix r tests (#1927)
- fix build issues (#1916)
- disable test until Synapse is fixed (#1915)
- add .bloop to .gitignore (#1897)
- clean up old/missed search indexes in SearchWriterSuite (#1901)
- Add utility to clean azure search indexes
- update website docs to point to correct developer API docs (#1877)
- Update pipeline.yaml for Azure Pipelines (#1866)
- make sure nightly build has new commit
Changes:
- 866261c chore: bump to v0.11.1 (#1933)
- 3c09702 chore: Adding telemetry for the dataset metadata. This one is specially for … (#1917)
- 0d0d10c feat: add streaming API for MVAD (#1893)
- 1b71c1d chore: fix r tests (#1927)
- 0df97ad chore: fix build issues (#1916)
- 78695fb Update Regression - Vowpal Wabbit vs. LightGBM vs. Linear Regressor.ipynb (#1922)
- 87d5bc5 docs: Fix installation instruction in the webpage for the build.sbt file (#1921)
- 8320b2b fix: set default values for aadToken & url for internal Synapse (#1918)
- 4912ae4 chore: disable test until Synapse is fixed (#1915)
- 469445b fix: ONNX model shape inference cannot handle batch with shape [-1] (#1906)
See More
- 3fa001e build: bump peter-evans/create-or-update-comment from 2 to 3 (#1907)
- f51327e Update LightGBM version to 3.3.5 (#1910)
- b1e584e fix: forgot to add getPValue to python side (#1909)
- a09a6f7 docs: note discrete treatment data type (#1905)
- 0fa3f2a fix: generate random dir for each test (#1908)
- 736c317 fix: add back diagnosticsInfo for MVAD (#1892)
- 13afff6 fix: DML run get timeout if big dataset has more feature columns (Workaround Synapse Spark optimizer issue) (#1903)
- 7546e7f build: bump ossf/scorecard-action from 2.1.2 to 2.1.3 (#1898)
- f227f02 fix: fix date parsing in FaceSuite test (#1896)
- 0f02626 fix: fix Build pipeline (#1904)
- ce9fe41 chore: add .bloop to .gitignore (#1897)
- 7ffa970 chore: clean up old/missed search indexes in SearchWriterSuite (#1901)
- 9a6cf03 chore: Add utility to clean azure search indexes
- 52919ce fix: Retry OnnxHub call to improve test reliability (#1889)
- 979c629 feat: [DistributionBalanceMeasure] Add implementation + unit tests for custom reference distribution (#1885)
- 412620a docs: add custom chatbot creation to form demo (#1888)
- 9f634a6 feat: Add ChatGPT through the
OpenAIChatCompletion
transformer (#1887) - 7657089 fix: Normalize line-endings (#1883)
- c156792 feat: support new api version of form recognizer (#1882)
- ed842a5 docs: add overview page for simple DNN and fix some typos (#1879)
- 87e1c78 fix: Remove case matching for erased generic type (#1880)
- cd72bc9 build: bump amannn/action-semantic-pull-request from 5.1.0 to 5.2.0 (#1878)
- 564d047 fix: fix bug #1869, DML .setFitIntercept should be set to true (#1876)
- 392dbbf chore: update website docs to point to correct developer API docs (#1877)
- 129abde build: bump @sideway/formula from 3.0.0 to 3.0.1 in /website (#1874)
- 4d1c560 build: bump webpack from 5.75.0 to 5.76.1 in /website (#1870)
- 62c79d8 docs: Fix a typo in installation docs
- 1f63dab feat: Add a new function to DMLModel, getPValue (#1863)
- 83f8260 fix: Remove extraneous "Foo" type from Py codegen (#1867)
- a5bec45 fix: Allow variable size in ONNX inputs (#1851)
- 23c9b0a chore: Update pipeline.yaml for Azure Pipelines (#1866)
- dedcbda docs: fix link issue in CONTRIBUTING.md (#1864)
- a7f31d5 fix: Abstain from CodeQL for markdown-only changes (#1865)
- a5f38b1 Update DoubleMLEstimator test CI verification (#1862)
- a44f917 fix: fix style
- cc931af fix: update OpenAIEmbedding internalServiceType
- 424d586 feat: update default internal endpoint for cog services (#1859)
- e4a0e2c docs: fix ...
SynapseML v0.11.0
Building production ready distributed machine learning pipelines can be a challenge for even the most seasoned researcher or engineer. We are excited to announce the release of SynapseML v0.11.0 (Previously MMLSpark), an open-source library that aims to simplify the creation of massively scalable machine learning pipelines. SynapseML unifies several existing ML Frameworks and new MSFT algorithms in a single, scalable API that’s usable across Python, R, Scala, Java, .NET, C#, and F#.
Highlights
ChatGPT and GPT-4 at Scale | Simple Deep Learning | LightGBM v2 |
Intelligent chat and embeddings. Simplified Prompting APIs. | Train custom image and text classifiers with ease | Higher performance, >10x lower memory footprint, same API |
View Notebook | Learn More | Try an example |
ONNX Model Hub | Causal Learning | Vowpal Wabbit v2 |
Embed >150 state of the art deep networks into your pipelines | Discover and measure causal treatment effects | New second generation integration |
Learn More | View Docs | Explore Samples |
New Features
General ✨
- R Support is no longer Beta! (#1586)
- Support for Spark 3.2.3
Open AI 🤖
- Add OpenAI Prompt Template support (#1843)
- Add Azure OpenAI embedding support (#1832)
- Add Azure Active Directory authentication for OpenAI (#1829)
- Add Null-value handling for OpenAI models (#1854)
Deep Learning 🕸
- Remove CNTK functionality and replace with ONNX (#1593)
- Add the
DeepTextClassifier
a simple API for fine tuning a wide array of Hugging Face 🤗 text transformers using PyTorch Lightning (#1591) - Add the
DeepVisionClassifier
a simple API for deep transfer learning and fine-tuning of a variety of vision backbones (#1518)
Azure Cognitive Services for Big Data 🧠
- Add
SpeakerEmotionInference
transformer to generate emotion annotation tags for emotive reading inSpeechToText
(#1691) - Add new AnalyzeText API (#1760)
- Support Azure Active Directory (AAD) authentication for the cognitive services (#1778, #1797)
- Move different cognitive services into sub packages (#1746)
- Add audiobook generation example (#1852)
- Add a notebook for advanced cognitive service usage (#1825)
- Upgrade MVAD to v1.1 (#1788)
- Remove MVAD's dependence on hardwired credentials and azure SDKs (#1629)
- Add word-level timing to
SpeechToTextSDK
andConversationTranscription
(#1801) - Add the
descriptionExcludes
parameter to AnalyzeImage (#1590)
Causal Learning 📈
- Add the causal
DoubleMLEstimator
for learning causal treatment effects from data (#1715) - Add a DoubleMLEstimator document and sample notebook (#1730)
- Fix DML regression bug, should remove both treatment and outcome columns as feature columns (#1820)
- Add TreatmentCol type checking (#1816)
- Update test to validate ATE value should be positive (#1821)
- Fix issue with missing causal test coverage (#1799)
LightGBM 🌳
- Add LightGBM streaming execution mode for more reliable performance with orders of magnitude less memory. (#1580)
- Add maxNumClasses param to LightGBMClassifier for multi-class (#1841)
- Added the
passThroughArgs
feature which allows users to set low level LGBM parameters before they are wrapped in SparkML (#1749)
Vowpal Wabbit 🐇
- Vowpal Wabbit v2 (#1579):
- Support Vowpal Wabbit input format using VowpalWabbitGeneric model
- Support additional algorithms & label types (multi-class, cost sensitive one against all): sample notebook
- Progressive validation (aka 1-step ahead) using VowaplWabbitGenericProgressive
- New Contextual Bandit Offline Policy Evaluation Notebook
- Data parallel training independent of cluster size
Additional Updates
Bug Fixes 🐞
- Support grayscale images in
toNDArray
(#1592) - Adjust learning rate in VW example notebook (#1853)
- Correct copy/paste error in acr cleanup (#1838)
- Fix synapse test config, and isolation forest notebook (#1833)
- Add spark config to fix ArrayStoreException (#1757)
- Fix breeze NoSuchMethodError (#1807)
- Fix
modelVersion
param in TextAnalytics (#1756) - Make logging infrastructure consistent and add logging checks (#1755)
- Fix website sidebars and vulnerabilities in packages (#1753)
- Remove Vowpal Wabbit exclusion, add Interpretability exclusion (#1708)
- Update isolation forest notebook (#1696)
- Remove error on invalid columns in DropColumns (#1695)
- Fix PyArrow failure in deeplearning test (#1689)
- Fix linked service setters on cog service base class (#1685)
- KernelSHAP throws error when the key type in the ZipMap output is LongType (#1656)
- Fix flaky translate tests (#1643)
- Fix speechToTextSuite serialization Fuzzing failure (#1626)
- Fix translator endpoint and update all endpoints for gov regions (#1623)
- Finder runtime issues (#1598)
- Clean up cluster if Databricks tests pass ([#1599](https://github....
SynapseML v0.10.2
v0.10.2
Bug Fixes 🐞
- remove Vowpal Wabbit exclusion, add Interpretability exclusion (#1708)
- remove synapse E2E testing exclusion - cyber ml (#1699)
- update isolation forest notebook (#1696)
- don't throw on invalid columns in DropColumns (#1695)
- fix pyarrow failure in deeplearning test (#1689)
- fix linked service on cog service base (#1685)
- fix Uplift Modelling style
- KernelSHAP throws error when the key type in the ZipMap output is LongType (#1656)
- fix flaky translate tests (#1643)
- update ubuntu to 20.04 in pipeline (#1624)
Build 🏭
- bump actions/checkout from 2 to 3 (#1737)
- bump loader-utils from 2.0.2 to 2.0.3 in /website (#1709)
- bump amannn/action-semantic-pull-request from 5.0.1 to 5.0.2 (#1688)
- bump amannn/action-semantic-pull-request from 4 to 5.0.1 (#1680)
Documentation 📘
- update developer readme instruction on python env creation (#1693)
- fix multiple typos and update error hintings in ai-samples-timeseries notebook (#1663)
- improve error msg to make it clearer for users and fix typos (#1662)
- simplify data downloading and add mlflow to uplift modelling (#1659)
- move magic command forward since it restarts interpreter
- remove unused docs and fix links
- improve example notebooks
- add aisample uplift modelling (#1640)
- fix command to launch jupyter notebook (#1649)
- add mlflow in ai samples time series forecasting (#1645)
- add mlflow logging and loading (#1641)
- update spark version in Readme
- improve readme overview
- add aisample on text classification (#1617)
Features 🌈
- add simple deep learning text classifier (#1591)
- Add SpeakerEmotionInference transformer for generating SSML t… (#1691)
- Deprecate CNTK objects (#1712)
- Remove CNTK functionality and replace with ONNX (#1593)
- R test generation (#1586)
Maintenance 🔧
- bump version to 0.10.2 (#1738)
- fix style (#1736)
- automate clean-acr with github action workflow (#1735)
- autodelete old models (#1729)
- Making secrets optional and cached (#1726)
- add secret scanning infrastructure (#1724)
- Move new ImageFeaturizer to onnx namespace (#1711)
- ScalaStyle fixes (#1716)
- update scalatest and scalactic (#1706)
- remove synapse test exclusions (#1698)
- pin az and python versions (#1705)
- fix ado integration (#1704)
- remove notebooks (#1703)
- fix reopen comment action
- fix reopen on comment workflow
- fix typo in issue reopen yaml
- re open github issues after a comment (#1676)
- clean up github workflows and add issue label remover (#1674)
- turn off failing synapse tests temporarily (#1658)
- added
synapse-internal
to platform detector function (#1651) - publish test jars
- improve test coverage (#1631)
- Remove MVAD's dependence on hardwired credentials and azure SDKs (#1629)
- clean up TextAnalytics cog service APIs (#1622)
Testing 💚
Acknowledgements
We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML.\n
Changes:
- cd1d2ea chore: bump version to 0.10.2 (#1738)
- fd78889 build: bump actions/checkout from 2 to 3 (#1737)
- c806ba7 chore: fix style (#1736)
- e6b5a90 feat: add simple deep learning text classifier (#1591)
- 1de2d55 chore: automate clean-acr with github action workflow (#1735)
- 952d1bd clarify date comparisons when deleting old models/groups (#1733)
- 6ea02bd chore: autodelete old models (#1729)
- 8b02e1d chore: Making secrets optional and cached (#1726)
- c62c6ad test: Additional E2E testing infrastructure (#1727)
- aeb2ff7 feat: Add SpeakerEmotionInference transformer for generating SSML t… (#1691)
See More
- 0b96cc5 chore: add secret scanning infrastructure (#1724)
- 2a7a67b feat: Deprecate CNTK objects (#1712)
- e38e3ad chore: Move new ImageFeaturizer to onnx namespace (#1711)
- 0ff6802 test: Improve ONNXtests reliability (#1713)
- fe4c5d2 chore: ScalaStyle fixes (#1716)
- 050b541 build: bump loader-utils from 2.0.2 to 2.0.3 in /website (#1709)
- f2e88fd feat: Remove CNTK functionality and replace with ONNX (#1593)
- abdfe19 fix: remove Vowpal Wabbit exclusion, add Interpretability exclusion (#1708)
- 6a1f994 chore: update scalatest and scalactic (#1706)
- 144674f chore: remove synapse test exclusions (#1698)
- 32c654b chore: pin az and python versions (#1705)
- c8fba28 chore: fix ado integration (#1704)
- 92d4095 chore: remove notebooks (#1703)
- a953780 fix: remove synapse E2E testing exclusion - cyber ml (#1699)
- b257c70 fix: update isolation forest notebook (#1696)
- 9120b05 using predictionCol for isolation forest (#1686) [ #1060 ]
- 448f6b7 Remove trident.mlflow APIs. (#1687)
- f4af33f fix: don't throw on invalid columns in DropColumns (#1695)
- c531bbb docs: update developer readme instruction on python env creation (#1693)
- 467e651 build: bump amannn/action-semantic-pull-request from 5.0.1 to 5.0.2 (#1688)
- 302831f fix: fix pyarrow failure in deeplearning test (#1689)
- e857511 fix: fix linked service on cog service base (#1685)
- f29318a build: bump amannn/action-semantic-pull-request from 4 to 5.0.1 (#1680)
- 50ac0c8 Update reopen-issue-on-comment.yml
- c9278b5 chore: fix reopen comment action
- b3a9ba9 chore: fix reopen on comment workflow
- 9fe273b chore: fix typo in issue reopen yaml
- a7c50de chore: re open github issues after a comment (#1676)
- 8914750 chore: clean up github workflows and add issue label remover (#1674)
- 965231a docs: fix multiple typos and update error hintings in ai-samples-timeseries notebook (#1663)
- 4fa7249 docs: improve error msg to make it clearer for users and fix typos (#1...
v0.10.1
SynapseML v0.10.1
Bug Fixes 🐞
- fix speechToTextSuite serializationFuzzing failure (#1626)
- fix translator endpoint and update all endpoints for gov regions (#1623)
- binder runtime issues (#1598)
- clean up cluster if databricks tests pass (#1599)
- fix deep-learning test flakiness (#1600)
- update dotnetTestBase assembly version (#1601)
- fix flaky forms test (#1584)
Build 🏭
- bump EnricoMi/publish-unit-test-result-action from 1 to 2 (#1609)
- bump actions/setup-node from 2 to 3 (#1610)
- bump actions/setup-python from 2.3.2 to 4.2.0 (#1611)
- bump actions/setup-java from 2 to 3 (#1612)
- simplify e2e test pipeline with test matrix
Documentation 📘
- add aisample notebooks into community folder (#1606)
- add aisample time series forecasting (#1614)
- fix .NET logo on website (#1604)
- improve OpenAI notebook (#1596)
- pin mybinder to v0.10.0 to avoid thrashing
- add demo into videos on website (#1581)
- update installation guidance of v0.10.0 (#1578)
- add more .net samples (#1570)
- add dotnet installation & example doc (#1567)
- Update issue template
Features 🌈
- add stale bot for issues (#1602)
- Support grayscale images in
toNDArray
(#1592) - Add the
descriptionExcludes
parameter to AnalyzeImage (#1590) - Added the
DeepVisionClassifier
a simple API for deep transfer learning and fine-tuning of a variety of vision backbones (#1518)
Maintenance 🔧
- bump to v0.10.1 (#1628)
- deprecate old Text analytics APIs to prepare for refactoring (#1627)
- remove deprecated lime APIs (#1620)
- update openai service to the official deployment, and disable test due to outage (#1619)
- Auto update GitHub actions with dependabot (#1608)
- hotfix binder badge
- pin binder version for users (#1607)
- Bump spark to 3.2.2
- bump spark version
- Format welcome message with emojis (#1583)
- Add welcome message to new PRs/Issues (#1573)
- Add GH workflow to label new/reopened issues (#1571)
- update website (#1566)
Testing 💚
- stabilize unit tests (#1576)
Acknowledgements
We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML.\n
Changes:
- 0f54bc6 chore: bump to v0.10.1 (#1628)
- 3d0f3f4 chore: deprecate old Text analytics APIs to prepare for refactor (#1627)
- 2052e13 chore: remove deprecated lime APIs (#1620)
- 09213b0 fix: fix speechToTextSuite serializationFuzzing failure (#1626)
- 9f78bf0 fix: fix translator endpoint and update all endpoints for gov regions (#1623)
- 7e90d19 docs: add aisample notebooks into community folder (#1606)
- ac40e5a chore: update openai service to official, and disable test due to outage (#1619)
- f54f7f6 docs: add aisample time series forecasting (#1614)
- 7b4b0e1 build: bump EnricoMi/publish-unit-test-result-action from 1 to 2 (#1609)
- 43b0d17 build: bump actions/setup-node from 2 to 3 (#1610)
See More
- c48a07a build: bump actions/setup-python from 2.3.2 to 4.2.0 (#1611)
- b1a331c build: bump actions/setup-java from 2 to 3 (#1612)
- 78e40cb chore: Auto update github actions with dependabot (#1608)
- 69d2d20 chore: hotfix binder badge
- 93d7ccf chore: pin binder version for users (#1607)
- c7a61ec fix: binder runtime issues (#1598)
- c960c06 docs: fix .NET logo on website (#1604)
- 28a35b4 fix: clean up cluster if databricks tests pass (#1599)
- 5a28740 fix: fix deep-learning test flakiness (#1600)
- adf1a61 fix: update dotnetTestBase assembly version (#1601)
- c659b33 feat: add stale bot for issues (#1602)
- 05a4202 docs: improve OpenAI notebook (#1596)
- e019756 feat: Support gray scale images in
toNDArray
(#1592) - 51beaa0 feat: Add the
descriptionExcludes
parameter to AnalyzeImage (#1590) - b9ac22a docs: pin mybinder to v0.10.0 to avoid thrashing
- 1808a0f chore: Bump spark to 3.2.2
- 8e7d453 build: simplify e2e test pipeline with test matrix
- 8e34c7b chore: bump spark version
- 44c8ed5 feat: Added the
DeepVisionClassifier
a simple API for deep transfer learning and fine-tuning of a variety of vision backbones (#1518) - e4f0883 fix: fix flaky forms test (#1584)
- 7da5f49 chore: Format welcome message with emojis (#1583)
- 0e6bb35 Serena/update issue template (#1582)
- a6a2718 docs: add demo into videos on website (#1581)
- 7c34fc4 test: stabilize unit tests (#1576)
- 49f3a58 chore: Add welcome message to new PRs/Issues (#1573)
- 4868e8b Add back LightGBM library initialization in booster (#1575)
- d427b88 docs: update installation guidance of v0.10.0 (#1578)
- 55a60c9 docs: add more .net samples (#1570)
- 39fe2d8 chore: Add GH workflow to label new/reopened issues (#1571)
- 0febe3c docs: add dotnet installation & example doc (#1567)
- db95a10 chore: update website (#1566)
This list of changes was auto generated.
v0.10.0
Building production ready distributed machine learning pipelines can be a challenge for even the most seasoned researcher or engineer. We are excited to announce the release of SynapseML v0.10.0 (Previously MMLSpark), an open-source library that aims to simplify the creation of massively scalable machine learning pipelines. SynapseML unifies several existing ML Frameworks and new MSFT algorithms in a single, scalable API that’s usable across Python, R, Scala, Java, .NET, C#, and F#.
Highlights
OpenAI Language Models | .NET, C#, and F# Support | Full MLFlow Support | Live Demos in Browser |
Embed 175-billion parameter models into your databases with ease | Use or train any SynapseML model from .NET | Quick and easy MLOps, model management, and autologging | Explore the SynapseML library with zero setup |
Learn More | Getting Started Guide | Explore the Docs | Run in Browser |
New Features
General ✨
- SynapseML now supports .NET, C#, F#, and other .NET ecosystem languages in addition to Scala, Python, and R. Please see our Setup Guide and LightGBM from .NET example for more details. (#1539, #1156, #1443)
- SynapseML is now usable from your browser with zero setup using Binder. Quickly explore our demos in Binder. (#1487, #1493)
Azure Cognitive Services for Big Data 🧠
- Added OpenAI GPT-3 Sentence Completion Transformer. Use this feature to embed 175-billion parameter language models into distributed pipelines and databases to solve a variety of general purpose NLP tasks across natural language and code. (#1495, #1541)
- Added an example of Sentence Completion with GPT-3 (#1564)
- Added support for Form Recognizer V3.0 (#1269)
- Improved MVAD usability with async training and better data validation (#1477)
- Upgraded the univariate anomaly detection version to v1.1-preview (#1440)
- Added a multivariate anomaly detection sample notebook (#1365)
- Added a Text to Speech example to cognitive service overview (#1350)
- Added opinion mining to TextSentiment Models (#1449)
- Fixed Azure Maps schemas (#1553)
- Removed modelID param validators in FormRecognizerV3 (#1551)
- Fixed form recognizer and form ontology learner issues (#1506)
- Fixed
setServiceName
python method in OpenAI (#1498) - Fixed error in Text Analytics Analyze schema
- Improved error handling for MVAD (#1448, #1391)
- Removed unused concurrency parameter for MVAD (#1383)
- Improved robustness of flood risk notebook by adding polling (#1427)
Responsible AI at Scale 😇
- Added partial dependence plots (PDP) to allow for understanding how independent variables affect a model's prediction (#1426)
- Updated ICE/PDP documentation with PDP-based feature importance and additional examples (#1441, #1352)
- Added a notebook for ICE and PDP feature explainers (#1318)
- Updated data balance documentation to better describe how it can be used to ensure model fairness (#1540)
MLFlow 🔃
- Added documentation for MLFlow autologging (#1508)
- Added documentation on the SynapseML-MLFlow integration (#1428)
LightGBM on Spark 🌳
- Added the ability to pass in generic argument strings to LightGBM enabling many complex parameterizations (#1444)
- Added seed parameters to LightGBM (#1387)
- Added a method to get LightGBM native model string directly (#1515)
- Fixed issue with validation data creation during
useSingleDataset
mode (#1527) - Fixed multiclass training with initial scores (#1526)
- Fixed saving LightGBM model iterations with early stopping (#1497)
- Fixed issue where chunk size parameter was incorrectly specified during data copy (#1490)
- Fixed issue where when empty partition is chosen as the main worker in
singleDatasetMode
(#1458) - Fixed bug with data repartitioning in
LightGBMRanker
(#1368) - Fixed outdated docs for
useSingleDatasetMode
(#1562) - Refactored LightGBM class structure to improve logging and debugging (#1557)
Vowpal Wabbit 🐇
- Fixed issues with the
saveNativeModel
for the VWRegressionModel #1364 (#1366) - Fixed issues with building quadratic interaction terms (#1460)
Isolation Forests 🌲
Additional Updates
Maintenance 🔧
- Removed unused debugging code (#1546)
- Remove Synapse test exclusion for Explanation Dashboard notebook (#1531)
- Made python style checks verbose (#1532)
- Fixed library checking while installing library on Databricks cluster (#1488)
- Upgraded and fix Dockerfiles (#1472)
- Added Developer Docker Image build to pipeline (#1480)
- Fixed ADO area path in Issue Linker (#1464)
- Fix master version badge display
- Improved Databricks error reporting
- Updated azure cli to stop build errors
- Fixed SSL handshake flakiness
- Added
itsdangerous
as a dependency to ADB tests (#1412) - Turned on debug for pr to work item workflow
- Pointed pr linker to official implementation
- Changed GitHub action trigger from pull_request_target to pull_request (#1413)
- Fixed issue where Unit Tests were not executing ([#1409](https://github.com/Microsoft/SynapseML/issu...
SynapseML v0.9.5
Building production ready distributed machine learning pipelines can be a challenge for even the most seasoned researcher or engineer. We are excited to announce the release of SynapseML v0.9.5 (Previously MMLSpark), an open-source library that aims to simplify the creation of massively scalable machine learning pipelines. SynapseML unifies several existing ML Frameworks and new MSFT algorithms in a single, scalable API that’s usable across Python, R, Scala, and Java.
Highlights
New Features
Geospatial Intelligence 🗺️
- Added support for distributed geospatial queries backed by the Azure Maps API
- Added the geospatial usage overview (#1339)
- Explore how to use the geospatial intelligence services to analyze flood risks. (#1339)
- Added the
AddressGeocoder
transformer to map informal addresses to standardized adresses with latitude and longitude (#1294) - Added the
ReverseGeocoder
transformer to map latitude and longitude measurements to standardized addresses. (#1339) - Added the
CheckPointInPolygon
, to detect if latitude and longitude queries lie inside regions of interest (#1339)
Azure Cognitive Services for Big Data 🧠
- Added the Healthcare Analytics Transformer for extracting medical information, entities, and relationships for text. [Example Usage] (#1329)
- Added the
FitMultivariateAnomaly
estimator for training custom anomaly detection models on DataFrames of multivariate time series data (#1272) - Added example notebook for Multivariate Anomaly Detector
- See how to train a custom Multivariate Anomaly detector in the Estimators reference docs (#1323)
- Added simplified Text Analytics transformers that support auto-batching (#1329)
- Added the
TextToSpeech
Transformer for transforming Dataframes of text to audio files with neural voice synthesis (#1320) - Added the
TextAnalyze
transformer to support executing multiple text analytics workloads within a single API call (#1267, #1312)
Responsible AI at Scale 😇
- Added Individual Conditional Expectation explanations and Partial Dependence Plots with the
ICETransformer
. This tool gives detailed explanations of how features in opaque-box models affect the model prediction. (#1284) - Learn about how to use the ICETransformer through an example with the Adult Census dataset
MLFlow 🔃
LightGBM on Spark 🌳
- Improved LightGBM training performance 4x-10x by setting num_threads to be cores-1 (#1282)
- Added the predict_disable_shape_check in LightGBM (#1273)
- Reduced temporary file bloat by creating the LightGBM native temp directory lazily (#1326)
- Added logging for number of columns and rows when creating datasets, set useSingleDatasetMode=True by default (#1222)
Infrastructure 🏭
- SynapseML now installable from Maven Central!
- SynapseML now supports spark v3.2.x
Additional Updates
Bug Fixes 🐞
- Allowed FlattenBatch to propagate non-array values (#1286)
- Fixed flaky tests (#1342)
- Fixed website bugs and migrated docSearch (#1331)
- Fixed issue where IsolationForestModel does not properly exchange params with the inner model (#1330)
- Corrected the objective param when using fobj (#1292)
- Fixed issue where broadcasted sum in breeze 1.0 breaks in Spark 3.2.0 (#1299)
- Hotfixes for R test runners (#1283)
- fix installation instruction (#1268)
- Removing broadcast hint (#1255)
- fix install instructions (#1259)
Build 🏭
- bump algoliasearch-helper from 3.6.1 to 3.6.2 in /website (#1270)
- remove some deps that cause sec issues (#1264)
Documentation 📘
- Fixed broken link to CyberML notebook (#1322)
- Added website announcement bar (#1263)
- Updated and improve readme (#1262)
- Removed references to runme in contributing.md
- Supported Math expressions in website markdown (#1278)
- Corrected Synapse typo in website (#1335)
Maintenance 🔧
- Stopped lightGBM tests from timing out (#1315)
- Fixed r test flakiness (#1314)
- Updated VerifyLightGBMClassifier.scala (#1313)
- Update speech SDK test results
- Add in missing tests in build (#1300)
- Fix flaky build steps (#1298)
- Fix website telemetry (#1261)
- Add website telemetry (#1260)
- Added missing test classes to pipeline
Contributor Spotlight
We are excited to highlight the contributions of the following SynapseML contributors:
SynapseML v0.9.4
Building production ready distributed machine learning pipelines can be a challenge for even the most seasoned researcher or engineer. We are excited to announce the release of SynapseML (Previously MMLSpark), an open-source library that aims to simplify the creation of massively scalable machine learning pipelines. SynapseML unifies several existing ML Frameworks and new MSFT algorithms in a single, scalable API that’s usable across Python, R, Scala, and Java.
Highlights
New Features
General ✨
- Renamed and rebranded! Microsoft ML for Apache Spark is now SynapseML
- New modular library sub-packages for standalone install of each major set of features
- Support Spark 3.1.2 and Scala 2.12
- Support
pip install synapseml
for python bindings
ONNX on Spark 🕸
- ONNX model inference on Spark (#1152)
- Add documention and notebooks for ONNXModel evaluation (#1164)
Cognitive Services for Big Data🧠
- Added Multilingual Translation APIs (#1108) (Tutorial)
- Added FormRecognition APIs (Invoice, IDs, BusinessCards, Layouts, Custom Models) (#1099) (Tutorial)
- Added the FormOntologyLearner to extract meaningful "ontologies" of objects from collections of forms
- Add notebook to Create a Multilingual Search Engine from Forms
- Updated Text Analytics API to V3.1 (#1193)
- Add redactedText to PIIV3 (#1247)
- Added Personally Identifying Information (PII) identification
- Added Read API
- Added Conversation Transcription API
- Cognitive service now support data exfiltration protected (DEP) VNET allowing for individualized security solutions on Synapse Analytics (Learn More)
- Added support for the m4a codec in Speech to Text models
- Added predictive maintenance notebook
- Added Cognitive Service overview notebook
- Added support for linked service authentication in Synapse Analytics
- Simple no-code support in in Synapse Analytics
Responsible AI at Scale 😇
- Added Additive Shapley Explanations (SHAP) for understanding the predictions of opaque-box models (#1077)
- New API for Locally Interpretable Model-Agnostic Explanations (LIME), now supports background distributions text models, and has the same API as SHAP (#1077)
- Added Measure transformers for Data Balance Analysis (#1218)
- Add more notebook samples for documentation (#1043)
- Documentation and notebooks for Interpretability on Spark
- Introduce Responsible AI section on website (Interpretability + DataBalanceAnalysis) (#1241)
- Adding document and notebook for Data Balance Analysis (#1226)
- Explainable Boosting Machines for performant and interpretable ML (Private preview on Synapse Analytics only)
Vowpal Wabbit 🐇
- Added ContextualBandit reinforcement learning (#896)
- Added Vowpal Wabbit Overview Notebook
LightGBM 🌳
- Added matrix type parameter and improve logic to automatically infer dataset sparsity (#1052)
- Added several parameters related to dart boosting type (#1045)
- Added chunk size parameter for copying java data to native (#1041)
- Added number of threads parameter (#1055)
- Added custom objective function to LightGBM learners (#1054)
- Added singleton dataset mode for faster performance and reduced memory usage (#1066)
- Add num iteration and start iteration parameters to LightGBM model (#1024)
- Added the average precision metric (#1034)
- Added overview notebook for LightGBM
- Moved to new streaming API for dense data to reduce memory usage
- Tuned chinking code for faster performance
Build and Infrastructure Improvements 🏭
- New Docusaurus website generation system
- E2E Tests on Synapse Analytics (#1014)
- Split library into separately installable subprojects (#1073)
- Added a unified logging and telemetry system (#1019)
- Modernized R wrapper generation
- New Automated Python test generation (#998)
- New extensible code generation system
- New two-tiered security for build secrets
- Update ubuntu version to 18.04
- Automated back-up ACR images
Additional Updates
Bug Fixes 🐞
- Enable backwards compatibility for
mmlspark
python namespace imports (#1244) - Fix publishing to maven and pypi (#1242)
- Fix broken link to notebook in Data Balance Analysis doc (#1240)
min_data_in_leaf
missing from dataset parameters in lightgbm (#1239)- Fix performance issue in interpretability notebooks (#1238)
- Fixed cognitive service errors (#1176)
- Fixed flaky tests
- Rename NERPii to PII
- Fixed cog service test flakes
- Fixed setLinkedService issues in Synapse (#1177)
- Improved LGBM error message for invalid slot names (#1160)
- Fixed generated python code (#1121)
- Updated notebookUtils class path (#1118)
- Fixed LIME NaN weight output (#1117, #1112)
- Fixed Guava version issue in Azure Synapse and Databricks (#1103)
- Fixed flakiness in spark session stopping
- Fixed result parsing for forms
- Fixed explainers returning wrong results when
targetClassesCol
is specified - Fixed CNTKModel issue due to catalyst bug on databricks (#1076)
- Fixed null handling in bing image response (#1067)
- Avoided strange issue with databricks json parser
- Fixed dependency exclusions and build secret querying
- Fixed issue in tabular lime sampler (#1058)
- Updated Bing search URLs (#1048)
- Refactored python wrappers to use common class (#758)
- Updated java params patch (#1027)
- Added missing returns in new python lightGBM model methods
- Stop R binding generation from failing silently
- Fixed conversation transcription participant column functionality
- Reduce verbosity to...
SynapseML v0.9.2
v0.9.2
Bug Fixes 🐞
- fix publish to central maven (#1233)
- fix website (#1234)
- fix typo in sbt install
- lightgbm default params should not be specified if optional (#1232)
- fix website broken links (#1230)
- improve azure search writer error message in Array[Array[]] case
- update baseUrl and fix static images (#1217)
- Fixing flaky unit tests (#1215)
- Docker image should install openjdk-8-jre as opposed to default-… (#1211)
- Fixing flaky test
Documentation 📘
- add explanation dashboard integration example notebook (#1236)
- fix links to developer readme and R setup (#1229)
Feat
- Build our new website (#1190)
Features 🌈
- support direct pip install (#1223)
- Measure transformers for Data Balance Analysis (#1218)
- Add the FormOntologyLearner
Maintenance 🔧
- release synapseml 0.9.2 (#1237)
Performance Improvements 🚀
- website enhancement (#1221)
Acknowledgements
We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML.\n
Changes:
- 81f5f80 chore: release synapseml 0.9.2 (#1237)
- 127c70a docs: add explanation dashboard integration example notebook (#1236)
- 9b9c2fb fix: fix publish to central maven (#1233)
- 7059573 fix: fix website (#1234)
- d47f014 fix: fix typo in sbt install
- 336eff5 fix: lightgbm default params should not be specified if optional (#1232)
- 3d92dd7 feat: support direct pip install (#1223)
- 2771853 docs: fix links to developer readme and R setup (#1229)
- ea91189 fix: fix website broken links (#1230)
- bbd8744 perf: website enhancement (#1221)
See More
- c5e1742 feat: Measure transformers for Data Balance Analysis (#1218)
- 73c6a65 fix: improve azure search writer error message in Array[Array[]] case
- d8344c5 feat: Add the FormOntologyLearner
- 2d81b50 fix: update baseUrl and fix static images (#1217)
- e23041f fix: Fixing flaky unit tests (#1215)
- 5d31e3e fix: Docker image should install openjdk-8-jre as opposed to default-… (#1211)
- 9623b3e Feat: Build our new website (#1190)
- 3f74133 fix: Fixing flaky test
This list of changes was auto generated.