docs: add overview page for simple DNN and fix some typos (#1879)

* docs: add overview page for simple DNN and fix some typos * docs: fixup docs for acrolinx * fix acrolinx * update * update * update ---------
microsoft · Mar 21, 2023 · ed842a5 · ed842a5
1 parent 87e1c78
commit ed842a5
Show file tree

Hide file tree

Showing 13 changed files with 615 additions and 209 deletions.
diff --git a/website/.gitignore b/website/.gitignore
@@ -26,6 +26,7 @@
 !/docs/features/simple_deep_learning
 /docs/features/simple_deep_learning/*
 !/docs/features/simple_deep_learning/about.md
+!/docs/features/simple_deep_learning/installation.md
 !/docs/features/spark_serving
 /docs/features/spark_serving/*
 !/docs/features/spark_serving/about.md

diff --git a/website/docs/features/simple_deep_learning/about.md b/website/docs/features/simple_deep_learning/about.md
@@ -1,42 +1,76 @@
 ---
-title: Deep Vision Classification on Databricks
-sidebar_label: Deep Vision Classification on Databricks
+title: Simple Deep Learning with SynapseML
+sidebar_label: About
 ---
 
-:::note
-This is for databricks 10.4.x-gpu-ml-scala2.12 runtime
-:::
+### Why Simple Deep Learning
+Creating a Spark-compatible deep learning system can be challenging for users who may not have a 
+thorough understanding of deep learning and distributed systems. Additionally, writing custom deep learning 
+scripts may be a cumbersome and time-consuming task.
+SynapseML aims to simplify this process by building on top of the [Horovod](https://github.com/horovod/horovod) Estimator, a general-purpose 
+distributed deep learning model that is compatible with SparkML, and [Pytorch-lightning](https://github.com/Lightning-AI/lightning),
+a lightweight wrapper around the popular PyTorch deep learning framework.
 
-## 1. Reinstall horovod using our prepared script
+SynapseML's simple deep learning toolkit makes it easy to use modern deep learning methods in Apache Spark.
+By providing a collection of Estimators, SynapseML enables users to perform distributed transfer learning on
+spark clusters to solve custom machine learning tasks without requiring in-depth domain expertise.
+Whether you're a data scientist, data engineer, or business analyst this project aims to make modern deep-learning methods easy to use for new domain-specific problems.
 
-We build on top of torchvision, horovod and pytorch_lightning, so we need to reinstall horovod by building on specific versions of those packages.
-Download our [horovod installation script](https://mmlspark.blob.core.windows.net/publicwasb/horovod_installation.sh) and upload
-it to databricks dbfs.
+### SynapseML's Simple DNN
+SynapseML goes beyond the limited support for deep networks in SparkML and provides out-of-the-box solutions for various common scenarios:
+- Visual Classification: Users can apply transfer learning for image classification tasks, using pretrained models and fine-tuning them to solve custom classification problems.
+- Text Classification: SynapseML simplifies the process of implementing natural language processing tasks such as sentiment analysis, text classification, and language modeling by providing prebuilt models and tools.
+- And more coming soon
 
-Add the path of this script to `Init Scripts` section when configuring the spark cluster.
-Restarting the cluster will automatically install horovod v0.25.0 with pytorch_lightning v1.5.0 and torchvision v0.12.0.
+### Why Horovod
+Horovod is a distributed deep learning framework developed by Uber, which has become popular for its ability to scale
+deep learning tasks across multiple GPUs and compute nodes efficiently. It's designed to work with TensorFlow, Keras, PyTorch, and Apache MXNet.
+- Scalability: Horovod uses efficient communication algorithms like ring-allreduce and hierarchical all reduce, which allow it to scale the training process across multiple GPUs and nodes without significant performance degradation.
+- Easy Integration: Horovod can be easily integrated into existing deep learning codebases with minimal changes, making it a popular choice for distributed training.
+- Fault Tolerance: Horovod provides fault tolerance features like elastic training. It can dynamically adapt to changes in the number of workers or recover from failures.
+- Community Support: Horovod has an active community and is widely used in the industry, which ensures that the framework is continually updated and improved.
 
-## 2. Install SynapseML Deep Learning Component
-
-You could install the single synapseml-deep-learning wheel package to get the full functionality of deep vision classification.
-Run the following command:
-```powershell
-pip install https://mmlspark.blob.core.windows.net/pip/$SYNAPSEML_SCALA_VERSION/synapseml_deep_learning-$SYNAPSEML_PYTHON_VERSION-py2.py3-none-any.whl
-```
+### Why Pytorch Lightning
+PyTorch Lightning is a lightweight wrapper around the popular PyTorch deep learning framework, designed to make it 
+easier to write clean, modular, and scalable deep learning code. PyTorch Lightning has several advantages that 
+make it an excellent choice for SynapseML's Simple Deep Learning:
+- Code Organization: PyTorch Lightning promotes a clean and organized code structure by separating the research code from the engineering code. This property makes it easier to maintain, debug, and share deep learning models.
+- Flexibility: PyTorch Lightning retains the flexibility and expressiveness of PyTorch while adding useful abstractions to simplify the training loop and other boilerplate code.
+- Built-in Best Practices: PyTorch Lightning incorporates many best practices for deep learning, such as automatic optimization, gradient clipping, and learning rate scheduling, making it easier for users to achieve optimal performance.
+- Compatibility: PyTorch Lightning is compatible with a wide range of popular tools and frameworks, including Horovod, which allows users to easily use distributed training capabilities.
+- Rapid Development: With PyTorch Lightning, users can quickly experiment with different model architectures and training strategies without worrying about low-level implementation details.
 
-An alternative is installing the SynapseML jar package in library management section, by adding:
-```
-Coordinate: com.microsoft.azure:synapseml_2.12:SYNAPSEML_SCALA_VERSION
-Repository: https://mmlspark.azureedge.net/maven
-```
+### Sample usage with DeepVisionClassifier
+DeepVisionClassifier incorporates all models supported by [torchvision](https://github.com/pytorch/vision). 
 :::note
-If you install the jar package, you need to follow the first two cells of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md/#environment-setup----reinstall-horovod-based-on-new-version-of-pytorch)
-to make horovod recognize our module.
+The current version is based on pytorch_lightning v1.5.0 and torchvision v0.12.0
 :::
+By providing a spark dataframe that contains an 'imageCol' and 'labelCol', you could directly apply 'transform' function
+on it with DeepVisionClassifier.
+```python
+train_df = spark.createDataframe([
+    ("PATH_TO_IMAGE_1.jpg", 1),
+    ("PATH_TO_IMAGE_2.jpg", 2)
+], ["image", "label"])
 
-## 3. Try our sample notebook
+deep_vision_classifier = DeepVisionClassifier(
+    backbone="resnet50", # Put your backbone here
+    store=store, # Corresponding store
+    callbacks=callbacks, # Optional callbacks
+    num_classes=17,
+    batch_size=16,
+    epochs=epochs,
+    validation=0.1,
+)
 
-You could follow the rest of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md) and have a try on your own dataset.
+deep_vision_model = deep_vision_classifier.fit(train_df)
+```
+DeepVisionClassifier does distributed-training on spark with Horovod under the hood, after this fitting process it returns
+a DeepVisionModel. With this code you could use the model for inference directly:
+```python
+pred_df = deep_vision_model.transform(test_df)
+```
 
-Supported models (`backbone` parameter for `DeepVisionClassifer`) should be string format of [Torchvision-supported models](https://github.com/pytorch/vision/blob/v0.12.0/torchvision/models/__init__.py);
-You could also check by running `backbone in torchvision.models.__dict__`.
+## Examples
+- [DeepLearning - Deep Vision Classification](../DeepLearning%20-%20Deep%20Vision%20Classification)
+- [DeepLearning - Deep Text Classification](../DeepLearning%20-%20Deep%20Text%20Classification)
diff --git a/website/docs/features/simple_deep_learning/installation.md b/website/docs/features/simple_deep_learning/installation.md
@@ -0,0 +1,42 @@
+---
+title: Installation Guidance
+sidebar_label: Installation Guidance for Deep Vision Classification
+---
+
+:::note
+This is a sample with databricks 10.4.x-gpu-ml-scala2.12 runtime
+:::
+
+## 1. Reinstall horovod using our prepared script
+
+We build on top of torchvision, horovod and pytorch_lightning, so we need to reinstall horovod by building on specific versions of those packages.
+Download our [horovod installation script](https://mmlspark.blob.core.windows.net/publicwasb/horovod_installation.sh) and upload
+it to databricks dbfs.
+
+Add the path of this script to `Init Scripts` section when configuring the spark cluster.
+Restarting the cluster automatically installs horovod v0.25.0 with pytorch_lightning v1.5.0 and torchvision v0.12.0.
+
+## 2. Install SynapseML Deep Learning Component
+
+You could install the single synapseml-deep-learning wheel package to get the full functionality of deep vision classification.
+Run the following command:
+```powershell
+pip install synapseml==0.11.0
+```
+
+An alternative is installing the SynapseML jar package in library management section, by adding:
+```
+Coordinate: com.microsoft.azure:synapseml_2.12:0.11.0
+Repository: https://mmlspark.azureedge.net/maven
+```
+:::note
+If you install the jar package, follow the first two cells of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md/#environment-setup----reinstall-horovod-based-on-new-version-of-pytorch)
+to ensure horovod recognizes SynapseML.
+:::
+
+## 3. Try our sample notebook
+
+You could follow the rest of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md) and have a try on your own dataset.
+
+Supported models (`backbone` parameter for `DeepVisionClassifer`) should be string format of [Torchvision-supported models](https://github.com/pytorch/vision/blob/v0.12.0/torchvision/models/__init__.py);
+You could also check by running `backbone in torchvision.models.__dict__`.
diff --git a/...oned_docs/version-0.11.0/documentation/estimators/causal/_causalInferenceDML.md b/...oned_docs/version-0.11.0/documentation/estimators/causal/_causalInferenceDML.md
@@ -18,29 +18,30 @@ values={[
 ```python
 from synapse.ml.causal import *
 from pyspark.ml.classification import LogisticRegression
-from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType
+from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, BooleanType
 
 schema = StructType([
-    StructField("Treatment", IntegerType()),
-    StructField("Outcome", IntegerType()),
+    StructField("Treatment", BooleanType()),
+    StructField("Outcome", BooleanType()),
     StructField("col2", DoubleType()),
     StructField("col3", DoubleType()),
     StructField("col4", DoubleType())
     ])
 
+
 df = spark.createDataFrame([
-      (0, 1, 0.30, 0.66, 0.2),
-      (1, 0, 0.38, 0.53, 1.5),
-      (0, 1, 0.68, 0.98, 3.2),
-      (1, 0, 0.15, 0.32, 6.6),
-      (0, 1, 0.50, 0.65, 2.8),
-      (1, 1, 0.40, 0.54, 3.7),
-      (0, 1, 0.78, 0.97, 8.1),
-      (1, 0, 0.12, 0.32, 10.2),
-      (0, 1, 0.35, 0.63, 1.8),
-      (1, 0, 0.45, 0.57, 4.3),
-      (0, 1, 0.75, 0.97, 7.2),
-      (1, 1, 0.16, 0.32, 11.7)], schema
+      (False, True, 0.30, 0.66, 0.2),
+      (True, False, 0.38, 0.53, 1.5),
+      (False, True, 0.68, 0.98, 3.2),
+      (True, False, 0.15, 0.32, 6.6),
+      (False, True, 0.50, 0.65, 2.8),
+      (True, True, 0.40, 0.54, 3.7),
+      (False, True, 0.78, 0.97, 8.1),
+      (True, False, 0.12, 0.32, 10.2),
+      (False, True, 0.35, 0.63, 1.8),
+      (True, False, 0.45, 0.57, 4.3),
+      (False, True, 0.75, 0.97, 7.2),
+      (True, True, 0.16, 0.32, 11.7)], schema
 )
 
 dml = (DoubleMLEstimator()
@@ -63,18 +64,18 @@ import com.microsoft.azure.synapse.ml.causal._
 import org.apache.spark.ml.classification.LogisticRegression
 
 val df = (Seq(
-  (0, 1, 0.50, 0.60, 0),
-  (1, 0, 0.40, 0.50, 1),
-  (0, 1, 0.78, 0.99, 2),
-  (1, 0, 0.12, 0.34, 3),
-  (0, 1, 0.50, 0.60, 0),
-  (1, 0, 0.40, 0.50, 1),
-  (0, 1, 0.78, 0.99, 2),
-  (1, 0, 0.12, 0.34, 3),
-  (0, 0, 0.50, 0.60, 0),
-  (1, 1, 0.40, 0.50, 1),
-  (0, 1, 0.78, 0.99, 2),
-  (1, 0, 0.12, 0.34, 3))
+  (false, true, 0.50, 0.60, 0),
+  (true, false, 0.40, 0.50, 1),
+  (false, true, 0.78, 0.99, 2),
+  (true, false, 0.12, 0.34, 3),
+  (false, true, 0.50, 0.60, 0),
+  (true, false, 0.40, 0.50, 1),
+  (false, true, 0.78, 0.99, 2),
+  (true, false, 0.12, 0.34, 3),
+  (false, false, 0.50, 0.60, 0),
+  (true, true, 0.40, 0.50, 1),
+  (false, true, 0.78, 0.99, 2),
+  (true, false, 0.12, 0.34, 3))
   .toDF("Treatment", "Outcome", "col2", "col3", "col4"))
 
 val dml = (new DoubleMLEstimator()

diff --git a/website/versioned_docs/version-0.11.0/features/causal_inference/about.md b/website/versioned_docs/version-0.11.0/features/causal_inference/about.md
@@ -45,7 +45,12 @@ dml = (DoubleMLEstimator()
       .setOutcomeCol("Outcome")
       .setOutcomeModel(LogisticRegression())
       .setMaxIter(20))
-dmlModel = dml.fit(df)
+dmlModel = dml.fit(dataset)
+```
+> Note: all columns except "Treatment" and "Outcome" in your dataset will be used as confounders.  
+
+After fitting the model, you can get average treatment effect and confidence interval:
+```python
 dmlModel.getAvgTreatmentEffect()
 dmlModel.getConfidenceInterval()
 ```

diff --git a/...ion-0.11.0/features/cognitive_services/CognitiveServices - Create Audiobooks.md b/...ion-0.11.0/features/cognitive_services/CognitiveServices - Create Audiobooks.md
@@ -3,7 +3,7 @@ title: CognitiveServices - Create Audiobooks
 hide_title: true
 status: stable
 ---
-# Create Audiobooks using Neural Speech to Text
+# Create audiobooks using neural Text to speech
 
 ## Step 1: Load libraries and add service information
 
@@ -38,11 +38,6 @@ spark.sparkContext._jsc.hadoopConfiguration().set(spark_key_setting, storage_key
 ```
 
 
-```python
-import os
-```
-
-
 ```python
 import os
 from os.path import exists, join

diff --git a/...versioned_docs/version-0.11.0/features/other/AzureSearchIndex - Met Artworks.md b/...versioned_docs/version-0.11.0/features/other/AzureSearchIndex - Met Artworks.md
@@ -0,0 +1,108 @@
+---
+title: AzureSearchIndex - Met Artworks
+hide_title: true
+status: stable
+---
+<h1>Creating a searchable Art Database with The MET's open-access collection</h1>
+
+In this example, we show how you can enrich data using Cognitive Skills and write to an Azure Search Index using SynapseML. We use a subset of The MET's open-access collection and enrich it by passing it through 'Describe Image' and a custom 'Image Similarity' skill. The results are then written to a searchable index.
+
+
+```python
+import os, sys, time, json, requests
+from pyspark.ml import Transformer, Estimator, Pipeline
+from pyspark.ml.feature import SQLTransformer
+from pyspark.sql.functions import lit, udf, col, split
+```
+
+
+```python
+from pyspark.sql import SparkSession
+
+# Bootstrap Spark Session
+spark = SparkSession.builder.getOrCreate()
+
+from synapse.ml.core.platform import running_on_synapse, find_secret
+
+if running_on_synapse():
+    from notebookutils.visualization import display
+```
+
+
+```python
+cognitive_key = find_secret("cognitive-api-key")
+cognitive_loc = "eastus"
+azure_search_key = find_secret("azure-search-key")
+search_service = "mmlspark-azure-search"
+search_index = "test"
+```
+
+
+```python
+data = (
+    spark.read.format("csv")
+    .option("header", True)
+    .load("wasbs://[email protected]/metartworks_sample.csv")
+    .withColumn("searchAction", lit("upload"))
+    .withColumn("Neighbors", split(col("Neighbors"), ",").cast("array<string>"))
+    .withColumn("Tags", split(col("Tags"), ",").cast("array<string>"))
+    .limit(25)
+)
+```
+
+<img src="https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworkSamples.png" width="800" />
+
+
+```python
+from synapse.ml.cognitive import AnalyzeImage
+from synapse.ml.stages import SelectColumns
+
+# define pipeline
+describeImage = (
+    AnalyzeImage()
+    .setSubscriptionKey(cognitive_key)
+    .setLocation(cognitive_loc)
+    .setImageUrlCol("PrimaryImageUrl")
+    .setOutputCol("RawImageDescription")
+    .setErrorCol("Errors")
+    .setVisualFeatures(
+        ["Categories", "Description", "Faces", "ImageType", "Color", "Adult"]
+    )
+    .setConcurrency(5)
+)
+
+df2 = (
+    describeImage.transform(data)
+    .select("*", "RawImageDescription.*")
+    .drop("Errors", "RawImageDescription")
+)
+```
+
+<img src="https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworksProcessed.png" width="800" />
+
+Before writing the results to a Search Index, you must define a schema which must specify the name, type, and attributes of each field in your index. Refer [Create a basic index in Azure Search](https://docs.microsoft.com/en-us/azure/search/search-what-is-an-index) for more information.
+
+
+```python
+from synapse.ml.cognitive import *
+
+df2.writeToAzureSearch(
+    subscriptionKey=azure_search_key,
+    actionCol="searchAction",
+    serviceName=search_service,
+    indexName=search_index,
+    keyCol="ObjectID",
+)
+```
+
+The Search Index can be queried using the [Azure Search REST API](https://docs.microsoft.com/rest/api/searchservice/) by sending GET or POST requests and specifying query parameters that give the criteria for selecting matching documents. For more information on querying refer [Query your Azure Search index using the REST API](https://docs.microsoft.com/en-us/rest/api/searchservice/Search-Documents)
+
+
+```python
+url = "https://{}.search.windows.net/indexes/{}/docs/search?api-version=2019-05-06".format(
+    search_service, search_index
+)
+requests.post(
+    url, json={"search": "Glass"}, headers={"api-key": azure_search_key}
+).json()
+```