-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support writing to Databricks delta table using a module #1483
base: branch-24.06
Are you sure you want to change the base?
Conversation
delta_table_write_mode = module_config.get("DELTA_WRITE_MODE", "append") | ||
databricks_host=module_config.get("DATABRICKS_HOST", None) | ||
databricks_token=module_config.get("DATABRICKS_TOKEN", None) | ||
databricks_cluster_id=module_config.get("DATABRICKS_CLUSTER_ID", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add error checking here for any config parameter that might cause the builder.remote call to fail later, this gives us more immediate feedback during module loading instead of runtime.
@@ -0,0 +1,113 @@ | |||
# Copyright (c) 2022-2023, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright (c) 2022-2023, NVIDIA CORPORATION. | |
# Copyright (c) 2024, NVIDIA CORPORATION. |
except ImportError as import_exc: | ||
IMPORT_EXCEPTION = import_exc | ||
|
||
def _extract_schema_from_pandas_dataframe(df: pd.DataFrame) -> "sql_types.StructType": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add at least one unit test for this helper function; mocking out everything to deal with the spark internals in the module's work unit probably isn't worth it right now, but having a test file gives us a spot for sprouting new tests later.
from databricks.connect import DatabricksSession | ||
from pyspark.sql import types as sql_types | ||
except ImportError as import_exc: | ||
IMPORT_EXCEPTION = import_exc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 'IMPORT_EXCEPTION' used anywhere? If we're not handling the failure to import via some kind of work around, then lets log an error here and re-raise the ImportErrror
module_config = builder.get_current_module_config() | ||
"""module_config contains all the required configuration parameters, that would otherwise be passed to the stage. | ||
Parameters | ||
---------- | ||
config : morpheus.config.Config | ||
Pipeline configuration instance. | ||
delta_path : str, default None | ||
Path of the delta table where the data need to be written or updated. | ||
databricks_host : str, default None | ||
URL of Databricks host to connect to. | ||
databricks_token : str, default None | ||
Access token for Databricks cluster. | ||
databricks_cluster_id : str, default None | ||
Databricks cluster to be used to query the data as per SQL provided. | ||
delta_table_write_mode: str, default "append" | ||
Delta table write mode for storing data. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module_config = builder.get_current_module_config() | |
"""module_config contains all the required configuration parameters, that would otherwise be passed to the stage. | |
Parameters | |
---------- | |
config : morpheus.config.Config | |
Pipeline configuration instance. | |
delta_path : str, default None | |
Path of the delta table where the data need to be written or updated. | |
databricks_host : str, default None | |
URL of Databricks host to connect to. | |
databricks_token : str, default None | |
Access token for Databricks cluster. | |
databricks_cluster_id : str, default None | |
Databricks cluster to be used to query the data as per SQL provided. | |
delta_table_write_mode: str, default "append" | |
Delta table write mode for storing data. | |
""" | |
""" | |
Parameters | |
---------- | |
config : morpheus.config.Config | |
Pipeline configuration instance. | |
delta_path : str, default None | |
Path of the delta table where the data need to be written or updated. | |
databricks_host : str, default None | |
URL of Databricks host to connect to. | |
databricks_token : str, default None | |
Access token for Databricks cluster. | |
databricks_cluster_id : str, default None | |
Databricks cluster to be used to query the data as per SQL provided. | |
delta_table_write_mode: str, default "append" | |
Delta table write mode for storing data. | |
""" | |
module_config = builder.get_current_module_config() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else; if you're feeling ambitious, you might also look at updating the delta lake writer stage so that it just loads this module; that way we dont' have duplicated code between the two.
/ok to test |
Adds support to write to delta lake table using module, this will close #1482