This is an Amazon EMR project for Python development with CDK.
The cdk.json
file tells the CDK Toolkit how to execute your app.
This project is set up like a standard Python project. The initialization
process also creates a virtualenv within this project, stored under the .venv
directory. To create the virtualenv it assumes that there is a python3
(or python
for Windows) executable in your path with access to the venv
package. If for any reason the automatic creation of the virtualenv fails,
you can create the virtualenv manually.
To manually create a virtualenv on MacOS and Linux:
$ python3 -m venv .venv
After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv.
$ source .venv/bin/activate
If you are a Windows platform, you would activate the virtualenv like this:
% .venv\Scripts\activate.bat
Once the virtualenv is activated, you can install the required dependencies.
(.venv) $ pip install -r requirements.txt
At this point you can now synthesize the CloudFormation template for this code.
You pass context variable such as vcp_name=<your vpc name>
(e.g. vpc_name='default'
) in order to use the existing VPC.
(.venv) $ cdk synth -c vpc_name="your-vpc-name"
Before deployment, you shuld create the default IAM role EMR_EC2_DefaultRole
and EMR_DefaultRole
which can be used when creating the cluster
(.venv) $ aws emr create-default-roles
Use cdk deploy
command to create the stack shown above.
(.venv) $ cdk deploy -c vpc_name="your-vpc-name"
To add additional dependencies, for example other CDK libraries, just add
them to your setup.py
file and rerun the pip install -r requirements.txt
command.
$ aws ec2-instance-connect ssh --instance-id {Primary Node Instance Id (e.g., i-073e3a822dd3351a1)} --os-user hadoop Last login: Wed Nov 22 06:34:50 2023 , #_ ~\_ ####_ Amazon Linux 2 ~~ \_#####\ ~~ \###| AL2 End of Life is 2025-06-30. ~~ \#/ ___ ~~ V~' '-> ~~~ / A newer version of Amazon Linux is available! ~~._. _/ _/ _/ Amazon Linux 2023, GA and supported until 2028-03-15. _/m/' https://aws.amazon.com/linux/amazon-linux-2023/ 16 package(s) needed for security, out of 24 available Run "sudo yum update" to apply all updates. EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R E::::E M::::::M:::M M:::M::::::M R:::R R::::R E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R E::::E M:::::M M:::M M:::::M R:::R R::::R E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR [hadoop@ip-172-31-3-191 ~]$
Before cleaning up the emr cluster, you need to tunrn off EMR Termination protection
.
(.venv) $ aws emr modify-cluster-attributes --cluster-id your-emr-cluster-id --no-termination-protected (.venv) $ cdk destroy --force --all
cdk ls
list all stacks in the appcdk synth
emits the synthesized CloudFormation templatecdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk docs
open CDK documentation
- Application versions in Amazon EMR 6.x releases
- Application versions in Amazon EMR 5.x releases
- Amzon EMR Best Practices Guides
- aws emr create-default-roles
- When I create an Amazon EMR cluster, I get an "EMR_DefaultRole is invalid" or "EMR_EC2_DefaultRole is invalid" error
- aws ec2-instance-connect ssh - Connect to your EC2 instance using your OpenSSH client.
- Spark 3.2.1 Configuration
- (video) Modern Data Lake Storage Layers - Hudi vs. Iceberg vs. DeltaLake
- An Introduction to Modern Data Lake Storage Layers - Hudi vs. Iceberg vs. DeltaLake
Enjoy!