This tutorial is inspired by the Kaggle competition RSNA-MICCAI Brain Tumor Radiogenomic Classification. We will use Cloudera Data Engineering (CDE) on Cloudera Data Platform (CDP) to transform the DICOM files produced by an MRI into PNG images.
In a future tutorial, we will use the PNG images to train a machine learning model to detect the presence of a protein found in certain brain cancers.
There are two (2) options in getting assets for this tutorial:
It only contains the necessary files for this tutorial. Unzip tutorial-files.zip and remember its location.
It provides assets used in this and other tutorials; organized by tutorial title.
In addition to the files above, you will also need to download the test and train datasets from the Kaggle competition, RSNA-MICCAI Brain Tumor Radiogenomic Classification.
NOTE: The datasets use approximately 137 GB of storage. It will take some time to download and unzip the file.
Using AWS CLI, copy the train directory to your S3 bucket, defined by your environment’s storage.location.base attribute.
For example, the property storage.location.base has the value s3a://usermarketing-cdp-demo; copy the train folder using the command:
aws s3 cp train s3://usermarketing-cdp-demo/train --recursive
There are two (2) variables in file spark-etl.py that need to be updated. The values are based on the S3 location you stored the data:
If your environment doesn’t already have a CDE Service enabled, let’s enable it.
Make other configuration changes (optional)
If you don’t already have a CDE virtual cluster created, let’s create it.
The prerequisites for this tutorial requires you to already have CDE CLI configured. If you need help configuring, take a look at Using CLI-API to Automate Access to Cloudera Data Engineering.
On the command line, issue the following commands to create a CDE resource and upload the requirements.txt file to install required libraries in a new Python environment:
cde resource create --name rsna-etl --type python-env
cde resource upload --local-path requirements.txt --name rsna-etl
The Python environment will take a few minutes to build. You can issue this command to see the status. When the status becomes ready, it is ready to be used, and you can submit jobs.
cde resource list --filter 'name[rlike]rsna'
Now that we have our Python environment setup, let’s run the Spark job, spark-etl.py, to transform the DICOM files produced by the MRI into PNG images.
IMPORTANT: Restrict access to the virtual cluster only to users that are allowed to access the AWS credentials used in the job.
In the command prompt, create two (2) environment variables to hold your AWS credentials, which are needed to write the PNG images into S3.
Run the job using the command:
cde spark submit --python-env-resource-name rsna-etl \ --conf spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \ --conf spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \ --conf spark.executorEnv.AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \ --conf spark.executorEnv.AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \ spark-etl.py
When the job completes, you can review the output using the command:
cde run logs --type "driver/stdout" --id #, where # is the job ID
Finally, you can verify that the DICOM images have been transformed into PNG images using the following command. The files are located in the same S3 folder you specified, with _processed_images appended to the folder name.
aws s3 ls s3://usermarketing-cdp-demo/train_processed_images --recursive
This may have been caused by one of the following: