When implementing machine studying (ML) workflows in Amazon SageMaker Canvas, organizations would possibly want to contemplate exterior dependencies required for his or her particular use instances. Though SageMaker Canvas offers highly effective no-code and low-code capabilities for speedy experimentation, some initiatives would possibly require specialised dependencies and libraries that aren’t included by default in SageMaker Canvas. This put up offers an instance of the best way to incorporate code that depends on exterior dependencies into your SageMaker Canvas workflows.
Amazon SageMaker Canvas is a low-code no-code (LCNC) ML platform that guides customers via each stage of the ML journey, from preliminary knowledge preparation to closing mannequin deployment. With out writing a single line of code, customers can discover datasets, rework knowledge, construct fashions, and generate predictions.
SageMaker Canvas affords complete knowledge wrangling capabilities that assist you to put together your knowledge, together with:
- Over 300 built-in transformation steps
- Characteristic engineering capabilities
- Information normalization and cleaning features
- A customized code editor supporting Python, PySpark, and SparkSQL
On this put up, we show the best way to incorporate dependencies saved in Amazon Easy Storage Service (Amazon S3) inside an Amazon SageMaker Information Wrangler circulate. Utilizing this method, you may run customized scripts that rely upon modules not inherently supported by SageMaker Canvas.
Answer overview
To showcase the combination of customized scripts and dependencies from Amazon S3 into SageMaker Canvas, we discover the next instance workflow.
The answer follows three predominant steps:
- Add customized scripts and dependencies to Amazon S3
- Use SageMaker Information Wrangler in SageMaker Canvas to rework your knowledge utilizing the uploaded code
- Practice and export the mannequin
The next diagram is the structure for the answer.
On this instance, we work with two complementary datasets obtainable in SageMaker Canvas that include transport data for laptop display deliveries. By becoming a member of these datasets, we create a complete dataset that captures varied transport metrics and supply outcomes. Our aim is to construct a predictive mannequin that may decide whether or not future shipments will arrive on time based mostly on historic transport patterns and traits.
Stipulations
As a prerequisite, you want entry to Amazon S3 and Amazon SageMaker AI. Should you don’t have already got a SageMaker AI area configured in your account, you additionally want permissions to create a SageMaker AI area.
Create the info circulate
To create the info circulate, comply with these steps:
- On the Amazon SageMaker AI console, within the navigation pane, below Purposes and IDEs, choose Canvas, as proven within the following screenshot. You would possibly must create a SageMaker area if you happen to haven’t carried out so already.
- After your area is created, select Open Canvas.
- In Canvas, choose the Datasets tab and choose canvas-sample-shipping-logs.csv, as proven within the following screenshot. After the preview seems, select + Create an information circulate.
The preliminary knowledge circulate will open with one supply and one knowledge sort.
- On the prime proper of the display, and choose Add knowledge → tabular. Select Canvas Datasets because the supply and choose canvas-sample-product-descriptions.csv.
- Select Subsequent as proven within the following screenshot. Then select Import.
- After each datasets have been added, choose the plus signal. From the dropdown menu, select choose Mix knowledge. From the subsequent dropdown menu, select Be a part of.
- To carry out an interior be part of on the ProductID column, within the right-hand menu, below Be a part of sort, select Internal be part of. Underneath Be a part of keys, select ProductId, as proven within the following screenshot.
- After the datasets have been joined, choose the plus signal. Within the dropdown menu, choose + Add rework. A preview of the dataset will open.
The dataset incorporates XShippingDistance (lengthy) and YShippingDistance (lengthy) columns. For our functions, we need to use a customized perform that can discover the full distance utilizing the X and Y coordinates after which drop the person coordinate columns. For this instance, we discover the full distance utilizing a perform that depends on the mpmath library.
- To name the customized perform, choose + Add rework. Within the dropdown menu, choose Customized rework. Change the editor to Python (Pandas) and attempt to run the next perform from the Python editor:
Working the perform produces the next error: ModuleNotFoundError: No module named ‘mpmath’, as proven within the following screenshot.
This error happens as a result of mpmath isn’t a module that’s inherently supported by SageMaker Canvas. To make use of a perform that depends on this module, we have to method the usage of a customized perform in another way.
Zip the script and dependencies
To make use of a perform that depends on a module that isn’t natively supported in Canvas, the customized script should be zipped with the module(s) it depends on. For this instance, we used our native built-in growth surroundings (IDE) to create a script.py that depends on the mpmath library.
The script.py file incorporates two features: one perform that’s appropriate with the Python (Pandas) runtime (perform calculate_total_distance
), and one that’s appropriate with the Python (Pyspark) runtime (perform udf_total_distance
).
To ensure the script can run, set up mpmath into the identical listing as script.py by operating pip set up mpmath
.
Run zip -r my_project.zip
to create a .zip file containing the perform and the mpmath set up. The present listing now incorporates a .zip file, our Python script, and the set up our script relies on, as proven within the following screenshot.
Add to Amazon S3
After creating the .zip file, add it to an Amazon S3 bucket.
After the zip file has been uploaded to Amazon S3, it’s accessible in SageMaker Canvas.
Run the customized script
Return to the info circulate in SageMaker Canvas and change the prior customized perform code with the next code and select Replace.
This instance code unzips the .zip file and provides the required dependencies to the native path so that they’re obtainable to the perform at run time. As a result of mpmath was added to the native path, now you can name a perform that depends on this exterior library.
The previous code runs utilizing the Python (Pandas) runtime and calculate_total_distance perform. To make use of the Python (Pyspark) runtime, replace the function_name variable to name the udf_total_distance perform as a substitute.
Full the info circulate
As a final step, take away irrelevant columns earlier than coaching the mannequin. Comply with these steps:
- On the SageMaker Canvas console, choose + Add rework. From the dropdown menu, choose Handle columns
- Underneath Rework, select Drop column. Underneath Columns to drop, add ProductId_0, ProductId_1, and OrderID, as proven within the following screenshot.
The ultimate dataset ought to include 13 columns. The whole knowledge circulate is pictured within the following picture.
Practice the mannequin
To coach the mannequin, comply with these steps:
- On the prime proper of the web page, choose Create mannequin and identify your dataset and mannequin.
- Choose Predictive evaluation as the issue sort and OnTimeDelivery because the goal column, as proven within the screenshot under.
When constructing the mannequin you may select to run a Fast construct or a Customary construct. A Fast construct prioritizes pace over accuracy and produces a skilled mannequin in lower than 20 minutes. An ordinary construct prioritizes accuracy over latency however the mannequin takes longer to coach.
Outcomes
After the mannequin construct is full, you may view the mannequin’s accuracy, together with metrics like F1, precision and recall. Within the case of a normal construct, the mannequin achieved 94.5% accuracy.
After the mannequin coaching is full, there are 4 methods you need to use your mannequin:
- Deploy the mannequin immediately from SageMaker Canvas to an endpoint
- Add the mannequin to the SageMaker Mannequin Registry
- Export your mannequin to a Jupyter Pocket book
- Ship your mannequin to Amazon QuickSight to be used in dashboard visualizations
Clear up
To handle prices and forestall extra workspace fees, select Log off to signal out of SageMaker Canvas while you’re carried out utilizing the appliance, as proven within the following screenshot. You may also configure SageMaker Canvas to mechanically shut down when idle.
Should you created an S3 bucket particularly for this instance, you may also need to empty and delete your bucket.
Abstract
On this put up, we demonstrated how one can add customized dependencies to Amazon S3 and combine them into SageMaker Canvas workflows. By strolling via a sensible instance of implementing a customized distance calculation perform with the mpmath library, we confirmed the best way to:
- Bundle customized code and dependencies right into a .zip file
- Retailer and entry these dependencies from Amazon S3
- Implement customized knowledge transformations in SageMaker Information Wrangler
- Practice a predictive mannequin utilizing the remodeled knowledge
This method implies that knowledge scientists and analysts can lengthen SageMaker Canvas capabilities past the greater than 300 included features.
To strive customized transforms your self, seek advice from the Amazon SageMaker Canvas documentation and check in to SageMaker Canvas at this time. For added insights into how one can optimize your SageMaker Canvas implementation, we advocate exploring these associated posts:
Concerning the Writer
Nadhya Polanco is an Affiliate Options Architect at AWS based mostly in Brussels, Belgium. On this function, she helps organizations seeking to incorporate AI and Machine Studying into their workloads. In her free time, Nadhya enjoys indulging in her ardour for espresso and exploring new locations.