Integrating customized dependencies in Amazon SageMaker Canvas workflows

When implementing machine studying (ML) workflows in Amazon SageMaker Canvas, organizations would possibly want to contemplate exterior dependencies required for his or her particular use instances. Though SageMaker Canvas offers highly effective no-code and low-code capabilities for speedy experimentation, some initiatives would possibly require specialised dependencies and libraries that aren’t included by default in SageMaker Canvas. This put up offers an instance of the best way to incorporate code that depends on exterior dependencies into your SageMaker Canvas workflows.

Amazon SageMaker Canvas is a low-code no-code (LCNC) ML platform that guides customers via each stage of the ML journey, from preliminary knowledge preparation to closing mannequin deployment. With out writing a single line of code, customers can discover datasets, rework knowledge, construct fashions, and generate predictions.

SageMaker Canvas affords complete knowledge wrangling capabilities that assist you to put together your knowledge, together with:

Over 300 built-in transformation steps
Characteristic engineering capabilities
Information normalization and cleaning features
A customized code editor supporting Python, PySpark, and SparkSQL

On this put up, we show the best way to incorporate dependencies saved in Amazon Easy Storage Service (Amazon S3) inside an Amazon SageMaker Information Wrangler circulate. Utilizing this method, you may run customized scripts that rely upon modules not inherently supported by SageMaker Canvas.

Answer overview

To showcase the combination of customized scripts and dependencies from Amazon S3 into SageMaker Canvas, we discover the next instance workflow.

The answer follows three predominant steps:

Add customized scripts and dependencies to Amazon S3
Use SageMaker Information Wrangler in SageMaker Canvas to rework your knowledge utilizing the uploaded code
Practice and export the mannequin

The next diagram is the structure for the answer.

On this instance, we work with two complementary datasets obtainable in SageMaker Canvas that include transport data for laptop display deliveries. By becoming a member of these datasets, we create a complete dataset that captures varied transport metrics and supply outcomes. Our aim is to construct a predictive mannequin that may decide whether or not future shipments will arrive on time based mostly on historic transport patterns and traits.

Stipulations

As a prerequisite, you want entry to Amazon S3 and Amazon SageMaker AI. Should you don’t have already got a SageMaker AI area configured in your account, you additionally want permissions to create a SageMaker AI area.

Create the info circulate

To create the info circulate, comply with these steps:

On the Amazon SageMaker AI console, within the navigation pane, below Purposes and IDEs, choose Canvas, as proven within the following screenshot. You would possibly must create a SageMaker area if you happen to haven’t carried out so already.
After your area is created, select Open Canvas.

In Canvas, choose the Datasets tab and choose canvas-sample-shipping-logs.csv, as proven within the following screenshot. After the preview seems, select + Create an information circulate.

The preliminary knowledge circulate will open with one supply and one knowledge sort.

On the prime proper of the display, and choose Add knowledge → tabular. Select Canvas Datasets because the supply and choose canvas-sample-product-descriptions.csv.
Select Subsequent as proven within the following screenshot. Then select Import.

After each datasets have been added, choose the plus signal. From the dropdown menu, select choose Mix knowledge. From the subsequent dropdown menu, select Be a part of.

To carry out an interior be part of on the ProductID column, within the right-hand menu, below Be a part of sort, select Internal be part of. Underneath Be a part of keys, select ProductId, as proven within the following screenshot.

After the datasets have been joined, choose the plus signal. Within the dropdown menu, choose + Add rework. A preview of the dataset will open.

The dataset incorporates XShippingDistance (lengthy) and YShippingDistance (lengthy) columns. For our functions, we need to use a customized perform that can discover the full distance utilizing the X and Y coordinates after which drop the person coordinate columns. For this instance, we discover the full distance utilizing a perform that depends on the mpmath library.

To name the customized perform, choose + Add rework. Within the dropdown menu, choose Customized rework. Change the editor to Python (Pandas) and attempt to run the next perform from the Python editor:

from mpmath import sqrt  # Import sqrt from mpmath

def calculate_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):

    # Use mpmath's sqrt to calculate the full distance for every row
    df[new_col] = df.apply(lambda row: float(sqrt(row[x_col]**2 + row[y_col]**2)), axis=1)
    
    # Drop the unique x and y columns
    df = df.drop(columns=[x_col, y_col])
    
    return df

df = calculate_total_distance(df)

Working the perform produces the next error: ModuleNotFoundError: No module named ‘mpmath’, as proven within the following screenshot.

This error happens as a result of mpmath isn’t a module that’s inherently supported by SageMaker Canvas. To make use of a perform that depends on this module, we have to method the usage of a customized perform in another way.

Zip the script and dependencies

To make use of a perform that depends on a module that isn’t natively supported in Canvas, the customized script should be zipped with the module(s) it depends on. For this instance, we used our native built-in growth surroundings (IDE) to create a script.py that depends on the mpmath library.

The script.py file incorporates two features: one perform that’s appropriate with the Python (Pandas) runtime (perform calculate_total_distance), and one that’s appropriate with the Python (Pyspark) runtime (perform udf_total_distance).

def calculate_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):
    from npmath import sqrt  # Import sqrt from npmath

    # Use npmath's sqrt to calculate the full distance for every row
    df[new_col] = df.apply(lambda row: float(sqrt(row[x_col]**2 + row[y_col]**2)), axis=1)

    # Drop the unique x and y columns
    df = df.drop(columns=[x_col, y_col])

    return df

def udf_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):
    from pyspark.sql import SparkSession
    from pyspark.sql.features import udf
    from pyspark.sql.varieties import FloatType

    spark = SparkSession.builder 
        .grasp("native") 
        .appName("DistanceCalculation") 
        .getOrCreate()

    def calculate_distance(x, y):
        import sys

        # Add the trail to npmath
        mpmath_path = "/tmp/maths"
        if mpmath_path not in sys.path:
            sys.path.insert(0, mpmath_path)

        from mpmath import sqrt
        return float(sqrt(x**2 + y**2))

    # Register and apply UDF
    distance_udf = udf(calculate_distance, FloatType())
    df = df.withColumn(new_col, distance_udf(df[x_col], df[y_col]))
    df = df.drop(x_col, y_col)

    return df

To ensure the script can run, set up mpmath into the identical listing as script.py by operating pip set up mpmath.

Run zip -r my_project.zip to create a .zip file containing the perform and the mpmath set up. The present listing now incorporates a .zip file, our Python script, and the set up our script relies on, as proven within the following screenshot.

Add to Amazon S3

After creating the .zip file, add it to an Amazon S3 bucket.

After the zip file has been uploaded to Amazon S3, it’s accessible in SageMaker Canvas.

Run the customized script

Return to the info circulate in SageMaker Canvas and change the prior customized perform code with the next code and select Replace.

import zipfile
import boto3
import sys
from pathlib import Path
import shutil
import importlib.util


def load_script_and_dependencies(bucket_name, zip_key, extract_to):
    """
    Downloads a zipper file from S3, unzips it, and ensures dependencies can be found.

    Args:
        bucket_name (str): Title of the S3 bucket.
        zip_key (str): Key for the .zip file within the bucket.
        extract_to (str): Listing to extract recordsdata to.

    Returns:
        str: Path to the extracted folder containing the script and dependencies.
    """
    
    s3_client = boto3.shopper("s3")
    
    # Native path for the zip file
    zip_local_path="/tmp/dependencies.zip"
    
    # Obtain the .zip file from S3
    s3_client.download_file(bucket_name, zip_key, zip_local_path)
    print(f"Downloaded zip file from S3: {zip_key}")

    # Unzip the file
    strive:
        with zipfile.ZipFile(zip_local_path, 'r') as zip_ref:
            zip_ref.extractall(extract_to)
            print(f"Extracted recordsdata to {extract_to}")
    besides Exception as e:
        elevate RuntimeError(f"Didn't extract zip file: {e}")

    # Add the extracted folder to Python path
    if extract_to not in sys.path:
      sys.path.insert(0, extract_to)
          
    return extract_to
    


def call_function_from_script(script_path, function_name, df):
    """
    Dynamically hundreds a perform from a Python script utilizing importlib.
    """
    strive:
        # Get the script identify from the trail
        module_name = script_path.break up('/')[-1].change('.py', '')
        
        # Load the module specification
        spec = importlib.util.spec_from_file_location(module_name, script_path)
        if spec is None:
            elevate ImportError(f"Couldn't load specification for module {module_name}")
            
        # Create the module
        module = importlib.util.module_from_spec(spec)
        sys.modules[module_name] = module
        
        # Execute the module
        spec.loader.exec_module(module)
        
        # Get the perform from the module
        if not hasattr(module, function_name):
            elevate AttributeError(f"Perform '{function_name}' not discovered within the script.")
            
        loaded_function = getattr(module, function_name)

        # Clear up: take away module from sys.modules after execution
        del sys.modules[module_name]
        
        # Name the perform
        return loaded_function(df)
        
    besides Exception as e:
        elevate RuntimeError(f"Error loading or executing perform: {e}")


bucket_name="canvasdatabuckett"  # S3 bucket identify
zip_key = 'features/my_project.zip'  # S3 path to the zip file with our customized dependancy
script_name="script.py"  # Title of the script within the zip file
function_name="udf" # Title of perform to name from our script
extract_to = '/tmp/maths' # Native path to our customized script and dependancies

# Step 1: Load the script and dependencies
extracted_path = load_script_and_dependencies(bucket_name, zip_key, extract_to)

# Step 2: Name the perform from the script
script_path = f"{extracted_path}/{script_name}"
df = call_function_from_script(script_path, function_name, df)

This instance code unzips the .zip file and provides the required dependencies to the native path so that they’re obtainable to the perform at run time. As a result of mpmath was added to the native path, now you can name a perform that depends on this exterior library.

The previous code runs utilizing the Python (Pandas) runtime and calculate_total_distance perform. To make use of the Python (Pyspark) runtime, replace the function_name variable to name the udf_total_distance perform as a substitute.

Full the info circulate

As a final step, take away irrelevant columns earlier than coaching the mannequin. Comply with these steps:

On the SageMaker Canvas console, choose + Add rework. From the dropdown menu, choose Handle columns
Underneath Rework, select Drop column. Underneath Columns to drop, add ProductId_0, ProductId_1, and OrderID, as proven within the following screenshot.

The ultimate dataset ought to include 13 columns. The whole knowledge circulate is pictured within the following picture.

Practice the mannequin

To coach the mannequin, comply with these steps:

On the prime proper of the web page, choose Create mannequin and identify your dataset and mannequin.
Choose Predictive evaluation as the issue sort and OnTimeDelivery because the goal column, as proven within the screenshot under.

When constructing the mannequin you may select to run a Fast construct or a Customary construct. A Fast construct prioritizes pace over accuracy and produces a skilled mannequin in lower than 20 minutes. An ordinary construct prioritizes accuracy over latency however the mannequin takes longer to coach.

Outcomes

After the mannequin construct is full, you may view the mannequin’s accuracy, together with metrics like F1, precision and recall. Within the case of a normal construct, the mannequin achieved 94.5% accuracy.

After the mannequin coaching is full, there are 4 methods you need to use your mannequin:

Deploy the mannequin immediately from SageMaker Canvas to an endpoint
Add the mannequin to the SageMaker Mannequin Registry
Export your mannequin to a Jupyter Pocket book
Ship your mannequin to Amazon QuickSight to be used in dashboard visualizations

Clear up

To handle prices and forestall extra workspace fees, select Log off to signal out of SageMaker Canvas while you’re carried out utilizing the appliance, as proven within the following screenshot. You may also configure SageMaker Canvas to mechanically shut down when idle.

Should you created an S3 bucket particularly for this instance, you may also need to empty and delete your bucket.

Abstract

On this put up, we demonstrated how one can add customized dependencies to Amazon S3 and combine them into SageMaker Canvas workflows. By strolling via a sensible instance of implementing a customized distance calculation perform with the mpmath library, we confirmed the best way to:

Bundle customized code and dependencies right into a .zip file
Retailer and entry these dependencies from Amazon S3
Implement customized knowledge transformations in SageMaker Information Wrangler
Practice a predictive mannequin utilizing the remodeled knowledge

This method implies that knowledge scientists and analysts can lengthen SageMaker Canvas capabilities past the greater than 300 included features.

To strive customized transforms your self, seek advice from the Amazon SageMaker Canvas documentation and check in to SageMaker Canvas at this time. For added insights into how one can optimize your SageMaker Canvas implementation, we advocate exploring these associated posts:

Concerning the Writer

Nadhya Polanco is an Affiliate Options Architect at AWS based mostly in Brussels, Belgium. On this function, she helps organizations seeking to incorporate AI and Machine Studying into their workloads. In her free time, Nadhya enjoys indulging in her ardour for espresso and exploring new locations.

Integrating customized dependencies in Amazon SageMaker Canvas workflows

Japanese-Chinese language Translation with GenAI: What Works and What Doesn’t

Information Science: From College to Work, Half III

Information Science: From College to Work, Half III

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Proton launches ‘Privacy-First’ AI Email Assistant to Compete with Google and Microsoft

Streamlit fairly styled dataframes half 1: utilizing the pandas Styler

About Us

Category

Recent Posts