Automationscribe.com
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automation Scribe
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automationscribe.com
No Result
View All Result

7 Important Python Itertools for Characteristic Engineering

admin by admin
April 23, 2026
in Artificial Intelligence
0
7 Important Python Itertools for Characteristic Engineering
399
SHARES
2.3k
VIEWS
Share on FacebookShare on Twitter


On this article, you’ll learn to use Python’s itertools module to simplify widespread function engineering duties with clear, environment friendly patterns.

Matters we are going to cowl embrace:

  • Producing interplay, polynomial, and cumulative options with itertools.
  • Constructing lookup grids, lag home windows, and grouped aggregates for structured information workflows.
  • Utilizing iterator-based instruments to put in writing cleaner, extra composable function engineering code.

On we go.

7 Essential Python Itertools for Feature Engineering

7 Important Python Itertools for Characteristic Engineering
Picture by Editor

Introduction

Characteristic engineering is the place a lot of the actual work in machine studying occurs. A great function typically improves a mannequin greater than switching algorithms. But this step normally results in messy code with nested loops, guide indexing, hand-built mixtures, and the like.

Python’s itertools module is a normal library toolkit that the majority information scientists know exists however not often attain for when constructing options. That’s a missed alternative, as itertools is designed for working with iterators effectively. Lots of function engineering, at its core, is structured iteration over pairs of variables, sliding home windows, grouped sequences, or each attainable subset of a function set.

On this article, you’ll work by means of seven itertools features that resolve widespread function engineering issues. We’ll spin up pattern e-commerce information and canopy interplay options, lag home windows, class mixtures, and extra. By the top, you’ll have a set of patterns you’ll be able to drop instantly into your individual function engineering pipelines.

You will get the code on GitHub.

1. Producing Interplay Options with mixtures

Interplay options seize the connection between two variables — one thing neither variable expresses alone. Manually itemizing each pair from a multi-column dataset is tedious. mixtures within the itertools module does it in a single line.

Let’s code an instance to create interplay options utilizing mixtures:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

import itertools

import pandas as pd

 

df = pd.DataFrame({

    “avg_order_value”:    [142.5, 89.0, 210.3, 67.8, 185.0],

    “discount_rate”:      [0.10,  0.25, 0.05,  0.30, 0.15],

    “days_since_signup”:  [120,   45,   380,   12,   200],

    “items_per_order”:    [3.2,   1.8,  5.1,   1.2,  4.0],

    “return_rate”:        [0.05,  0.18, 0.02,  0.22, 0.08],

})

 

numeric_cols = df.columns.tolist()

 

for col_a, col_b in itertools.mixtures(numeric_cols, 2):

    feature_name = f“{col_a}_x_{col_b}”

    df[feature_name] = df[col_a] * df[col_b]

 

interaction_cols = [c for c in df.columns if “_x_” in c]

print(df[interaction_cols].head())

Truncated output:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

   avg_order_value_x_discount_rate  avg_order_value_x_days_since_signup  

0                           14.250                              17100.0  

1                           22.250                               4005.0  

2                           10.515                              79914.0  

3                           20.340                                813.6  

4                           27.750                              37000.0  

 

   avg_order_value_x_items_per_order  avg_order_value_x_return_price  

0                             456.00                          7.125  

1                             160.20                         16.020  

2                            1072.53                          4.206  

3                              81.36                         14.916  

4                             740.00                         14.800  

...

 

   days_since_signup_x_return_rate  items_per_order_x_return_price  

0                             6.00                          0.160  

1                             8.10                          0.324  

2                             7.60                          0.102  

3                             2.64                          0.264  

4                            16.00                          0.320  

mixtures(numeric_cols, 2) generates each distinctive pair precisely as soon as with out duplicates. With 5 columns, that’s 10 pairs; with 10 columns, it’s 45. This strategy scales as you add columns.

2. Constructing Cross-Class Characteristic Grids with product

itertools.product provides you the Cartesian product of two or extra iterables — each attainable mixture throughout them — together with repeats throughout totally different teams.

Within the e-commerce pattern we’re working with, that is helpful while you need to construct a function matrix throughout buyer segments and product classes.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

import itertools

 

customer_segments = [“new”, “returning”, “vip”]

product_categories = [“electronics”, “apparel”, “home_goods”, “beauty”]

channels = [“mobile”, “desktop”]

 

# All section × class × channel mixtures

combos = checklist(itertools.product(customer_segments, product_categories, channels))

 

grid_df = pd.DataFrame(combos, columns=[“segment”, “category”, “channel”])

 

# Simulate a conversion price lookup per mixture

import numpy as np

np.random.seed(7)

grid_df[“avg_conversion_rate”] = np.spherical(

    np.random.uniform(0.02, 0.18, measurement=len(grid_df)), 3

)

 

print(grid_df.head(12))

print(f“nTotal mixtures: {len(grid_df)}”)

Output:

      section     class  channel  avg_conversion_price

0         new  electronics   cell                0.032

1         new  electronics  desktop                0.145

2         new      attire   cell                0.090

3         new      attire  desktop                0.136

4         new   home_goods   cell                0.176

5         new   home_goods  desktop                0.106

6         new       magnificence   cell                0.100

7         new       magnificence  desktop                0.032

8   returning  electronics   cell                0.063

9   returning  electronics  desktop                0.100

10  returning      attire   cell                0.129

11  returning      attire  desktop                0.149

 

Whole mixtures: 24

This grid can then be merged again onto your predominant transaction dataset as a lookup function, as each row will get the anticipated conversion price for its particular section × class × channel bucket. product ensures you haven’t missed any legitimate mixture when constructing that grid.

3. Flattening Multi-Supply Characteristic Units with chain

In most pipelines, options come from a number of sources: a buyer profile desk, a product metadata desk, and a searching historical past desk. You typically must flatten these right into a single function checklist for column choice or validation.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

import itertools

 

customer_features = [

    “customer_age”, “days_since_signup”, “lifetime_value”,

    “total_orders”, “avg_order_value”

]

 

product_features = [

    “category”, “brand_tier”, “avg_rating”,

    “review_count”, “is_sponsored”

]

 

behavioral_features = [

    “pages_viewed_last_7d”, “search_queries_last_7d”,

    “cart_abandonment_rate”, “wishlist_size”

]

 

# Flatten all function teams into one checklist

all_features = checklist(itertools.chain(

    customer_features,

    product_features,

    behavioral_options

))

 

print(f“Whole options: {len(all_features)}”)

print(all_features)

Output:

Whole options: 14

[‘customer_age’, ‘days_since_signup’, ‘lifetime_value’, ‘total_orders’, ‘avg_order_value’, ‘category’, ‘brand_tier’, ‘avg_rating’, ‘review_count’, ‘is_sponsored’, ‘pages_viewed_last_7d’, ‘search_queries_last_7d’, ‘cart_abandonment_rate’, ‘wishlist_size’]

This may appear like utilizing + to concatenate lists, and it’s for easy circumstances. However chain is particularly helpful when you might have many sources, when sources are mills relatively than lists, or while you’re constructing the function checklist conditionally, the place some function teams are non-compulsory relying on information availability. It retains the code readable and composable.

4. Creating Windowed Lag Options with islice

Lag options are vital in lots of datasets. In e-commerce, for instance, what a buyer spent final month, their order rely over the past 3 purchases, and their common basket measurement over the past 5 transactions can all be vital options. Constructing these manually with index arithmetic is liable to errors.

islice helps you to slice an iterator with out changing it to an inventory first. That is helpful when processing ordered transaction histories row by row.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

import itertools

 

# Transaction historical past for buyer C-10482, ordered chronologically

transactions = [

    {“order_id”: “ORD-8821”, “amount”: 134.50, “items”: 3},

    {“order_id”: “ORD-8934”, “amount”:  89.00, “items”: 2},

    {“order_id”: “ORD-9102”, “amount”: 210.75, “items”: 5},

    {“order_id”: “ORD-9341”, “amount”:  55.20, “items”: 1},

    {“order_id”: “ORD-9488”, “amount”: 178.90, “items”: 4},

    {“order_id”: “ORD-9601”, “amount”: 302.10, “items”: 7},

]

 

# Construct lag-3 options for every transaction (utilizing 3 most up-to-date prior orders)

window_size = 3

options = []

 

for i in vary(window_size, len(transactions)):

    window = checklist(itertools.islice(transactions, i – window_size, i))

    present = transactions[i]

 

    lag_amounts = [t[“amount”] for t in window]

    options.append({

        “order_id”:         present[“order_id”],

        “current_amount”:   present[“amount”],

        “lag_1_amount”:     lag_amounts[–1],

        “lag_2_amount”:     lag_amounts[–2],

        “lag_3_amount”:     lag_amounts[–3],

        “rolling_mean_3”:   spherical(sum(lag_amounts) / len(lag_amounts), 2),

        “rolling_max_3”:    max(lag_amounts),

    })

 

print(pd.DataFrame(options).to_string(index=False))

Output:

order_id  current_amount  lag_1_amount  lag_2_amount  lag_3_amount  rolling_mean_3  rolling_max_3

ORD–9341            55.2        210.75         89.00        134.50          144.75         210.75

ORD–9488           178.9         55.20        210.75         89.00          118.32         210.75

ORD–9601           302.1        178.90         55.20        210.75          148.28         210.75

islice(transactions, i - window_size, i) provides you precisely the previous window_size transactions with out constructing intermediate lists for the total historical past.

5. Aggregating Per-Class Options with groupby

groupby helps you to group a sorted iterable and compute per-group statistics cleanly.

Going again to our instance, a buyer’s conduct typically varies considerably by product class. Their common spend on electronics could be 4× their spend on equipment. Treating all orders as one pool loses that sign.

Right here’s an instance:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

import itertools

 

orders = [

    {“customer”: “C-10482”, “category”: “electronics”, “amount”: 349.99},

    {“customer”: “C-10482”, “category”: “electronics”, “amount”: 189.00},

    {“customer”: “C-10482”, “category”: “apparel”,     “amount”:  62.50},

    {“customer”: “C-10482”, “category”: “apparel”,     “amount”:  88.00},

    {“customer”: “C-10482”, “category”: “apparel”,     “amount”:  45.75},

    {“customer”: “C-10482”, “category”: “home_goods”,  “amount”: 124.30},

]

 

# Should be sorted by the grouping key earlier than utilizing groupby

orders_sorted = sorted(orders, key=lambda x: x[“category”])

 

category_features = {}

for class, group in itertools.groupby(orders_sorted, key=lambda x: x[“category”]):

    quantities = [o[“amount”] for o in group]

    category_features[category] = {

        “order_count”:  len(quantities),

        “total_spend”:  spherical(sum(quantities), 2),

        “avg_spend”:    spherical(sum(quantities) / len(quantities), 2),

        “max_spend”:    max(quantities),

    }

 

cat_df = pd.DataFrame(category_features).T

cat_df.index.identify = “class”

print(cat_df)

Output:

             order_count  total_spend  avg_spend  max_spend

class                                                  

attire              3.0       196.25      65.42      88.00

electronics          2.0       538.99     269.50     349.99

residence_items           1.0       124.30     124.30     124.30

These per-category aggregates change into options on the client row — electronics_avg_spend, apparel_order_count, and so forth. The vital factor to recollect with itertools.groupby is that you have to type by the key first. In contrast to pandas groupby, it solely teams consecutive parts.

6. Constructing Polynomial Options with combinations_with_replacement

Polynomial options — squares, cubes, and cross-products — are a normal method to give linear fashions the power to seize non-linear relationships.

Scikit-learn’s PolynomialFeatures does this, however combinations_with_replacement provides you an identical end result with full management over which options get expanded and the way.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

import itertools

 

df_poly = pd.DataFrame({

    “avg_order_value”:  [142.5, 89.0, 210.3, 67.8],

    “discount_rate”:    [0.10,  0.25, 0.05,  0.30],

    “items_per_order”:  [3.2,   1.8,  5.1,   1.2],

})

 

cols = df_poly.columns.tolist()

 

# Diploma-2: contains col^2 and col_a × col_b

for col_a, col_b in itertools.combinations_with_replacement(cols, 2):

    feature_name = f“{col_a}^2” if col_a == col_b else f“{col_a}_x_{col_b}”

    df_poly[feature_name] = df_poly[col_a] * df_poly[col_b]

 

poly_cols = [c for c in df_poly.columns if “^2” in c or “_x_” in c]

print(df_poly[poly_cols].spherical(3))

Output:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

   avg_order_value^2  avg_order_value_x_discount_price  

0           20306.25                           14.250  

1            7921.00                           22.250  

2           44226.09                           10.515  

3            4596.84                           20.340  

 

   avg_order_value_x_items_per_order  discount_rate^2  

0                             456.00            0.010  

1                             160.20            0.062  

2                            1072.53            0.003  

3                              81.36            0.090  

 

   discount_rate_x_items_per_order  items_per_order^2  

0                            0.320              10.24  

1                            0.450               3.24  

2                            0.255              26.01  

3                            0.360               1.44  

The distinction from mixtures is within the identify: combinations_with_replacement permits the identical component to look twice. That’s what provides you the squared phrases (avg_order_value^2). Use this while you need polynomial enlargement with out pulling in scikit-learn only for preprocessing.

7. Accumulating Cumulative Behavioral Options with accumulate

itertools.accumulate computes working aggregates over a sequence with no need pandas or NumPy.

Cumulative options — working whole spend, cumulative order rely, and working common basket measurement — are helpful alerts for lifetime worth modeling and churn prediction. A buyer’s cumulative spend at order 5 says one thing totally different than their spend at order 15. Right here’s a helpful instance:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

import itertools

 

# Buyer C-20917: chronological order quantities

order_amounts = [56.80, 123.40, 89.90, 245.00, 67.50, 310.20, 88.75]

 

# Cumulative spend

cumulative_spend = checklist(itertools.accumulate(order_amounts))

 

# Cumulative max spend (highest single order to date)

cumulative_max = checklist(itertools.accumulate(order_amounts, func=max))

 

# Cumulative order rely (simply utilizing addition on 1s)

cumulative_count = checklist(itertools.accumulate([1] * len(order_amounts)))

 

features_df = pd.DataFrame({

    “order_number”:    vary(1, len(order_amounts) + 1),

    “order_amount”:    order_amounts,

    “cumulative_spend”: cumulative_spend,

    “cumulative_max_order”: cumulative_max,

    “order_count_so_far”:   cumulative_count,

})

 

features_df[“avg_spend_so_far”] = (

    features_df[“cumulative_spend”] / features_df[“order_count_so_far”]

).spherical(2)

 

print(features_df.to_string(index=False))

Output:

order_number  order_amount  cumulative_spend  cumulative_max_order  order_count_so_far  avg_spend_so_far

            1         56.80             56.80                  56.8                   1             56.80

            2        123.40            180.20                 123.4                   2             90.10

            3         89.90            270.10                 123.4                   3             90.03

            4        245.00            515.10                 245.0                   4            128.78

            5         67.50            582.60                 245.0                   5            116.52

            6        310.20            892.80                 310.2                   6            148.80

            7         88.75            981.55                 310.2                   7            140.22

accumulate takes an non-compulsory func argument — any two-argument operate. The default is addition, however max, min, operator.mul, or a customized lambda all work. On this instance, every row within the output is a snapshot of the client’s historical past at that cut-off date. That is helpful when constructing options for sequential fashions or coaching information the place you have to keep away from leakage.

Wrapping Up

I hope you discovered this text on utilizing Python’s itertools module for function engineering useful. Right here’s a fast reference for when to succeed in for every operate:

Perform Characteristic Engineering Use Case
mixtures Pairwise interplay options
product Cross-category function grids
chain Merging function lists from a number of sources
islice Lag and rolling window options
groupby Per-group aggregation options
combinations_with_replacement Polynomial / squared options
accumulate Cumulative behavioral options

A helpful behavior to construct right here is recognizing when a function engineering drawback is, at its core, an iteration drawback. When it’s, itertools virtually at all times has a cleaner reply than a customized operate with hard-to-maintain loops. Within the subsequent article, we’ll concentrate on constructing options for time sequence information. Till then, glad coding!

Tags: EngineeringEssentialFeatureItertoolsPython
Previous Post

Utilizing Causal Inference to Estimate the Affect of Tube Strikes on Biking Utilization in London

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Popular News

  • Greatest practices for Amazon SageMaker HyperPod activity governance

    Greatest practices for Amazon SageMaker HyperPod activity governance

    405 shares
    Share 162 Tweet 101
  • How Cursor Really Indexes Your Codebase

    404 shares
    Share 162 Tweet 101
  • Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

    403 shares
    Share 161 Tweet 101
  • Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

    403 shares
    Share 161 Tweet 101
  • Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

    403 shares
    Share 161 Tweet 101

About Us

Automation Scribe is your go-to site for easy-to-understand Artificial Intelligence (AI) articles. Discover insights on AI tools, AI Scribe, and more. Stay updated with the latest advancements in AI technology. Dive into the world of automation with simplified explanations and informative content. Visit us today!

Category

  • AI Scribe
  • AI Tools
  • Artificial Intelligence

Recent Posts

  • 7 Important Python Itertools for Characteristic Engineering
  • Utilizing Causal Inference to Estimate the Affect of Tube Strikes on Biking Utilization in London
  • Amazon SageMaker AI now helps optimized generative AI inference suggestions
  • Home
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms & Conditions

© 2024 automationscribe.com. All rights reserved.

No Result
View All Result
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us

© 2024 automationscribe.com. All rights reserved.