On this article, you’ll learn to use Python’s itertools module to simplify widespread function engineering duties with clear, environment friendly patterns.
Matters we are going to cowl embrace:
- Producing interplay, polynomial, and cumulative options with itertools.
- Constructing lookup grids, lag home windows, and grouped aggregates for structured information workflows.
- Utilizing iterator-based instruments to put in writing cleaner, extra composable function engineering code.
On we go.
7 Important Python Itertools for Characteristic Engineering
Picture by Editor
Introduction
Characteristic engineering is the place a lot of the actual work in machine studying occurs. A great function typically improves a mannequin greater than switching algorithms. But this step normally results in messy code with nested loops, guide indexing, hand-built mixtures, and the like.
Python’s itertools module is a normal library toolkit that the majority information scientists know exists however not often attain for when constructing options. That’s a missed alternative, as itertools is designed for working with iterators effectively. Lots of function engineering, at its core, is structured iteration over pairs of variables, sliding home windows, grouped sequences, or each attainable subset of a function set.
On this article, you’ll work by means of seven itertools features that resolve widespread function engineering issues. We’ll spin up pattern e-commerce information and canopy interplay options, lag home windows, class mixtures, and extra. By the top, you’ll have a set of patterns you’ll be able to drop instantly into your individual function engineering pipelines.
You will get the code on GitHub.
1. Producing Interplay Options with mixtures
Interplay options seize the connection between two variables — one thing neither variable expresses alone. Manually itemizing each pair from a multi-column dataset is tedious. mixtures within the itertools module does it in a single line.
Let’s code an instance to create interplay options utilizing mixtures:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import itertools import pandas as pd
df = pd.DataFrame({ “avg_order_value”: [142.5, 89.0, 210.3, 67.8, 185.0], “discount_rate”: [0.10, 0.25, 0.05, 0.30, 0.15], “days_since_signup”: [120, 45, 380, 12, 200], “items_per_order”: [3.2, 1.8, 5.1, 1.2, 4.0], “return_rate”: [0.05, 0.18, 0.02, 0.22, 0.08], })
numeric_cols = df.columns.tolist()
for col_a, col_b in itertools.mixtures(numeric_cols, 2): feature_name = f“{col_a}_x_{col_b}” df[feature_name] = df[col_a] * df[col_b]
interaction_cols = [c for c in df.columns if “_x_” in c] print(df[interaction_cols].head()) |
Truncated output:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
avg_order_value_x_discount_rate avg_order_value_x_days_since_signup 0 14.250 17100.0 1 22.250 4005.0 2 10.515 79914.0 3 20.340 813.6 4 27.750 37000.0
avg_order_value_x_items_per_order avg_order_value_x_return_price 0 456.00 7.125 1 160.20 16.020 2 1072.53 4.206 3 81.36 14.916 4 740.00 14.800 ...
days_since_signup_x_return_rate items_per_order_x_return_price 0 6.00 0.160 1 8.10 0.324 2 7.60 0.102 3 2.64 0.264 4 16.00 0.320 |
mixtures(numeric_cols, 2) generates each distinctive pair precisely as soon as with out duplicates. With 5 columns, that’s 10 pairs; with 10 columns, it’s 45. This strategy scales as you add columns.
2. Constructing Cross-Class Characteristic Grids with product
itertools.product provides you the Cartesian product of two or extra iterables — each attainable mixture throughout them — together with repeats throughout totally different teams.
Within the e-commerce pattern we’re working with, that is helpful while you need to construct a function matrix throughout buyer segments and product classes.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import itertools
customer_segments = [“new”, “returning”, “vip”] product_categories = [“electronics”, “apparel”, “home_goods”, “beauty”] channels = [“mobile”, “desktop”]
# All section × class × channel mixtures combos = checklist(itertools.product(customer_segments, product_categories, channels))
grid_df = pd.DataFrame(combos, columns=[“segment”, “category”, “channel”])
# Simulate a conversion price lookup per mixture import numpy as np np.random.seed(7) grid_df[“avg_conversion_rate”] = np.spherical( np.random.uniform(0.02, 0.18, measurement=len(grid_df)), 3 )
print(grid_df.head(12)) print(f“nTotal mixtures: {len(grid_df)}”) |
Output:
|
section class channel avg_conversion_price 0 new electronics cell 0.032 1 new electronics desktop 0.145 2 new attire cell 0.090 3 new attire desktop 0.136 4 new home_goods cell 0.176 5 new home_goods desktop 0.106 6 new magnificence cell 0.100 7 new magnificence desktop 0.032 8 returning electronics cell 0.063 9 returning electronics desktop 0.100 10 returning attire cell 0.129 11 returning attire desktop 0.149
Whole mixtures: 24 |
This grid can then be merged again onto your predominant transaction dataset as a lookup function, as each row will get the anticipated conversion price for its particular section × class × channel bucket. product ensures you haven’t missed any legitimate mixture when constructing that grid.
3. Flattening Multi-Supply Characteristic Units with chain
In most pipelines, options come from a number of sources: a buyer profile desk, a product metadata desk, and a searching historical past desk. You typically must flatten these right into a single function checklist for column choice or validation.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import itertools
customer_features = [ “customer_age”, “days_since_signup”, “lifetime_value”, “total_orders”, “avg_order_value” ]
product_features = [ “category”, “brand_tier”, “avg_rating”, “review_count”, “is_sponsored” ]
behavioral_features = [ “pages_viewed_last_7d”, “search_queries_last_7d”, “cart_abandonment_rate”, “wishlist_size” ]
# Flatten all function teams into one checklist all_features = checklist(itertools.chain( customer_features, product_features, behavioral_options ))
print(f“Whole options: {len(all_features)}”) print(all_features) |
Output:
|
Whole options: 14 [‘customer_age’, ‘days_since_signup’, ‘lifetime_value’, ‘total_orders’, ‘avg_order_value’, ‘category’, ‘brand_tier’, ‘avg_rating’, ‘review_count’, ‘is_sponsored’, ‘pages_viewed_last_7d’, ‘search_queries_last_7d’, ‘cart_abandonment_rate’, ‘wishlist_size’] |
This may appear like utilizing + to concatenate lists, and it’s for easy circumstances. However chain is particularly helpful when you might have many sources, when sources are mills relatively than lists, or while you’re constructing the function checklist conditionally, the place some function teams are non-compulsory relying on information availability. It retains the code readable and composable.
4. Creating Windowed Lag Options with islice
Lag options are vital in lots of datasets. In e-commerce, for instance, what a buyer spent final month, their order rely over the past 3 purchases, and their common basket measurement over the past 5 transactions can all be vital options. Constructing these manually with index arithmetic is liable to errors.
islice helps you to slice an iterator with out changing it to an inventory first. That is helpful when processing ordered transaction histories row by row.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
import itertools
# Transaction historical past for buyer C-10482, ordered chronologically transactions = [ {“order_id”: “ORD-8821”, “amount”: 134.50, “items”: 3}, {“order_id”: “ORD-8934”, “amount”: 89.00, “items”: 2}, {“order_id”: “ORD-9102”, “amount”: 210.75, “items”: 5}, {“order_id”: “ORD-9341”, “amount”: 55.20, “items”: 1}, {“order_id”: “ORD-9488”, “amount”: 178.90, “items”: 4}, {“order_id”: “ORD-9601”, “amount”: 302.10, “items”: 7}, ]
# Construct lag-3 options for every transaction (utilizing 3 most up-to-date prior orders) window_size = 3 options = []
for i in vary(window_size, len(transactions)): window = checklist(itertools.islice(transactions, i – window_size, i)) present = transactions[i]
lag_amounts = [t[“amount”] for t in window] options.append({ “order_id”: present[“order_id”], “current_amount”: present[“amount”], “lag_1_amount”: lag_amounts[–1], “lag_2_amount”: lag_amounts[–2], “lag_3_amount”: lag_amounts[–3], “rolling_mean_3”: spherical(sum(lag_amounts) / len(lag_amounts), 2), “rolling_max_3”: max(lag_amounts), })
print(pd.DataFrame(options).to_string(index=False)) |
Output:
|
order_id current_amount lag_1_amount lag_2_amount lag_3_amount rolling_mean_3 rolling_max_3 ORD–9341 55.2 210.75 89.00 134.50 144.75 210.75 ORD–9488 178.9 55.20 210.75 89.00 118.32 210.75 ORD–9601 302.1 178.90 55.20 210.75 148.28 210.75 |
islice(transactions, i - window_size, i) provides you precisely the previous window_size transactions with out constructing intermediate lists for the total historical past.
5. Aggregating Per-Class Options with groupby
groupby helps you to group a sorted iterable and compute per-group statistics cleanly.
Going again to our instance, a buyer’s conduct typically varies considerably by product class. Their common spend on electronics could be 4× their spend on equipment. Treating all orders as one pool loses that sign.
Right here’s an instance:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
import itertools
orders = [ {“customer”: “C-10482”, “category”: “electronics”, “amount”: 349.99}, {“customer”: “C-10482”, “category”: “electronics”, “amount”: 189.00}, {“customer”: “C-10482”, “category”: “apparel”, “amount”: 62.50}, {“customer”: “C-10482”, “category”: “apparel”, “amount”: 88.00}, {“customer”: “C-10482”, “category”: “apparel”, “amount”: 45.75}, {“customer”: “C-10482”, “category”: “home_goods”, “amount”: 124.30}, ]
# Should be sorted by the grouping key earlier than utilizing groupby orders_sorted = sorted(orders, key=lambda x: x[“category”])
category_features = {} for class, group in itertools.groupby(orders_sorted, key=lambda x: x[“category”]): quantities = [o[“amount”] for o in group] category_features[category] = { “order_count”: len(quantities), “total_spend”: spherical(sum(quantities), 2), “avg_spend”: spherical(sum(quantities) / len(quantities), 2), “max_spend”: max(quantities), }
cat_df = pd.DataFrame(category_features).T cat_df.index.identify = “class” print(cat_df) |
Output:
|
order_count total_spend avg_spend max_spend class attire 3.0 196.25 65.42 88.00 electronics 2.0 538.99 269.50 349.99 residence_items 1.0 124.30 124.30 124.30 |
These per-category aggregates change into options on the client row — electronics_avg_spend, apparel_order_count, and so forth. The vital factor to recollect with itertools.groupby is that you have to type by the key first. In contrast to pandas groupby, it solely teams consecutive parts.
6. Constructing Polynomial Options with combinations_with_replacement
Polynomial options — squares, cubes, and cross-products — are a normal method to give linear fashions the power to seize non-linear relationships.
Scikit-learn’s PolynomialFeatures does this, however combinations_with_replacement provides you an identical end result with full management over which options get expanded and the way.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import itertools
df_poly = pd.DataFrame({ “avg_order_value”: [142.5, 89.0, 210.3, 67.8], “discount_rate”: [0.10, 0.25, 0.05, 0.30], “items_per_order”: [3.2, 1.8, 5.1, 1.2], })
cols = df_poly.columns.tolist()
# Diploma-2: contains col^2 and col_a × col_b for col_a, col_b in itertools.combinations_with_replacement(cols, 2): feature_name = f“{col_a}^2” if col_a == col_b else f“{col_a}_x_{col_b}” df_poly[feature_name] = df_poly[col_a] * df_poly[col_b]
poly_cols = [c for c in df_poly.columns if “^2” in c or “_x_” in c] print(df_poly[poly_cols].spherical(3)) |
Output:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
avg_order_value^2 avg_order_value_x_discount_price 0 20306.25 14.250 1 7921.00 22.250 2 44226.09 10.515 3 4596.84 20.340
avg_order_value_x_items_per_order discount_rate^2 0 456.00 0.010 1 160.20 0.062 2 1072.53 0.003 3 81.36 0.090
discount_rate_x_items_per_order items_per_order^2 0 0.320 10.24 1 0.450 3.24 2 0.255 26.01 3 0.360 1.44 |
The distinction from mixtures is within the identify: combinations_with_replacement permits the identical component to look twice. That’s what provides you the squared phrases (avg_order_value^2). Use this while you need polynomial enlargement with out pulling in scikit-learn only for preprocessing.
7. Accumulating Cumulative Behavioral Options with accumulate
itertools.accumulate computes working aggregates over a sequence with no need pandas or NumPy.
Cumulative options — working whole spend, cumulative order rely, and working common basket measurement — are helpful alerts for lifetime worth modeling and churn prediction. A buyer’s cumulative spend at order 5 says one thing totally different than their spend at order 15. Right here’s a helpful instance:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
import itertools
# Buyer C-20917: chronological order quantities order_amounts = [56.80, 123.40, 89.90, 245.00, 67.50, 310.20, 88.75]
# Cumulative spend cumulative_spend = checklist(itertools.accumulate(order_amounts))
# Cumulative max spend (highest single order to date) cumulative_max = checklist(itertools.accumulate(order_amounts, func=max))
# Cumulative order rely (simply utilizing addition on 1s) cumulative_count = checklist(itertools.accumulate([1] * len(order_amounts)))
features_df = pd.DataFrame({ “order_number”: vary(1, len(order_amounts) + 1), “order_amount”: order_amounts, “cumulative_spend”: cumulative_spend, “cumulative_max_order”: cumulative_max, “order_count_so_far”: cumulative_count, })
features_df[“avg_spend_so_far”] = ( features_df[“cumulative_spend”] / features_df[“order_count_so_far”] ).spherical(2)
print(features_df.to_string(index=False)) |
Output:
|
order_number order_amount cumulative_spend cumulative_max_order order_count_so_far avg_spend_so_far 1 56.80 56.80 56.8 1 56.80 2 123.40 180.20 123.4 2 90.10 3 89.90 270.10 123.4 3 90.03 4 245.00 515.10 245.0 4 128.78 5 67.50 582.60 245.0 5 116.52 6 310.20 892.80 310.2 6 148.80 7 88.75 981.55 310.2 7 140.22 |
accumulate takes an non-compulsory func argument — any two-argument operate. The default is addition, however max, min, operator.mul, or a customized lambda all work. On this instance, every row within the output is a snapshot of the client’s historical past at that cut-off date. That is helpful when constructing options for sequential fashions or coaching information the place you have to keep away from leakage.
Wrapping Up
I hope you discovered this text on utilizing Python’s itertools module for function engineering useful. Right here’s a fast reference for when to succeed in for every operate:
| Perform | Characteristic Engineering Use Case |
|---|---|
mixtures |
Pairwise interplay options |
product |
Cross-category function grids |
chain |
Merging function lists from a number of sources |
islice |
Lag and rolling window options |
groupby |
Per-group aggregation options |
combinations_with_replacement |
Polynomial / squared options |
accumulate |
Cumulative behavioral options |
A helpful behavior to construct right here is recognizing when a function engineering drawback is, at its core, an iteration drawback. When it’s, itertools virtually at all times has a cleaner reply than a customized operate with hard-to-maintain loops. Within the subsequent article, we’ll concentrate on constructing options for time sequence information. Till then, glad coding!

