Pandas Isn’t Going Anyplace: Why It’s Nonetheless My Go-To for Information Wrangling

studying knowledge science in 2020, Pandas was one of the widespread instruments. Though new instruments concentrate on bettering Pandas’ weaknesses in dealing with very massive datasets, I nonetheless use Pandas for a lot of knowledge cleansing, processing, and evaluation duties. Sure, Pandas offers me a tough time when working with billions of rows, however it’s undoubtedly greater than sufficient for working with something under that.

I see Pandas being utilized in not just for EDA or in notebooks but in addition in manufacturing methods.

On this article, I’ll go over some knowledge cleansing and processing operations to show how succesful Pandas is.

Let’s begin with the dataset, which incorporates inventory protecting models (SKUs) and a search API responses for these SKUs.

import pandas as pd

search_results = pd.read_csv("search_results.csv")

search_results.head()

Search result’s a listing of dictionaries and appears like this:

search_results.loc[0, "search_result"]

"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}}, 
{'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}}, 
{'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}}, 
{'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}}, 
{'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}}, 
{'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}}, 
{'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}}, 
{'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}}, 
{'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}}, 
{'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}] 
... and 5 entities remaining"

As we see within the output, it’s not a correct record of dictionary format due to the final half (“… and 5 entities remaining”). Additionally, it’s saved as a single string.

With the intention to make higher use of it, we have to convert it to a correct record of dictionaries. The next line of code removes the final half by splitting the string at “…” and takes the primary break up.

search_results.loc[0, "search_result"].break up("...")[0].strip()

Nonetheless, the output remains to be a single string. We are able to use the built-in ast module of Python to transform it to a listing:

import ast

res = ast.literal_eval(search_results.loc[0, "search_result"].break up("...")[0].strip())

res

[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
 {'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
 {'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
 {'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
 {'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
 {'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
 {'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]

We now have the search outcomes as a correct record of dictionaries. This was just for a single row. We have to apply the identical operation to all SKUs (i.e. whole SKU column).

One choice is to go over all of the rows in a for loop and carry out the identical operation. Nonetheless, this isn’t the best choice. We must always favor vectorized operations after we can. A vectorized operation principally means executing the code on all rows directly.

On a single row, I used splitting to do away with the final a part of the string nevertheless it didn’t work in a vectorized operation. A extra strong choice appears to be utilizing a regex.

search_results.loc[:, 'search_result'] = search_results['search_result'].str.exchange(r"....*", "", regex=True).str.strip()

This code selects “…” and every thing that comes after it and replaces them with nothing. In different phrases, it removes “… and 5 entities remaining” half.

We now have all of the rows within the search outcomes column as a correct record of dictionaries.

search_results.loc[10, "search_result"]

"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
 {'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
 {'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
 {'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
 {'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
 {'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
 {'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]"

They’re nonetheless saved as a string however I can simply convert them to a listing utilizing the ast module, which I’ll do within the subsequent step.

What I’m fascinated about is the SKUs returned within the search outcomes. I’ll create a brand new column by extracting the SKUs within the dictionaries. I can entry them utilizing the “my_id” key of the dictionary.

There are 3 elements of this operation:

Convert the search end result string to record utilizing the literal_eval perform
Extract SKU from the my_id key of the dictionary
Do that in a listing comprehension to get SKUs from all of the dictionaries within the record

We are able to do all these operations by making use of a lambda perform to all rows as follows:

search_results.loc[:, "result_skus"] = 
search_results["search_result"].apply(lambda x: [item['my_id'] for merchandise in ast.literal_eval(x)])

search_results.head()

Every row within the result_skus column incorporates a listing of 10 SKUs. Let’s say I have to have these 10 SKUs in several rows. For every row within the sku column, there will probably be 10 rows created from the record within the result_skus column. There’s a quite simple manner of doing this in Pandas, which is the explode perform.

knowledge = search_results[["sku", "result_skus"]].explode("result_skus", ignore_index=True)

knowledge.head()

We created a brand new dataframe with sku and result_skus column. The drawing under demonstrates what the explode perform does:

Take into account the alternative. We’ve a dataframe as proven above however need to have all outcomes for an sku in a single row.

We are able to use the groupby perform to group the rows by sku after which apply the record perform on the result_skus column:

new_data = knowledge.groupby("sku", as_index=False)["result_skus"].apply(record)

new_data.head()

This may get us again to the earlier step:

Utilizing the explode perform, we created a dataframe with a separate row for every sku within the result_skus column. What if we have to have them separated to totally different columns as a substitute of rows?

One choice is to use the pd.Sequence perform to the result_skus column and concatenate the ensuing columns to the unique dataframe.

new_cols = new_data["result_skus"].apply(pd.Sequence)

new_data = pd.concat([new_data, new_cols], axis=1)

new_data.head()

Columns from 0 to 9 incorporates the ten SKUs within the result_skus column. This code utilizing the apply perform just isn’t a vectorized operation.

We’ve another choice, which is vectorized and far quicker.

new_cols = pd.DataFrame(new_data["result_skus"].tolist())

new_data = pd.concat([new_data, new_cols], axis=1)

This code will give us the identical dataframe as above however a lot quicker.

I demonstrated a typical knowledge cleansing and processing activity a knowledge scientist or analyst could encounter of their job. I’ve been within the subject for over 5 years and Pandas has at all times been sufficient to do what I want apart from when working very massive datasets (e.g. billions of rows).

The instruments which are higher match for such massive datasets have related syntax to Pandas. For instance, PySpark is sort of a mix of Pandas and SQL. Polars is similar to Pandas by way of syntax. Thus, studying and practicind Pandas remains to be a extremely invaluable ability for anybody working within the knowledge science and AI area.

Thanks for studying.

Pandas Isn’t Going Anyplace: Why It’s Nonetheless My Go-To for Information Wrangling

Management the place your AI brokers can browse with Chrome enterprise insurance policies on Amazon Bedrock AgentCore

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

Democratizing AI: How Thomson Reuters Open Area helps no-code AI for each skilled with Amazon Bedrock

About Us

Category

Recent Posts