Automationscribe.com
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automation Scribe
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automationscribe.com
No Result
View All Result

From Textual content to Tables: Function Engineering with LLMs for Tabular Information

admin by admin
March 12, 2026
in Artificial Intelligence
0
From Textual content to Tables: Function Engineering with LLMs for Tabular Information
399
SHARES
2.3k
VIEWS
Share on FacebookShare on Twitter


On this article, you’ll discover ways to use a pre-trained giant language mannequin to extract structured options from textual content and mix them with numeric columns to coach a supervised classifier.

Matters we are going to cowl embody:

  • Making a toy dataset with blended textual content and numeric fields for classification
  • Utilizing a Groq-hosted LLaMA mannequin to extract JSON options from ticket textual content with a Pydantic schema
  • Coaching and evaluating a scikit-learn classifier on the engineered tabular dataset

Let’s not waste any extra time.

From Text to Tables: Feature Engineering with LLMs for Tabular Data

From Textual content to Tables: Function Engineering with LLMs for Tabular Information
Picture by Editor

Introduction

Whereas giant language fashions (LLMs) are usually used for conversational functions in use circumstances that revolve round pure language interactions, they’ll additionally help with duties like function engineering on complicated datasets. Particularly, you may leverage pre-trained LLMs from suppliers like Groq (for instance, fashions from the Llama household) to undertake information transformation and preprocessing duties, together with turning unstructured information like textual content into totally structured, tabular information that can be utilized to gas predictive machine studying fashions.

On this article, I’ll information you thru the total means of making use of function engineering to structured textual content, turning it into tabular information appropriate for a machine studying mannequin — particularly, a classifier skilled on options created from textual content by utilizing an LLM.

Setup and Imports

First, we are going to make all the mandatory imports for this sensible instance:

import pandas as pd

import json

from pydantic import BaseModel, Area

from openai import OpenAI

from google.colab import userdata

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report

from sklearn.preprocessing import StandardScaler

Notice that in addition to widespread libraries for machine studying and information preprocessing like scikit-learn, we import the OpenAI class — not as a result of we are going to immediately use an OpenAI mannequin, however as a result of many LLM APIs (together with Groq’s) have adopted the identical interface type and specs as OpenAI. This class subsequently helps you work together with a wide range of suppliers and entry a variety of LLMs by way of a single consumer, together with Llama fashions through Groq, as we are going to see shortly.

Subsequent, we arrange a Groq consumer to allow entry to a pre-trained LLM that we will name through API for inference throughout execution:

groq_api_key = userdata.get(‘GROQ_API_KEY’)

consumer = OpenAI(

    base_url=“https://api.groq.com/openai/v1”,

    api_key=groq_api_key

)

Essential notice: for the above code to work, you should outline an API secret key for Groq. In Google Colab, you are able to do this by way of the “Secrets and techniques” icon on the left-hand aspect bar (this icon appears like a key). Right here, give your key the identify 'GROQ_API_KEY', then register on the Groq web site to get an precise key, and paste it into the worth discipline.

Making a Toy Ticket Dataset

The following step generates an artificial, partly random toy dataset for illustrative functions. When you have your personal textual content dataset, be happy to adapt the code accordingly and use your personal.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

import random

import time

 

random.seed(42)

classes = [“access”, “inquiry”, “software”, “billing”, “hardware”]

 

templates = {

    “entry”: [

        “I’ve been locked out of my account for {days} days and need urgent help!”,

        “I can’t log in, it keeps saying bad password.”,

        “Reset my access credentials immediately.”,

        “My 2FA isn’t working, please help me get into my account.”

    ],

    “inquiry”: [

        “When will my new credit card arrive in the mail?”,

        “Just checking on the status of my recent order.”,

        “What are your business hours on weekends?”,

        “Can I upgrade my current plan to the premium tier?”

    ],

    “software program”: [

        “The app keeps crashing every time I try to view my transaction history.”,

        “Software bug: the submit button is greyed out.”,

        “Pages are loading incredibly slowly since the last update.”,

        “I’m getting a 500 Internal Server Error on the dashboard.”

    ],

    “billing”: [

        “I need a refund for the extra charges on my bill.”,

        “Why was I billed twice this month?”,

        “Please update my payment method, the old card expired.”,

        “I didn’t authorize this $49.99 transaction.”

    ],

    “{hardware}”: [

        “My hardware token is broken, I can’t log in.”,

        “The screen on my physical device is cracked.”,

        “The card reader isn’t scanning properly anymore.”,

        “Battery drains in 10 minutes, I need a replacement unit.”

    ]

}

 

information = []

for _ in vary(100):

    cat = random.selection(classes)

    # Injecting a random variety of days into particular templates to foster selection

    textual content = random.selection(templates[cat]).format(days=random.randint(1, 14))

    

    information.append({

        “textual content”: textual content,

        “account_age_days”: random.randint(1, 2000),

        “prior_tickets”: random.selections([0, 1, 2, 3, 4, 5], weights=[40, 30, 15, 10, 3, 2])[0],

        “label”: cat

    })

 

df = pd.DataFrame(information)

The dataset generated comprises buyer help tickets, combining textual content descriptions with structured numeric options like account age and variety of prior tickets, in addition to a category label spanning a number of ticket classes. These labels will later be used for coaching and evaluating a classification mannequin on the finish of the method.

Extracting LLM Options

Subsequent, we outline the specified tabular options we wish to extract from the textual content. The selection of options is domain-dependent and totally customizable, however you’ll use the LLM afterward to extract these fields in a constant, structured format:

class TicketFeatures(BaseModel):

    urgency_score: int = Area(description=“Urgency of the ticket on a scale of 1 to five”)

    is_frustrated: int = Area(description=“1 if the consumer expresses frustration, 0 in any other case”)

For instance, urgency and frustration usually correlate with particular ticket sorts (e.g. entry lockouts and outages are usually extra pressing and emotionally charged than basic inquiries), so these alerts might help a downstream classifier separate classes extra successfully than uncooked textual content alone.

The following perform is a key aspect of the method, because it encapsulates the LLM integration wanted to remodel a ticket’s textual content right into a JSON object that matches our schema.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

def extract_features(textual content: str) -> dict:

    # Sleep for two.5 seconds for safer use underneath the constraints of the 30 RPM free-tier restrict

    time.sleep(2.5)

    

    schema_instructions = json.dumps(TicketFeatures.model_json_schema())

    response = consumer.chat.completions.create(

        mannequin=“llama-3.3-70b-versatile”,

        messages=[

            {

                “role”: “system”,

                “content”: f“You are an extraction assistant. Output ONLY valid JSON matching this schema: {schema_instructions}”

            },

            {“role”: “user”, “content”: text}

        ],

        response_format={“kind”: “json_object”},

        temperature=0.0

    )

    return json.hundreds(response.selections[0].message.content material)

Why does the perform return JSON objects? First, JSON is a dependable solution to ask an LLM to supply structured outputs. Second, JSON objects may be simply transformed into Pandas Sequence objects, which may then be seamlessly merged with different columns of an present DataFrame to turn out to be new ones. The next directions do the trick and append the brand new options, saved in engineered_features, to the remainder of the unique dataset:

print(“1. Extracting structured options from textual content utilizing LLM…”)

engineered_features = df[“text”].apply(extract_features)

features_df = pd.DataFrame(engineered_features.tolist())

 

X_raw = pd.concat([df.drop(columns=[“text”, “label”]), features_df], axis=1)

y = df[“label”]

 

print(“n2. Ultimate Engineered Tabular Dataset:”)

print(X_raw)

Here’s what the ensuing tabular information appears like:

          account_age_days  prior_tickets  urgency_score  is_pissed off

0                564              0              5              1

1               1517              3              4              0

2                 62              0              5              1

3                408              2              4              0

4                920              1              5              1

..               ...            ...            ...            ...

95                91              2              4              1

96               884              0              4              1

97              1737              0              5              1

98               837              0              5              1

99               862              1              4              1

 

[100 rows x 4 columns]

Sensible notice on price and latency: Calling an LLM as soon as per row can turn out to be gradual and costly on bigger datasets. In manufacturing, you’ll often wish to (1) batch requests (course of many tickets per name, in case your supplier and immediate design enable it), (2) cache outcomes keyed by a secure identifier (or a hash of the ticket textual content) so re-runs don’t re-bill the identical examples, and (3) implement retries with backoff to deal with transient fee limits and community errors. These three practices usually make the pipeline quicker, cheaper, and way more dependable.

Coaching and Evaluating the Mannequin

Lastly, right here comes the machine studying pipeline, the place the up to date, totally tabular dataset is scaled, cut up into coaching and check subsets, and used to coach and consider a random forest classifier.

print(“n3. Scaling and Coaching Random Forest…”)

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X_raw)

 

# Cut up the information into coaching and check

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.4, random_state=42)

 

# Prepare a random forest classification mannequin

clf = RandomForestClassifier(random_state=42)

clf.match(X_train, y_train)

 

# Predict and Consider

y_pred = clf.predict(X_test)

print(“n4. Classification Report:”)

print(classification_report(y_test, y_pred, zero_division=0))

Listed below are the classifier outcomes:

Classification Report:

              precision    recall  f1–rating   help

 

      entry       0.22      0.18      0.20        11

     billing       0.29      0.33      0.31         6

    {hardware}       0.29      0.25      0.27         8

     inquiry       1.00      1.00      1.00         8

    software program       0.44      0.57      0.50         7

 

    accuracy                           0.45        40

   macro avg       0.45      0.47      0.45        40

weighted avg       0.44      0.45      0.44        40

If you happen to used the code for producing an artificial toy dataset, chances are you’ll get a reasonably disappointing classifier consequence when it comes to accuracy, precision, recall, and so forth. That is regular: for the sake of effectivity and ease, we used a small, partly random set of 100 cases — which is often too small (and arguably too random) to carry out effectively. The important thing right here is the method of turning uncooked textual content into significant options by way of the usage of a pre-trained LLM through API, which ought to work reliably.

Abstract

This text takes a delicate tour by way of the method of turning uncooked textual content into totally tabular options for downstream machine studying modeling. The important thing trick proven alongside the best way is utilizing a pre-trained LLM to carry out inference and return structured outputs through efficient prompting.

Tags: DataEngineeringFeatureLLMsTablestabularText
Previous Post

An Intuitive Information to MCMC (Half I): The Metropolis-Hastings Algorithm

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Popular News

  • Greatest practices for Amazon SageMaker HyperPod activity governance

    Greatest practices for Amazon SageMaker HyperPod activity governance

    405 shares
    Share 162 Tweet 101
  • Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

    403 shares
    Share 161 Tweet 101
  • Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

    403 shares
    Share 161 Tweet 101
  • Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

    403 shares
    Share 161 Tweet 101
  • Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

    403 shares
    Share 161 Tweet 101

About Us

Automation Scribe is your go-to site for easy-to-understand Artificial Intelligence (AI) articles. Discover insights on AI tools, AI Scribe, and more. Stay updated with the latest advancements in AI technology. Dive into the world of automation with simplified explanations and informative content. Visit us today!

Category

  • AI Scribe
  • AI Tools
  • Artificial Intelligence

Recent Posts

  • From Textual content to Tables: Function Engineering with LLMs for Tabular Information
  • An Intuitive Information to MCMC (Half I): The Metropolis-Hastings Algorithm
  • Speed up customized LLM deployment: Superb-tune with Oumi and deploy to Amazon Bedrock
  • Home
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms & Conditions

© 2024 automationscribe.com. All rights reserved.

No Result
View All Result
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us

© 2024 automationscribe.com. All rights reserved.