Automationscribe.com
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automation Scribe
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automationscribe.com
No Result
View All Result

Learn how to Mix LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline

admin by admin
March 14, 2026
in Artificial Intelligence
0
Learn how to Mix LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline
399
SHARES
2.3k
VIEWS
Share on FacebookShare on Twitter


On this article, you’ll learn to fuse dense LLM sentence embeddings, sparse TF-IDF options, and structured metadata right into a single scikit-learn pipeline for textual content classification.

Matters we are going to cowl embrace:

  • Loading and making ready a textual content dataset alongside artificial metadata options.
  • Constructing parallel function pipelines for TF-IDF, LLM embeddings, and numeric metadata.
  • Fusing all function branches with ColumnTransformer and coaching an end-to-end classifier.

Let’s break it down.

Combine LLM Embeddings TF-IDF Metadata Scikit-learn Pipeline

Learn how to Mix LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline (click on to enlarge)
Picture by Editor

Introduction

Knowledge fusion, or combining various items of knowledge right into a single pipeline, sounds bold sufficient. If we speak not nearly two, however about three complementary function sources, then the problem — and the potential payoff — goes to the following stage. Probably the most thrilling half is that scikit-learn permits us to unify all of them cleanly inside a single, end-to-end workflow. Do you need to see how? This text walks you step-by-step by way of constructing an entire fusion pipeline from scratch for a downstream textual content classification process, combining dense semantic data from LLM-generated embeddings, sparse lexical options from TF-IDF, and structured metadata alerts. ? Maintain studying.

Step-by-Step Pipeline Constructing Course of

First, we are going to make all the required imports for the pipeline-building course of. In case you are working in a neighborhood setting, you may must pip set up a few of them first:

import numpy as np

import pandas as pd

 

from sklearn.datasets import fetch_20newsgroups

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.feature_extraction.textual content import TfidfVectorizer

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.decomposition import TruncatedSVD

 

from sentence_transformers import SentenceTransformer

Let’s look carefully at this — virtually limitless! — listing of imports. I wager one component has caught your consideration: fetch_20newsgroups. This can be a freely out there textual content dataset in scikit-learn that we’ll use all through this text: it accommodates textual content extracted from information articles belonging to all kinds of classes.

To maintain our dataset manageable in observe, we are going to decide the information articles belonging to a subset of classes specified by us. The next code does the trick:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

classes = [

    “rec.sport.baseball”,

    “sci.space”,

    “comp.graphics”,

    “talk.politics.misc”

]

 

dataset = fetch_20newsgroups(

    subset=“all”,

    classes=classes,

    take away=(“headers”, “footers”, “quotes”)

)

 

X_raw = dataset.information

y = dataset.goal

 

print(f“Variety of samples: {len(X_raw)}”)

We known as this freshly created dataset X_raw to emphasise that this can be a uncooked, far-from-final model of the dataset we are going to step by step assemble for downstream duties like utilizing machine studying fashions for predictive functions. It’s honest to say that the “uncooked” suffix can also be used as a result of right here now we have the uncooked textual content, from which three totally different information parts (or streams) will likely be generated and later merged.

For the structured metadata related to the information articles obtained, in real-world contexts, this metadata may already be out there or supplied by the dataset proprietor. That’s not the case with this publicly out there dataset, so we are going to synthetically create some easy metadata options primarily based on the textual content, together with options describing character size, phrase depend, common phrase size, uppercase ratio, and digit ratio.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

def generate_metadata(texts):

    lengths = [len(t) for t in texts]

    word_counts = [len(t.split()) for t in texts]

    

    avg_word_lengths = []

    uppercase_ratios = []

    digit_ratios = []

 

    for t in texts:

        phrases = t.break up()

        if phrases:

            avg_word_lengths.append(np.imply([len(w) for w in words]))

        else:

            avg_word_lengths.append(0)

 

        denom = max(len(t), 1)

 

        uppercase_ratios.append(

            sum(1 for c in t if c.isupper()) / denom

        )

 

        digit_ratios.append(

            sum(1 for c in t if c.isdigit()) / denom

        )

 

    return pd.DataFrame({

        “textual content”: texts,

        “char_length”: lengths,

        “word_count”: word_counts,

        “avg_word_length”: avg_word_lengths,

        “uppercase_ratio”: uppercase_ratios,

        “digit_ratio”: digit_ratios

    })

 

# Calling the perform to generate a structured dataset that accommodates: uncooked textual content + metadata

df = generate_metadata(X_raw)

df[“target”] = y

 

df.head()

Earlier than getting absolutely into the pipeline-building course of, we are going to break up the info into prepare and take a look at subsets:

X = df.drop(columns=[“target”])

y = df[“target”]

 

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.2, random_state=42, stratify=y

)

Crucial: splitting the info into coaching and take a look at units should be completed earlier than extracting the LLM embeddings and TF-IDF options. Why? As a result of these two extraction processes turn into a part of the pipeline, and so they contain becoming transformations with scikit-learn, that are studying processes — for instance, studying the TF-IDF vocabulary and inverse doc frequency (IDF) statistics. The scikit-learn logic to implement that is as follows: any information transformations should be fitted (be taught the transformation logic) solely on the coaching information after which utilized to the take a look at information utilizing the discovered logic. This fashion, no data from the take a look at set will affect or bias function building or downstream mannequin coaching.

Now comes a key stage: defining a category that encapsulates a pre-trained sentence transformer (a language mannequin like all-MiniLM-L6-v2 able to producing textual content embeddings from uncooked textual content) to provide our customized LLM embeddings.

class EmbeddingTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, model_name=“all-MiniLM-L6-v2”):

        self.model_name = model_name

        self.mannequin = None

 

    def match(self, X, y=None):

        self.mannequin = SentenceTransformer(self.model_name)

        return self

 

    def remodel(self, X):

        embeddings = self.mannequin.encode(

            X.tolist(),

            show_progress_bar=False

        )

        return np.array(embeddings)

Now we’re constructing the three fundamental information branches (or parallel pipelines) we’re fascinated by, one after the other. First, the pipeline for TF-IDF function extraction, through which we are going to use scikit-learn’s TfidfVectorizer class to extract these options seamlessly:

tfidf_pipeline = Pipeline([

    (“tfidf”, TfidfVectorizer(max_features=5000)),

    (“svd”, TruncatedSVD(n_components=300, random_state=42))

])

Subsequent comes the LLM embeddings pipeline, aided by the customized class we outlined earlier:

embedding_pipeline = Pipeline([

    (“embed”, EmbeddingTransformer())

])

Final, we outline the department pipeline for the metadata options, through which we purpose to standardize these attributes as a consequence of their disparate ranges:

metadata_features = [

    “char_length”,

    “word_count”,

    “avg_word_length”,

    “uppercase_ratio”,

    “digit_ratio”

]

 

metadata_pipeline = Pipeline([

    (“scaler”, StandardScaler())

])

Now now we have three parallel pipelines, however nothing to attach them — at the least not but. Right here comes the primary, overarching pipeline that may orchestrate the fusion course of amongst all three information branches, through the use of a really helpful and versatile scikit-learn artifact for the fusion of heterogeneous information flows: a ColumnTransformer pipeline.

preprocessor = ColumnTransformer(

    transformers=[

        (“tfidf”, tfidf_pipeline, “text”),

        (“embedding”, embedding_pipeline, “text”),

        (“metadata”, metadata_pipeline, metadata_features),

    ],

    the rest=“drop”

)

And the icing on the cake: a full, end-to-end pipeline that may mix the fusion pipeline with an instance of a machine learning-driven downstream process. Specifically, right here’s how you can mix all the information fusion pipeline now we have simply architected with the coaching of a logistic regression classifier to foretell the information class:

full_pipeline = Pipeline([

    (“features”, preprocessor),

    (“clf”, LogisticRegression(max_iter=2000))

])

The next instruction will do all of the heavy lifting now we have been designing up to now. The LLM embeddings half will significantly take a couple of minutes (particularly if the mannequin must be downloaded), so be affected person. This step will undertake the entire threefold course of of knowledge preprocessing, fusion, and mannequin coaching:

full_pipeline.match(X_train, y_train)

To finalize, we will make predictions on the take a look at set and see how our fusion-driven classifier performs.

y_pred = full_pipeline.predict(X_test)

 

print(classification_report(y_test, y_pred, target_names=dataset.target_names))

And for a visible wrap-up, right here’s what all the pipeline seems to be like:

Text data fusion pipeline with scikit-learn

Wrapping Up

This text guided you thru the method of constructing a whole machine learning-oriented workflow that focuses on the fusion of a number of data sources derived from uncooked textual content information, in order that the whole lot may be put collectively in downstream predictive duties like textual content classification. We’ve seen how scikit-learn gives a set of helpful lessons and strategies to make the method simpler and extra intuitive.

Tags: CombineEmbeddingsLLMMetadatapipelinescikitlearnTFIDF
Previous Post

Why Care About Immediate Caching in LLMs?

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Popular News

  • Greatest practices for Amazon SageMaker HyperPod activity governance

    Greatest practices for Amazon SageMaker HyperPod activity governance

    405 shares
    Share 162 Tweet 101
  • Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

    403 shares
    Share 161 Tweet 101
  • Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

    403 shares
    Share 161 Tweet 101
  • Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

    403 shares
    Share 161 Tweet 101
  • Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

    403 shares
    Share 161 Tweet 101

About Us

Automation Scribe is your go-to site for easy-to-understand Artificial Intelligence (AI) articles. Discover insights on AI tools, AI Scribe, and more. Stay updated with the latest advancements in AI technology. Dive into the world of automation with simplified explanations and informative content. Visit us today!

Category

  • AI Scribe
  • AI Tools
  • Artificial Intelligence

Recent Posts

  • Learn how to Mix LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline
  • Why Care About Immediate Caching in LLMs?
  • Enhance operational visibility for inference workloads on Amazon Bedrock with new CloudWatch metrics for TTFT and Estimated Quota Consumption
  • Home
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms & Conditions

© 2024 automationscribe.com. All rights reserved.

No Result
View All Result
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us

© 2024 automationscribe.com. All rights reserved.