Subject Modeling Strategies for 2026: Seeded Modeling, LLM Integration, and Information Summaries

By: Martin Feldkircher (Vienna College of Worldwide Research), Márton Kardos (Aarhus College, Denmark), and Petr Koráb (Textual content Mining Tales)

1.

Subject modelling has lately progressed in two instructions. The improved statistical strategies stream of Python packages focuses on extra strong, environment friendly, and preprocessing-free fashions, producing fewer junk subjects (e.g., FASTopic). The opposite depends on the facility of generative language fashions to extract intuitively comprehensible subjects and their descriptions (e.g., TopicGPT [6], LlooM [5]).

Due to analysis on statistical strategies for modelling textual content representations from transformers, junk subjects are the exception reasonably than the norm in newer fashions. In the meantime, novel, LLM-based approaches are difficult our long-standing views about what a subject mannequin is and what it will probably do. Human-readable subject names and descriptions at the moment are turning into increasingly an anticipated results of a well-designed subject modelling pipeline.

As thrilling as these developments are, subject modelling is way from being a solved downside. Neural subject fashions may be reasonably unstable and typically onerous for customers to belief due to their black-box nature. LLM-powered strategies produce spectacular outcomes, however can at occasions elevate questions on belief, as a consequence of hallucinations and sensitivity to semantically irrelevant adjustments in enter. That is particularly an issue for the banking sector, the place (un)certainty is important. Operating massive language fashions can also be an enormous infrastructural and computational burden, and would possibly find yourself costing massive sums of cash even for smaller datasets.

Our earlier tutorial offers an in depth introduction to how LLMs improve conventional subject modeling by mechanically labeling subject names. On this article, we mix present subject modeling strategies with focused LLM help. In our view, a mix of current advances in language modeling and classical machine studying can present customers with the perfect of each worlds: a pipeline that mixes the capabilities of huge language fashions with the computational effectivity, trustworthiness, and stability of probabilistic ML.

This text explains three contemporary topic-modelling methods that needs to be a part of the NLP toolkit in 2026. We’ll determine:

How one can use textual content prompts to specify what subject fashions ought to give attention to (i.e., seeded subject fashions).
How LLM-generated summaries could make subject fashions extra correct.
How generative fashions can be utilized to label subjects and supply their descriptions.
How these methods can be utilized to realize insights from central banking communication.

We illustrate these on the central financial institution communication speeches corpus from the European Central Financial institution. Such a textual content is lengthy, fastidiously structured, and extremely repetitive — precisely the sort of knowledge the place normal subject fashions wrestle and the place interpretability is important. By combining seeded subject modelling with LLM-assisted doc summarization and evaluation, we present how one can extract targeted, secure, and economically significant subjects with out compromising transparency or scalability.

2. Instance Information

We use the press convention communications of the European Central Financial institution (ECB) as instance textual content knowledge. Since 2002, the ECB’s Governing Council has met on the primary Thursday of every month, and its communication of the assembly’s end result follows the two-step construction ([2]).

The way it works: First, at 13:45 CET, the ECB releases a quick financial coverage resolution (MPD) assertion, which incorporates solely restricted textual info. Second, at 14:30 CET, the ECB President delivers an introductory assertion throughout a press convention. This fastidiously ready doc explains the rationale behind coverage selections, outlines the ECB’s evaluation of financial circumstances, and offers steerage on future coverage issues. The introductory assertion usually lasts about quarter-hour and is adopted by a 45-minute Q&A session.

For this text, we use the introductory statements, scraped straight from the ECB web site (launched with a versatile knowledge licence). The dataset incorporates 279 statements, and here’s what it seems like:

_{Picture 1: ECB communication dataset. Supply: Picture by authors.}

3. Seeded Subject Modelling

Historically, subject fashions give attention to figuring out essentially the most informative subjects in a dataset. A naive method practitioners take is to suit a bigger mannequin, then, normally manually, filter out subjects irrelevant to their knowledge query.

What should you may situation a subject mannequin to solely extract related subjects to your knowledge query? That is exactly what seeded subject modelling is used for.

In some strategies, this implies choosing a set of key phrases that replicate your query. However within the framework we discover on this article, you possibly can specify your curiosity in free-text utilizing a seed phrase that tells the mannequin what to give attention to.

3.1 KeyNMF Mannequin

We’ll use the cutting-edge contextual KeyNMF subject mannequin ([3]). It’s, in lots of features, similar to older subject fashions, because it formulates subject discovery by way of matrix factorization. In different phrases, when utilizing this mannequin, you assume that subjects are latent elements, that your paperwork comprise to a lesser or better extent, which decide and clarify the content material of these paperwork.

KeyNMF is contextual as a result of, not like older fashions, it makes use of context-sensitive transformer representations of textual content. To grasp how seeded modelling works, we have to achieve a primary understanding of the mannequin. The modelling course of occurs within the following steps:

We encode our paperwork into dense vectors utilizing a sentence-transformer.
We encode the vocabulary of those paperwork into the identical embedding house.
For every doc, we extract the highest N key phrases by taking the phrases which have the very best cosine similarity to the doc embedding.
Phrase significance for a given doc is then the cosine similarity, pruned at zero. These scores are organized right into a key phrase matrix, the place every row is a doc, and columns correspond to phrases.
The key phrase matrix is decomposed right into a topic-term matrix and a document-topic matrix utilizing Nonnegative Matrix Factorization.

_{Picture 2: KeyNMF mannequin structure. Supply: Turftopic’s documentation}

3.2 Seeded KeyNMF

The overall KeyNMF, whereas completely enough for locating subjects in a corpus, shouldn’t be essentially the most appropriate selection if we have to use the mannequin for a selected query. To make this occur, we first should specify a seed phrase, a phrase that minimally signifies what we’re occupied with. For instance, when analysing the ECB communication dataset, this might be “Enlargement of the Eurozone”.

As sentence-transformers can encode this seed phrase, we are able to use it to retrieve paperwork which might be related to our query:

We encode the seed phrase into the identical embedding house as our paperwork and vocabulary.
To make our mannequin extra attentive to paperwork that comprise related info, we compute a doc relevance rating by computing cosine similarity to the seed embedding. We prune, once more, at zero.
To magnify the seed’s significance, one can apply a seed exponent. This includes elevating the doc relevance scores to the facility of this exponent.
We multiply the key phrase matrix’s entries by the doc relevance.
We then, as earlier than, use NMF to decompose this, now conditioned, key phrase matrix.

The benefits of this method are that it’s:

1) extremely versatile, and
2) can save quite a lot of handbook work.

Watch out: some embedding fashions may be delicate to phrasing and would possibly retrieve totally different document-importance scores for a similar doc with a barely totally different seed phrase. To cope with this, we suggest that you simply use one of many paraphrase fashions from sentence-transformers, as a result of they’ve intentionally been skilled to be phrasing invariant, and produce high-quality subjects with KeyNMF.

3.3 How one can use Seeded KeyNMF

KeyNMF and its seeded model can be found on PyPI within the Turftopic package deal, in a scikit-learn-compatible type. To specify what you have an interest in, merely initialize the mannequin with a seed phrase:

from sentence-transformers import SentenceTransformer
from turftopic import KeyNMF

# Encode paperwork utilizing a sentence-transformer
encoder = SentenceTransformer("paraphrase-mpnet-base-v2")
embeddings = encoder.encode(paperwork, show_progress_bar=True)

# Initialize KeyNMF with 4 subjects and a seed phrase
mannequin = KeyNMF(
 n_components=4,
 encoder=encoder,
 seed_phrase="Enlargement of the Eurozone",
 seed_exponent=3.0,
)

# Match mannequin 
mannequin.match(corpus)

# Print modelled subjects
mannequin.print_topics()

We are able to see that the mannequin returns subject IDs with typical key phrases which might be clearly associated to the Euro and the Eurozone:

Subject Id	Highest Rating
0	gdp, financial, euro, progress, economic system, evaluation, evaluation, macroeconomic, measures, anticipated
1	inflation, charges, inflationary, stability, euro, fee, economic system, financial, ecb, expectations
2	euro, fiscal, stability, nations, reforms, query, governing, coverage, european, insurance policies
3	ecb, germany, frankfurt, communicationssonnemannstrasse, 7455media, europa, query, eu, 1344, 2060314

_{Picture 3: Seed KeyNMF mannequin output.} _{Supply: picture by authors.}

4. LLM-assisted Subject Modeling

Discovering interpretable subjects from a corpus is a tough process, and it typically requires greater than only a statistical mannequin that finds patterns within the uncooked knowledge. LLMs serve subject modelling in two primary areas:

Studying a doc and figuring out the best features within the textual content based mostly on a selected knowledge query.
Decoding the subject mannequin’s output within the related context.

Within the following textual content, we are going to now discover 1) how LLMs enhance processing paperwork for a subject mannequin and a couple of) how generative fashions enhance understanding and decoding the mannequin outcomes.

_{Picture 4: Subject modelling pipeline prolonged with LLM parts. Supply: Turftopic’s docs.}

4.1. Doc Summarization

One of many Achilles’ heels of the sentence transformers we regularly use for subject evaluation is their quick context size. Encoder fashions that may learn significantly longer contexts have hardly ever been evaluated for his or her efficiency in subject modeling. Subsequently, we didn’t know whether or not or how these bigger transformer fashions work in a subject modelling pipeline. One other problem is that they produce higher-dimensional embeddings, which regularly negatively have an effect on unsupervised machine studying fashions ([4]). It might probably both be as a result of Euclidean distances get inflated in higher-dimensional house, or as a result of the variety of parameters surges with enter dimensionality, making parameter restoration tougher.

We are able to resolve these points by:

Chunking paperwork into smaller sections that match into the context window of a sentence transformer. Sadly, chunking can lead to textual content chunks which might be wildly out of context, and it would take appreciable effort to chunk paperwork at semantically wise boundaries.
Utilizing generative fashions to summarize the contents of those paperwork. LLMs excel at this process and can even take away all varieties of tokenization-based noise and irrelevant info from texts that may hinder our subject mannequin.

Let’s now summarise the trade-offs of utilizing LLM-generated summaries in subject modelling within the following picture.

_{Picture 5: Advantages and disadvantages of LLM-assisted doc processing within the subject modelling pipeline. Supply: picture by authors.}

The really helpful technique for LLM-assisted doc preprocessing is a two-step:

Prepare a subject mannequin with easy preprocessing, or no preprocessing in any respect.
Once you discover that subject fashions have a tough time decoding your corpus, utilizing LLM-based summarisation is usually a sensible choice if the trade-offs work positively in your particular venture.

4.1.1. Doc Summarization in Code

Let’s now take a look at how we are able to summarize paperwork utilizing an LLM. On this instance, we are going to use GPT-5-nano, however Turftopic additionally permits working domestically run open LLMs. We suggest utilizing open LLMs domestically, if doable, as a consequence of decrease prices and higher knowledge privateness.

import pandas as pd
from tqdm import tqdm
from turftopic.analyzers import OpenAIAnalyzer, LLMAnalyzer

# Loading the info
knowledge = pd.read_parquet("knowledge/ecb_data.parquet")
content material = record(knowledge["content"])

# We write a immediate that may extract the related info
# We ask the mannequin to separate info to key factors in order that 
# they turn out to be simpler to mannequin

summary_prompt="Summarize the next press convention from 
the European Central Financial institution right into a set of key factors separated by 
two newline characters. Reply with the abstract solely, nothing else. 
n {doc}"

# Formalize a summarized
summarizer = OpenAIAnalyzer("gpt-5-nano", summary_prompt=summary_prompt)
summaries = []

# Summarize dataframe, observe code execution 
for doc in tqdm(knowledge["content"], desc="Summarising paperwork..."):
    abstract = summarizer.summarize_document(doc)
    # We print summaries as we go as a sanity test, to ensure 
    # the immediate works
    print(abstract)
    summaries.append(abstract)

# Acquire summaries right into a dataframe
summary_df = pd.DataFrame(
    {
        "id": knowledge["id"],
        "date": knowledge["date"],
        "writer": knowledge["author"],
        "title": knowledge["title"],
        "abstract": summaries,
    }
)

Subsequent, we are going to match a easy KeyNMF mannequin on the important thing factors in these summaries, and let the mannequin uncover the variety of subjects utilizing the Bayesian Info Criterion. This method works very nicely on this case, however watch out that automated subject quantity detection has its shortcomings. Try the Subject Mannequin Leaderboard to realize extra info on how fashions carry out at detecting the variety of subjects.

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from turftopic import KeyNMF

# Create corpus from textual content summaries (not unique texts)
corpus = record(summary_df["summary"])

# Acquire key factors by segmenting at double line breaks
factors = []

for doc in corpus:
  _points = doc.break up("nn")
  doc_points = [p for p in _points if len(p.strip().removeprefix(" - "))]
  factors.prolong(doc_points)

# Inform KeyNMF to mechanically detect the variety of subjects utilizing BIC
mannequin = KeyNMF("auto", encoder="paraphrase-mpnet-base-v2")
doc_topic = mannequin.fit_transform(factors)

# Print subject IDs with prime phrases
mannequin.print_topics()

Listed here are the KeyNMF outcomes skilled on doc summaries:

Subject ID	Highest Rating
0	inflation, hicp, expectations, anticipated, wage, power, costs, value, medium, pressures
1	ecb, charges, unchanged, stored, key, fee, liquidity, banks, market, alternate
2	m3, credit score, financial, progress, lending, liquidity, loans, monetary, cash, m1
3	euro, space, economic system, banknotes, anticipated, alternate, foreign money, nations, convergence, exterior
4	reforms, fiscal, progress, structural, consolidation, markets, potential, productiveness, market, important
5	dangers, draw back, progress, balanced, outlook, upside, costs, tensions, potential, world
6	stability, value, medium, pact, stays, dangers, expectations, time period, upside, fiscal
7	coverage, financial, fee, fiscal, charges, selections, transmission, measures, stays, stance
8	gdp, progress, actual, financial, projections, quarter, demand, q2, anticipated, economic system
9	council, governing, fee, resolution, assembly, refinancing, consensus, unanimous, guarantee, charges

_{Picture 6: KeyNMF 10-topic outcomes skilled on doc summaries. Supply: picture by authors.}

4.3. Subject Evaluation with LLMs

In a typical topic-analysis pipeline, a consumer would first prepare a subject mannequin, then spend time decoding what the mannequin has found, label subjects manually, and eventually present a quick description of the varieties of paperwork the subject incorporates. That is time-consuming, particularly in corpora with many recognized subjects.

This half can now be carried out by LLMs that may simply generate human-readable subject names and descriptions. We’ll use the identical Analyzer API from Turftopic to attain this:

from turftopic.analyzers import OpenAIAnalyzer

analyzer = OpenAIAnalyzer()
analysis_result = mannequin.analyze_topics(analyzer, use_documents=True)

print(analysis_result.to_df())

We apply the analyzer to the introductory statements issued by the ECB, which accompany every financial coverage resolution. These statements are ready fastidiously and observe a comparatively normal construction. Listed here are the labelled subject names with their descriptions and prime phrases printed from analysis_result:

_{Picture 7: Subject Evaluation utilizing GPT-5-nano in Turftopic. Supply: picture by authors.}

Subsequent, let’s present the prevalence of the labelled KeyNMF’ subject names over time. It’s how intensely these subjects have been mentioned within the ECB press conferences over the last 25 years:

from datetime import datetime
import plotly.categorical as px
from scipy.sign import savgol_filter

# create dataframe from labelled subjects, 
# mix with timestamp from date column
time_df = pd.DataFrame(
    dict(
        date=timestamps,
        **dict(zip(analysis_result.topic_names, doc_topic.T /
                                                doc_topic.sum(axis=1)))
    )
).set_index("date")

# group dataframe to month-to-month frequency
time_df = time_df.groupby(by=[time_df.index.month, time_df.index.year]).imply()
time_df.index = [datetime(year=y, month=m, day=1) for m, y in time_df.index]
time_df = time_df.sort_index()

# show dataframe with Plotly

for col in time_df.columns:
    time_df[col] = savgol_filter(time_df[col], 12, 2)
fig = px.line(
    time_df.sort_index(),
    template="plotly_white",
)
fig.present()

Right here is the labelled subject mannequin dataframe displayed in yearly frequency:

_{Picture 8: Subject Evaluation utilizing GPT-5-nano in Turftopic over time. Supply: Picture by authors.}

Mannequin leads to context: The financial union subject was most distinguished within the early 2000s (see [5] for extra info). The financial coverage and fee resolution subject peaks on the finish of the worldwide monetary disaster round 2011, a interval throughout which the ECB (some commentators argue mistakenly) raised rates of interest. The timing of the inflation and inflation expectations subject additionally corresponds with financial developments: it rises sharply round 2022, when power costs pushed inflation into double-digit territory within the euro space for the primary time since its creation.

5. Abstract

Let’s now summarize the important thing factors of the article. The necessities and code for this tutorial are on this repo.

Seeded KeyNMF subject mannequin combines textual content prompts with the newest subject mannequin to pay attention modelling on a sure downside.
Summarizing knowledge for subject modeling reduces coaching time, nevertheless it has drawbacks that needs to be thought-about in a venture.
The Tutftopic Python package deal implements systematic descriptions and labels with current LLMs into a subject modelling pipeline.

References

[1] Taejin Park, Fernando Perez-Cruz and Hyun Track Shin. 2025. Mapping the house of central bankers’ concepts. In: BIS Working Papers, No. 1299, 16 October 2025, 26 pp.

[2] Carlo Altavilla, Luca Brugnolini, Refet S. Gürkaynak, Roberto Motto and Giuseppe Ragusa. 2019. Measuring euro space financial coverage. In: Journal of Financial Economics, Quantity 108, pp 162-179.

[3] Ross Deans Kristensen-McLachlan, Rebecca M.M. Hicke, Márton Kardos, and Mette Thunø. 2024. Context is Key(NMF): Modelling Topical Info Dynamics in Chinese language Diaspora Media. In: CHR 2024: Computational Humanities Analysis Convention, December 4–6, 2024, Aarhus, Denmark.

[4] Márton Kardos, Jan Kostkan, Kenneth Enevoldsen, Arnault-Quentin Vermillet, Kristoffer Nielbo, and Roberta Rocca. 2025. S³ – Semantic Sign Separation. In: Proceedings of the 63rd Annual Assembly of the Affiliation for Computational Linguistics (Quantity 1: Lengthy Papers), pages 633–666, Vienna, Austria. Affiliation for Computational Linguistics.

[5] Martin Feldkircher, Petr Koráb and Viktoriya Teliha. 2025. What do central bankers speak about? Proof from the BIS archive. In: CAMA Working Paper Nr. 35/2025.

[6] Michelle S. Lam, Janice Teoh, James A. Landay, Jeffrey Heer, and Michael S. Bernstein. 2024. Idea Induction: Analyzing Unstructured Textual content with Excessive-Degree Ideas Utilizing LLooM. In: Proceedings of the 2024 CHI Convention on Human Components in Computing Techniques (CHI ’24). Affiliation for Computing Equipment, New York, NY, USA, Article 766, 1–28. https://doi.org/10.1145/3613904.3642830.

[7] Chau Minh Pham, Alexander Hoyle, Simeng Solar, Philip Resnik, and Mohit Iyyer. 2024. TopicGPT: A Immediate-based Subject Modeling Framework. In Proceedings of the 2024 Convention of the North American Chapter of the Affiliation for Computational Linguistics: Human Language Applied sciences (Quantity 1: Lengthy Papers), pages 2956–2984, Mexico Metropolis, Mexico. Affiliation for Computational Linguistics.

Subject Modeling Strategies for 2026: Seeded Modeling, LLM Integration, and Information Summaries

Securing Amazon Bedrock cross-Area inference: Geographic and international

5 Python Libraries for Superior Time Collection Forecasting

5 Python Libraries for Superior Time Collection Forecasting

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Good-Sufficient Fact | In direction of Knowledge Science

About Us

Category

Recent Posts