As a Developer Advocate, it’s difficult to maintain up with person discussion board messages and perceive the massive image of what customers are saying. There’s loads of precious content material — however how will you rapidly spot the important thing conversations? On this tutorial, I’ll present you an AI hack to carry out semantic clustering just by prompting LLMs!
TL;DR 🔄 this weblog submit is about learn how to go from (knowledge science + code) → (AI prompts + LLMs) for a similar outcomes — simply quicker and with much less effort! 🤖⚡. It’s organized as follows:
- Inspiration and Knowledge Sources
- Exploring the Knowledge with Dashboards
- LLM Prompting to supply KNN Clusters
- Experimenting with Customized Embeddings
- Clustering Throughout A number of Discord Servers
Inspiration and Knowledge Sources
First, I’ll give props to the December 2024 paper Clio (Claude insights and observations), a privacy-preserving platform that makes use of AI assistants to investigate and floor aggregated utilization patterns throughout thousands and thousands of conversations. Studying this paper impressed me to do this.
Knowledge. I used solely publicly out there Discord messages, particularly “discussion board threads”, the place customers ask for tech assist. As well as, I aggregated and anonymized content material for this weblog. Per thread, I formatted the info into dialog flip format, with person roles recognized as both “person”, asking the query or “assistant”, anybody answering the person’s preliminary query. I additionally added a easy, hard-coded binary sentiment rating (0 for “not blissful” and 1 for “blissful”) primarily based on whether or not the person mentioned thanks anytime of their thread. For vectorDB distributors I used Zilliz/Milvus, Chroma, and Qdrant.
Step one was to transform the info right into a pandas knowledge body. Beneath is an excerpt. You may see for thread_id=2, a person solely requested 1 query. However for thread_id=3, a person requested 4 completely different questions in the identical thread (different 2 questions at farther down timestamps, not proven under).

I added a naive sentiment 0|1 scoring operate.
def calc_score(df):
# Outline the goal phrases
target_words = ["thanks", "thank you", "thx", "🙂", "😉", "👍"]
# Helper operate to examine if any goal phrase is within the concatenated message content material
def contains_target_words(messages):
concatenated_content = " ".be part of(messages).decrease()
return any(phrase in concatenated_content for phrase in target_words)
# Group by 'thread_id' and calculate rating for every group
thread_scores = (
df[df['role_name'] == 'person']
.groupby('thread_id')['message_content']
.apply(lambda messages: int(contains_target_words(messages)))
)
# Map the calculated scores again to the unique DataFrame
df['score'] = df['thread_id'].map(thread_scores)
return df
...
if __name__ == "__main__":
# Load parameters from YAML file
config_path = "config.yaml"
params = load_params(config_path)
input_data_folder = params['input_data_folder']
processed_data_dir = params['processed_data_dir']
threads_data_file = os.path.be part of(processed_data_dir, "thread_summary.csv")
# Learn knowledge from Discord Discussion board JSON recordsdata right into a pandas df.
clean_data_df = process_json_files(
input_data_folder,
processed_data_dir)
# Calculate rating primarily based on particular phrases in message content material
clean_data_df = calc_score(clean_data_df)
# Generate experiences and plots
plot_all_metrics(processed_data_dir)
# Concat thread messages & save as CSV for prompting.
thread_summary_df, avg_message_len, avg_message_len_user =
concat_thread_messages_df(clean_data_df, threads_data_file)
assert thread_summary_df.form[0] == clean_data_df.thread_id.nunique()
Exploring the Knowledge with Dashboards
From the processed knowledge above, I constructed conventional dashboards:
- Message Volumes: One-off peaks in distributors like Qdrant and Milvus (probably as a result of advertising and marketing occasions).
- Person Engagement: High customers bar charts and scatterplots of response time vs. variety of person turns present that, on the whole, extra person turns imply larger satisfaction. However, satisfaction does NOT look correlated with response time. Scatterplot darkish dots appear random with regard to y-axis (response time). Possibly customers should not in manufacturing, their questions should not very pressing? Outliers exist, comparable to Qdrant and Chroma, which can have bot-driven anomalies.
- Satisfaction Developments: Round 70% of customers seem blissful to have any interplay. Knowledge be aware: be sure to examine emojis per vendor, generally customers reply utilizing emojis as a substitute of phrases! Instance Qdrant and Chroma.

LLM Prompting to supply KNN Clusters
For prompting, the subsequent step was to combination knowledge by thread_id. For LLMs, you want the texts concatenated collectively. I separate out person messages from complete thread messages, to see if one or the opposite would produce higher clusters. I ended up utilizing simply person messages.

With a CSV file for prompting, you’re able to get began utilizing a LLM to do knowledge science!
!pip set up -q google.generativeai
import os
import google.generativeai as genai
# Get API key from native system
api_key=os.environ.get("GOOGLE_API_KEY")
# Configure API key
genai.configure(api_key=api_key)
# Listing all of the mannequin names
for m in genai.list_models():
if 'generateContent' in m.supported_generation_methods:
print(m.title)
# Attempt completely different fashions and prompts
GEMINI_MODEL_FOR_SUMMARIES = "gemini-2.0-pro-exp-02-05"
mannequin = genai.GenerativeModel(GEMINI_MODEL_FOR_SUMMARIES)
# Mix the immediate and CSV knowledge.
full_input = immediate + "nnCSV Knowledge:n" + csv_data
# Inference name to Gemini LLM
response = mannequin.generate_content(full_input)
# Save response.textual content as .json file...
# Verify token counts and evaluate to mannequin restrict: 2 million tokens
print(response.usage_metadata)

Sadly Gemini API saved chopping quick the response.textual content
. I had higher luck utilizing AI Studio instantly.

My 5 prompts to Gemini Flash & Professional (temperature set to 0) are under.
Immediate#1: Get thread Summaries:
Given this .csv file, per row, add 3 columns:
– thread_summary = 205 characters or much less abstract of the row’s column ‘message_content’
– user_thread_summary = 126 characters or much less abstract of the row’s column ‘message_content_user’
– thread_topic = 3–5 phrase tremendous high-level class
Make certain the summaries seize the principle content material with out dropping an excessive amount of element. Make person thread summaries straight to the purpose, seize the principle content material with out dropping an excessive amount of element, skip the intro textual content. If a shorter abstract is nice sufficient choose the shorter abstract. Make certain the subject is common sufficient that there are fewer than 20 high-level matters for all the info. Choose fewer matters. Output JSON columns: thread_id, thread_summary, user_thread_summary, thread_topic.
Immediate#2: Get cluster stats:
Given this CSV file of messages, use column=’user_thread_summary’ to carry out semantic clustering of all of the rows. Use method = Silhouette, with linkage technique = ward, and distance_metric = Cosine Similarity. Simply give me the stats for the strategy Silhouette evaluation for now.
Immediate#3: Carry out preliminary clustering:
Given this CSV file of messages, use column=’user_thread_summary’ to carry out semantic clustering of all of the rows into N=6 clusters utilizing the Silhouette technique. Use column=”thread_topic” to summarize every cluster subject in 1–3 phrases. Output JSON with columns: thread_id, level0_cluster_id, level0_cluster_topic.
Silhouette Rating measures how related an object is to its personal cluster (cohesion) versus different clusters (separation). Scores vary from -1 to 1. The next common silhouette rating typically signifies better-defined clusters with good separation. For extra particulars, try the scikit-learn silhouette rating documentation.
Making use of it to Chroma Knowledge. Beneath, I present outcomes from Immediate#2, as a plot of silhouette scores. I selected N=6 clusters as a compromise between excessive rating and fewer clusters. Most LLMs lately for knowledge evaluation take enter as CSV and output JSON.

From the plot above, you’ll be able to see we’re lastly entering into the meat of what customers are saying!
Immediate#4: Get hierarchical cluster stats:
Given this CSV file of messages, use the column=’thread_summary_user’ to carry out semantic clustering of all of the rows into Hierarchical Clustering (Agglomerative) with 2 ranges. Use Silhouette rating. What’s the optimum variety of subsequent Level0 and Level1 clusters? What number of threads per Level1 cluster? Simply give me the stats for now, we’ll do the precise clustering later.
Immediate#5: Carry out hierarchical clustering:
Settle for this clustering with 2-levels. Add cluster matters that summarize textual content column “thread_topic”. Cluster matters must be as quick as potential with out dropping an excessive amount of element within the cluster that means.
– Level0 cluster matters ~1–3 phrases.
– Level1 cluster matters ~2–5 phrases.
Output JSON with columns: thread_id, level0_cluster_id, level0_cluster_topic, level1_cluster_id, level1_cluster_topic.
I additionally prompted to generate Streamlit code to visualise the clusters (since I’m not a JS skilled 😄). Outcomes for a similar Chroma knowledge are proven under.

I discovered this very insightful. For Chroma, clustering revealed that whereas customers have been pleased with matters like Question, Distance, and Efficiency, they have been sad about areas comparable to Knowledge, Shopper, and Deployment.
Experimenting with Customized Embeddings
I repeated the above clustering prompts, utilizing simply the numerical embedding (“user_embedding”) within the CSV as a substitute of the uncooked textual content summaries (“user_text”).I’ve defined embeddings intimately in earlier blogs earlier than, and the dangers of overfit fashions on leaderboards. OpenAI has dependable embeddings that are extraordinarily inexpensive by API name. Beneath is an instance code snippet learn how to create embeddings.
from openai import OpenAI
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 512 # 512 or 1536 potential
# Initialize shopper with API key
openai_client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY"),
)
# Perform to create embeddings
def get_embedding(textual content, embedding_model=EMBEDDING_MODEL,
embedding_dim=EMBEDDING_DIM):
response = openai_client.embeddings.create(
enter=textual content,
mannequin=embedding_model,
dimensions=embedding_dim
)
return response.knowledge[0].embedding
# Perform to name per pandas df row in .apply()
def generate_row_embeddings(row):
return {
'user_embedding': get_embedding(row['user_thread_summary']),
}
# Generate embeddings utilizing pandas apply
embeddings_data = df.apply(generate_row_embeddings, axis=1)
# Add embeddings again into df as separate columns
df['user_embedding'] = embeddings_data.apply(lambda x: x['user_embedding'])
show(df.head())
# Save as CSV ...

Apparently, each Perplexity Professional and Gemini 2.0 Professional generally hallucinated cluster matters (e.g., misclassifying a query about sluggish queries as “Private Matter”).
Conclusion: When performing NLP with prompts, let the LLM generate its personal embeddings — externally generated embeddings appear to confuse the mannequin.

Clustering Throughout A number of Discord Servers
Lastly, I broadened the evaluation to incorporate Discord messages from three completely different VectorDB distributors. The ensuing visualization highlighted widespread points — like each Milvus and Chroma dealing with authentication issues.

Abstract
Right here’s a abstract of the steps I adopted to carry out semantic clustering utilizing LLM prompts:
- Extract Discord threads.
- Format knowledge into dialog turns with roles (“person”, “assistant”).
- Rating sentiment and save as CSV.
- Immediate Google Gemini 2.0 flash for thread summaries.
- Immediate Perplexity Professional or Gemini 2.0 Professional for clustering primarily based on thread summaries utilizing the identical CSV.
- Immediate Perplexity Professional or Gemini 2.0 Professional to jot down Streamlit code to visualise clusters (as a result of I’m not a JS skilled 😆).
By following these steps, you’ll be able to rapidly remodel uncooked discussion board knowledge into actionable insights — what used to take days of coding can now be accomplished in only one afternoon!
References
- Clio: Privateness-Preserving Insights into Actual-World AI Use, https://arxiv.org/abs/2412.13678
- Anthropic weblog about Clio, https://www.anthropic.com/analysis/clio
- Milvus Discord Server, final accessed Feb 7, 2025
Chroma Discord Server, final accessed Feb 7, 2025
Qdrant Discord Server, final accessed Feb 7, 2025 - Gemini fashions, https://ai.google.dev/gemini-api/docs/fashions/gemini
- Weblog about Gemini 2.0 fashions, https://weblog.google/know-how/google-deepmind/gemini-model-updates-february-2025/
- Scikit-learn Silhouette Rating
- OpenAI Matryoshka embeddings
- Streamlit