Comparing LLM-Based vs Traditional Clustering for Support Conversations

Research conducted with OpenAI's ChatGPT DeepResearch

Voice of the Customer (VoC) analysis often involves grouping customer support conversations into meaningful topics or issues. Historically, this is done by vectorizing conversation text (e.g. with TF‑IDF or sentence embeddings) and applying clustering algorithms like K-means or DBSCAN. Recent approaches leverage Large Language Models (LLMs) – for example, OpenAI’s GPT series (including smaller variants like o3-mini) – to cluster or categorize conversations using their deep semantic understanding. Below, we compare these approaches in terms of clustering accuracy, scalability, and quality of clusters, and highlight conceptual differences in methodology.

Clustering Quality and Accuracy

Traditional clustering (embedding + algorithm) can identify distinct groups of support conversations, but its effectiveness depends heavily on the chosen features and parameters. For instance, one study clustering 200k chat transcripts found that using TF‑IDF with K-means or HDBSCAN produced the best silhouette scores and Adjusted Rand Index (ARI) among tested methods Chat Clustering Study. This indicates that with well-tuned embeddings (TF‑IDF unigrams in that case) and clustering, classical methods can capture major issue categories. However, challenges remain – e.g. DBSCAN/HDBSCAN tended to label a large fraction of chats as “outliers” (up to 90% in some cases) when conversations were very similar, whereas K-means forced every chat into a cluster (including some less meaningful groupings) (DBSCAN Limitations) (K-means Limitations). In other words, traditional methods might either over-cluster (grouping dissimilar issues together) or leave many unassigned if parameters weren’t optimal (DBSCAN Clustering Issues) (K-means Assignment Issues).

LLM-based clustering approaches have shown higher clustering accuracy and coherence on many text datasets. Multiple recent studies report that incorporating LLMs yields state-of-the-art clustering performance, consistently outperforming classical algorithms like K-means on clustering tasks (k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering). For example, ClusterLLM (EMNLP 2023) uses an instruction-tuned LLM (ChatGPT) to refine cluster assignments. It was tested on 14 datasets and “consistently improves clustering quality” over traditional unsupervised methods, with minimal human guidance (at an average cost of only ~$0.6 per dataset) ([2305.14871] ClusterLLM: Large Language Models as a Guide for Text Clustering). Similarly, Viswanathan et al. (2024) demonstrated that even with only a few user-provided examples, LLM-guided clustering can significantly boost cluster purity and alignment with true topics (k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering). In practical terms, this means LLM-based methods can better group conversations by actual issue type – for instance, separating “billing error” vs “account login problem” conversations more accurately than embeddings+KMeans would. Improved cluster quality is often reflected in higher ARI or NMI (Normalized Mutual Information) scores in evaluations (k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering). These gains stem from LLMs’ ability to capture subtle semantic differences that fixed embeddings might miss. (Notably, if an embedding model wasn’t well-tuned to the domain – e.g. a general multilingual SBERT on industry-specific support tickets – its clusters may not correspond to real categories (Domain-Specific Training Issues) (BERT Model Limitations), whereas an LLM can interpret the text on the fly with domain knowledge.)

It’s worth mentioning that if labeled data is available, fine-tuning an LLM for direct classification can yield very high accuracy. For example, a fine-tuned 7B LLM on support call logs was able to classify call reasons with about 90% accuracy on a test set (far above an un-tuned model’s performance) (Fine-Tuning Zephyr-7B to Analyze Customer Support Call Logs - Predibase - Predibase). This is a supervised approach rather than clustering, but it highlights the ceiling of performance when LLMs are tailored to the task. In unsupervised scenarios, LLM-based clustering narrows the gap with supervised methods, often producing clusters that align closely with human-labeled categories of issues.

Scalability Considerations

Scalability is a major differentiator between traditional and LLM-based approaches. Clustering algorithms like K-means are very efficient on large datasets – they have polynomial-time complexity and can be run in minibatches or with distributed computing. Techniques like mini-batch K-means can cluster millions of conversations quickly, and methods like DBSCAN (with indexing) can handle large volumes as well. The pipeline of “embed then cluster” is straightforward to scale: one can vectorize each conversation (potentially using accelerated libraries or GPUs for transformer embeddings), then apply clustering whose runtime grows roughly linearly with the number of points n. In the earlier chat clustering study, for example, a simple TF-IDF vectorization led to fast clustering, whereas more complex embedding or distance schemes (like word mover’s distance) slowed it down significantly (Vectorization Performance) (Embedding Complexity). Traditional methods thus excel in throughput – they can churn through large datasets without requiring heavy compute beyond the initial embedding step.

LLM-based clustering, on the other hand, can face scalability challenges because of the computational cost of large models. Naively asking an LLM to compare or label every pair of conversations would be infeasible for thousands of chats. Recent research therefore focuses on making LLM clustering efficient with minimal calls or by combining LLMs with smaller models. Many advanced LLM clustering methods rely on iterative querying or fine-tuning, which “introduces instability… and limits scalability for big data” if done for every data point (k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering). For instance, an approach that queries ChatGPT for many pairwise decisions will scale poorly as data grows. However, novel frameworks aim to mitigate this. ClusterLLM significantly limits API calls by only querying the LLM on a small subset of informative examples (hard triplets and a few pairwise questions) and then uses those signals to adjust a lightweight embedder for all data ([2305.14871] ClusterLLM: Large Language Models as a Guide for Text Clustering) ([2305.14871] ClusterLLM: Large Language Models as a Guide for Text Clustering). This achieved quality gains without a proportional increase in cost – LLM usage did not scale with dataset size in their experiments (LLM Efficiency Study). Another example is k-LLMmeans (2024), which integrates an LLM into the centroids update step of K-means. It periodically prompts the LLM to summarize each cluster’s contents into a new centroid vector, rather than averaging embeddings. By doing this sparingly (every few iterations), it improved clustering results while keeping LLM calls fixed regardless of dataset size (k-LLMmeans Efficiency). In short, researchers are making LLM-based methods more scalable by using the LLM as a smart “coach” or feature improver, rather than brute-force clustering everything via the LLM.

A hybrid strategy is to use LLMs to generate a schema or training data, then rely on classical models for scale. For example, the TnT-LLM framework (2024) first uses an LLM in a multi-step prompt to generate a taxonomy of topics and to assign pseudo-labels to a portion of data, then trains a fast classifier (like a smaller neural network) to label the rest of the dataset at scale (TnT-LLM: Text Mining at Scale with Large Language Models) (TnT-LLM: Text Mining at Scale with Large Language Models). Applied to conversation data (Bing Chat logs), this approach produced more accurate and relevant topic labels than state-of-the-art clustering baselines, while achieving a good balance of accuracy and efficiency for large-scale deployment (TnT-LLM: Text Mining at Scale with Large Language Models). This underscores an emerging theme: use LLMs to do the “heavy lifting” in understanding the data’s structure, then use traditional ML to mass-produce the results. By contrast, pure LLM clustering (especially with very large models) may become cost-prohibitive as the number of conversations grows into the tens of thousands, unless one has access to cheaper or distilled versions (like an o3-mini model) running on local hardware. Scalability thus tilts in favor of classical methods, but with careful design LLM-based pipelines can be made to scale sufficiently for typical VoC datasets (often a few thousand to a few hundred thousand feedback items).

Methodology and Conceptual Differences

Beyond metrics, LLM-based and traditional clustering represent different paradigms. Traditional clustering is fundamentally a numerical optimization problem: conversations are converted into points in a vector space, and algorithms group them based on distance/similarity. K-means, for example, seeks cluster assignments that minimize intra-cluster variance; DBSCAN groups points that are densely packed and labels others as noise. These methods require parameter choices (number of clusters k, or density thresholds) and operate blindly on the features given. They excel at finding patterns like “these conversations use many similar keywords” or “this subset has a distinct embedding profile,” but they do not understand the content in a human sense. They also have fixed behavior – e.g. K-means will always partition the data into k groups (even if one category is actually much more varied or important than others), and will always assign every point to a cluster, whereas DBSCAN will drop anything that doesn’t fit well into a cluster. The theoretical guarantees (convergence, complexity bounds) of these algorithms make their behavior predictable and stable, which is a strength (k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering). However, their rigidity means if the true structure of conversations is complex (e.g. overlapping themes or a need for hierarchical grouping), standard clustering can struggle without extensive feature engineering or manual tuning.

LLM-based approaches treat clustering more as a cognitive task – closer to how a human analyst might read through transcripts and sort them by topic. An LLM can leverage contextual understanding, world knowledge, and even instructions from the user. For example, Wang et al. (2023) propose goal-driven clustering where you provide a goal like “cluster customer feedback by the reason for dissatisfaction” and the LLM generates clusters with natural language descriptions of each theme (Goal-Driven Explainable Clustering via Language Descriptions | OpenReview). In their workflow, the model brainstorms possible cluster explanations in free-form text, then assigns each feedback item to the best-fitting explanation, effectively clustering the data. They report this method produced more accurate, goal-aligned clusters and explanations than prior automated clustering techniques (Goal-Driven Explainable Clustering via Language Descriptions | OpenReview). This illustrates a key conceptual difference: LLMs can incorporate high-level criteria or business context during clustering, whereas traditional methods only consider the low-level features. If you want clusters to focus on, say, root causes of support tickets (as opposed to surface-level similarity), an LLM can be prompted accordingly – it might group “password reset” with “account unlock” issues together under a theme like “login/access problems,” even if those share fewer keywords, because it understands both relate to account access. Traditional clustering would likely split those if the vocabulary differs.

Moreover, LLM-based clustering often involves an interactive or iterative process rather than a one-shot algorithm. Approaches like ClusterLLM use an LLM to evaluate or refine cluster boundaries: e.g. ask “Should conversation A and B be in the same category?” for a few uncertain pairs, or “Is conversation X more similar to cluster Y or Z?” ([2305.14871] ClusterLLM: Large Language Models as a Guide for Text Clustering). The answers are then used to adjust the clustering (by fine-tuning embeddings or merging/splitting clusters). This is akin to having an expert in the loop, except the “expert” is the LLM. The outcome is clusters that better align with semantic reality and often fewer mislabeled items. However, because the LLM is effectively changing the clustering rules on the fly (based on its understanding or the provided instructions), the process lacks the neat objective function of K-means or DBSCAN. There’s a theoretical trade-off: injecting LLM reasoning can yield more meaningful clusters, but one sacrifices some mathematical transparency and stability. As one paper notes, many LLM-based methods require careful prompt tuning and can be unstable if the LLM’s outputs change, whereas a modified algorithm like k-LLMmeans tries to “preserve the core behavior” of K-means to retain its guarantees (k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering). In summary, traditional clustering is data-driven and formulaic, while LLM-based clustering is knowledge-driven and adaptive. Conceptually, LLM clustering can be seen as clustering by “conceptual similarity” (using the model’s internal language understanding) rather than purely by numeric similarity in a predetermined feature space.

Interpretability and VoC Insights

For VoC use-cases, the end goal is not just clusters for their own sake, but actionable insights – understanding the themes and issues customers talk about. Here, LLM-based approaches offer a significant advantage in interpretability. Because an LLM works in natural language, it can provide human-readable descriptions of each cluster. For example, an LLM that groups conversations could output labels or summaries like “Cluster 1: Billing Issues (customers complaining about incorrect charges or fees)”. These sorts of summaries can be generated by prompting the LLM on the cluster’s content. Researchers have leveraged this by having the LLM serve as a “cluster interpreter.” In one experiment, ChatGPT was fed sets of support tickets obtained by different clustering algorithms, and asked to describe the cluster’s theme. Marketing experts found that the quality of these AI-generated explanations depended more on how much context the LLM was given than on which algorithm produced the cluster (Chat Clustering Study) (Chat Clustering Study). In other words, with sufficient background, ChatGPT could articulate cluster topics quite well, helping domain experts quickly grasp what each group of customer comments meant. The ability to explain clusters in plain language is extremely valuable for VoC programs – it bridges the gap between raw data and insights that business stakeholders can act on.

Traditional clustering, conversely, yields clusters that must be interpreted manually or with additional tools. Analysts often have to inspect frequent words in each cluster or apply a separate topic model to each group to figure out the theme. In the aforementioned chat analysis project, after clustering, the authors applied extractive summarization (TextRank, LSA, etc.) to each cluster to get a representative summary (Summarization Methods). This two-step process (cluster then summarize) can work, but it’s less seamless than an LLM that inherently does both. Modern LLM-based pipelines can effectively cluster and label simultaneously. For instance, the goal-driven clustering method produces an explanation with each cluster as it goes, as part of the output (Goal-Driven Clustering) (Explainable Clustering). Similarly, k-LLMmeans uses cluster centroid summaries – the textual centroid not only guides the algorithm but also serves as an interpretable label for the cluster (k-LLMmeans Interpretability). This means a VoC analyst using LLM-based clustering might receive clusters of conversations each accompanied by a short description of the issue connecting them, making it immediately clear why that cluster is important.

From a VoC insights perspective, accuracy and interpretability go hand-in-hand. If a clustering method is very accurate in a mathematical sense but yields opaque clusters that an analyst can’t make sense of, it has limited value. LLM-based methods strive to provide both high clustering accuracy and intelligible results. They align clustering with how humans categorize meaning, often resulting in clusters that correspond to intuitive topics (with less “junk” mixing). However, it’s important to validate LLM-generated clusters and labels – sometimes an LLM might hallucinate a theme that sounds plausible but isn’t truly reflected by all items in the cluster. Ensuring the clusters really match customer utterances still requires human oversight. In practice, many organizations may use a hybrid approach: use LLMs to draft cluster labels or groupings, then have analysts review and adjust as needed.

Conclusion

In summary, LLM-based clustering approaches have begun to outshine traditional vector+algorithm techniques in accuracy and quality of groupings, especially for complex, nuanced conversation data. They can capture the true Voice of the Customer more faithfully, distinguishing subtle differences in intent and providing rich, explainable cluster labels ([2305.14871] ClusterLLM: Large Language Models as a Guide for Text Clustering) (Goal-Driven Explainable Clustering via Language Descriptions | OpenReview). Traditional methods like K-means/DBSCAN remain faster, more scalable, and easier to implement, and they perform well when conversations are shorter or vocabulary-driven (and they benefit from domain-specific embedding tuning) (Domain Tuning Benefits) (BERT Model Performance). Conceptually, the choice is between a data-centric approach (fast but literal grouping of text similarity) and a knowledge-centric approach (slower but semantically nuanced grouping). For VoC applications, where understanding themes and root causes in customer feedback is the end goal, LLM-based techniques provide a powerful boost in insight generation – often worth the additional complexity. As research continues to improve the efficiency of LLM clustering (through few-shot learning, smarter prompts, and hybrid models), we can expect these AI-driven methods to become increasingly practical at scale, turning mountains of support conversations into clear, accurate Voice-of-the-Customer insights.

Sources:

Recent research and case studies comparing LLM-guided clustering with traditional methods (k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering) (k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering) ([2305.14871] ClusterLLM: Large Language Models as a Guide for Text Clustering) (TnT-LLM: Text Mining at Scale with Large Language Models), as well as theoretical discussions on their differences ([2305.14871] ClusterLLM: Large Language Models as a Guide for Text Clustering) (k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering) and real-world evaluations on customer feedback data (Chat Clustering Study) (Goal-Driven Explainable Clustering via Language Descriptions | OpenReview).

Discussion

After reading the identified papers, we decided to split the difference and go with a hybrid approach. Using the LLM approach does pass the Vibes test and benefit from interpretability (not in the official sense) that suits our Customers’ needs. It does feel like it lacks the rigor that traditional clustering has. Traditional clustering is a more mature field with a lot of research and best practices, but we found it still suffers from interpretability issues. You find yourself still trying to ascribe a story to each cluster that the LLM approach has done for you.

We chose a data model that supports both modes. Research is ongoing on the exact composition of the traditional ML clustering approaches plus the new LLM clustering.