{"id":238,"date":"2025-11-30T01:39:35","date_gmt":"2025-11-30T01:39:35","guid":{"rendered":"http:\/\/ijeesoo.com\/?page_id=238"},"modified":"2025-11-30T02:19:10","modified_gmt":"2025-11-30T02:19:10","slug":"clustering-for-information-retrieval-recommender-systems","status":"publish","type":"page","link":"http:\/\/ijeesoo.com\/?page_id=238","title":{"rendered":"\u00a0Clustering for Information Retrieval, Recommender Systems"},"content":{"rendered":"\n<div class=\"wp-block-group alignfull has-accent-background-color has-background has-global-padding is-layout-constrained wp-container-core-group-is-layout-73128380 wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-a0d91a25 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-stretch is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\">\n<div class=\"wp-block-group is-vertical is-content-justification-stretch is-layout-flex wp-container-core-group-is-layout-c042bc37 wp-block-group-is-layout-flex\" style=\"min-height:100%\">\n<p class=\"has-heading-font-family has-x-large-font-size wp-block-paragraph\" style=\"line-height:1.2\"><br>What is Clustering?<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\">\n<p class=\"wp-block-paragraph\">\u2022<strong>Definition:<\/strong>The fundamental task of grouping a set of objects (documents) such that objects in the same group (cluster) are <strong>more similar<\/strong>to each other than to those in other groups.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022It is the most prevalent form of <strong>unsupervised learning<\/strong>in IR.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Goal:<\/strong>To achieve high <strong>intra-cluster similarity<\/strong>(documents inside a cluster are highly similar) and high <strong>inter-cluster dissimilarity<\/strong>(documents in different clusters are highly dissimilar).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Contrast with Classification:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Classification<\/strong>is supervised: documents are assigned to predefined, known classes (labels).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Clustering<\/strong>is unsupervised: the classes (clusters) are discovered <em>from<\/em>the data.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-base-color has-contrast-background-color has-text-color has-background has-link-color wp-elements-0094ebd230c07d9ccdbbb64f9a920d8c has-global-padding is-layout-constrained wp-container-core-group-is-layout-36f65c2d wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group alignwide is-layout-flow wp-block-group-is-layout-flow\">\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-07826717f6d4e79c9fe4182e7338b7c0\">What is Clustering?<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Definition:<\/strong>The fundamental task of grouping a set of objects (documents) such that objects in the same group (cluster) are <strong>more similar<\/strong> to each other than to those in other groups.<\/li>\n\n\n\n<li>It is the most prevalent form of <strong>unsupervised learning<\/strong> in IR.<\/li>\n\n\n\n<li><strong>Goal:<\/strong>To achieve high <strong>intra-cluster similarity<\/strong>(documents inside a cluster are highly similar) and high <strong>inter-cluster dissimilarity<\/strong>(documents in different clusters are highly dissimilar).<\/li>\n\n\n\n<li><strong>Contrast with Classification:<\/strong><\/li>\n\n\n\n<li><strong>Classification<\/strong> is supervised: documents are assigned to predefined, known classes (labels).<\/li>\n\n\n\n<li><strong>Clustering<\/strong> is unsupervised: the classes (clusters) are discovered <em>from<\/em> the data.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-82d4eb3e8ebc0ddce583e560e979fc6a\">The Cluster Hypothesis<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"871\" height=\"311\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-24.png\" alt=\"\" class=\"wp-image-240\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-24.png 871w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-24-300x107.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-24-768x274.png 768w\" sizes=\"auto, (max-width: 871px) 100vw, 871px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-3b63c4511fd66ba061a8f9ac29ff653e\">Why Cluster Documents? (I: User Interface)<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-ff531f0748bcb7724f9b701a0c5aac72 wp-block-paragraph\"><strong>1. Organizing Search Results (Post-Retrieval Clustering):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Clustering is performed on the top 50\u2013100 documents retrieved by a standard ranking function.<\/li>\n\n\n\n<li><strong>Purpose:<\/strong>To provide a <strong>summary<\/strong> or <strong>topical categorization<\/strong> of the result set.<\/li>\n\n\n\n<li><strong>Benefit:<\/strong>Users can quickly grasp the diverse themes\/topics present and navigate to the most relevant group.<\/li>\n\n\n\n<li><strong>Example:<\/strong>Tools like <strong>Scatter\/Gather<\/strong> or modern clustered search engines (e.g., Viv\u00edsimo) group results with auto-generated labels (e.g., <em>\u201cAmazon River,\u201d \u201cAmazon E-Commerce,\u201d \u201cAmazon Rainforest\u201d<\/em>for the query \u201cAmazon\u201d).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-749df5c17558c23e17717a451ca1487c\">Why Cluster Documents? (II: System Efficiency &amp; Effectiveness)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"545\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-25-1024x545.png\" alt=\"\" class=\"wp-image-242\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-25-1024x545.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-25-300x160.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-25-768x408.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-25.png 1365w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignwide is-layout-flow wp-block-group-is-layout-flow\">\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-3b8831ecb2f87d5caf710d0a04e0309f\">Document Representation for Clustering<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"579\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-26-1024x579.png\" alt=\"\" class=\"wp-image-245\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-26-1024x579.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-26-300x169.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-26-768x434.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-26.png 1301w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-ce76688d347d17ac10c649e45cea282c\">Measuring Document Similarity<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"588\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-27-1024x588.png\" alt=\"\" class=\"wp-image-246\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-27-1024x588.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-27-300x172.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-27-768x441.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-27.png 1337w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-8739ea6499debf4a0092119d47af807b\">Types of Clustering Structures<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-a465cc69c7a6bdc81d6e0d0c14035e96 wp-block-paragraph\">Clustering methods can be broadly categorized based on the structure of the resulting partition:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>1. Partitional (Flat) Clustering:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Creates a single level of <strong>K<\/strong> non-overlapping, exhaustive clusters.<\/li>\n\n\n\n<li>Every document belongs to exactly one cluster.<\/li>\n\n\n\n<li>Requires specifying the number of clusters <strong>K<\/strong><em>a priori<\/em>.<\/li>\n\n\n\n<li><strong>Algorithms:K-Means<\/strong>, EM (Expectation-Maximization).<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\"><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>2. Hierarchical Clustering:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Creates a nested sequence of partitions, organized into a tree structure called a <strong>Dendrogram<\/strong>.<\/li>\n\n\n\n<li>Does <em>not<\/em>require specifying <strong>K<\/strong>. The final partition is chosen by cutting the dendrogramat a specific level.<\/li>\n\n\n\n<li><strong>Algorithms:Agglomerative (Bottom-Up)<\/strong>and <strong>Divisive (Top-Down)<\/strong>.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-cc59cbe4a9f347081e045a378c9c92fe\">Partitional Clustering: K-Means Introduction<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"523\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-28-1024x523.png\" alt=\"\" class=\"wp-image-248\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-28-1024x523.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-28-300x153.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-28-768x392.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-28.png 1334w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-62d927f4fd906fdadb1297757981a166\">The K-Means Algorithm Steps<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"500\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-29-1024x500.png\" alt=\"\" class=\"wp-image-250\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-29-1024x500.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-29-300x146.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-29-768x375.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-29.png 1334w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-7e8768fb14f07eaf4218b54d1051adcc\">Challenges in K-Means Clustering<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f9646766b098a7abfb7f83e556f28a9a wp-block-paragraph\">While fast and simple, K-Means has significant drawbacks, especially in the context of high-dimensional IR data:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>1. Sensitivity to Initial Centroids:<\/strong>The final clustering result often depends heavily on the initial random selection of <strong>K<\/strong>centroids. Poor initialization can lead to a locally optimal but globally suboptimal solution (high SSE).<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SSE = sum of squared distances between each point and its assigned cluster centroid.<\/li>\n\n\n\n<li>High SSE \u2192 clusters are not tight, points are far from their centroids \u2192 poor clustering quality.<\/li>\n\n\n\n<li><em>Mitigation:<\/em>Use more sophisticated initialization methods like <strong>K-Means++<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>2. Requirement to Pre-specify K:<\/strong>There is no <em>a priori<\/em>method to determine the &#8220;correct&#8221; number of clusters, <strong>K<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<em>Mitigation:<\/em>Use techniques like the <strong>Elbow Method<\/strong>or <strong>Silhouette Analysis<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>3. Assumes Spherical Clusters:<\/strong>K-Means minimizes squared Euclidean distance, inherently assuming clusters are convex and roughly spherical (isotropicallydistributed). It struggles with irregularly shaped or intertwined clusters.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-b3e864ab6d80c31f3bd2447df00a555f wp-block-paragraph\"><strong>4. Outlier Sensitivity:<\/strong>Centroids are the mean of the data points, making them highly sensitive to outliers, which can significantly skew the cluster center.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-324d75742c13aab6376c0c65a7213319\">Hierarchical Clustering: Introduction<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-4a243d9fa6b0320376e01a7b44d095bd wp-block-paragraph\">\u2022<strong>Concept:<\/strong>Creates a nested sequence of clusters, resulting in a tree-like structure known as a <strong>Dendrogram<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Advantage:<\/strong>Does not require specifying the number of clusters (<strong>K<\/strong>) beforehand. The resulting clusters are viewed at various levels of granularity by &#8220;cutting&#8221; the dendrogram.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Two Main Approaches:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Agglomerative (Bottom-Up):<\/strong>Starts with individual documents as clusters and merges the most similar clusters iteratively.<\/li>\n\n\n\n<li><strong>Divisive (Top-Down):<\/strong>Starts with all documents in one cluster and recursively splits the cluster into two until only individual documents remain.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-da27aa0aefb7d5b77d63a329bf9d3f3c\">Agglomerative Hierarchical Clustering (HAC)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"510\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-30-1024x510.png\" alt=\"\" class=\"wp-image-253\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-30-1024x510.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-30-300x149.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-30-768x382.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-30.png 1338w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-ed199116519ffa91815a538623b9ff1f\">Defining Cluster Proximity: Linkage Criteria<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"735\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-31-1024x735.png\" alt=\"\" class=\"wp-image-257\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-31-1024x735.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-31-300x215.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-31-768x551.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-31.png 1352w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-d94ad99fa26f521bd088241c88e444e6\">The Dendrogram<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-ed2fcbb09bf27d47c72fc7a08730c2d6 wp-block-paragraph\">\u2022<strong>Visualization Tool:<\/strong>The Dendrogramvisually represents the entire history of merges (in HAC) or splits (in Divisive) that created the hierarchy.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Interpretation:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">      \u2022<strong>X-axis:<\/strong>Represents the documents.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">      \u2022<strong>Y-axis:<\/strong>Represents the <strong>distance\/dissimilarity<\/strong>(or time\/iteration step) at which the merges occurred.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-95cca824abfdcb14d118538ce252ad72 wp-block-paragraph\">\u2022<strong>Determining K:<\/strong>To obtain <strong>K<\/strong>flat clusters, one &#8220;cuts&#8221; the dendrogramhorizontally at a height that intersects exactly <strong>K<\/strong>vertical lines. Clusters joined at lower heights are more similar.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-3002007c6851e612f327b09cb279eb14\">Divisive Hierarchical Clustering (DHC)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"571\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-32-1024x571.png\" alt=\"\" class=\"wp-image-259\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-32-1024x571.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-32-300x167.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-32-768x429.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-32.png 1326w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-4a6cc91b6d9e0da18861852a76189390\">Density-Based Clustering (DBSCAN)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"536\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-33-1024x536.png\" alt=\"\" class=\"wp-image-260\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-33-1024x536.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-33-300x157.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-33-768x402.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-33.png 1321w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\"\/>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-cfca68bae0eb193ce522d29732fd7189\"><br>Model-Based Clustering (GMM)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"568\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-34-1024x568.png\" alt=\"\" class=\"wp-image-261\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-34-1024x568.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-34-300x167.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-34-768x426.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-34.png 1288w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-2caec8c2e6280ac75800950588e940a5\">Clustering of Terms<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"472\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-35-1024x472.png\" alt=\"\" class=\"wp-image-263\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-35-1024x472.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-35-300x138.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-35-768x354.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-35.png 1323w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-43a7713afc466073f6679b0c9ad81637\">Dimensionality Reduction for Clustering<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-45ac576ef1e23242b269a26887f8933f wp-block-paragraph\">\u2022Document vectors are extremely high-dimensional (vocabulary size <strong>M <\/strong>is typically tens of thousands). This presents a challenge:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>The Curse of Dimensionality:<\/strong>In high dimensions, all distances tend to become equal, making the concept of &#8220;nearness&#8221; and &#8220;farness&#8221; less meaningful for clustering.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Computational Cost:<\/strong>Clustering algorithms slow down considerably as <strong>M<\/strong>increases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Common Techniques:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>1. Latent Semantic Indexing (LSI) \/ Singular Value Decomposition (SVD):<\/strong>Decomposes the Term-Document matrix to project data into a lower-dimensional latent space (k \u2248 300).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>2. Principal Component Analysis (PCA):<\/strong>Finds the directions (components) in the data that account for the most variance.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-1e5e4d658ed36ecc5ff72aa9e9484681 wp-block-paragraph\">\u2022<strong>Goal:<\/strong>Preserve the underlying cluster structure while removing noise and improving computational efficiency.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-36b01850e62aa6c5b55c5f2b5c6fad1a\">Scalability Challenges in IR Clustering<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"526\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-36-1024x526.png\" alt=\"\" class=\"wp-image-266\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-36-1024x526.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-36-300x154.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-36-768x394.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-36.png 1305w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-3df644a139c39461c73c039f9bc330f5\">Scalability Solutions: Sampling and Bisection<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8e787451a0e8bcf059aaf407d29df8ef wp-block-paragraph\">\u2022<strong>1. Random Sampling:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Cluster only a <strong>representative sample<\/strong>of the documents (e.g., 1% of the corpus).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Use the resulting centroids\/structure to guide the assignment of the remaining 99%.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Risk:<\/strong>The sample may not perfectly capture the true distribution of topics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>2. Bisecting K-Means:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022A hybrid approach blending partitionaland divisive hierarchical concepts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Method:<\/strong>Start with one cluster containing all documents. In each step, choose a cluster to split and use <strong>K-Means with K=2<\/strong>to divide it into two sub-clusters.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-89edfece83b8221e8732384750f3d508 wp-block-paragraph\">\u2022<strong>Advantage:<\/strong>Faster than traditional K-Means because it runs K-Means on smaller subsets, and it is deterministic (produces a single hierarchy).<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-ff9670c43acb4a7006d7e61be74f5d07\">Scalability Solutions: Centroid-Based Optimization<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"462\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-37-1024x462.png\" alt=\"\" class=\"wp-image-271\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-37-1024x462.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-37-300x135.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-37-768x347.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-37.png 1327w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-e1f431f2d9d2594d99d2ab2b516bcceb\">Soft Clustering and Multilingual IR<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-56a294f42b17a4b643005fe3773f494c wp-block-paragraph\">\u2022<strong>Soft (Fuzzy) Clustering:<\/strong>Documents are assigned a <strong>degree of membership<\/strong>(a probability or weight) to <em>every<\/em>cluster.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Benefit in IR:<\/strong>A document covering multiple topics (e.g., &#8220;AI ethics and law&#8221;) is appropriately represented in both the &#8220;AI&#8221; and &#8220;Law&#8221; clusters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Common Algorithm:Expectation-Maximization (EM) for Gaussian Mixture Models (GMMs)<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Multilingual Clustering:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Documents in different languages covering the same topic should cluster together.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-55e7c2444e36a1312d3292aa7ef24a0e wp-block-paragraph\">\u2022<strong>Requirement:<\/strong>Needs a <strong>language-independent representation<\/strong>, such as Cross-Lingual LSI (CL-LSI) or machine translation to a single intermediary language&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-6f3500a7158e70499e5a28ab7de2b816\">Application I: Search Result Clustering (SRC)<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f82ef5fcea3d8c01f2f7dc54746320e1 wp-block-paragraph\">\u2022<strong>Type:Post-Retrieval Clustering<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Process:<\/strong>The system first executes the query using a standard ranking model (e.g., BM25) and retrieves the top <strong>N<\/strong>documents (<strong>N <\/strong>\u2248<strong>50-200<\/strong>). It then clusters only this small result set.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>User Benefit:<\/strong>Improves the <strong>browsing<\/strong>experience. Instead of a long ranked list, the user sees a few meaningful cluster labels.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Challenges:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Speed:<\/strong>Must be executed <strong>in real-time<\/strong>(milliseconds) to avoid user latency. Requires very fast algorithms.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f5f6edbfd19b194f577e1ce56e66eaaa wp-block-paragraph\">\u2022<strong>Labeling:<\/strong>Clusters must be accurately and concisely <strong>labeled<\/strong>using terms extracted from the cluster documents (e.g., using phrases or named entities).<\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-d445b0947fe58c3e4205811f0b0098fb\">Application II: Pre-Retrieval Clustering (Corpus-Based)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"440\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-38-1024x440.png\" alt=\"\" class=\"wp-image-274\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-38-1024x440.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-38-300x129.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-38-768x330.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-38.png 1332w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-9dcc8543249657a61868c508b519e62a\">The Cluster Labeling Problem<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-5e71919c454aafa650b4b9b7fb5aebaa wp-block-paragraph\">\u2022<strong>Crucial for SRC:<\/strong>A cluster is useless to a user without a descriptive, human-readable label.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Goal:<\/strong>Generate a short phrase (2\u20134 words) that accurately summarizes the topic of all documents in the cluster.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Methods:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>1. Centroid Terms:<\/strong>Use the terms with the highest weight (TF-IDF) in the cluster centroid vector.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>2. Frequent Phrases:<\/strong>Extract frequent, non-stop-word phrases (N-grams) that appear across the cluster documents.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>3. Named Entities:<\/strong>Utilize recognized names (people, organizations, locations) as labels.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-b6c1389a26ece9f5c4a84725c9f14ef0 wp-block-paragraph\">\u2022<strong>Filtering:<\/strong>Labels must be filtered to remove overly general or non<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-bc526ba9a1f4bcadd720e5ad614a2be1\">Determining the Optimal Number of Clusters (K)<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f1a6ec4bf9a4fdf4128d05f2233a54cb wp-block-paragraph\"><strong>The K Problem:<\/strong>Only some partitionalalgorithms (like K-Means) require <strong>K<\/strong>as input. Choosing <strong>K<\/strong>is non-trivial.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>1. Elbow Method:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Plot the <strong>SSE<\/strong>(Sum of Squared Errors) against the number of clusters <strong>K<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Rationale:<\/strong>As <strong>K<\/strong>increases, the SSE naturally drops. The optimal <strong>K<\/strong>is often found at the &#8220;elbow&#8221;\u2014where the rate of decrease dramatically slows down.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>2. Silhouette Analysis:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Measures how similar a document is to its own cluster compared to other clusters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Goal:<\/strong>Maximize the average <strong>Silhouette Coefficient<\/strong>over all documents.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-6b4a2f2e1b2d2765acad980629bf44dc wp-block-paragraph\">\u2022<strong>3. Domain Knowledge:<\/strong>Sometimes, <strong>K<\/strong>is fixed based on known categories or user interface constraints (e.g., &#8220;show 10 clusters&#8221;).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-4f0dc2b08fba7fd8bbddc9e555848e54\">Evaluation of Clustering: Internal Metrics<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-4007657c8ad3be3f43a968123de75b21 wp-block-paragraph\">\u2022<strong>Concept:<\/strong>Assess the quality of the clustering structure <em>without<\/em>reference to external, pre-defined class labels. Based purely on the data and the distance metric.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>1. Cohesion (Intra-Cluster Similarity):<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Measures how tightly related the documents within a cluster are.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Metrics:<\/strong>Average similarity of all documents to the cluster centroid (or the mean pairwise distance). <strong>We want high cohesion.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>2. Separation (Inter-Cluster Dissimilarity):<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Measures how distinct clusters are from each other.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Metrics:<\/strong>Distance between cluster centroids (e.g., Euclidean distance). <strong>We want high separation.<\/strong><\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f73d3909aaa6e6edd0ad829c127fb616 wp-block-paragraph\">\u2022<strong>3. Silhouette Coefficient:<\/strong>Combines cohesion and separation into a single score.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-b6aff349445a4ad9a51f5dbbc36abc5e\">Evaluation of Clustering: External Metrics<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"550\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-39-1024x550.png\" alt=\"\" class=\"wp-image-278\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-39-1024x550.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-39-300x161.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-39-768x412.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-39.png 1364w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-cbed2b5900c628f879a6db930162f367\">Evaluation Pitfall: The Subjectivity of Clustering<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-fe65414e989a399070eaab8c76fa2e4a wp-block-paragraph\">\u2022<strong>Clustering is Subjective:<\/strong>Unlike classification, there is no single, objective &#8220;correct&#8221; clustering for a dataset. Different goals require different clustering structures.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022For <strong>efficiency<\/strong>, a flat, coarse clustering (low <strong>K<\/strong>) might be best.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022For <strong>user browsing<\/strong>, a fine-grained, meaningful clustering (higher <strong>K<\/strong>) is better.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Impact of Metrics:<\/strong>Internal metrics (cohesion\/separation) and external metrics (purity\/F-measure) may disagree.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022A clustering that optimizes <strong>Purity<\/strong>might fail to be useful in a search interface if the clusters are too small or lack good labels.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-5e5548625f1cc566f40a749a4151bdf0 wp-block-paragraph\">\u2022<strong>Conclusion:<\/strong>Clustering evaluation must consider the <strong>application context<\/strong>and human judgment (user studies) alongside numerical&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-bffa56d1123752369376cfc756e559fd\">Improving Retrieval Effectiveness via Clustering<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"522\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-40-1024x522.png\" alt=\"\" class=\"wp-image-282\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-40-1024x522.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-40-300x153.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-40-768x391.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-40.png 1346w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-9f5fba91fbb6209e0ba6d24555e10f10\">Case Study: Scatter\/Gather Interface<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Concept:<\/strong>A user-driven interactive clustering interface for browsing large document collections.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Steps:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1.The user selects an initial set of documents or submits a query.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2.The result set is quickly clustered (the <strong>Scatter<\/strong>step).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">3.The system presents the cluster labels and sizes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">4.The user selects one or more clusters to merge, refine, or drill down into (the <strong>Gather<\/strong>step).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">5.The system re-clusters the gathered subset, repeating the process.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-0be91fba7f38a9f6a320fa0830604e68 wp-block-paragraph\">\u2022<strong>Advantage:<\/strong>Provides an effective topical overview and allows the user to dynamically navigate the space without multiple query reformulations.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-245a5ef89919c05f937692bfa3ec730a\"><br>Clustering and Personalization<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-906452c8aff63c23176366af549d7751 wp-block-paragraph\">\u2022<strong>User Profiles:<\/strong>Documents retrieved by a system are clustered based not just on content, but also on the user&#8217;s past interaction history.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Method:<\/strong>Documents are weighted by a user-interest vector. Clustering then groups documents that are topically similar <em>and<\/em>relevant to the user&#8217;s inferred interests.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Application: News and Recommendation Systems:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Grouping news articles by topic, then prioritizing clusters where the user has shown previous interest.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-91321743b2e9c82cd41c4f58c3e2dacd wp-block-paragraph\">\u2022Helps filter out irrelevant content and improve the perceived&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-9b31ab1a8a8624ce75bcbab1c1c0dd23\">Clustering as an Alternative to Indexing<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-d8f546237e2926a7fa496078ad9523e3 wp-block-paragraph\">\u2022<strong>Inverted Indices<\/strong>are the standard method for fast retrieval in IR.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Clustering<\/strong>offers an alternative access structure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Comparison:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Indexing:High Precision\/High Recall<\/strong>. Comparison cost is <strong><em>O(query terms)<\/em><\/strong>. Highly accurate.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Clustering:Reduced Cost\/Potentially Lower Recall<\/strong>. Comparison cost is <strong><em>O(number of clusters)<\/em><\/strong>. Fast, but lossy.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f7eed903fc410ae0f00ed8aad996bdc5 wp-block-paragraph\">\u2022<strong>Hybrid Systems:<\/strong>The most common approach is to use clustering <strong>in conjunction<\/strong>with an inverted index\u2014e.g., using clustering to find the&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-5ae78264f8a1001b9618de1d0fa6f082\">The Impact of Sparsity<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Sparsity:<\/strong>Document vectors are extremely <strong>sparse<\/strong>(most dimensions\/terms have a zero weight) in IR.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Consequence for Distance:<\/strong>Sparse vectors often have few co-occurring non-zero dimensions, leading to small or zero dot products (high Cosine Similarity if normalized, but the space is complex).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Mitigation:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Dimensionality Reduction (LSI):<\/strong>Creating dense vectors in a latent space mitigates the sparsity problem.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Careful Metric Selection:<\/strong>Use metrics that handle high-dimensional sparsity well (like Cosine Similarity) over Euclidean distance.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-71ad6a7f17e0c36747ca839099798ec0\">Challenges: Noise and Outliers<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Noise:<\/strong>Documents that are short, contain little content, or have irrelevant content (e.g., boilerplate text, spam) can distort cluster centers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Outliers:<\/strong>Documents that are entirely dissimilar to the rest of the collection.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>K-Means Problem:<\/strong>Outliers can hijack a cluster centroid, pulling it away from the true cluster mean.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-9e9ed2cfa1ec4a9a803ce480461d7609 wp-block-paragraph\">\u2022<strong>DBSCAN Solution:<\/strong>Density-based methods, explicitly mark outliers as &#8220;noise&#8221; points that do not belong to any cluster, making the resulting clusters cleaner.<\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignwide is-layout-flow wp-block-group-is-layout-flow\">\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-45926f92eea044d0e0414c3a69fd0c32\">Summary: Clustering in IR<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-30cbb5e73218fe4d1c249a5f3fe66307 wp-block-paragraph\">\u2022<strong>Objective:<\/strong>Group similar documents to improve organization, user browsing, efficiency, and retrieval effectiveness (Recall).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Foundation:<\/strong>The <strong>Cluster Hypothesis<\/strong>\u2014similar documents are relevant to similar queries.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Key Algorithms:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>K-Means:<\/strong>Fast, scalable, requires pre-defined <strong>K<\/strong>, assumes spherical clusters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Hierarchical (HAC\/DHC):<\/strong>Creates a rich structural hierarchy (Dendrogram), more complex\/slower.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-3024ed1c3fe6433e27e579747518263d wp-block-paragraph\">\u2022<strong>Evaluation:<\/strong>Requires both <strong>internal metrics<\/strong>(Cohesion\/Separation) and <strong>external metrics<\/strong>(Purity\/F-Measure) relative to the final application goal.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-61dbda33c4685e1b5a6e70e45c0c265c\">Future Directions in IR Clustering<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-c01ada41b3e3a30ee12de4b571be49e9 wp-block-paragraph\">\u2022<strong>1. Dynamic and Streaming Clustering:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Clustering documents (e.g., news feeds, social media posts) that arrive continuously and rapidly. The cluster structure must be maintained without re-clustering the entire history.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>2. Integrated Topic Modeling:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Moving beyond traditional geometric clustering by using <strong>probabilistic models<\/strong>(like Latent DirichletAllocation -LDA) to define clusters based on latent topics, leading to more semantically meaningful cluster labels.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>3. GPU-Accelerated Clustering:<\/strong><\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f669cab7f6911ab09a05e37de8f42d32 wp-block-paragraph\">\u2022Leveraging parallel processing hardware to achieve real-time clustering performance necessary for high-volume, real-time web search result organization.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-c2a21dc913fa7b4afe672d8302ba919a\">Conclusion<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-cfc15af2e3dd0222cbddeea844a5246e wp-block-paragraph\">\u2022Clustering is a vital component of modern Information Retrieval systems, transforming vast, undifferentiated document collections into organized, navigable topical structures.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-cae6d3d84263d6dc10cbf57c6097ff50 wp-block-paragraph\">\u2022<strong>Key Takeaway:<\/strong>The choice of clustering algorithm, similarity metric, and evaluation method must always be dictated by the specific IR goal: efficiency, effectiveness, or user experience.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-container-core-group-is-layout-36f65c2d wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-columns alignwide are-vertically-aligned-center is-layout-flex wp-container-core-columns-is-layout-ca2b64f2 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\">\n<h2 class=\"wp-block-heading\"><br>Recommender Systems<\/h2>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\"><\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-base-color has-contrast-background-color has-text-color has-background has-link-color wp-elements-a9ec71512fd96df22145675da7477204 has-global-padding is-layout-constrained wp-container-core-group-is-layout-36f65c2d wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group alignwide is-layout-flow wp-block-group-is-layout-flow\">\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-97ca590194584ede35adb84d167aec9c\">Introduction to Recommender Systems<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"799\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-41-1024x799.png\" alt=\"\" class=\"wp-image-305\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-41-1024x799.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-41-300x234.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-41-768x599.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-41.png 1120w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-975c360dd1e9406397f8f7aceec9ef0b\">Types of Feedback and Data<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-97e74c2369ff174f7897fff0767d4b6e wp-block-paragraph\">\u2022<strong>Explicit Feedback:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Direct input from users regarding their interest in items.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Examples:<\/strong>Star ratings (1-5 stars on Amazon), Like\/Dislike (YouTube), Numerical scores.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Pros:<\/strong>High precision; clear signal of user preference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Cons:<\/strong>Sparse data (users rarely rate); reporting bias (users usually rate only things they love or hate).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Implicit Feedback:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Inferences made from user behavior.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Examples:<\/strong>Purchase history, video watch time, page views, click-through rate, listening history (Spotify).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Pros:<\/strong>Abundant data; no extra effort required from the user.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Cons:<\/strong>Noisy data (a click doesn&#8217;t guarantee satisfaction); binary interpretation (interacted vs. didn&#8217;t interact) is often required.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>The Utility Matrix:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022The central data structure is a matrix with Users as rows and Items as columns.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-9cee760ce492c140f56c6d1eda17c9d7 wp-block-paragraph\">\u2022The matrix is extremely <strong>sparse<\/strong>(often &gt;99% empty). The goal of RS is effectively &#8220;Matrix Completion.&#8221;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-1e3af6fc9028adbe03ca9158f504c165\">Taxonomy of Recommendation Approaches<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-5b02941103362a39ab9e85e22f78d35c wp-block-paragraph\">\u2022<strong>1. Content-Based Filtering (CBF):<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Recommends items similar to those the user Liked in the past.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Relies on <strong>Item Features<\/strong>(keywords, genre, director, etc.) and a <strong>User Profile<\/strong>built from those features.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>2. Collaborative Filtering (CF):<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Recommends items based on the preferences of similar users.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Relies on the <strong>Interaction Matrix<\/strong>(ratings\/behavior) rather than item metadata.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Key Idea:<\/strong>&#8220;Users who agreed in the past will agree in the future.&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>3. Hybrid Methods:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Combines CBF and CF to overcome limitations like the Cold Start problem.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>4. Knowledge-Based:<\/strong><\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-22f80e1c6abafb260dfb1617e15b352d wp-block-paragraph\">\u2022Uses explicit domain knowledge and requirements (e.g., &#8220;I need a laptop with 16GB RAM for under $1000&#8221;). Common in complex domains like real estate or electronics.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-da5233fcd58e1defd7003335fa4a95d9\">Content-Based Filtering: Mechanics<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"668\" height=\"386\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-42.png\" alt=\"\" class=\"wp-image-310\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-42.png 668w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-42-300x173.png 300w\" sizes=\"auto, (max-width: 668px) 100vw, 668px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignwide is-layout-flow wp-block-group-is-layout-flow\">\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-04f0dcddc65a9d7db14d5a8f8f4d31bc\">Collaborative Filtering: User-Based<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"624\" height=\"378\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-43.png\" alt=\"\" class=\"wp-image-312\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-43.png 624w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-43-300x182.png 300w\" sizes=\"auto, (max-width: 624px) 100vw, 624px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-4b9c92610cc81ce8c18387a504d05a4a\">Collaborative Filtering: Item-Based<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"732\" height=\"376\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-44.png\" alt=\"\" class=\"wp-image-314\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-44.png 732w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-44-300x154.png 300w\" sizes=\"auto, (max-width: 732px) 100vw, 732px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-e080f61a47d6d6184d1d48fea44b3d8b\">Model-Based CF: Matrix Factorization<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"552\" height=\"428\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-45.png\" alt=\"\" class=\"wp-image-318\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-45.png 552w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-45-300x233.png 300w\" sizes=\"auto, (max-width: 552px) 100vw, 552px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-ec103ab635a02c65dc55b8e29d08122c\">The Cold Start Problem &amp; Hybridization<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"728\" height=\"384\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-46.png\" alt=\"\" class=\"wp-image-320\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-46.png 728w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-46-300x158.png 300w\" sizes=\"auto, (max-width: 728px) 100vw, 728px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-0df20751683ad4e6a5463535ff77e1fc\">Evaluation Metrics (Prediction vs. Ranking)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"742\" height=\"376\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-47.png\" alt=\"\" class=\"wp-image-322\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-47.png 742w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-47-300x152.png 300w\" sizes=\"auto, (max-width: 742px) 100vw, 742px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-5d56367ef63d7d8d40dbefe9a8c2cd5e\">Challenges and Advanced Topics<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-d9a68c48f67866226536e647391e9f07 wp-block-paragraph\">\u2022<strong>Scalability:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Real-world systems have millions of users and items. O(N2) neighbor calculations are too slow.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Solutions:<\/strong>Clustering, Dimensionality Reduction (SVD), Approximate Nearest Neighbors (ANN) searching (e.g., Faiss).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Data Sparsity:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022When the interaction matrix is &lt; 0.1% full, finding overlaps between users is difficult.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Solutions:<\/strong>Matrix Factorization, implicating trust networks (Trust-aware Recommenders).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>The &#8220;Filter Bubble&#8221;:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022System keeps recommending what the user already agrees with\/likes, reducing diversity.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Evaluation:<\/strong>Need metrics for <strong>Novelty<\/strong>(recommending unknown items) and <strong>Serendipity<\/strong>(recommending surprisingly interesting items).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022<strong>Shilling Attacks:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2022Malicious users creating fake profiles to rate items highly (push attacks) or poorly (nuke attacks) to manipulate the system.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-9b8db5c8fd5bd065457a22d72d04db57 wp-block-paragraph\">\u2022<strong>Robustness:<\/strong>Designing algorithms resistant to outliers and fake profiles.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>What is Clustering? \u2022Definition:The fundamental task of grouping a set of objects (documents) such that objects in the same group (cluster) are more similarto each other than to those in other groups. \u2022It is the most prevalent form of unsupervised learningin IR. \u2022Goal:To achieve high intra-cluster similarity(documents inside a cluster are highly similar) and high [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":113,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-238","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/238","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=238"}],"version-history":[{"count":73,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/238\/revisions"}],"predecessor-version":[{"id":336,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/238\/revisions\/336"}],"up":[{"embeddable":true,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/113"}],"wp:attachment":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=238"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}