{"id":238,"date":"2025-11-30T01:39:35","date_gmt":"2025-11-30T01:39:35","guid":{"rendered":"http:\/\/ijeesoo.com\/?page_id=238"},"modified":"2025-11-30T02:19:10","modified_gmt":"2025-11-30T02:19:10","slug":"clustering-for-information-retrieval-recommender-systems","status":"publish","type":"page","link":"http:\/\/ijeesoo.com\/?page_id=238","title":{"rendered":"\u00a0Clustering for Information Retrieval, Recommender Systems"},"content":{"rendered":"\n<div class=\"wp-block-group alignfull has-accent-background-color has-background has-global-padding is-layout-constrained wp-container-core-group-is-layout-669513ed wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-1da03c2a wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-stretch is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\">\n<div class=\"wp-block-group is-vertical is-content-justification-stretch is-layout-flex wp-container-core-group-is-layout-444d5ee8 wp-block-group-is-layout-flex\" style=\"min-height:100%\">\n<p class=\"has-heading-font-family has-x-large-font-size\" style=\"line-height:1.2\"><br>What is Clustering?<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\">\n<p>\u2022<strong>Definition:<\/strong>The fundamental task of grouping a set of objects (documents) such that objects in the same group (cluster) are <strong>more similar<\/strong>to each other than to those in other groups.<\/p>\n\n\n\n<p>\u2022It is the most prevalent form of <strong>unsupervised learning<\/strong>in IR.<\/p>\n\n\n\n<p>\u2022<strong>Goal:<\/strong>To achieve high <strong>intra-cluster similarity<\/strong>(documents inside a cluster are highly similar) and high <strong>inter-cluster dissimilarity<\/strong>(documents in different clusters are highly dissimilar).<\/p>\n\n\n\n<p>\u2022<strong>Contrast with Classification:<\/strong><\/p>\n\n\n\n<p>\u2022<strong>Classification<\/strong>is supervised: documents are assigned to predefined, known classes (labels).<\/p>\n\n\n\n<p>\u2022<strong>Clustering<\/strong>is unsupervised: the classes (clusters) are discovered <em>from<\/em>the data.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-base-color has-contrast-background-color has-text-color has-background has-link-color wp-elements-0094ebd230c07d9ccdbbb64f9a920d8c has-global-padding is-layout-constrained wp-container-core-group-is-layout-d89aad35 wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group alignwide is-layout-flow wp-block-group-is-layout-flow\">\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-07826717f6d4e79c9fe4182e7338b7c0\">What is Clustering?<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Definition:<\/strong>The fundamental task of grouping a set of objects (documents) such that objects in the same group (cluster) are <strong>more similar<\/strong> to each other than to those in other groups.<\/li>\n\n\n\n<li>It is the most prevalent form of <strong>unsupervised learning<\/strong> in IR.<\/li>\n\n\n\n<li><strong>Goal:<\/strong>To achieve high <strong>intra-cluster similarity<\/strong>(documents inside a cluster are highly similar) and high <strong>inter-cluster dissimilarity<\/strong>(documents in different clusters are highly dissimilar).<\/li>\n\n\n\n<li><strong>Contrast with Classification:<\/strong><\/li>\n\n\n\n<li><strong>Classification<\/strong> is supervised: documents are assigned to predefined, known classes (labels).<\/li>\n\n\n\n<li><strong>Clustering<\/strong> is unsupervised: the classes (clusters) are discovered <em>from<\/em> the data.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-82d4eb3e8ebc0ddce583e560e979fc6a\">The Cluster Hypothesis<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"871\" height=\"311\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-24.png\" alt=\"\" class=\"wp-image-240\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-24.png 871w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-24-300x107.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-24-768x274.png 768w\" sizes=\"auto, (max-width: 871px) 100vw, 871px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-3b63c4511fd66ba061a8f9ac29ff653e\">Why Cluster Documents? (I: User Interface)<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-ff531f0748bcb7724f9b701a0c5aac72\"><strong>1. Organizing Search Results (Post-Retrieval Clustering):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Clustering is performed on the top 50\u2013100 documents retrieved by a standard ranking function.<\/li>\n\n\n\n<li><strong>Purpose:<\/strong>To provide a <strong>summary<\/strong> or <strong>topical categorization<\/strong> of the result set.<\/li>\n\n\n\n<li><strong>Benefit:<\/strong>Users can quickly grasp the diverse themes\/topics present and navigate to the most relevant group.<\/li>\n\n\n\n<li><strong>Example:<\/strong>Tools like <strong>Scatter\/Gather<\/strong> or modern clustered search engines (e.g., Viv\u00edsimo) group results with auto-generated labels (e.g., <em>\u201cAmazon River,\u201d \u201cAmazon E-Commerce,\u201d \u201cAmazon Rainforest\u201d<\/em>for the query \u201cAmazon\u201d).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-749df5c17558c23e17717a451ca1487c\">Why Cluster Documents? (II: System Efficiency &amp; Effectiveness)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"545\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-25-1024x545.png\" alt=\"\" class=\"wp-image-242\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-25-1024x545.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-25-300x160.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-25-768x408.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-25.png 1365w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignwide is-layout-flow wp-block-group-is-layout-flow\">\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-3b8831ecb2f87d5caf710d0a04e0309f\">Document Representation for Clustering<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"579\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-26-1024x579.png\" alt=\"\" class=\"wp-image-245\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-26-1024x579.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-26-300x169.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-26-768x434.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-26.png 1301w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-ce76688d347d17ac10c649e45cea282c\">Measuring Document Similarity<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"588\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-27-1024x588.png\" alt=\"\" class=\"wp-image-246\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-27-1024x588.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-27-300x172.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-27-768x441.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-27.png 1337w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-8739ea6499debf4a0092119d47af807b\">Types of Clustering Structures<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-a465cc69c7a6bdc81d6e0d0c14035e96\">Clustering methods can be broadly categorized based on the structure of the resulting partition:<\/p>\n\n\n\n<p><strong>1. Partitional (Flat) Clustering:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Creates a single level of <strong>K<\/strong> non-overlapping, exhaustive clusters.<\/li>\n\n\n\n<li>Every document belongs to exactly one cluster.<\/li>\n\n\n\n<li>Requires specifying the number of clusters <strong>K<\/strong><em>a priori<\/em>.<\/li>\n\n\n\n<li><strong>Algorithms:K-Means<\/strong>, EM (Expectation-Maximization).<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\"><\/ul>\n\n\n\n<p><strong>2. Hierarchical Clustering:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Creates a nested sequence of partitions, organized into a tree structure called a <strong>Dendrogram<\/strong>.<\/li>\n\n\n\n<li>Does <em>not<\/em>require specifying <strong>K<\/strong>. The final partition is chosen by cutting the dendrogramat a specific level.<\/li>\n\n\n\n<li><strong>Algorithms:Agglomerative (Bottom-Up)<\/strong>and <strong>Divisive (Top-Down)<\/strong>.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-cc59cbe4a9f347081e045a378c9c92fe\">Partitional Clustering: K-Means Introduction<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"523\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-28-1024x523.png\" alt=\"\" class=\"wp-image-248\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-28-1024x523.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-28-300x153.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-28-768x392.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-28.png 1334w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-62d927f4fd906fdadb1297757981a166\">The K-Means Algorithm Steps<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"500\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-29-1024x500.png\" alt=\"\" class=\"wp-image-250\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-29-1024x500.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-29-300x146.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-29-768x375.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-29.png 1334w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-7e8768fb14f07eaf4218b54d1051adcc\">Challenges in K-Means Clustering<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f9646766b098a7abfb7f83e556f28a9a\">While fast and simple, K-Means has significant drawbacks, especially in the context of high-dimensional IR data:<\/p>\n\n\n\n<p><strong>1. Sensitivity to Initial Centroids:<\/strong>The final clustering result often depends heavily on the initial random selection of <strong>K<\/strong>centroids. Poor initialization can lead to a locally optimal but globally suboptimal solution (high SSE).<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SSE = sum of squared distances between each point and its assigned cluster centroid.<\/li>\n\n\n\n<li>High SSE \u2192 clusters are not tight, points are far from their centroids \u2192 poor clustering quality.<\/li>\n\n\n\n<li><em>Mitigation:<\/em>Use more sophisticated initialization methods like <strong>K-Means++<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p><strong>2. Requirement to Pre-specify K:<\/strong>There is no <em>a priori<\/em>method to determine the &#8220;correct&#8221; number of clusters, <strong>K<\/strong>.<\/p>\n\n\n\n<p>\u2022<em>Mitigation:<\/em>Use techniques like the <strong>Elbow Method<\/strong>or <strong>Silhouette Analysis<\/strong>.<\/p>\n\n\n\n<p><strong>3. Assumes Spherical Clusters:<\/strong>K-Means minimizes squared Euclidean distance, inherently assuming clusters are convex and roughly spherical (isotropicallydistributed). It struggles with irregularly shaped or intertwined clusters.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-b3e864ab6d80c31f3bd2447df00a555f\"><strong>4. Outlier Sensitivity:<\/strong>Centroids are the mean of the data points, making them highly sensitive to outliers, which can significantly skew the cluster center.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-324d75742c13aab6376c0c65a7213319\">Hierarchical Clustering: Introduction<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-4a243d9fa6b0320376e01a7b44d095bd\">\u2022<strong>Concept:<\/strong>Creates a nested sequence of clusters, resulting in a tree-like structure known as a <strong>Dendrogram<\/strong>.<\/p>\n\n\n\n<p>\u2022<strong>Advantage:<\/strong>Does not require specifying the number of clusters (<strong>K<\/strong>) beforehand. The resulting clusters are viewed at various levels of granularity by &#8220;cutting&#8221; the dendrogram.<\/p>\n\n\n\n<p>\u2022<strong>Two Main Approaches:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Agglomerative (Bottom-Up):<\/strong>Starts with individual documents as clusters and merges the most similar clusters iteratively.<\/li>\n\n\n\n<li><strong>Divisive (Top-Down):<\/strong>Starts with all documents in one cluster and recursively splits the cluster into two until only individual documents remain.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-da27aa0aefb7d5b77d63a329bf9d3f3c\">Agglomerative Hierarchical Clustering (HAC)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"510\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-30-1024x510.png\" alt=\"\" class=\"wp-image-253\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-30-1024x510.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-30-300x149.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-30-768x382.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-30.png 1338w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-ed199116519ffa91815a538623b9ff1f\">Defining Cluster Proximity: Linkage Criteria<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"735\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-31-1024x735.png\" alt=\"\" class=\"wp-image-257\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-31-1024x735.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-31-300x215.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-31-768x551.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-31.png 1352w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-d94ad99fa26f521bd088241c88e444e6\">The Dendrogram<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-ed2fcbb09bf27d47c72fc7a08730c2d6\">\u2022<strong>Visualization Tool:<\/strong>The Dendrogramvisually represents the entire history of merges (in HAC) or splits (in Divisive) that created the hierarchy.<\/p>\n\n\n\n<p>\u2022<strong>Interpretation:<\/strong><\/p>\n\n\n\n<p>      \u2022<strong>X-axis:<\/strong>Represents the documents.<\/p>\n\n\n\n<p>      \u2022<strong>Y-axis:<\/strong>Represents the <strong>distance\/dissimilarity<\/strong>(or time\/iteration step) at which the merges occurred.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-95cca824abfdcb14d118538ce252ad72\">\u2022<strong>Determining K:<\/strong>To obtain <strong>K<\/strong>flat clusters, one &#8220;cuts&#8221; the dendrogramhorizontally at a height that intersects exactly <strong>K<\/strong>vertical lines. Clusters joined at lower heights are more similar.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-3002007c6851e612f327b09cb279eb14\">Divisive Hierarchical Clustering (DHC)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"571\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-32-1024x571.png\" alt=\"\" class=\"wp-image-259\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-32-1024x571.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-32-300x167.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-32-768x429.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-32.png 1326w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-4a6cc91b6d9e0da18861852a76189390\">Density-Based Clustering (DBSCAN)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"536\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-33-1024x536.png\" alt=\"\" class=\"wp-image-260\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-33-1024x536.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-33-300x157.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-33-768x402.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-33.png 1321w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-wide\"\/>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-cfca68bae0eb193ce522d29732fd7189\"><br>Model-Based Clustering (GMM)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"568\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-34-1024x568.png\" alt=\"\" class=\"wp-image-261\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-34-1024x568.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-34-300x167.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-34-768x426.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-34.png 1288w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-2caec8c2e6280ac75800950588e940a5\">Clustering of Terms<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"472\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-35-1024x472.png\" alt=\"\" class=\"wp-image-263\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-35-1024x472.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-35-300x138.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-35-768x354.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-35.png 1323w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-43a7713afc466073f6679b0c9ad81637\">Dimensionality Reduction for Clustering<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-45ac576ef1e23242b269a26887f8933f\">\u2022Document vectors are extremely high-dimensional (vocabulary size <strong>M <\/strong>is typically tens of thousands). This presents a challenge:<\/p>\n\n\n\n<p>\u2022<strong>The Curse of Dimensionality:<\/strong>In high dimensions, all distances tend to become equal, making the concept of &#8220;nearness&#8221; and &#8220;farness&#8221; less meaningful for clustering.<\/p>\n\n\n\n<p>\u2022<strong>Computational Cost:<\/strong>Clustering algorithms slow down considerably as <strong>M<\/strong>increases.<\/p>\n\n\n\n<p>\u2022<strong>Common Techniques:<\/strong><\/p>\n\n\n\n<p>\u2022<strong>1. Latent Semantic Indexing (LSI) \/ Singular Value Decomposition (SVD):<\/strong>Decomposes the Term-Document matrix to project data into a lower-dimensional latent space (k \u2248 300).<\/p>\n\n\n\n<p>\u2022<strong>2. Principal Component Analysis (PCA):<\/strong>Finds the directions (components) in the data that account for the most variance.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-1e5e4d658ed36ecc5ff72aa9e9484681\">\u2022<strong>Goal:<\/strong>Preserve the underlying cluster structure while removing noise and improving computational efficiency.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-36b01850e62aa6c5b55c5f2b5c6fad1a\">Scalability Challenges in IR Clustering<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"526\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-36-1024x526.png\" alt=\"\" class=\"wp-image-266\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-36-1024x526.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-36-300x154.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-36-768x394.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-36.png 1305w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-3df644a139c39461c73c039f9bc330f5\">Scalability Solutions: Sampling and Bisection<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8e787451a0e8bcf059aaf407d29df8ef\">\u2022<strong>1. Random Sampling:<\/strong><\/p>\n\n\n\n<p>\u2022Cluster only a <strong>representative sample<\/strong>of the documents (e.g., 1% of the corpus).<\/p>\n\n\n\n<p>\u2022Use the resulting centroids\/structure to guide the assignment of the remaining 99%.<\/p>\n\n\n\n<p>\u2022<strong>Risk:<\/strong>The sample may not perfectly capture the true distribution of topics.<\/p>\n\n\n\n<p>\u2022<strong>2. Bisecting K-Means:<\/strong><\/p>\n\n\n\n<p>\u2022A hybrid approach blending partitionaland divisive hierarchical concepts.<\/p>\n\n\n\n<p>\u2022<strong>Method:<\/strong>Start with one cluster containing all documents. In each step, choose a cluster to split and use <strong>K-Means with K=2<\/strong>to divide it into two sub-clusters.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-89edfece83b8221e8732384750f3d508\">\u2022<strong>Advantage:<\/strong>Faster than traditional K-Means because it runs K-Means on smaller subsets, and it is deterministic (produces a single hierarchy).<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-ff9670c43acb4a7006d7e61be74f5d07\">Scalability Solutions: Centroid-Based Optimization<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"462\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-37-1024x462.png\" alt=\"\" class=\"wp-image-271\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-37-1024x462.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-37-300x135.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-37-768x347.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-37.png 1327w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-e1f431f2d9d2594d99d2ab2b516bcceb\">Soft Clustering and Multilingual IR<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-56a294f42b17a4b643005fe3773f494c\">\u2022<strong>Soft (Fuzzy) Clustering:<\/strong>Documents are assigned a <strong>degree of membership<\/strong>(a probability or weight) to <em>every<\/em>cluster.<\/p>\n\n\n\n<p>\u2022<strong>Benefit in IR:<\/strong>A document covering multiple topics (e.g., &#8220;AI ethics and law&#8221;) is appropriately represented in both the &#8220;AI&#8221; and &#8220;Law&#8221; clusters.<\/p>\n\n\n\n<p>\u2022<strong>Common Algorithm:Expectation-Maximization (EM) for Gaussian Mixture Models (GMMs)<\/strong>.<\/p>\n\n\n\n<p>\u2022<strong>Multilingual Clustering:<\/strong><\/p>\n\n\n\n<p>\u2022Documents in different languages covering the same topic should cluster together.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-55e7c2444e36a1312d3292aa7ef24a0e\">\u2022<strong>Requirement:<\/strong>Needs a <strong>language-independent representation<\/strong>, such as Cross-Lingual LSI (CL-LSI) or machine translation to a single intermediary language&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-6f3500a7158e70499e5a28ab7de2b816\">Application I: Search Result Clustering (SRC)<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f82ef5fcea3d8c01f2f7dc54746320e1\">\u2022<strong>Type:Post-Retrieval Clustering<\/strong><\/p>\n\n\n\n<p>\u2022<strong>Process:<\/strong>The system first executes the query using a standard ranking model (e.g., BM25) and retrieves the top <strong>N<\/strong>documents (<strong>N <\/strong>\u2248<strong>50-200<\/strong>). It then clusters only this small result set.<\/p>\n\n\n\n<p>\u2022<strong>User Benefit:<\/strong>Improves the <strong>browsing<\/strong>experience. Instead of a long ranked list, the user sees a few meaningful cluster labels.<\/p>\n\n\n\n<p>\u2022<strong>Challenges:<\/strong><\/p>\n\n\n\n<p>\u2022<strong>Speed:<\/strong>Must be executed <strong>in real-time<\/strong>(milliseconds) to avoid user latency. Requires very fast algorithms.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f5f6edbfd19b194f577e1ce56e66eaaa\">\u2022<strong>Labeling:<\/strong>Clusters must be accurately and concisely <strong>labeled<\/strong>using terms extracted from the cluster documents (e.g., using phrases or named entities).<\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-d445b0947fe58c3e4205811f0b0098fb\">Application II: Pre-Retrieval Clustering (Corpus-Based)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"440\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-38-1024x440.png\" alt=\"\" class=\"wp-image-274\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-38-1024x440.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-38-300x129.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-38-768x330.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-38.png 1332w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-9dcc8543249657a61868c508b519e62a\">The Cluster Labeling Problem<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-5e71919c454aafa650b4b9b7fb5aebaa\">\u2022<strong>Crucial for SRC:<\/strong>A cluster is useless to a user without a descriptive, human-readable label.<\/p>\n\n\n\n<p>\u2022<strong>Goal:<\/strong>Generate a short phrase (2\u20134 words) that accurately summarizes the topic of all documents in the cluster.<\/p>\n\n\n\n<p>\u2022<strong>Methods:<\/strong><\/p>\n\n\n\n<p>\u2022<strong>1. Centroid Terms:<\/strong>Use the terms with the highest weight (TF-IDF) in the cluster centroid vector.<\/p>\n\n\n\n<p>\u2022<strong>2. Frequent Phrases:<\/strong>Extract frequent, non-stop-word phrases (N-grams) that appear across the cluster documents.<\/p>\n\n\n\n<p>\u2022<strong>3. Named Entities:<\/strong>Utilize recognized names (people, organizations, locations) as labels.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-b6c1389a26ece9f5c4a84725c9f14ef0\">\u2022<strong>Filtering:<\/strong>Labels must be filtered to remove overly general or non<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-bc526ba9a1f4bcadd720e5ad614a2be1\">Determining the Optimal Number of Clusters (K)<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f1a6ec4bf9a4fdf4128d05f2233a54cb\"><strong>The K Problem:<\/strong>Only some partitionalalgorithms (like K-Means) require <strong>K<\/strong>as input. Choosing <strong>K<\/strong>is non-trivial.<\/p>\n\n\n\n<p>\u2022<strong>1. Elbow Method:<\/strong><\/p>\n\n\n\n<p>\u2022Plot the <strong>SSE<\/strong>(Sum of Squared Errors) against the number of clusters <strong>K<\/strong>.<\/p>\n\n\n\n<p>\u2022<strong>Rationale:<\/strong>As <strong>K<\/strong>increases, the SSE naturally drops. The optimal <strong>K<\/strong>is often found at the &#8220;elbow&#8221;\u2014where the rate of decrease dramatically slows down.<\/p>\n\n\n\n<p>\u2022<strong>2. Silhouette Analysis:<\/strong><\/p>\n\n\n\n<p>\u2022Measures how similar a document is to its own cluster compared to other clusters.<\/p>\n\n\n\n<p>\u2022<strong>Goal:<\/strong>Maximize the average <strong>Silhouette Coefficient<\/strong>over all documents.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-6b4a2f2e1b2d2765acad980629bf44dc\">\u2022<strong>3. Domain Knowledge:<\/strong>Sometimes, <strong>K<\/strong>is fixed based on known categories or user interface constraints (e.g., &#8220;show 10 clusters&#8221;).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-4f0dc2b08fba7fd8bbddc9e555848e54\">Evaluation of Clustering: Internal Metrics<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-4007657c8ad3be3f43a968123de75b21\">\u2022<strong>Concept:<\/strong>Assess the quality of the clustering structure <em>without<\/em>reference to external, pre-defined class labels. Based purely on the data and the distance metric.<\/p>\n\n\n\n<p>\u2022<strong>1. Cohesion (Intra-Cluster Similarity):<\/strong><\/p>\n\n\n\n<p>\u2022Measures how tightly related the documents within a cluster are.<\/p>\n\n\n\n<p>\u2022<strong>Metrics:<\/strong>Average similarity of all documents to the cluster centroid (or the mean pairwise distance). <strong>We want high cohesion.<\/strong><\/p>\n\n\n\n<p>\u2022<strong>2. Separation (Inter-Cluster Dissimilarity):<\/strong><\/p>\n\n\n\n<p>\u2022Measures how distinct clusters are from each other.<\/p>\n\n\n\n<p>\u2022<strong>Metrics:<\/strong>Distance between cluster centroids (e.g., Euclidean distance). <strong>We want high separation.<\/strong><\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f73d3909aaa6e6edd0ad829c127fb616\">\u2022<strong>3. Silhouette Coefficient:<\/strong>Combines cohesion and separation into a single score.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-b6aff349445a4ad9a51f5dbbc36abc5e\">Evaluation of Clustering: External Metrics<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"550\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-39-1024x550.png\" alt=\"\" class=\"wp-image-278\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-39-1024x550.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-39-300x161.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-39-768x412.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-39.png 1364w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-cbed2b5900c628f879a6db930162f367\">Evaluation Pitfall: The Subjectivity of Clustering<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-fe65414e989a399070eaab8c76fa2e4a\">\u2022<strong>Clustering is Subjective:<\/strong>Unlike classification, there is no single, objective &#8220;correct&#8221; clustering for a dataset. Different goals require different clustering structures.<\/p>\n\n\n\n<p>\u2022For <strong>efficiency<\/strong>, a flat, coarse clustering (low <strong>K<\/strong>) might be best.<\/p>\n\n\n\n<p>\u2022For <strong>user browsing<\/strong>, a fine-grained, meaningful clustering (higher <strong>K<\/strong>) is better.<\/p>\n\n\n\n<p>\u2022<strong>Impact of Metrics:<\/strong>Internal metrics (cohesion\/separation) and external metrics (purity\/F-measure) may disagree.<\/p>\n\n\n\n<p>\u2022A clustering that optimizes <strong>Purity<\/strong>might fail to be useful in a search interface if the clusters are too small or lack good labels.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-5e5548625f1cc566f40a749a4151bdf0\">\u2022<strong>Conclusion:<\/strong>Clustering evaluation must consider the <strong>application context<\/strong>and human judgment (user studies) alongside numerical&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-bffa56d1123752369376cfc756e559fd\">Improving Retrieval Effectiveness via Clustering<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"522\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-40-1024x522.png\" alt=\"\" class=\"wp-image-282\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-40-1024x522.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-40-300x153.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-40-768x391.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-40.png 1346w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-9f5fba91fbb6209e0ba6d24555e10f10\">Case Study: Scatter\/Gather Interface<\/h3>\n\n\n\n<p>\u2022<strong>Concept:<\/strong>A user-driven interactive clustering interface for browsing large document collections.<\/p>\n\n\n\n<p>\u2022<strong>Steps:<\/strong><\/p>\n\n\n\n<p>1.The user selects an initial set of documents or submits a query.<\/p>\n\n\n\n<p>2.The result set is quickly clustered (the <strong>Scatter<\/strong>step).<\/p>\n\n\n\n<p>3.The system presents the cluster labels and sizes.<\/p>\n\n\n\n<p>4.The user selects one or more clusters to merge, refine, or drill down into (the <strong>Gather<\/strong>step).<\/p>\n\n\n\n<p>5.The system re-clusters the gathered subset, repeating the process.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-0be91fba7f38a9f6a320fa0830604e68\">\u2022<strong>Advantage:<\/strong>Provides an effective topical overview and allows the user to dynamically navigate the space without multiple query reformulations.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-245a5ef89919c05f937692bfa3ec730a\"><br>Clustering and Personalization<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-906452c8aff63c23176366af549d7751\">\u2022<strong>User Profiles:<\/strong>Documents retrieved by a system are clustered based not just on content, but also on the user&#8217;s past interaction history.<\/p>\n\n\n\n<p>\u2022<strong>Method:<\/strong>Documents are weighted by a user-interest vector. Clustering then groups documents that are topically similar <em>and<\/em>relevant to the user&#8217;s inferred interests.<\/p>\n\n\n\n<p>\u2022<strong>Application: News and Recommendation Systems:<\/strong><\/p>\n\n\n\n<p>\u2022Grouping news articles by topic, then prioritizing clusters where the user has shown previous interest.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-91321743b2e9c82cd41c4f58c3e2dacd\">\u2022Helps filter out irrelevant content and improve the perceived&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-9b31ab1a8a8624ce75bcbab1c1c0dd23\">Clustering as an Alternative to Indexing<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-d8f546237e2926a7fa496078ad9523e3\">\u2022<strong>Inverted Indices<\/strong>are the standard method for fast retrieval in IR.<\/p>\n\n\n\n<p>\u2022<strong>Clustering<\/strong>offers an alternative access structure.<\/p>\n\n\n\n<p>\u2022<strong>Comparison:<\/strong><\/p>\n\n\n\n<p>\u2022<strong>Indexing:High Precision\/High Recall<\/strong>. Comparison cost is <strong><em>O(query terms)<\/em><\/strong>. Highly accurate.<\/p>\n\n\n\n<p>\u2022<strong>Clustering:Reduced Cost\/Potentially Lower Recall<\/strong>. Comparison cost is <strong><em>O(number of clusters)<\/em><\/strong>. Fast, but lossy.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f7eed903fc410ae0f00ed8aad996bdc5\">\u2022<strong>Hybrid Systems:<\/strong>The most common approach is to use clustering <strong>in conjunction<\/strong>with an inverted index\u2014e.g., using clustering to find the&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-5ae78264f8a1001b9618de1d0fa6f082\">The Impact of Sparsity<\/h3>\n\n\n\n<p>\u2022<strong>Sparsity:<\/strong>Document vectors are extremely <strong>sparse<\/strong>(most dimensions\/terms have a zero weight) in IR.<\/p>\n\n\n\n<p>\u2022<strong>Consequence for Distance:<\/strong>Sparse vectors often have few co-occurring non-zero dimensions, leading to small or zero dot products (high Cosine Similarity if normalized, but the space is complex).<\/p>\n\n\n\n<p>\u2022<strong>Mitigation:<\/strong><\/p>\n\n\n\n<p>\u2022<strong>Dimensionality Reduction (LSI):<\/strong>Creating dense vectors in a latent space mitigates the sparsity problem.<\/p>\n\n\n\n<p>\u2022<strong>Careful Metric Selection:<\/strong>Use metrics that handle high-dimensional sparsity well (like Cosine Similarity) over Euclidean distance.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-71ad6a7f17e0c36747ca839099798ec0\">Challenges: Noise and Outliers<\/h3>\n\n\n\n<p>\u2022<strong>Noise:<\/strong>Documents that are short, contain little content, or have irrelevant content (e.g., boilerplate text, spam) can distort cluster centers.<\/p>\n\n\n\n<p>\u2022<strong>Outliers:<\/strong>Documents that are entirely dissimilar to the rest of the collection.<\/p>\n\n\n\n<p>\u2022<strong>K-Means Problem:<\/strong>Outliers can hijack a cluster centroid, pulling it away from the true cluster mean.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-9e9ed2cfa1ec4a9a803ce480461d7609\">\u2022<strong>DBSCAN Solution:<\/strong>Density-based methods, explicitly mark outliers as &#8220;noise&#8221; points that do not belong to any cluster, making the resulting clusters cleaner.<\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignwide is-layout-flow wp-block-group-is-layout-flow\">\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-45926f92eea044d0e0414c3a69fd0c32\">Summary: Clustering in IR<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-30cbb5e73218fe4d1c249a5f3fe66307\">\u2022<strong>Objective:<\/strong>Group similar documents to improve organization, user browsing, efficiency, and retrieval effectiveness (Recall).<\/p>\n\n\n\n<p>\u2022<strong>Foundation:<\/strong>The <strong>Cluster Hypothesis<\/strong>\u2014similar documents are relevant to similar queries.<\/p>\n\n\n\n<p>\u2022<strong>Key Algorithms:<\/strong><\/p>\n\n\n\n<p>\u2022<strong>K-Means:<\/strong>Fast, scalable, requires pre-defined <strong>K<\/strong>, assumes spherical clusters.<\/p>\n\n\n\n<p>\u2022<strong>Hierarchical (HAC\/DHC):<\/strong>Creates a rich structural hierarchy (Dendrogram), more complex\/slower.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-3024ed1c3fe6433e27e579747518263d\">\u2022<strong>Evaluation:<\/strong>Requires both <strong>internal metrics<\/strong>(Cohesion\/Separation) and <strong>external metrics<\/strong>(Purity\/F-Measure) relative to the final application goal.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-61dbda33c4685e1b5a6e70e45c0c265c\">Future Directions in IR Clustering<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-c01ada41b3e3a30ee12de4b571be49e9\">\u2022<strong>1. Dynamic and Streaming Clustering:<\/strong><\/p>\n\n\n\n<p>\u2022Clustering documents (e.g., news feeds, social media posts) that arrive continuously and rapidly. The cluster structure must be maintained without re-clustering the entire history.<\/p>\n\n\n\n<p>\u2022<strong>2. Integrated Topic Modeling:<\/strong><\/p>\n\n\n\n<p>\u2022Moving beyond traditional geometric clustering by using <strong>probabilistic models<\/strong>(like Latent DirichletAllocation -LDA) to define clusters based on latent topics, leading to more semantically meaningful cluster labels.<\/p>\n\n\n\n<p>\u2022<strong>3. GPU-Accelerated Clustering:<\/strong><\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-f669cab7f6911ab09a05e37de8f42d32\">\u2022Leveraging parallel processing hardware to achieve real-time clustering performance necessary for high-volume, real-time web search result organization.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-c2a21dc913fa7b4afe672d8302ba919a\">Conclusion<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-cfc15af2e3dd0222cbddeea844a5246e\">\u2022Clustering is a vital component of modern Information Retrieval systems, transforming vast, undifferentiated document collections into organized, navigable topical structures.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-cae6d3d84263d6dc10cbf57c6097ff50\">\u2022<strong>Key Takeaway:<\/strong>The choice of clustering algorithm, similarity metric, and evaluation method must always be dictated by the specific IR goal: efficiency, effectiveness, or user experience.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-container-core-group-is-layout-d89aad35 wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-columns alignwide are-vertically-aligned-center is-layout-flex wp-container-core-columns-is-layout-47c06fe3 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\">\n<h2 class=\"wp-block-heading\"><br>Recommender Systems<\/h2>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\"><\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-base-color has-contrast-background-color has-text-color has-background has-link-color wp-elements-a9ec71512fd96df22145675da7477204 has-global-padding is-layout-constrained wp-container-core-group-is-layout-d89aad35 wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group alignwide is-layout-flow wp-block-group-is-layout-flow\">\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-97ca590194584ede35adb84d167aec9c\">Introduction to Recommender Systems<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"799\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-41-1024x799.png\" alt=\"\" class=\"wp-image-305\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-41-1024x799.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-41-300x234.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-41-768x599.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-41.png 1120w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-975c360dd1e9406397f8f7aceec9ef0b\">Types of Feedback and Data<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-97e74c2369ff174f7897fff0767d4b6e\">\u2022<strong>Explicit Feedback:<\/strong><\/p>\n\n\n\n<p>\u2022Direct input from users regarding their interest in items.<\/p>\n\n\n\n<p>\u2022<strong>Examples:<\/strong>Star ratings (1-5 stars on Amazon), Like\/Dislike (YouTube), Numerical scores.<\/p>\n\n\n\n<p>\u2022<strong>Pros:<\/strong>High precision; clear signal of user preference.<\/p>\n\n\n\n<p>\u2022<strong>Cons:<\/strong>Sparse data (users rarely rate); reporting bias (users usually rate only things they love or hate).<\/p>\n\n\n\n<p>\u2022<strong>Implicit Feedback:<\/strong><\/p>\n\n\n\n<p>\u2022Inferences made from user behavior.<\/p>\n\n\n\n<p>\u2022<strong>Examples:<\/strong>Purchase history, video watch time, page views, click-through rate, listening history (Spotify).<\/p>\n\n\n\n<p>\u2022<strong>Pros:<\/strong>Abundant data; no extra effort required from the user.<\/p>\n\n\n\n<p>\u2022<strong>Cons:<\/strong>Noisy data (a click doesn&#8217;t guarantee satisfaction); binary interpretation (interacted vs. didn&#8217;t interact) is often required.<\/p>\n\n\n\n<p>\u2022<strong>The Utility Matrix:<\/strong><\/p>\n\n\n\n<p>\u2022The central data structure is a matrix with Users as rows and Items as columns.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-9cee760ce492c140f56c6d1eda17c9d7\">\u2022The matrix is extremely <strong>sparse<\/strong>(often &gt;99% empty). The goal of RS is effectively &#8220;Matrix Completion.&#8221;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-1e3af6fc9028adbe03ca9158f504c165\">Taxonomy of Recommendation Approaches<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-5b02941103362a39ab9e85e22f78d35c\">\u2022<strong>1. Content-Based Filtering (CBF):<\/strong><\/p>\n\n\n\n<p>\u2022Recommends items similar to those the user Liked in the past.<\/p>\n\n\n\n<p>\u2022Relies on <strong>Item Features<\/strong>(keywords, genre, director, etc.) and a <strong>User Profile<\/strong>built from those features.<\/p>\n\n\n\n<p>\u2022<strong>2. Collaborative Filtering (CF):<\/strong><\/p>\n\n\n\n<p>\u2022Recommends items based on the preferences of similar users.<\/p>\n\n\n\n<p>\u2022Relies on the <strong>Interaction Matrix<\/strong>(ratings\/behavior) rather than item metadata.<\/p>\n\n\n\n<p>\u2022<strong>Key Idea:<\/strong>&#8220;Users who agreed in the past will agree in the future.&#8221;<\/p>\n\n\n\n<p>\u2022<strong>3. Hybrid Methods:<\/strong><\/p>\n\n\n\n<p>\u2022Combines CBF and CF to overcome limitations like the Cold Start problem.<\/p>\n\n\n\n<p>\u2022<strong>4. Knowledge-Based:<\/strong><\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-22f80e1c6abafb260dfb1617e15b352d\">\u2022Uses explicit domain knowledge and requirements (e.g., &#8220;I need a laptop with 16GB RAM for under $1000&#8221;). Common in complex domains like real estate or electronics.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-da5233fcd58e1defd7003335fa4a95d9\">Content-Based Filtering: Mechanics<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"668\" height=\"386\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-42.png\" alt=\"\" class=\"wp-image-310\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-42.png 668w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-42-300x173.png 300w\" sizes=\"auto, (max-width: 668px) 100vw, 668px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignwide is-layout-flow wp-block-group-is-layout-flow\">\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-04f0dcddc65a9d7db14d5a8f8f4d31bc\">Collaborative Filtering: User-Based<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"624\" height=\"378\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-43.png\" alt=\"\" class=\"wp-image-312\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-43.png 624w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-43-300x182.png 300w\" sizes=\"auto, (max-width: 624px) 100vw, 624px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-4b9c92610cc81ce8c18387a504d05a4a\">Collaborative Filtering: Item-Based<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"732\" height=\"376\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-44.png\" alt=\"\" class=\"wp-image-314\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-44.png 732w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-44-300x154.png 300w\" sizes=\"auto, (max-width: 732px) 100vw, 732px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-e080f61a47d6d6184d1d48fea44b3d8b\">Model-Based CF: Matrix Factorization<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"552\" height=\"428\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-45.png\" alt=\"\" class=\"wp-image-318\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-45.png 552w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-45-300x233.png 300w\" sizes=\"auto, (max-width: 552px) 100vw, 552px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-ec103ab635a02c65dc55b8e29d08122c\">The Cold Start Problem &amp; Hybridization<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"728\" height=\"384\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-46.png\" alt=\"\" class=\"wp-image-320\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-46.png 728w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-46-300x158.png 300w\" sizes=\"auto, (max-width: 728px) 100vw, 728px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-0df20751683ad4e6a5463535ff77e1fc\">Evaluation Metrics (Prediction vs. Ranking)<\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"742\" height=\"376\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-47.png\" alt=\"\" class=\"wp-image-322\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-47.png 742w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-47-300x152.png 300w\" sizes=\"auto, (max-width: 742px) 100vw, 742px\" \/><\/figure>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-8a6244d0256af1d148be9c6f8727b5fe\"><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<h3 class=\"wp-block-heading alignwide has-base-2-color has-text-color has-link-color wp-elements-5d56367ef63d7d8d40dbefe9a8c2cd5e\">Challenges and Advanced Topics<\/h3>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-d9a68c48f67866226536e647391e9f07\">\u2022<strong>Scalability:<\/strong><\/p>\n\n\n\n<p>\u2022Real-world systems have millions of users and items. O(N2) neighbor calculations are too slow.<\/p>\n\n\n\n<p>\u2022<strong>Solutions:<\/strong>Clustering, Dimensionality Reduction (SVD), Approximate Nearest Neighbors (ANN) searching (e.g., Faiss).<\/p>\n\n\n\n<p>\u2022<strong>Data Sparsity:<\/strong><\/p>\n\n\n\n<p>\u2022When the interaction matrix is &lt; 0.1% full, finding overlaps between users is difficult.<\/p>\n\n\n\n<p>\u2022<strong>Solutions:<\/strong>Matrix Factorization, implicating trust networks (Trust-aware Recommenders).<\/p>\n\n\n\n<p>\u2022<strong>The &#8220;Filter Bubble&#8221;:<\/strong><\/p>\n\n\n\n<p>\u2022System keeps recommending what the user already agrees with\/likes, reducing diversity.<\/p>\n\n\n\n<p>\u2022<strong>Evaluation:<\/strong>Need metrics for <strong>Novelty<\/strong>(recommending unknown items) and <strong>Serendipity<\/strong>(recommending surprisingly interesting items).<\/p>\n\n\n\n<p>\u2022<strong>Shilling Attacks:<\/strong><\/p>\n\n\n\n<p>\u2022Malicious users creating fake profiles to rate items highly (push attacks) or poorly (nuke attacks) to manipulate the system.<\/p>\n\n\n\n<p class=\"has-contrast-1-color has-text-color has-link-color wp-elements-9b8db5c8fd5bd065457a22d72d04db57\">\u2022<strong>Robustness:<\/strong>Designing algorithms resistant to outliers and fake profiles.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-base-color has-alpha-channel-opacity has-base-background-color has-background is-style-wide\" style=\"margin-top:var(--wp--preset--spacing--30);margin-bottom:var(--wp--preset--spacing--30)\"\/>\n\n\n\n<div style=\"margin-top:var(--wp--preset--spacing--10);margin-bottom:0;height:var(--wp--preset--spacing--10)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>What is Clustering? \u2022Definition:The fundamental task of grouping a set of objects (documents) such that objects in the same group (cluster) are more similarto each other than to those in other groups. \u2022It is the most prevalent form of unsupervised learningin IR. \u2022Goal:To achieve high intra-cluster similarity(documents inside a cluster are highly similar) and high [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":113,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-238","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/238","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=238"}],"version-history":[{"count":73,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/238\/revisions"}],"predecessor-version":[{"id":336,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/238\/revisions\/336"}],"up":[{"embeddable":true,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/113"}],"wp:attachment":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=238"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}