{"id":115,"date":"2025-11-23T04:07:39","date_gmt":"2025-11-23T04:07:39","guid":{"rendered":"http:\/\/ijeesoo.com\/?page_id=115"},"modified":"2025-11-24T00:52:58","modified_gmt":"2025-11-24T00:52:58","slug":"web-crawling","status":"publish","type":"page","link":"http:\/\/ijeesoo.com\/?page_id=115","title":{"rendered":"Web Crawling"},"content":{"rendered":"\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-container-core-group-is-layout-490fa1db wp-block-group-is-layout-constrained\" style=\"padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group has-global-padding is-layout-constrained wp-container-core-group-is-layout-650790e1 wp-block-group-is-layout-constrained\">\n<h1 class=\"wp-block-heading has-text-align-center has-x-large-font-size\">Definition and Role<\/h1>\n\n\n\n<div style=\"height:1.25rem\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"has-text-align-center\"><strong>Definition:<\/strong>Web crawling (or &#8220;spidering&#8221;) is the process of systematically browsing the World Wide Web, typically for the purpose of web indexing.<\/p>\n\n\n\n<p><strong>The Crawler:<\/strong>A software bot (also called a spider or web robot) that visits web pages to gather information, starting from a list of &#8220;seed&#8221; URLs.<\/p>\n\n\n\n<p><strong>Role in Search Engines:<\/strong><\/p>\n\n\n\n<p><strong>Discovery:<\/strong>Finds new and updated pages to add to the search engine&#8217;s index.<\/p>\n\n\n\n<p><strong>Indexing:<\/strong>Extracts content (text, links, images) from pages to build the searchable index.<\/p>\n\n\n\n<p class=\"has-text-align-center\"><strong>Freshness:<\/strong>Revisits pages to check for changes, keeping the index up-to-date.<\/p>\n<\/div>\n\n\n\n<figure class=\"wp-block-image alignwide size-large is-style-rounded is-style-rounded--1\"><img decoding=\"async\" src=\"https:\/\/www.mdpi.com\/applsci\/applsci-10-03837\/article_deploy\/html\/images\/applsci-10-03837-g001-550.jpg\" alt=\"Building exterior in Toronto, Canada\"\/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-base-2-background-color has-background has-global-padding is-layout-constrained wp-container-core-group-is-layout-669513ed wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-52b864f0 wp-block-group-is-layout-flex\">\n<h2 class=\"wp-block-heading has-text-align-center is-style-asterisk\">Core Crawling Strategies<\/h2>\n\n\n\n<div style=\"height:0px\" aria-hidden=\"true\" class=\"wp-block-spacer wp-container-content-4f2ccadb\"><\/div>\n\n\n\n<p class=\"has-text-align-center\"><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d0bbbce0 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-container-core-column-is-layout-e3d1c41b wp-block-column-is-layout-flow\">\n<h3 class=\"wp-block-heading has-text-align-left is-style-asterisk has-body-font-family has-medium-font-size\" style=\"font-style:normal;font-weight:600\">Deapth-First Crawling<\/h3>\n\n\n\n<p class=\"has-text-align-left\"><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"457\" height=\"481\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-6.png\" alt=\"\" class=\"wp-image-137\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-6.png 457w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-6-285x300.png 285w\" sizes=\"auto, (max-width: 457px) 100vw, 457px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-left\"><\/p>\n\n\n\n<p class=\"has-text-align-left\"><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-container-core-column-is-layout-e3d1c41b wp-block-column-is-layout-flow\">\n<h3 class=\"wp-block-heading has-text-align-left is-style-asterisk has-body-font-family has-medium-font-size\" style=\"font-style:normal;font-weight:600\">Breath-First Crawling<\/h3>\n\n\n\n<p class=\"has-text-align-left\"><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"453\" height=\"476\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-7.png\" alt=\"\" class=\"wp-image-138\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-7.png 453w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-7-286x300.png 286w\" sizes=\"auto, (max-width: 453px) 100vw, 453px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-left\"><\/p>\n\n\n\n<p class=\"has-text-align-left\"><\/p>\n\n\n\n<p class=\"has-text-align-left\"><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-container-core-column-is-layout-e3d1c41b wp-block-column-is-layout-flow\">\n<h3 class=\"wp-block-heading has-text-align-left is-style-asterisk has-body-font-family has-medium-font-size\" style=\"font-style:normal;font-weight:600\">Focused Crawling<\/h3>\n\n\n\n<p class=\"has-text-align-left\"><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"454\" height=\"483\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-8.png\" alt=\"\" class=\"wp-image-139\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-8.png 454w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-8-282x300.png 282w\" sizes=\"auto, (max-width: 454px) 100vw, 454px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-left\"><\/p>\n\n\n\n<p class=\"has-text-align-left\"><\/p>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:var(--wp--preset--spacing--20)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d0bbbce0 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<h2 class=\"wp-block-heading\">Crawling Strategies: Comparison<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"415\" src=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-3-1024x415.png\" alt=\"\" class=\"wp-image-122\" srcset=\"http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-3-1024x415.png 1024w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-3-300x122.png 300w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-3-768x311.png 768w, http:\/\/ijeesoo.com\/wp-content\/uploads\/2025\/11\/image-3.png 1532w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\"><\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-container-core-group-is-layout-d89aad35 wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group alignwide has-global-padding is-layout-constrained wp-container-core-group-is-layout-19e250f3 wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-6329a8f3 wp-block-group-is-layout-flex\">\n<h2 class=\"wp-block-heading has-text-align-center is-style-asterisk\">Features a Crawler Must Provide<\/h2>\n<\/div>\n\n\n\n<div style=\"height:var(--wp--preset--spacing--40)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Feature 1: <strong>Politeness &amp; Compliance<\/strong><\/h3>\n\n\n\n<p><strong>Goal:<\/strong>To minimize server load, respect site owner wishes, and maintain a good relationship with the web community.<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><img decoding=\"async\" src=\"blob:http:\/\/ijeesoo.com\/7c8ddcf3-23b4-45ce-b4a9-8c136edd988e\"><strong>robots.txt:<\/strong>The crawler must first check for and obey the Robots Exclusion Protocol. A Disallow: rule tells the crawler which parts of a site *not* to visit.<\/li>\n\n\n\n<li><img decoding=\"async\" src=\"blob:http:\/\/ijeesoo.com\/6e86e818-0536-40ab-8034-20fae6fdf8ff\"><strong>Crawl-Delay:<\/strong>Adhering to Crawl-Delay directives in robots.txt or using adaptive throttling to avoid overloading a web server. This is critical to avoid being banned.<\/li>\n\n\n\n<li><img decoding=\"async\" src=\"blob:http:\/\/ijeesoo.com\/7b953baf-f23b-42bf-9c67-42b24cfe3361\"><strong>User-Agent Handling:<\/strong>Clearly identifying the crawler via its User-Agent string (e.g., Googlebot\/2.1) allows site administrators (webmasters) to set specific rules for specific bots.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Feature 2: Robustness and Error Handling<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Fault Tolerance:<\/strong>The system must be designed to withstand component failures (network outages, machine crashes) without losing track of which URLs have been processed or are pending.<\/li>\n\n\n\n<li><strong>Retry Logic:<\/strong>Implementation of smart, time-delayed retries (often using <strong>exponential backoff<\/strong>) for transient HTTP errors (like 503 Service Unavailable) to recover data when the server returns.<\/li>\n\n\n\n<li><strong>Handling Malformed HTML:<\/strong>The parser must be robust enough to handle &#8220;tag soup&#8221;\u2014broken tags, missing elements, and non-standard markup\u2014to successfully extract content and outgoing links from messy real-world web pages.<br><\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:var(--wp--preset--spacing--40)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Feature 3: <strong>Scalability for the Global Web<\/strong><\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Large-Scale Operation:<\/strong>A modern crawler must be able to fetch billions of pages from millions of different web hosts.<\/li>\n\n\n\n<li><strong>Distributed Systems:<\/strong>This is not feasible on a single machine. Scalability is achieved by distributing the crawl across many machines (nodes).<\/li>\n\n\n\n<li><strong>Key Components:<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>URL Frontier:<\/strong>Manages the list of all URLs to be crawled.<\/li>\n\n\n\n<li><strong>Fetchers:<\/strong>Parallel processes that download pages.<\/li>\n\n\n\n<li><strong>Parsers:<\/strong>Parallel processes that extract new links.<br><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\">\n<figure class=\"wp-block-image size-large is-style-rounded is-style-rounded--2\"><img decoding=\"async\" src=\"https:\/\/media.geeksforgeeks.org\/wp-content\/uploads\/20240510092325\/High-Level-Design-.webp\" alt=\"Windows of a building in Nuremberg, Germany\"\/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Feature 4:<strong> URL Normalization and Duplication<\/strong><\/h3>\n\n\n\n<p><strong>Goal:<\/strong>Avoid redundant fetches of the same content. URLs are normalized to a canonical (standard) form *before* being checked for duplication.<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>URL Normalization:<\/strong>Converts syntactically different URLs that point to the <em>same<\/em>resource into a single, <strong>canonical form<\/strong>.\n<ul class=\"wp-block-list\">\n<li><em>Examples:<\/em>Converting hostnames to lowercase, removing default ports (:80), and sorting query parameters.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>URL Deduplication:<\/strong>Using fast, space-efficient data structures (like Bloom Filters) to check the normalized URL against the set of already crawled pages, preventing redundant fetches and endless loops.<\/li>\n\n\n\n<li><strong>Content Deduplication:<\/strong>Comparing the content of two different URLs to detect and eliminate duplicate copies, ensuring the index is efficient and free of boilerplate content.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<table>\n<thead style=\"background-color: lightblue;\">\n<td><b>Normalization Type<\/b><\/td>\n<td><b>Original URL<\/b><\/td>\n<td><b>Normalized URL<\/b><\/td>\n<\/thead>\n<tbody>\n<tr style=\"background-color: #f2f2f2;\">\n<td>Lowercase Host<\/td><td>http:\/\/ExAmPlE.com\/<\/td>\n<td>http:\/\/example.com\/<\/td><\/tr>\n<tr style=\"background-color: lightblue;\">\n<td>Remove Default Port<\/td>\n<td>http:\/\/example.com:80\/<\/td>\n<td>http:\/\/example.com\/<\/td><\/tr>\n<tr style=\"background-color: #f2f2f2;\">\n<td>Resolve Path Traversal<\/td>\n<td>http:\/\/example.com\/a\/b\/..\/c\/<\/td>\n<td>http:\/\/example.com\/a\/c\/<\/td><\/tr>\n<tr style=\"background-color:lightblue;\">\n<td>Sort Query Parameters<\/td>\n<td>http:\/\/example.com\/?b=2&#038;a=1<\/td><td>http:\/\/example.com\/?a=1&#038;b=2<\/td><\/tr>\n<\/tbody>\n<\/table>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Feature 5:<strong> Content parsing and link extraction<\/strong><\/h3>\n\n\n\n<p><strong>1-Content Parsing (Goal: Prepare for Indexing)&nbsp;<\/strong><\/p>\n\n\n\n<p>The goal here is to isolate the <em>meaningful<\/em> content of the page from the markup and surrounding noise so it can be added to the search engine&#8217;s index.<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><\/p><strong>Extraction<\/strong>:Separating the textual content (the actual articles, descriptions, or information) from the HTML tags, JavaScript code, and CSS styles.<\/li>\n\n\n\n<li><p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><\/p><strong>Normalization<\/strong>:Cleaning up the extracted text, resolving character encodings, and handling specific document types (like converting a PDF into searchable text).<\/li>\n\n\n\n<li><p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><span style=\"font-family: var(--wp--preset--font-family--body); font-size: var(--wp--preset--font-size--medium);\"><strong>Metadata Harvesting<\/strong>:Collecting important information about the page, such as the title (&lt;title&gt;), meta description, and keywords, which are crucial for ranking and snippet generation.<\/span><\/p><\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<h2 class=\"wp-block-heading\">Content parsing and link extraction<\/h2>\n\n\n\n<p><strong>2-Link Extraction (Goal: Feed the Crawler)&nbsp;<\/strong><\/p>\n\n\n\n<p>The goal here is to identify and process every outgoing hyperlink on the current page to ensure the crawler can continue traversing the web.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Discovery:<\/strong>Finding all anchor tags (&lt;a href=&#8221;&#8230;&#8221;&gt;) and other elements that contain links (like image sources, if needed) within the parsed document.<\/li>\n\n\n\n<li><strong>Resolution:<\/strong>Converting <strong>relative URLs<\/strong>(e.g., \/products\/new.html) into <strong>absolute, fully qualified URLs<\/strong> (e.g., http:\/\/example.com\/products\/new.html) so they can be accurately added to the URL frontier.<\/li>\n\n\n\n<li><strong>Filtering:<\/strong>Checking the extracted links against rules (like those found in robots.txt or nofollowattributes) to ensure the crawler only queues URLs it&#8217;s allowed to visit.<\/li>\n<\/ul>\n\n\n\n<p>In essence, parsing and extraction serve as the bottleneck between downloading the raw bytes and making decisions about <strong>what to index<\/strong> and <strong>where to go next<\/strong>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-container-core-group-is-layout-d89aad35 wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group alignwide has-global-padding is-layout-constrained wp-container-core-group-is-layout-19e250f3 wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-6329a8f3 wp-block-group-is-layout-flex\">\n<h2 class=\"wp-block-heading has-text-align-center is-style-asterisk\">Features a Crawler Should Provide<\/h2>\n<\/div>\n\n\n\n<div style=\"height:var(--wp--preset--spacing--40)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Focused Crawling<\/h3>\n\n\n\n<p><strong>Goal:<\/strong>To selectively crawl and index pages that are relevant to a specific, pre-defined topic (e.g., &#8220;health,&#8221; &#8220;finance&#8221;), saving resources.<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Mechanism:<\/strong>It replaces the simple FIFO (Breadth-First) queue with a <strong>Priority Queue<\/strong>.<\/li>\n\n\n\n<li><strong>Prioritization:<\/strong>A <strong>machine learning classifier<\/strong>scores unvisited URLs based on their predicted relevance to the topic (using features like anchor text, keywords in the link, etc.).<\/li>\n\n\n\n<li><strong>Process:<\/strong>High-scoring links are prioritized in the queue, guiding the crawler to stay on-topic and avoid irrelevant parts of the web.<\/li>\n\n\n\n<li><p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><strong style=\"font-family: var(--wp--preset--font-family--body); font-size: var(--wp--preset--font-size--medium);\">Use Case:<\/strong><span style=\"font-family: var(--wp--preset--font-family--body); font-size: var(--wp--preset--font-size--medium);\">Vertical search engines (e.g., a medical search engine), academic research, building specialized knowledge bases.<\/span><\/p><\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Freshness &amp; Recrawl Policies<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><\/p><strong>The Challenge<\/strong>:The web is dynamic. Content is constantly added, updated, and deleted. A &#8220;stale&#8221; index is a low-quality index.<\/li>\n\n\n\n<li><p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><\/p><strong>The Goal<\/strong>:To maintain a &#8220;fresh&#8221; index by revisiting pages to check for changes.<\/li>\n\n\n\n<li><p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><\/p><strong>Problem<\/strong>:It is impossible and impolite to recrawlthe entire web constantly.<\/li>\n\n\n\n<li><strong>Solution (RecrawlPolicy):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Uniform Policy:<\/strong>Revisit all pages at the same rate (inefficient).<\/li>\n\n\n\n<li><strong>Proportional Policy (Smarter):<\/strong>Revisit pages based on their estimated <strong>frequency of change<\/strong>. News sites are recrawledfrequently (minutes\/hours), while static pages are recrawledrarely (weeks\/months).<\/li>\n\n\n\n<li><strong>Prioritization:<\/strong>High-importance pages (e.g., high PageRank) are also prioritized for recrawlingto ensure the most important content is the freshest.<p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><\/p><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:var(--wp--preset--spacing--40)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading is-style-asterisk\">Spam &amp; Trap Avoidance<\/h3>\n\n\n\n<div class=\"wp-block-group has-global-padding is-layout-constrained wp-block-group-is-layout-constrained\">\n<ul class=\"wp-block-list\">\n<li><strong>The Adversarial Web:<\/strong>Not all web content is benign; some is designed to deceive or trap crawlers.<\/li>\n\n\n\n<li><strong>Crawler Traps (Spider Traps):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Infinite Loops:<\/strong>Dynamically generated pages that create an endless &#8220;tree&#8221; of links (e.g., a calendar with a &#8220;next day&#8221; link that goes on forever).<\/li>\n\n\n\n<li><strong>Honeypots:<\/strong>Pages built specifically to trap bots with a massive number of (often invisible) links.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Heuristics for Avoidance:<\/strong>\n<ul class=\"wp-block-list\">\n<li>Setting a maximum crawl depth or URL length.<\/li>\n\n\n\n<li>Detecting path repetition (e.g., \/shop\/shop\/shop\/&#8230;).<\/li>\n\n\n\n<li>Using URL normalization to collapse similar, &#8220;spammy&#8221; URLs.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Cloaking:<\/strong>Detecting when a server provides different content to the crawler (based on User<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Metadata Extraction<\/h3>\n\n\n\n<p><strong>Goal<\/strong>:To extract data abouta page, which is often as important as the content onthe page for indexing and ranking.<br><strong>Key Metadata:<\/strong><\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><\/p><strong>Anchor Text:<\/strong>The clickable text of an incoming link (e.g., &lt;a href=&#8221;page_B.html&#8221;&gt;This is anchor text&lt;\/a&gt;). This text, found on Page A, is stored as a powerful, objective descriptor for Page B.<\/li>\n\n\n\n<li><p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><\/p><strong>Page Titles<\/strong>:The content of the &lt;title&gt;tag. A primary signal for the page&#8217;s main topic.<\/li>\n\n\n\n<li><p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><\/p><strong>HTTP Headers<\/strong>:Caching information (Last-Modified, ETag), content type (Content-Type), and redirect status (301, 302) are all stored to manage the crawl.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Support for Dynamic Content<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>The Challenge (The Modern Web)<\/strong>:Many pages load as a minimal HTML &#8220;shell&#8221; and then use JavaScript (JS) to fetch and display all the real content (via AJAX, etc.).<\/li>\n\n\n\n<li><strong>Problem<\/strong>:A simple &#8220;fetch-and-parse&#8221; crawler only sees the initial, often-empty, HTML file. It misses all the content loaded by JavaScript.<\/li>\n\n\n\n<li><p style=\"margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-style: normal; font-variant-caps: normal; font-width: normal; font-size: 12px; line-height: normal; font-family: Helvetica; font-size-adjust: none; font-kerning: auto; font-variant-alternates: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-position: normal; font-variant-emoji: normal; font-feature-settings: normal; font-optical-sizing: auto; font-variation-settings: normal; min-height: 14px;\"><strong style=\"font-family: var(--wp--preset--font-family--body); font-size: var(--wp--preset--font-size--medium);\">Solution (JavaScript Rendering):<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>The crawler must act like a full web browser.<\/li>\n\n\n\n<li>It loads the page into a <strong>headless browser<\/strong>(e.g., Chromium).<\/li>\n\n\n\n<li>It executes the JavaScript, waits for network calls to complete, and then parses the <strong>final, rendered HTML DOM<\/strong>(Document Object Model).<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Limitation:<\/strong>This is <em>extremely<\/em> resource-intensive (CPU, memory, time) and is a major bottleneck for modern crawlers.<br><\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">The Modern Crawler: Key Challenges<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li>The modern web crawler has evolved from a simple &#8220;fetcher&#8221; into a complex, intelligent, and resource-intensive system.<\/li>\n\n\n\n<li>It must balance multiple competing goals:\n<ul class=\"wp-block-list\">\n<li><strong>Efficiency vs. Completeness:<\/strong> Using simple HTML parsing for most pages but knowing when to deploy expensive JavaScript (JS) rendering.<\/li>\n\n\n\n<li><strong>Speed vs. Politeness:<\/strong> Crawling as fast as possible to maintain freshness, but respecting robots.txtand Crawl-Delayto remain a &#8220;good&#8221; bot.<\/li>\n\n\n\n<li><strong>Trust vs. Verification:<\/strong>Ingesting content while actively filtering out spam, traps, and deceptive cloaking techniques.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>These advanced features are what separate a simple script from a petabyte-scale search engine.<br><\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-container-core-group-is-layout-d89aad35 wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group alignwide has-global-padding is-layout-constrained wp-container-core-group-is-layout-19e250f3 wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-6329a8f3 wp-block-group-is-layout-flex\">\n<h2 class=\"wp-block-heading has-text-align-center is-style-asterisk\">Crawler Architecture<\/h2>\n\n\n\n<div style=\"height:0px\" aria-hidden=\"true\" class=\"wp-block-spacer wp-container-content-9d1f8957\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-columns alignfull is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Modular Crawler Architecture<\/h3>\n\n\n\n<p><strong>Goal:<\/strong>To build a large-scale system that is scalable, fault-tolerant, and maintainable. This is achieved through a &#8220;separation of concerns.&#8221;<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li>\u2022<strong>The Four Core Modules:<\/strong><\/li>\n\n\n\n<li>\u2022<strong>Scheduler (URL Frontier):<\/strong>The &#8220;brain.&#8221; Decides <em>which<\/em>page to crawl next and <em>when<\/em>.<\/li>\n\n\n\n<li>\u2022<strong>Fetcher:<\/strong>The &#8220;legs and hands.&#8221; Responsible for downloading the raw content (HTML, PDF, etc.) from a given URL.<\/li>\n\n\n\n<li>\u2022<strong>Parser:<\/strong>The &#8220;eyes and mind.&#8221; Responsible for reading the downloaded content, extracting the usable text, and discovering all new (outgoing) links.<\/li>\n\n\n\n<li>\u2022<strong>Storage:<\/strong>The &#8220;memory.&#8221; A set of databases that store the crawled content (document index), link relationships (link database), and metadata.<\/li>\n\n\n\n<li>\u2022<strong>The Flow:<\/strong>The Scheduler gives a URL to the Fetcher. The Fetcher&#8217;s content goes to the Parser. The Parser&#8217;s extracted links go back to the Scheduler, and the extracted text goes to Storage.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">The URL Frontier (Scheduler)<\/h3>\n\n\n\n<p><strong>Definition:<\/strong>The central data structure that manages all URLs the crawler <em>knows about<\/em>but <em>hasn&#8217;t crawled yet<\/em>. It can contain billions of URLs.<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Key Responsibilities:<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Prioritization:<\/strong>It&#8217;s not a simple queue. It&#8217;s a <strong>Priority Queue<\/strong>that uses <strong>scoring functions<\/strong>to decide what to crawl next.<\/li>\n\n\n\n<li><strong>Politeness:<\/strong>It must manage per-host queues to ensure it doesn&#8217;t hit the same web server too rapidly (obeying Crawl-Delay).<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Scoring Inputs (How Priority is Set):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Importance:<\/strong>Is the page a high PageRank site? (Crawl it sooner).<\/li>\n\n\n\n<li><strong>Freshness:<\/strong>Is this a news site that changes often? (Recrawlit often).<\/li>\n\n\n\n<li><strong>Topic:<\/strong>Is it relevant to a focused crawl? (Score it higher).<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:var(--wp--preset--spacing--40)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Content Filters (The Gatekeeper)<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Goal:<\/strong>To save processing (CPU) and storage resources by discarding unwanted content as early as possible.<\/li>\n\n\n\n<li><strong>Filter 1: MIME Type Filtering (Pre-Parse):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>What:<\/strong>Checks the Content-Typeheader (e.g., text\/html, application\/pdf, image\/jpeg).<\/li>\n\n\n\n<li><strong>Action:<\/strong>If text\/html, send to the HTML parser. If application\/pdf, send to the PDF-to-text converter. If video\/mp4or .zip, discard.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>\u2022<strong>Filter 2: Language Detection (Post-Parse):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>What:<\/strong>Analyzes the extracted text to identify its natural language.<\/li>\n\n\n\n<li><strong>Action:<\/strong>If the search engine only serves English-speaking users, pages identified as 100% Russian or Mandarin may be discarded.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Other Filters:<\/strong>Spam detection, duplicate content detection, etc.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Threading &amp; Concurrency (The Engine)<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>The Problem:<\/strong>Crawling is <strong>I\/O-bound<\/strong>. The single slowest part is waiting for a remote server to respond over the network. The CPU sits idle 99% of the time.<\/li>\n\n\n\n<li><strong>The Solution: Concurrency.<\/strong>Do thousands of things at once.<\/li>\n\n\n\n<li><strong>Approach 1: Multi-threaded Fetchers:<\/strong>\n<ul class=\"wp-block-list\">\n<li>Run hundreds or thousands of &#8220;fetcher&#8221; threads in parallel.<\/li>\n\n\n\n<li>While Thread 1 is <em>waiting<\/em>for Server A, Thread 2 is <em>requesting<\/em>from Server B, and Thread 3 is <em>processing<\/em>data from Server C.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Approach 2: Asynchronous (Non-Blocking) I\/O:<\/strong>\n<ul class=\"wp-block-list\">\n<li>The modern, more efficient model.<\/li>\n\n\n\n<li>A single process can manage thousands of open connections. It issues a request, then immediately moves to another task, handling responses as they arrive (event-driven). This maximizes CPU and network efficiency.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Architecture in Motion (The Full Flow)<\/h3>\n\n\n\n<p>This slide shows how all the modules work together:<\/p>\n\n\n\n<ol style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li>The <strong>URL Frontier<\/strong> selects a high-priority, polite URL.<\/li>\n\n\n\n<li>A free <strong>Fetcher Thread<\/strong>(using AsyncI\/O) makes the HTTP request and downloads the raw content.<\/li>\n\n\n\n<li>The <strong>MIME Type Filter<\/strong> checks the content. If it&#8217;s text\/html&#8230;<\/li>\n\n\n\n<li>A <strong>Parser<\/strong>is assigned to read the HTML, extract the clean text, and find all &lt;a> (link) tags.<\/li>\n\n\n\n<li>The <strong>Language Detection Filter<\/strong>checks the clean text. If it&#8217;s a valid language&#8230;<\/li>\n\n\n\n<li>The <strong>Clean Text &amp; Metadata<\/strong>are sent to <strong>Storage<\/strong>(the Index).<\/li>\n\n\n\n<li>The <strong>Extracted Links<\/strong>are sent back to the <strong>URL Frontier<\/strong>to be scored, prioritized, and added to the queue for future crawling.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-container-core-group-is-layout-d89aad35 wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group alignwide has-global-padding is-layout-constrained wp-container-core-group-is-layout-19e250f3 wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-6329a8f3 wp-block-group-is-layout-flex\">\n<h2 class=\"wp-block-heading has-text-align-center is-style-asterisk\">DNS Resolution<\/h2>\n\n\n\n<div style=\"height:0px\" aria-hidden=\"true\" class=\"wp-block-spacer wp-container-content-9d1f8957\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">DNS Resolution: The Crawler&#8217;s &#8220;Address Book&#8221;<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Role in Crawling:<\/strong>The crawler&#8217;s main job is to fetch content from URLs. A URL contains a <strong>hostname<\/strong>(e.g., www.example.com), but the internet works on <strong>IP addresses<\/strong>(e.g., 93.184.216.34).<\/li>\n\n\n\n<li><strong>The &#8220;First Step&#8221;:<\/strong>Before a crawler can open a network connection to a web server, it <em>must<\/em>use the Domain Name System (DNS) to resolve the hostname into its corresponding IP address.<\/li>\n\n\n\n<li><strong>Process:<\/strong>\n<ul class=\"wp-block-list\">\n<li>Crawler takes URL: http:\/\/www.example.com\/page.html<\/li>\n\n\n\n<li>Extracts hostname: www.example.com<\/li>\n\n\n\n<li>Performs DNS Lookup: www.example.com-> 93.184.216.34<\/li>\n\n\n\n<li>Opens connection to 93.184.216.34and requests \/page.html.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Conclusion:<\/strong>DNS resolution is a fundamental, non-negotiable prerequisite for fetching content from any new host.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Caching Strategies to Reduce Latency<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>The Bottleneck:<\/strong>Performing a full DNS lookup for <em>every single URL<\/em>would be disastrously slow. At the scale of billions of pages, this latency would kill crawl performance.<\/li>\n\n\n\n<li><strong>The Solution: Caching.<\/strong>A high-performance crawler maintains its own large, internal DNS cache (often in-memory).<\/li>\n\n\n\n<li><strong>How it Works:<\/strong>\n<ul class=\"wp-block-list\">\n<li>The crawler looks up example.comand gets 93.184.216.34. It stores this mapping in its cache.<\/li>\n\n\n\n<li>For the <em>next 10,000 URLs<\/em>from example.com, the crawler looks in its local cache first, getting the IP address almost instantly (in nanoseconds).<\/li>\n\n\n\n<li>This bypasses the need for any external network lookups.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Time-To-Live (TTL):<\/strong>The cache must still respect the TTL (e.g., &#8220;this IP is valid for 1 hour&#8221;) set by the domain. This ensures the crawler adapts if a site&#8217;s IP address changes.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:var(--wp--preset--spacing--40)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Handling DNS Failures and Timeouts<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>The Internet is Unreliable:<\/strong>DNS servers can fail, packets get lost, and domains expire. A robust crawler must handle this gracefully.<\/li>\n\n\n\n<li><strong>Common Failures:<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Timeout:<\/strong>The DNS server did not respond in time. This is a <em>transient<\/em>error.<\/li>\n\n\n\n<li><strong>NXDOMAIN<\/strong><strong>(Non-Existent Domain):<\/strong>A <em>permanent<\/em> error. The domain name is invalid or has expired.<\/li>\n\n\n\n<li><strong>SERVFAIL<\/strong><strong>(Server Failure):<\/strong>The DNS server is misconfigured or down. This is <em>transient<\/em>.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Crawler&#8217;s Retry Logic:<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>NXDOMAIN<\/strong><strong>:<\/strong> Mark the domain as &#8220;dead&#8221; and remove all its URLs from the frontier (or de-prioritize for 1 year).<\/li>\n\n\n\n<li><strong>Timeouts \/ <\/strong><strong>SERVFAIL<\/strong><strong>:<\/strong> Do <em>not<\/em> delete the URLs. This is a temporary problem. The crawler should log the error, de-prioritize the host, and schedule a retry for several hours or days later.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Impact on Crawl Throughput<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Throughput = Pages per Second.<\/strong>This is the #1 metric for a crawler&#8217;s efficiency.<\/li>\n\n\n\n<li><strong>Direct Impact:<\/strong>DNS is often the <em>main<\/em> bottleneck before the fetcher threads can even start their work.<\/li>\n\n\n\n<li><strong>High Throughput (Good):<\/strong>A large, fast, in-memory DNS cache makes hostname-to-IP resolution instantaneous. This allows the fetcher threads to run at full capacity, limited only by network speed and politeness delays.<\/li>\n\n\n\n<li><strong>Low Throughput (Bad):<\/strong>A small or non-existent cache forces fetchers to <em>wait<\/em>for DNS lookups. The entire crawl grinds to a halt.<\/li>\n\n\n\n<li><strong>Example:<\/strong>At scale, a crawler&#8217;s internal DNS resolver may handle millions of queries per second, while a standard DNS server can only handle thousands.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Impact on Crawl Reliability<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Reliability = Completeness and Uptime.<\/strong>Does the crawler find all the content and never stop running?<\/li>\n\n\n\n<li><strong>DNS Failures:<\/strong>How the crawler handles DNS failures directly impacts its reliability.<\/li>\n\n\n\n<li><strong>Unreliable Crawler (Bad):<\/strong>\n<ul class=\"wp-block-list\">\n<li>Treats a temporary DNS timeout as a permanent failure.<\/li>\n\n\n\n<li>Deletes a valid website from its frontier just because the DNS server was down for 10 minutes.<\/li>\n\n\n\n<li>The resulting index is <em>incomplete<\/em>and missing large chunks of the web.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Reliable Crawler (Good):<\/strong>\n<ul class=\"wp-block-list\">\n<li>Differentiates between permanent (NXDOMAIN) and temporary (timeout) errors.<\/li>\n\n\n\n<li>Uses smart retry logic to revisit temporarily failed hosts.<\/li>\n\n\n\n<li>The resulting index is <em>more complete<\/em>and accurately reflects the state of the web, even the parts that are temporarily hard to reach.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-container-core-group-is-layout-d89aad35 wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group alignwide has-global-padding is-layout-constrained wp-container-core-group-is-layout-19e250f3 wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-6329a8f3 wp-block-group-is-layout-flex\">\n<h2 class=\"wp-block-heading has-text-align-center is-style-asterisk\">The URL Frontier<\/h2>\n\n\n\n<div style=\"height:0px\" aria-hidden=\"true\" class=\"wp-block-spacer wp-container-content-9d1f8957\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">What is The URL Frontier?<\/h3>\n\n\n\n<p><strong>Definition:<\/strong>The URL Frontier is the central data structure that manages all the URLs a crawler intends to crawl or recrawl. It&#8217;s the crawler&#8217;s main &#8220;to-do list.&#8221;<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Core Task:<\/strong>It acts as the scheduler, deciding <em>which<\/em>URL a free fetcher thread should crawl <em>next<\/em>.<\/li>\n\n\n\n<li><strong>Key Challenges:<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Scale:<\/strong>It must manage <em>billions<\/em>of URLs.<\/li>\n\n\n\n<li><strong>Efficiency:<\/strong>It must select and dispatch URLs with high throughput.<\/li>\n\n\n\n<li><strong>Prioritization:<\/strong>It must decide which pages are more important to crawl first.<\/li>\n\n\n\n<li><strong>Politeness:<\/strong>It must ensure the crawler doesn&#8217;t hit the same server too quickly.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Frontier Data Structures<\/h3>\n\n\n\n<p><strong>The &#8220;Queue&#8221; is not one, but many.<\/strong>A large-scale frontier is a complex set of queues.<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>1. Per-Host Queues (For Politeness):<\/strong>\n<ul class=\"wp-block-list\">\n<li>The frontier is first divided into many separate <strong>FIFO (First-In, First-Out)<\/strong>queues, one for each hostname.<\/li>\n\n\n\n<li>When the crawler picks a host, it pulls the next URL from that host&#8217;s FIFO queue. This naturally enforces politeness delays (e.g., &#8220;wait 1 second before pulling from this queue again&#8221;).<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>2. Priority Queue (For Scheduling):<\/strong>\n<ul class=\"wp-block-list\">\n<li>A main <strong>Priority Queue<\/strong>is used to decide <em>which<\/em>per-host queue to visit next.<\/li>\n\n\n\n<li>This queue is not based on time, but on a <strong>score<\/strong>(e.g., PageRank, freshness). It ensures the crawler&#8217;s time is spent on high-value hosts.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>3. Bloom Filters (For Memory):<\/strong>\n<ul class=\"wp-block-list\">\n<li>A probabilistic data structure used to quickly answer: &#8220;Have I <em>ever<\/em>seen this URL before?&#8221;<\/li>\n\n\n\n<li>It prevents duplicate URLs from being added to the frontier, saving memory and preventing re-work, with a very small chance of error.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:var(--wp--preset--spacing--40)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">URL Scoring &amp; Prioritization<\/h3>\n\n\n\n<p><strong>Goal:<\/strong>To crawl a &#8220;better&#8221; version of the web faster by assigning a priority score to each URL. This score determines its position in the priority queue.<\/p>\n\n\n\n<p><strong>Key Scoring Inputs:<\/strong><\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Link Popularity (e.g., PageRank):<\/strong>High-PageRank URLs are considered more important and are given a higher priority. This is based on the idea of crawling the &#8220;best&#8221; pages first.<\/li>\n\n\n\n<li><strong>Freshness:<\/strong>Pages that are known to change frequently (like news homepages or stock tickers) are given a higher priority for <em>recrawling<\/em>.<\/li>\n\n\n\n<li><strong>Relevance (for Focused Crawls):<\/strong>In a topical crawler, a machine learning classifier scores URLs based on their predicted relevance to the topic. On-topic links get a high-priority score.<\/li>\n\n\n\n<li><strong>History:<\/strong>Pages that have historically returned errors or spam are given a very low priority.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Duplicate Detection &amp; Canonicalization<\/h3>\n\n\n\n<ol style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>This is Step 0.<\/strong>Before a URL is <em>ever<\/em>added to the frontier, it must be &#8220;cleaned&#8221; and &#8220;checked.&#8221;<\/li>\n\n\n\n<li><strong>Canonicalization (Normalization):<\/strong>\n<ol class=\"wp-block-list\">\n<li><strong>T<\/strong>he process of converting a URL into its single, standard (canonical) form.<\/li>\n\n\n\n<li><strong>Examples:<\/strong>\n<ol class=\"wp-block-list\">\n<li>HTTP:\/\/Example.com\/-> http:\/\/example.com\/(lowercase host)<\/li>\n\n\n\n<li>http:\/\/example.com:80\/-> http:\/\/example.com\/(remove default port)<\/li>\n\n\n\n<li>http:\/\/example.com\/a\/..\/b\/-> http:\/\/example.com\/b\/(resolve path)<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<\/li>\n\n\n\n<li><strong>Duplicate Detection:<\/strong>\n<ul class=\"wp-block-list\">\n<li>After canonicalization, the crawler checks if this standardized URL already exists<\/li>\n\n\n\n<li>(in the frontier or in the &#8220;already crawled&#8221; database).<\/li>\n\n\n\n<li>This is where <strong>Bloom Filters<\/strong>or <strong>Hash Sets<\/strong> are used for a very fast,\u00a0<\/li>\n\n\n\n<li>memory-efficient check to prevent redundant work.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Frontier Expansion Strategies<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>The frontier&#8217;s logic <\/strong><strong><em>defines<\/em><\/strong><strong>the crawl strategy.<\/strong>How you store, prioritize, and retrieve URLs dictates the &#8220;shape&#8221; of your crawl.<\/li>\n\n\n\n<li><strong>Breadth-First (BFS):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>How:<\/strong>Using a simple, single <strong>FIFO Queue<\/strong>(First-In, First-Out).<\/li>\n\n\n\n<li><strong>Result:<\/strong>The crawler explores the web level by level. It finds all pages 1 click from the seed, then all pages 2 clicks away, etc. This is the most common and &#8220;safest&#8221; strategy.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Depth-First (DFS):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>How:<\/strong>Using a <strong>LIFO Stack<\/strong>(Last-In, First-Out).<\/li>\n\n\n\n<li><strong>Result:<\/strong>The crawler follows a single link path as deep as it can go. This is rarely used as it gets &#8220;lost&#8221; in deep sites and crawler traps.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Focused:<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>How:<\/strong>Using a <strong>Priority Queue<\/strong>where the score is <em>topic relevance<\/em>.<\/li>\n\n\n\n<li><strong>Result:<\/strong>The crawler prioritizes links that <em>seem<\/em>relevant, ignoring paths that appear to be off-topic.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-container-core-group-is-layout-d89aad35 wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group alignwide has-global-padding is-layout-constrained wp-container-core-group-is-layout-19e250f3 wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-6329a8f3 wp-block-group-is-layout-flex\">\n<h2 class=\"wp-block-heading has-text-align-center is-style-asterisk\">Distributing Indexes<\/h2>\n\n\n\n<div style=\"height:0px\" aria-hidden=\"true\" class=\"wp-block-spacer wp-container-content-9d1f8957\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Why Distribute an Index?<\/h3>\n\n\n\n<p><strong>The Problem:<\/strong>A single machine cannot handle the scale of a modern search index.<\/p>\n\n\n\n<ol style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Storage Limits (The &#8220;Size&#8221; Problem):<\/strong>\n<ul class=\"wp-block-list\">\n<li>The inverted index for billions (or trillions) of documents is <em>petabytes<\/em> in size.<\/li>\n\n\n\n<li>This vastly exceeds the disk space, RAM, and CPU of any single server.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Query Throughput (The &#8220;Speed&#8221; Problem):<\/strong>\n<ul class=\"wp-block-list\">\n<li>A large search engine must handle tens of thousands of queries per second (QPS).<\/li>\n\n\n\n<li>A single machine cannot serve this many requests, especially when each query requires complex-list lookups and scoring.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>The Solution:<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Horizontal Scaling:<\/strong>Instead of one &#8220;supercomputer,&#8221; use a cluster of thousands of cheaper &#8220;commodity&#8221; machines working in parallel.<\/li>\n\n\n\n<li>This requires <strong>partitioning<\/strong>(splitting) the index and query work across the cluster.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Partitioning Strategies<\/h3>\n\n\n\n<p>Goal: To split the giant inverted index into smaller, manageable pieces.<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Strategy 1: Document-Based Partitioning (or &#8220;Sharding&#8221;)<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>How<\/strong>: The collection of documents is split into N partitions (shards). Each shard holds a complete, independent index for its subset of documents.<\/li>\n\n\n\n<li><strong>Query<\/strong>: The query is sent to all N shards in parallel. A broker node then gathers and merges the results from all shards.<\/li>\n\n\n\n<li><strong>Pro<\/strong>: Simple, easy to add new documents. This is the model used by Elasticsearch\/Solr.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Strategy 2: Term-Based Partitioning<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>How<\/strong>: The dictionary of terms is split. Shard A holds terms a-f, Shard B holds g-m, etc. Each shard holds the full posting lists for its terms.<\/li>\n\n\n\n<li><strong>Query<\/strong>: A query (e.g., &#8220;web search&#8221;) is split. The &#8220;web&#8221; term goes to Shard W, &#8220;search&#8221; goes to Shard S.<\/li>\n\n\n\n<li><strong>Pro<\/strong>: Very efficient for querying, as you only contact the shards that hold your terms. Con: More complex to manage and update.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:var(--wp--preset--spacing--40)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Shardingand Replication<\/h3>\n\n\n\n<p>These are two distinct but related concepts for horizontal scaling.<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Sharding(Partitioning):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>What:<\/strong>The process of splitting your data (the index) across multiple machines. This is what <em>solves the storage (size) problem<\/em>.<\/li>\n\n\n\n<li><strong>Benefit:<\/strong>Allows the index to grow to any size by simply adding more machines (shards). It also parallelizes query work.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>\u2022<strong>Replication:<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>What:<\/strong>The process of making <em>copies<\/em>(replicas) of your shards on <em>different<\/em> machines.<\/li>\n\n\n\n<li><strong>Benefit 1: Fault Tolerance:<\/strong>If a machine holding a primary shard fails, its replica on another machine is instantly promoted, and no data or query capability is lost.<\/li>\n\n\n\n<li><strong>Benefit 2: High Throughput:<\/strong>Query load can be balanced across all replicas, allowing the cluster to handle a much higher volume of queries per second (QPS).<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">MapReducefor Index Construction<\/h3>\n\n\n\n<p><strong>Problem<\/strong>: How do you build a petabyte-scale inverted index in parallel?<\/p>\n\n\n\n<p><strong>MapReduce<\/strong>is a programming model designed for this exact task.<\/p>\n\n\n\n<ol style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>MAP Phase (Parallel Tokenization):<\/strong>\n<ul class=\"wp-block-list\">\n<li>Input: (DocID, Document Content)<\/li>\n\n\n\n<li>Many &#8220;Map&#8221; workers run in parallel, each on a small chunk of documents.<\/li>\n\n\n\n<li>Each worker tokenizes its documents and emits (key, value) pairs of: (term, DocID)<\/li>\n\n\n\n<li>Example: (\u2018search\u2019, D1), (\u2018engine\u2019, D1), (\u2018search\u2019, D2)<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>SHUFFLE Phase:<\/strong>The system automatically sorts and groups all pairs by key (the term).<\/li>\n\n\n\n<li><strong>REDUCE Phase (Posting List Generation):<\/strong>\n<ul class=\"wp-block-list\">\n<li>Input: (term, [list_of_DocIDs])<\/li>\n\n\n\n<li>Each &#8220;Reduce&#8221; worker takes one term (e.g., \u2018search\u2019) and its entire list of document IDs ([D1, D2, D50, &#8230;]).<\/li>\n\n\n\n<li>It then sorts, compresses, and writes the final posting list for that term to disk.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Index Consistency &amp; Synchronization<\/h3>\n\n\n\n<p><strong>The Challenge:<\/strong>The web is not static. How do you add new documents to a massive, read-only distributed index <em>without<\/em>rebuilding it every time?<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>The &#8220;Main + Auxiliary&#8221; Model:<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Main Index:<\/strong>The massive, static index built by MapReduce. It is read-only and very fast for queries.<\/li>\n\n\n\n<li><strong>Auxiliary Index (or &#8220;Delta&#8221;):<\/strong>A second, smaller index (often in-memory) that is dynamic and writable. All <em>new<\/em>documents are added here first.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Synchronization:<\/strong>\n<ol class=\"wp-block-list\">\n<li>A query must now search <em>both<\/em>the Main Index and the Auxiliary Index, then merge the results.<\/li>\n\n\n\n<li>Periodically (e.g., every few minutes), the small in-memory Auxiliary Index is &#8220;flushed&#8221; to a new, permanent, read-only file on disk (called a &#8220;segment&#8221;).<\/li>\n\n\n\n<li>In the background, a &#8220;compaction&#8221; process merges these smaller segments into the Main Index.<\/li>\n<\/ol>\n<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">The Trade-offs<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li>Distributing an index is not free; it&#8217;s a constant balance of competing goals.<\/li>\n\n\n\n<li><strong>Latency vs. Update Frequency:<\/strong>\n<ul class=\"wp-block-list\">\n<li>A static, read-only index has the <em>lowest query latency<\/em>(fastest response).<\/li>\n\n\n\n<li>Allowing for high-frequency updates (e.g., &#8220;near-real-time&#8221;) requires the &#8220;Main + Auxiliary&#8221; model, which adds a small amount of latency to <em>every query<\/em>(the cost of merging).<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Throughput vs. Consistency:<\/strong>\n<ul class=\"wp-block-list\">\n<li>High query throughput is achieved with <em>many replicas<\/em>.<\/li>\n\n\n\n<li>However, the more replicas you have, the harder and slower it is to synchronize updates across all of them, leading to weaker <em>consistency<\/em>(a query to Replica A might show different results than Replica B for a few seconds).<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Cost vs. Performance:<\/strong>More shards and more replicas mean better performance, fault tolerance, and storage, but at a linear (or greater) increase in hardware and operational cost.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Real-World Systems<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Google (Caffeine, GFS, Bigtable):<\/strong>\n<ul class=\"wp-block-list\">\n<li>The &#8220;web-scale&#8221; model. Google continuously crawls and updates its index in small, &#8220;percolating&#8221; batches rather than one large one.<\/li>\n\n\n\n<li>The index is global and massively distributed, using a hybrid of term and document partitioning.<\/li>\n\n\n\n<li>A single query may touch thousands of machines to get an answer in milliseconds.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Elasticsearch&amp; Apache Solr:<\/strong>\n<ul class=\"wp-block-list\">\n<li>The &#8220;enterprise-scale&#8221; model, both built on the <strong>Apache Lucene<\/strong>library.<\/li>\n\n\n\n<li>They primarily use <strong>document-based partitioning<\/strong>(&#8220;sharding&#8221;).<\/li>\n\n\n\n<li>They manage shardingand replication automatically.<\/li>\n\n\n\n<li>They excel at <strong>&#8220;Near-Real-Time&#8221; (NRT) search<\/strong>, using the &#8220;Main + Auxiliary&#8221; (segment-based) model to make new documents searchable within seconds, not hours.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n\n<p>  <\/p>\n\n\n\n<div class=\"wp-block-group alignwide has-global-padding is-layout-constrained wp-container-core-group-is-layout-19e250f3 wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-6329a8f3 wp-block-group-is-layout-flex\">\n<h2 class=\"wp-block-heading has-text-align-center is-style-asterisk\">Connectivity Servers<\/h2>\n\n\n\n<div style=\"height:0px\" aria-hidden=\"true\" class=\"wp-block-spacer wp-container-content-9d1f8957\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Definition &amp; Role of a Connectivity Server<\/h3>\n\n\n\n<p><strong>Definition<\/strong>: A Connectivity Server is a specialized, large-scale database designed to store and serve the <strong>Web Graph<\/strong>.<\/p>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>What is the Web Graph?<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Nodes<\/strong>: Web pages (Documents).<\/li>\n\n\n\n<li><strong>Directed Edges<\/strong>: Hyperlinks (&lt;a href=&#8221;&#8230;&#8221;>) from one page to another.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Primary Role:<\/strong>\n<ol class=\"wp-block-list\">\n<li>Store: To maintain a massive, up-to-date copy of this link structure (trillions of edges).<\/li>\n\n\n\n<li>Serve: To provide other parts of the search engine (like the ranker and crawler) with fast answers to &#8220;who links to whom?&#8221;<\/li>\n<\/ol>\n<\/li>\n\n\n\n<li>It is the search engine&#8217;s central &#8220;database of links.&#8221;.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">The &#8220;Why&#8221;: Supporting Link Analysis<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>The Core Idea:<\/strong>The content of a page is not the only signal of its quality. The <em>links pointing to it<\/em>are a powerful, objective &#8220;vote&#8221; or &#8220;recommendation&#8221; from other authors.<\/li>\n\n\n\n<li><strong>The Problem:<\/strong>Raw content matching (keyword frequency) is easily spammed and doesn&#8217;t understand <em>authority<\/em>.<\/li>\n\n\n\n<li><strong>The Solution: Link Analysis.<\/strong>By analyzing the <em>entire<\/em>graph structure, a search engine can determine the relative <strong>authority, importance, and trustworthiness<\/strong>of every page.<\/li>\n\n\n\n<li>The Connectivity Server is the system that provides the raw data (the graph) to run these analyses.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:var(--wp--preset--spacing--40)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Key Algorithms: PageRank, HITS, TrustRank<\/h3>\n\n\n\n<p>The Connectivity Server provides the data to compute:<\/p>\n\n\n\n<ol style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>PageRank (Google):<\/strong>\n<ul class=\"wp-block-list\">\n<li>A measure of <strong>global, static authority<\/strong>.<\/li>\n\n\n\n<li>It&#8217;s a &#8220;random surfer&#8221; model: a page is important if other <em>important<\/em>pages link to it. The score is passed recursively through the graph.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>HITS (Hyperlink-Induced Topic Search):<\/strong>\n<ul class=\"wp-block-list\">\n<li>Identifies two types of pages for a specific query:<\/li>\n\n\n\n<li><strong>Authorities:<\/strong>Pages with the best information (many in-links).<\/li>\n\n\n\n<li><strong>Hubs:<\/strong>Pages that are good <em>lists<\/em>of authorities (many out-links).<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>TrustRank:<\/strong>\n<ul class=\"wp-block-list\">\n<li>A spam-fighting algorithm. It starts a PageRank-like &#8220;walk&#8221; from a seed set of <em>known, trusted<\/em>sites (e.g., universities, government sites) to measure &#8220;trust&#8221; instead of &#8220;popularity.&#8221;<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Anchor Text Indexing<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>The Problem<\/strong>: Page Alinks to Page B. The link looks like this:<\/li>\n\n\n\n<li>&lt;a href=&#8221;http:\/\/pageB.com&#8221;>Click here for the best widgets&lt;\/a><\/li>\n\n\n\n<li><strong>The Solution (Anchor Text): <\/strong>The clickable text (&#8220;Click here for the best widgets&#8221;) is a description of Page Bwritten by the author of Page A.<\/li>\n\n\n\n<li><strong>Role of Connectivity Server:<\/strong>\n<ul class=\"wp-block-list\">\n<li>\u2022It doesn&#8217;t just store (PageA-> PageB).<\/li>\n\n\n\n<li>\u2022It stores the link context: (Source: PageA, Target: PageB, AnchorText: &#8220;best widgets&#8221;)<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Impact: <\/strong>This allows Page Bto rank for the term &#8220;best widgets&#8221; even if that phrase never appears on Page B&#8217;s content. It&#8217;s a powerful, objective, third-party signal.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Efficient Storage of the Link Graph<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>The Challenge<\/strong>: Storing a graph with billions of nodes and trillions of edges is a massive storage problem.<\/li>\n\n\n\n<li><strong>Solution: Adjacency Lists<\/strong>. The graph is stored as a set of lists, not a giant matrix.<\/li>\n\n\n\n<li><strong>Two Key Lists are Needed:<\/strong>\n<ol class=\"wp-block-list\">\n<li><strong>Forward Adjacency List (Out-links):<\/strong>\n<ul class=\"wp-block-list\">\n<li>DocA-> [DocB, DocC, DocF]<\/li>\n\n\n\n<li>Use: For HITS (finding Hubs) and crawling (where to go next).<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Reverse Adjacency List (In-links):<\/strong>\n<ul class=\"wp-block-list\">\n<li>DocB-> [DocA, DocE, DocG]<\/li>\n\n\n\n<li>Use: Essential for calculating PageRank and TrustRank(who &#8220;votes&#8221; for this page?).<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<\/li>\n\n\n\n<li><strong>Compression<\/strong>: These lists are heavily compressed (e.g., delta-encoding sorted DocIDs) to fit as much of the graph as possible in RAM for high-speed lookups.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Application 1: Ranking &amp; Spam Detection<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li>The Connectivity Server is a primary input to the final ranking formula.<\/li>\n\n\n\n<li><strong>Ranking Signals:<\/strong>\n<ol class=\"wp-block-list\">\n<li><strong>Static Score<\/strong>: The pre-calculated PageRank or TrustRankof a document is used as a strong signal of its general, baseline quality.<\/li>\n\n\n\n<li><strong>Anchor Text<\/strong>: The query processor checks the anchor text index. If a user searches for &#8220;best widgets,&#8221; any page that has &#8220;best widgets&#8221; in its incoming anchor text gets a massive ranking boost.<\/li>\n<\/ol>\n<\/li>\n\n\n\n<li><strong>Spam Detection:<\/strong>\n<ul class=\"wp-block-list\">\n<li>Pages with extremely low TrustRank(far from the trusted seed set) can be heavily demoted or removed from results.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Application 2: Query Expansion &amp; Disambiguation<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li>The link graph helps the search engine <em>understand<\/em> the <em>context<\/em> of a query.<\/li>\n\n\n\n<li><strong>Query Disambiguation:<\/strong>\n<ul class=\"wp-block-list\">\n<li>User searches for &#8220;Jaguar.&#8221;<\/li>\n\n\n\n<li>The server sees two distinct clusters of top pages:\n<ol class=\"wp-block-list\">\n<li>One cluster is linked to by &#8220;car,&#8221; &#8220;Ford,&#8221; and &#8220;dealership&#8221; sites (it&#8217;s the car).<\/li>\n\n\n\n<li>Another cluster is linked to by &#8220;zoo,&#8221; &#8220;big cat,&#8221; and &#8220;jungle&#8221; sites (it&#8217;s the animal).<\/li>\n<\/ol>\n<\/li>\n\n\n\n<li>By analyzing the <strong>co-citation<\/strong>(what pages are linked to <em>together<\/em>), the engine can disambiguate the query and show results for both, or pick the most popular context.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Query Expansion:<\/strong>This same context helps find related terms (e.g., &#8220;Jaguar&#8221; is related to &#8220;Ford,&#8221; &#8220;Jaguar&#8221; is also related to &#8220;Leopard&#8221;).<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Integration with the Full Pipeline<\/h3>\n\n\n\n<p>The Connectivity Server is the &#8220;glue&#8221; between the other components.<\/p>\n\n\n\n<ol style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Crawling (Input):<\/strong>\n<ul class=\"wp-block-list\">\n<li>The <strong>Crawler<\/strong> is the <em>producer<\/em> of data.<\/li>\n\n\n\n<li>As it fetches and parses pages, it extracts all links and anchor text and sends them as a stream of updates to the Connectivity Server.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Indexing (Offline Process):<\/strong>\n<ul class=\"wp-block-list\">\n<li>The <strong>Indexer<\/strong>(often using MapReduce) takes a <em>snapshot<\/em>of the entire link graph from the server to run the massive, offline calculations for PageRank and TrustRank.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Querying (Real-time Consumer):<\/strong>\n<ul class=\"wp-block-list\">\n<li>The <strong>Query Processor<\/strong>is the <em>consumer<\/em>of data.<\/li>\n\n\n\n<li>It requests specific data (e.g., &#8220;Give me all anchor text for DocID123&#8221;) in real-time to score and rank the final results list.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-container-core-group-is-layout-d89aad35 wp-block-group-is-layout-constrained\" style=\"margin-top:0;margin-bottom:0;padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-group alignwide has-global-padding is-layout-constrained wp-container-core-group-is-layout-19e250f3 wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-6329a8f3 wp-block-group-is-layout-flex\">\n<h2 class=\"wp-block-heading has-text-align-center is-style-asterisk\"> Advanced Topics<\/h2>\n\n\n\n<div style=\"height:0px\" aria-hidden=\"true\" class=\"wp-block-spacer wp-container-content-9d1f8957\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Semantic Indexing &amp; Latent Semantic Analysis (LSA)<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>The Problem (Keyword Matching Fails):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Synonymy:<\/strong>User searches for &#8220;laptop&#8221; but the best page only says &#8220;notebook.&#8221; A simple keyword index will fail to match them.<\/li>\n\n\n\n<li><strong>Polysemy:<\/strong>User searches for &#8220;jaguar.&#8221; Does it mean the car, the animal, or the operating system? A simple index can&#8217;t tell the context.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>The Solution: Semantic Indexing (Index the <\/strong><strong><em>Meaning<\/em><\/strong><strong>).<\/strong><\/li>\n\n\n\n<li><strong>Latent Semantic Analysis (LSA):<\/strong>A mathematical technique that analyzes the entire document collection to find which terms <em>co-occur<\/em>(appear together) frequently.\n<ul class=\"wp-block-list\">\n<li>It learns that &#8220;laptop,&#8221; &#8220;notebook,&#8221; &#8220;CPU,&#8221; and &#8220;RAM&#8221; all belong to a hidden (&#8220;latent&#8221;) <em>concept<\/em>of &#8220;computing.&#8221;<\/li>\n\n\n\n<li>It creates a mathematical &#8220;concept space&#8221; where &#8220;laptop&#8221; and &#8220;notebook&#8221; are very close together.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Result:<\/strong>A query for &#8220;laptop&#8221; can now retrieve documents about &#8220;notebooks,&#8221; leading to far more relevant results. This is the foundation of modern vector and neural search.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Privacy-Preserving &amp; Federated Search<\/h3>\n\n\n\n<ol style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Privacy-Preserving Indexing (The &#8220;Encrypted Web&#8221;):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>The Challenge:<\/strong>How do you search on highly sensitive data (e.g., medical records, financial data) <em>without<\/em>the search engine (or its admin) being able to read the content?<\/li>\n\n\n\n<li><strong>The Solution:<\/strong>Use <strong>Searchable Encryption<\/strong>.<\/li>\n\n\n\n<li><strong>How:<\/strong>The index itself is built from <em>encrypted data<\/em>. When a user searches, their query is also encrypted. Advanced cryptography (like homomorphic encryption) allows the server to find matching documents <em>while they remain encrypted<\/em>. The server gets the right results but learns nothing about the content.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Federated Search (The &#8220;SiloedWeb&#8221;):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>The Challenge:<\/strong>Data is often in separate, &#8220;siloed&#8221; systems (a hospital DB, a university library, a company&#8217;s internal wiki) that cannot be centrally crawled.<\/li>\n\n\n\n<li><strong>The Solution:<\/strong>A &#8220;broker&#8221; system. The user sends <em>one<\/em>query. The broker forwards this query to <em>all<\/em>the independent systems, waits for their individual results, and then merges them into a single, unified list for the user. The index remains physically distributed.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:var(--wp--preset--spacing--40)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-d1c656ed wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Indexing Dynamic &amp; Deep Web Content<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>1.The Dynamic Web (JavaScript-Rendered Content):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>The Problem:<\/strong>Many modern websites load as an empty &#8220;shell.&#8221; All the content (articles, products) is then loaded dynamically using <strong>JavaScript (JS)<\/strong>. A simple crawler only sees the empty shell.<\/li>\n\n\n\n<li><strong>The Solution:Client-Side Rendering.<\/strong>The crawler must act as a full web browser (using a &#8220;headless&#8221; browser like Chromium). It loads the page, executes all the JavaScript, waits for the dynamic content to appear, and then indexes the <em>final rendered HTML<\/em>. This is very slow and computationally expensive.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>2.The Deep Web (Content Behind Forms):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>The Problem:<\/strong>This is content that is <em>not<\/em> accessible via hyperlinks (e.g., flight search results, library catalogs, your bank account).<\/li>\n\n\n\n<li><strong>The Solution (Difficult):<\/strong>There is no perfect solution.\n<ul class=\"wp-block-list\">\n<li><strong>Form Submission:<\/strong>Programmatically &#8220;guess&#8221; and submit common queries to forms (e.g., try all airport codes).<\/li>\n\n\n\n<li><strong>Data Feeds\/APIs:<\/strong>Partner with the data provider to get access to their database directly, bypassing the web interface.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:40%\">\n<h3 class=\"wp-block-heading is-style-asterisk\">Real-World Systems: Caffeine, Elasticsearch, Solr<\/h3>\n\n\n\n<ul style=\"line-height:1.75\" class=\"wp-block-list is-style-checkmark-list\">\n<li><strong>Google Caffeine (The &#8220;Freshness&#8221; System):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Advanced Problem:<\/strong>The old web was indexed in massive, slow, <em>batch<\/em> processes (e.g., update the index once a month). This is too slow for breaking news.<\/li>\n\n\n\n<li><strong>Caffeine&#8217;s Solution:<\/strong>A continuous, <em>incremental<\/em>indexing system. It &#8220;percolates&#8221; updates, analyzing the web in small, continuous chunks.<\/li>\n\n\n\n<li><strong>Result:<\/strong>A new blog post or news story can be crawled, indexed, and made searchable in <em>seconds or minutes<\/em>, not weeks. This is the &#8220;real-time&#8221; web.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Elasticsearch &amp; Apache Solr (The &#8220;Enterprise&#8221; Systems):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Advanced Problem:<\/strong>How can a normal company get Google-like search?<\/li>\n\n\n\n<li><strong>Their Solution:<\/strong>They are open-source <em>platforms<\/em>(built on Apache Lucene) that make distributed indexing &#8220;easy.&#8221;<\/li>\n\n\n\n<li><strong>Key Features:<\/strong>They automatically handle sharding, replication, and fault tolerance. They excel at <strong>&#8220;Near Real-Time&#8221; (NRT) Search<\/strong>, using a segment-based model to make new documents (like application logs or product updates) searchable in under a second.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Definition and Role Definition:Web crawling (or &#8220;spidering&#8221;) is the process of systematically browsing the World Wide Web, typically for the purpose of web indexing. The Crawler:A software bot (also called a spider or web robot) that visits web pages to gather information, starting from a list of &#8220;seed&#8221; URLs. Role in Search Engines: Discovery:Finds new [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":113,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-115","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/115","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=115"}],"version-history":[{"count":65,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/115\/revisions"}],"predecessor-version":[{"id":164,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/115\/revisions\/164"}],"up":[{"embeddable":true,"href":"http:\/\/ijeesoo.com\/index.php?rest_route=\/wp\/v2\/pages\/113"}],"wp:attachment":[{"href":"http:\/\/ijeesoo.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=115"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}