Web Sources

Overview

A general web source allows Inkeep to crawl and ingest content from websites. This source type is ideal for documentation sites, marketing pages, blogs, and other web-based content.

The following instructions also apply for Docusaurus, Readme, Redocly, Gitbook and Zendesk Help Center sources.

When you configure a general web source, Inkeep's crawler will systematically discover and index content from your website.

Note

Websites with complex javascript or dynamic content may not be fully indexed.

Crawler Configuration Fields

All crawler fields below must be of the same domain as the URL field.

URL

The URL field specifies the root URL of the website being crawled. This serves as the primary domain and starting point for content discovery.

Example: https://docs.yourcompany.com

Ingestion URLs

Ingestion URLs is a list of specific URLs that you want Inkeep to scrape. Use this when you want to target specific pages.

Example:

https://docs.yourcompany.com/getting-started
https://docs.yourcompany.com/api-reference
https://docs.yourcompany.com/tutorials

Crawler Sitemap URLs

Crawler Sitemap URLs allows you to specify XML sitemaps that contain URLs to scrape. Sitemaps provide an efficient way to tell Inkeep exactly which pages to index.

Benefits:

Faster discovery of content
More comprehensive coverage
Better control over what gets indexed

Example:

https://docs.yourcompany.com/sitemap.xml
https://blog.yourcompany.com/sitemap.xml

Sitemaps can often be found in the robots.txt file e.g. https://docs.yourcompany.com/robots.txt.

Crawler Start URLs defines the URLs where the crawler should begin if there is no sitemap available. The crawler will start from these URLs and follow links to discover additional content. Crawls are restricted to subpaths of the input start URLs.

Example:

https://docs.yourcompany.com/
https://docs.yourcompany.com/guides/

Filtering

URL Matching Patterns

URL Exclude Patterns allow you to specify strings or patterns to exclude certain URLs from being scraped. This helps you avoid indexing irrelevant or sensitive content.

URL Include Patterns allow you to specify patterns to only include certain URLs in the scrape. This is useful when you want to be very selective about what content gets indexed.

Pattern Types:

Exact Match: Excludes URLs that exactly match the specified string
Regex: Searches for the pattern in the URL path.

Examples:

Exact Match: https://docs.yourcompany.com/docs/page-to-include
Regex: docs\.yourcompany\.com/docs

Title Matching Patterns

Title Exclude Patterns allow you to exclude pages based on their titles.

Title Include Patterns allow you to only include pages with specific titles in the scrape.

Pattern Types:

Exact Match: Excludes pages with titles that exactly match the specified string
Regex: Searches for the pattern in the title of the page.

Examples:

Exact Match: 404 Not Found
Regex: ^[Pp]rivacy [Pp]olicy.*

Best Practices

Start with sitemaps when available for comprehensive coverage. Crawls without sitemaps are not gauranteed to find all content.
Use URL patterns to focus on relevant content areas, particularly if you have a large number of pages to crawl.
Exclude unnecessary content like user accounts, admin panels, and downloads.
Test patterns on a small set of urls to ensure they are working as expected.

On this page