Indexing Data

Crawling a website

First, create a Site Search Collection for your main domain. The crawler will start indexing the pages under that domain. You can Browse your Collection to check whether your pages have been correctly indexed.

The speed of the initial crawl depends largely on the size and speed of the site. It can take anywhere from a few seconds to several hours to complete.

Managing sites

Often you will find that the content on your website is spread across multiple domains. Whether it's a blog subdomain like blog.example.com or a completely different domain like another-example.com. In that case, you can add additional domains to your Collection.

Add a domain to a Collection

  1. Navigate to Sites
  2. Select Add Site from the top right of the page.
  3. Enter the domain to add.
  4. Ensure the Index checkbox is checked.
  5. Select Add.

Add a website

The Crawler will immediately begin indexing pages from the new domain.

Domain options

Each domain can be configured using three high-level options:

  • Stored in collection: Store pages from this domain in your Collection. If this is turned off, no new records are added to your Collection.

  • Crawling active: Run on-going indexing for pages in this domain. If enabled, the crawler will periodically visit pages in this domain and update your Collection with any changes. If turned off, the pages on this domain will not be updated in your Collection.

  • Authorized for search: Allow and authorize search requests coming from this domain. Any search interface embedded in this domain will be authorized to make search requests to this collection. A common use case when you will only have this turned on is when you want to allow staging websites or testing environments (e.g. Netlify or Codesandbox).

Remove a domain from a Collection

  1. Log in to the Console and select the relevant Collection
  2. Navigate to Sites section
  3. For the domain you want to remove, disable all 3 options mentioned above: Stored in Collection, Crawling active, and Authorized for search. Yes, we know this clunky, we'll add a delete button soon!

How the crawler works

The crawler visits the domains you added and the ones you allowed to be crawled. It checks for sitemaps, and then indexing the page found at the root of the domain (eg. www.example.com).

You can help the crawler in a number of ways to discover content on your website by,

  • adding a sitemap to your website
  • setting up instant indexing
  • manually point at specific URLs

Note: The crawler will only visit pages from domains that have Crawling active enabled.

Using Sitemaps

A sitemap is a web standard that provides a list of URLs available for crawling. The crawler looks for sitemaps on domains that are being indexed and will visit the URLs in any sitemap it finds. If the Crawler does not find your sitemap for some reason, you can point it manually to the sitemap file.

  1. Navigate to Sites > Diagnose
  2. Enter the URL of the sitemap, e.g. www.example.com/sitemap.xml, and press "Diagnose"
  3. Press "Add to Index"

Instant Indexing

The best way to manage crawling on your site is to setup Instant Indexing. Instant Indexing ensures that new and updated pages are immediately available once visited, without having to wait for a full crawl cycle to complete.

It is enabled by adding a small snippet of JavaScript, also known as ping-back code, to pages on your site. When the page is visited by an end-user it will trigger a light-weight background request to the crawler, which will check if the page is new or updated and needs to be reindexed.

You can find the snippet tailored to your Collection in the Instant Indexing section in the Console.

Pingback Install

Popularity

Using the ping-back code also records popularity metrics for each page, that can then be used in the search algorithm to prioritize popular content.

Check or add URLs manually

The Diagnose tool provides information on the status of URLs in your sites, including:

  • if the URL has been crawled already
  • redirecting to another URL
  • when the URL was last visited by the crawler
  • crawling errors (if any)

URLs that are not in your collection can also be added using the Diagnose tool, and existing URLs can be manually reindexed.

  1. Navigate to Sites > Diagnose
  2. Enter the URL you want to diagnose.
  3. Press "Add to Index" to crawl the URL.
  4. Check the status of the page by re-diagnosing the URL.

Note: The status might be "Pending" if there are a high number of indexing operations being run. It is usually indexed instantly, but in some cases, it might take a few minutes.

Example - Indexed Page