Indexing Data

Advanced crawler configuration

The crawler handles on-going indexing for website content, which means that it will add, update, and delete pages from your Collection as the content is added, updated, and removed from the associated sites. Here we outline some more advanced features of the Crawler, see Crawling a website for a quick way to get up and running.

What data is indexed by default?

The crawler extracts metadata from each page and condenses it into a standard set of fields to be added to the search index.

Note: Javascript rendered elements are not indexed. Any scripts that change content after DOM load (e.g. Optimize running via Google Tag Manager) are also not taken into account.

Page metadata

The crawler uses page metadata and content to construct a standardized set of fields:

  • URL (url). The full URL of the page
  • Title (title). The meta-title of the page
  • Image (image). URL for the page image
  • Language (lang). Language of the page content (en, fr, de, ...)
  • Description (description). The meta description of the page
  • Keywords (keywords). List of keywords for the page
  • Modified Time (modified_time). The time when the page was last modified
  • Published Time (published_time). The time when the page was first published
  • Headings (headings). List of headings from the body of the page

Fields derived from the URL are also included for common queries (e.g. limiting to a domain or particular sub-URL structure of a site):

  • Domain (domain). The domain of the URL
  • First directory (dir1). The first directory of the URL, or empty if none
  • Secondary directory (dir2). The second directory of the URL, or empty if none

In addition to the above, the following metadata is also extracted if available:

  • All meta tags within head
  • OpenGraph tags
  • Custom SJ tags
  • Body content (<body>)

Note: The page <body> is indexed but not added as a field. This means that queries can match against the content, but the text cannot be returned in a search result.

When multiple metadata types are used for a given field, the crawler will use OpenGraph values over others.

  • Page title: og:title, or <title>
  • Page description: og:description, or <meta type="description">

Body content

The page <body> is summarised to provide a more consise base for searching. This process discards text inside <head><script><header> and <footer> elements.

To test what content of a webpage is indexed, use our Page debug tool

Customizing indexed data

Add "Custom Tags" to HTML elements in your pages to override the value of any field, and add new fields.

Custom fields are defined in HTML by adding data attributes to elements. To avoid name clashes with other systems, all data attributes used by the crawler have prefix data-sj-.

Note: Any new fields encountered by the crawler are created as STRING fields by default. If you instead need a different type (INTEGER, FLOAT, TIMESTAMP, etc) then first create the field using the Schema tab in the Console.

Defining custom fields in <head> elements

By default the crawler reads <meta> tags within <head>, but only keeps standard fields (title, description, keywords, etc). Add a data-sj-field="fieldname" attribute to override this behaviour and create a custom field from the meta tag's content attribute. This example shows an otherwise ignored <meta> tag being converted into a custom field fieldname="fieldvalue":

<meta
  property="custom meta field"
  data-sj-field="fieldname"
  content="fieldvalue"
/>

Defining custom fields in <body> elements

To capture data already rendered within an element, add data-sj-field="fieldname" to it:

<span data-sj-field="random">This text is the value</span>

This will set custom field random="This text is the value".

If you don't want the data rendered on the page, then you can also set the field value using the data attribute.

<span data-sj-field="fieldname" data-sj-value="fieldvalue">
  This text is not used because the data attribute has a value
</span>

Localization

Problem: I have very locally targeted content and wish to recommend local content based on my site visitor location. Solution: On each "locally" targeted content page, add two pieces of meta information as follows. e.g.

<span data-sj-field="lat" data-sj-value="-33.867487"></span>
<span data-sj-field="lng" data-sj-value="181.3615434"></span>

In the above case, the prefix data-sj-field indicates this is information specific to the page. So data-sj-field="lat" indicates this page has a property called "lat" with corresponding value -33.867487.

Processed meta data vs Raw meta data

Processed metadata is the metadata that is stored in the index. Raw metadata is read by the crawler, but may not be indexed in the search index. An example of raw metadata is links on a webpage that may be useful for the crawler to find linked pages, but do not need to be recorded in the search index.

Crawler details

Crawling frequency

All indexed pages are recrawled every 3-6 days. See instant-indexing for detecting changes and updating them immediately.

Canonicals and redirects

Canonicals and redirects are followed.

Indexing non-linked pages

It is common to find pages that are not linked in header, footer, navigation or from anywhere else on the website. There are two ways to make sure such pages are also added to the search index:

Preventing pages from being indexed

To stop a page from being indexed, add the attribute data-sj-noindex to an HTML element on the page.

<meta name="noindex" content="noindex" data-sj-noindex />

Note: although this will prevent our crawler from indexing the page, it will not stop other crawlers. Use the attribute on the standard "robots noindex meta tag" to prevent all cralers from indexing the page:

<meta name="robots" content="noindex" data-sj-noindex />

Debugging a page

The 'Page debug' tool allows you to see how data is extracted from your pages by our crawler.

After Diagnosing a page click on 'See extended debug information' to use the Page debug tool. The Page debug tool crawls your webpage or document and gives you details of all the extracted metadata, content, open graph data, and schema.org data from your web page.

The Page debug tool allows you to identify existing issues with your pages that deteriorate the quality of search data such as missing metadata, missing canonicals, incorrect mark-up, lack of content, and incorrect redirects.

Page debug screenshot

Site Search Health Report

Another tool that you can use to check for errors across your whole domain rather than a specific web page is the Search Health Report.

The Search Health Report contains helpful information about your content, meta data, URL structure, query parameters, and server configuration. You also get this report emailed to you when you add a new domain or create a new collection using Sajari console.