Skip to main content
Consider your typical online search engine.  The engine crawls millions of web sites, grabbing the text from every page and putting it into an index.  When someone types in a search term, it matches that term against the index and returns a list of pages containing that text. But what about a search box that lives on your web site?  There are a number of 3rd party products (e.g., Google mini, Zoom) which allow you to host your own crawler to index your site only.  These products often contain default implementations of the search box and search results pages, making it easy to integrate a search feature for your site. Similarly, many Content Management Systems (CMS) include a built-in indexing engine, allowing developers to create a custom search interface without the need to purchase or implement a separate 3rd party product.  There is one distinct difference, however:  rather than crawling your site’s pages, these search indices catalog the individual content items which comprise your site. This subtle difference of crawling pages vs. indexing content can introduce a disconnect between the term a user searches for and an expected list of search results.  Left unaddressed, this has the potential to cause incomplete or incorrect search results.  In order to understand the issue, first we should review how content is stored in the CMS and how it is indexed by the content indexer.

Pages and Content Items

Content Management Systems (CMS) offer powerful tools allowing users to build pages in an intuitive and contextual manner, especially combined with a slick, preview-as-you-go, “page builder” interface.  Rooted in this approach is one of the classic strengths of the modern CMS: “Edit once, update everywhere”.  If your site uses a particular promotion or callout on a dozen different pages, you wouldn’t want to create or update that content a dozen times, so it’s best to encapsulate it in a single item and simply reference it on all pages. Consider a typical page constructed using a CMS page editor.  Everything in a CMS is a content item.  Think of a content item as any element on a webpage – a block of text, a callout box, etc. Each page has a single content item representing that page.  This item might contain a title, main rich text description, and include other metadata associated specifically with the page itself.  Additional elements may then be added to the page using an assortment of widgets.  For example, one might add a callout/teaser pointing to another page or an additional block of rich text appearing below all other page elements.  Although implementations differ, generally the concept is the same: you select a widget (e.g., via a drag-and-drop or click-and-select interface), then associate that widget with a new content item.  If your site has common content used on multiple pages (as most sites do), it’s easier to create a single content item and store it in a common location, which then facilitates using that item throughout your site. So on any particular page, each block of content might come from a content item unique to that page OR it may be retrieved from common content items that live in a shared location.

content item from CMS

Fig 1. Typical CMS-driven page

CMS driven page

Fig 2. Same page with delineation of content source (Green = content associated with the page, Orange = content from other content items)

Indexing

Most CMSs have a built-in content index which can be used for developing search interfaces.  These search indices* do not crawl the web site’s individual pages, similar to a Google-like page crawler.  Rather, these indexers itemize each piece of content and their associated fields.  Note that this goes beyond simple indexing of text and rich text fields; the index also includes taxonomy and metadata fields, which in turn enables more rich and complex search interfaces, including categorical/faceted searches as well as related content searches.*Although some CMSs do make available a page crawler search engine as well, this article refers to the more common content index.

The Issue Surfaces

Unique and shared content items present a challenge for a simple content-based search engines.  Depending on the implementation, searching for a particular text string like “new york” would bring back a list of all content items containing that search term in their text fields.  This would have two undesirable outcomes when compared to a page crawler:
  1. The indexer will only tell you which content items contain the search term, but will not tell you on what pages that content item is used.
  2. Certain vestigial content items which aren’t actually used on any page may appear in search results, often lacking information on how to display or link to that content.
For example, if your site’s home page contains “New York" in its content, you’d expect that a link to the home page would show up in search results if you search for “new york".  But what if that text on the home page was included via a callout or a rich text block or an image caption that comes from a distinct, standalone piece of content? In this situation, the home page would not show up in search results, because the content isn’t in the home page content item, rather it’s found in a common piece of content that is not explicitly associated with the home page. The opposite condition holds true as well.  What if there is a standalone piece of content with the text “New York” which can only be included on a page via a specific widget, but doesn’t correspond to a page all by itself?  The search index would pick up that content item and try to list it within search results.  However, since it isn’t supposed to be referenced by itself, the rendering engine may not know what to do with it, and may be unable to process the resulting search results link, or display it in an unintended, unpolished, or unstructured presentation.

Mitigating the Issue

In my next post, I’ll describe possible ways to mitigate this inherent issue with content indices, as well as describe some of the solutions Primacy has developed.

Other Posts from the Primacy CMS Series: