Consider your typical online search engine. The engine crawls millions of web sites, grabbing the text from every page and putting it into an index. When someone types in a search term, it matches that term against the index and returns a list of pages containing that text. But what about a search box that lives on your web site? There are a number of 3rd party products (e.g., Google mini, Zoom) which allow you to host your own crawler to index your site only. These products often contain default implementations of the search box and search results pages, making it easy to integrate a search feature for your site. Similarly, many Content Management Systems (CMS) include a built-in indexing engine, allowing developers to create a custom search interface without the need to purchase or implement a separate 3rd party product. There is one distinct difference, however: rather than crawling your site’s pages, these search indices catalog the individual content items which comprise your site. This subtle difference of crawling pages vs. indexing content can introduce a disconnect between the term a user searches for and an expected list of search results. Left unaddressed, this has the potential to cause incomplete or incorrect search results. In order to understand the issue, first we should review how content is stored in the CMS and how it is indexed by the content indexer.
Pages and Content ItemsContent Management Systems (CMS) offer powerful tools allowing users to build pages in an intuitive and contextual manner, especially combined with a slick, preview-as-you-go, “page builder” interface. Rooted in this approach is one of the classic strengths of the modern CMS: “Edit once, update everywhere”. If your site uses a particular promotion or callout on a dozen different pages, you wouldn’t want to create or update that content a dozen times, so it’s best to encapsulate it in a single item and simply reference it on all pages. Consider a typical page constructed using a CMS page editor. Everything in a CMS is a content item. Think of a content item as any element on a webpage – a block of text, a callout box, etc. Each page has a single content item representing that page. This item might contain a title, main rich text description, and include other metadata associated specifically with the page itself. Additional elements may then be added to the page using an assortment of widgets. For example, one might add a callout/teaser pointing to another page or an additional block of rich text appearing below all other page elements. Although implementations differ, generally the concept is the same: you select a widget (e.g., via a drag-and-drop or click-and-select interface), then associate that widget with a new content item. If your site has common content used on multiple pages (as most sites do), it’s easier to create a single content item and store it in a common location, which then facilitates using that item throughout your site. So on any particular page, each block of content might come from a content item unique to that page OR it may be retrieved from common content items that live in a shared location.
Fig 1. Typical CMS-driven page
Fig 2. Same page with delineation of content source (Green = content associated with the page, Orange = content from other content items)
IndexingMost CMSs have a built-in content index which can be used for developing search interfaces. These search indices* do not crawl the web site’s individual pages, similar to a Google-like page crawler. Rather, these indexers itemize each piece of content and their associated fields. Note that this goes beyond simple indexing of text and rich text fields; the index also includes taxonomy and metadata fields, which in turn enables more rich and complex search interfaces, including categorical/faceted searches as well as related content searches.*Although some CMSs do make available a page crawler search engine as well, this article refers to the more common content index.
The Issue SurfacesUnique and shared content items present a challenge for a simple content-based search engines. Depending on the implementation, searching for a particular text string like “new york” would bring back a list of all content items containing that search term in their text fields. This would have two undesirable outcomes when compared to a page crawler:
- The indexer will only tell you which content items contain the search term, but will not tell you on what pages that content item is used.
- Certain vestigial content items which aren’t actually used on any page may appear in search results, often lacking information on how to display or link to that content.