Introduction

imagesCANRN13KYou’ve probably seen the option to install Full-Text Search during a SQL Server installation without really understanding what it is and how it can be used. As with other database systems SQL Server has a full spectrum of features that typically reaches far beyond what is needed in any one software development project. I look at Full-Text Search Services as one of those step child type features I never felt was worth spending time on since it seemed so trivial at first glance. I couldn’t ever really recall a time when I had a problem searching text per say. If you are like me you probably thought about why would you need that feature for a second and then went on your merry way. Honestly I didn’t really know what the full-text search feature was in the scheme of things and never really had a situation where I felt like it was something I really needed.

Recently this feature became a topic of interest to me for a project I was working on for a client who wanted to create an intermediate interface for creating new CRM accounts. They didn’t want CRM to be just another system of independently created and maintained account information which would only add to their existing problem. The solution was to create a system that managed and linked duplicate customer account information together from various systems. The premise behind this system was to allow a user to look for an existing company record before creating a new CRM account. Instead of displaying the standard CRM new account form, the intermediate interface would be shown allowing the user to search for an existing company. If that company already existing in the CRM system they would simply be redirected to that record else the user could elect to create a new record either based on an existing record or from scratch. While it is true that the company information itself would effectively be duplicated in CRM for a new record, it would automatically be linked to existing company records in other systems while also reducing the amount of keying required to add the account information to CRM. In addition to the internal account information being this system they also added public databases of company information from other sources such as the Dun & Bradstreet database. By including public sources of company information the user could search for a company that may not be found in their existing systems but still allowing the user to create a new CRM account without having to rekey all that data in manually.

In theory this sounds great until I came to see the amount of data that needed to be searched. I ended up having upwards of 17 million rows to start off in my primary table. The user needed to be able to search on not only company name, but any information related to a company such as address, tax Id, duns number, etc. I couldn’t just simply throw an index on the company name field and call it a day. I also didn’t want to try creating a covering indexes across tables. I needed something much more comprehensive. As I found myself having to search through this significant amount of text data than normal I felt like I had to pursue another option. I wanted to research any opportunities to increase search performance of text fields in my database feeling that normal indexes were not going to cut it. With a full-text index I could create a new kind of index that would cover all the fields in my company table that indexed the things I wanted to search on in the first place – the words, not the character data in the fields themselves. Using full-text search I created an interface that allowed the user to perform their company searches over millions of rows for data that could be in one of a number of fields in my table.

If you think about the world we live in today we are increasingly storing more textual data in databases as we move towards a paperless society. You can expect that the amount of text database systems are going to be required to search through will only increase over time and standard indexes are not going to get the job done. This being the case it makes sense for you to have the ability create a special kind of index that can provide better performance and functionality on the level of words and phrases rather than simply on the characters that make up the words themselves.

Full-Text Search Overview

If you take a cursory look at what you can find online about full-text search you’ll most likely find at the top of your search some MSDN documentation from Microsoft on how to setup full-text search. At this point I wasn’t so much interested in knowing how to set it up versus why I would want to setup it in the first place.

So what is this full-text search feature? The name itself seems rather descriptive but it doesn’t give you a glance into how it can actually benefit you. Let’s think of this special type of index in terms of a book. A book will typically have a table of contents containing an index of chapters and possibly an appendix containing supplementary information on words or phrases and where to find them within the book. The table of contents index is typically based on the context of subject matter found in each chapter. This makes sense from a reader’s perspective who may want to skip ahead to read a particular topic. This is not easy to do if the reader only had an appendix which would be used more for a specific lookup of a word or phrase of interest.

I like to think of a typical index on a text field like the appendix of a book. The appendix is simply based on the content of the book which is valid for certain purposes but not necessarily the context. Full-text search from my perspective provides a higher level of indexing as compared to an appendix. A typical index on a text field indexes character data whereas a full-text index tracks the occurrences of words and word-forms in the blocks of unstructured text which makes a full-text index structure very different from standard indexes.

When it comes to searching text you can think of full-text search as a super index. Most developers are probably most familiar with searching through text data using the LIKE operator. Depending on the amount of data being searched the LIKE operator will often times give good enough performance, but if you think about it how often are you actually searching text data by the character? Most likely you are probably looking for a word or phrase. The biggest performance gains with a full-text index over a standard index comes when as you find yourself having to search larger datasets. This is because the full-text index is based on words and not simply on the individual characters that make up the words. When the full-text search engine indexes text it breaks the words into tokens which become the basis of the index.

Here’s what Microsoft has to say about full-text index performance:

“The performance benefit of using full-text search can be best realized when querying against a large amount of unstructured text data. A LIKE query against millions of rows of text data can take minutes to return; whereas a full-text query can take only seconds or less against the same data, depending on the number of rows that are returned. “

Besides the performance gains the full-text search engine allows you to not only search on words, but phrases, and inflectional forms of words. You can search for a word such as “drive” and specify that other forms of the word should quality in the search like “drives”, “driving” or “driven”. You also can specify searches on synonymous forms of a specific word as well as a word or phrase that begin with specific text.

You can think of full-text search as SQL Server’s own little search text search engine. You have the ability to rank results and limit the results returned. As a matter of fact a blog post by Microsoft MVP Michael Coles gives an example of how to create your own Google type search using the full-text search engine at http://www.sqlservercentral.com/articles/Full-Text+Search+(2008)/64248/.

In his blog Michael describes how to construct a query engine that uses similar syntax to Google for searching text inside SQL Server. One of the benefits to this is that most people are already familiar with using search engines so you can parlay this experience to reduce the learning curve of an application.

Some of the other nice things you can do with a full-text search are:

  • Search for all the difference tenses of a verb or the singular and plural forms of a noun. In the example from MSDN the forms of “foot” or “feet” are given as a possible search pattern.
     
  • The ability to search for words or phrases in close proximity to another word or phrase.

     
  • Search for words or phrases and specify a weighting value.

Differentiating Characteristics of a Full-Text Index

So you may still be asking yourself, but what are the actual characteristics of a full-text search index that make it unique? Here are a few features of full-text search that goes beyond a normal index:

  • Search Filters
    Search filters are full-text components that interpret binary data based on the structure and format. By creating a special column in your table that will store an identifier for the data in the form of a file extension the Full-Text engine can use it as a context for processing the data. An example would be a type column that stores a value of “doc” which would indicate that the engine is processing a Microsoft Word document and therefore use an MSWord Filter.

     
  • Stop Words
    There are common words in languages that provide no meaningful benefit for the search process. Words in the English language that meet this criteria are "the”, “a”, and “an”. These types of words are called stop words because they bloat the index and can interfere with the query processing. Excluding stop words makes the index more efficient.

     
  • Stop Lists
    A stop lists is a collection of stop words. Previous to SQL Server 2008 you were limited to a predefined English language stop word list. Starting in SQL Server 2008 you have the ability to create your own customized stop lists. This is useful if there are words outside of the normal language stop words that you would like to prevent from being indexed. A use case for creating your own stop list could be that for a particular industry there are common words used that you may consider a stop word that you would like to excluding from indexing. In this case you could create your own customized stop word list and add these words.

     
  • Word Breakers
    Languages contain certain characters or structures that delimit words and are considered word breakers. SQL Server uses word breakers to determine how to identify individual words while indexing. These characters or structures are particular to a language and are automatically loaded for each supposed language in SQL Server (NOTE: Multiple languages for full-text search is not supported previous to SQL Server 2008).

     
  • Word Stemmer
    Most language comes from a base of common words or concepts and through word forms can can be created into new words. A word stem example would be the word “repair” with variants such as “repairing” or “repaired”. Finding the common stems of words is critical for determining context. The set of rules that determine how these word forms are identified are called stemmers. Word stemmers can be especially handy when you would like to do a search including all variants of a particular word you may be looking for in a search. If I wanted to find all text containing the word stem of “repair” I could specify this in my query.

     
  • Population
    The process of loading a full-text index structure is called population. There are three modes of population which are full, incremental and update. Full population repopulates the entire index, incremental populates only the incremental changes based on timestamp values and update populates the index using a change tracking record of modified data since the last index population. Full-text search indexes can be populated in batch using Transact-SQL or on a scheduled job to alleviate real-time performance hits.

Summary

SQL Server’s full-text search engine is a great feature that helps you when dealing with large amounts of text data. I admittedly took it for granted thinking that it was just some fluffy feature that I didn’t really see as a necessity in my development efforts. Now that I know what full-text search is and what it can do for me I will always consider if the solution I’m working on can benefit from a full-text search index. Now you can do the same.

Resources:

MSDN Full-Text Search Documentation: http://msdn.microsoft.com/en-us/library/ms142571(v=SQL.90).aspx