Indexing data that's not in your database with Sphinx and Thinking Sphinx

October 13, 2010

At AboutUs where I work we recently re-implemented the site’s search feature. Behind the scenes the new search uses Sphinx (a super-fast and scalable search engine) and Thinking Sphinx (which allows easy configuration and querying of Sphinx from within Rails).

One thing that makes Sphinx so much faster than alternatives like Solr is that it communicates directly with your database when it is building its search index. This is far faster than talking with your application layer, and indexing the result of your application models’ methods. In our case we’re able to reindex about 40 million records in around 4 hours. Solr used to take days.

There is a major disadvantage to this approach though. Since Sphinx is talking directly to the database, you can’t index the results of method calls on your ruby objects. I’ve discovered though that this is not nearly as limiting as it may sound, since you can index the results of SQL functions. Even better, manipulating and transforming your data with SQL functions as you index it, is virtually free. I haven’t noticed any impact on our index times, after adding many functions to our indexing process.

Here are a couple real examples.

In our app, we deal a lot with domain names, and websites. One feature of our search displays domain names that match your query. Our stakeholders wanted these to be sorted by the importance of the tld, with .com at the top, followed by .org, .gov, etc. Since this sort order is basically arbitrary I replace each tld with an integer, using mySQL’s REPLACE function, store this attribute in Sphinx, and sort on it when querying.

This is what it looks like in Thinking Sphinx’s define_index block:
  define_index do
    # Attributes that allow for fast sorting of domain names by tld, favoring .com's etc.
    # Useful in "exact domain match" search
    replace_str = "LOWER(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(page_title,'com','1'),'org','2'),'gov','3'),'edu','4'),'net','5'),'co.uk','6'),'de','7'),'co.jp','8'),'info','9'))" 
    has replace_str, :as => :page_title_sortable_by_tld, :type => :string
  end

It’s not pretty, but it’s far prettier than trying to do this in ruby, at query time. Now we just add an :order => "page_title_sortable_by_tld ASC" to our calls to Page.search.

Another place we use this is to convert a flag stored in a serialized YAML structure into a boolean attribute that Sphinx can sort on and filter by.
  define_index do
    # Allow favoring in search results based on whether a page is
    # "pro_followed" (i.e. it has its external links followed).
    #
    # Since this data is stored in a serialized YAML structure we need some SQL
    # function fanciness to access this data as a boolean value.  If the below
    # YAML key value appear in the text field LOCATE will return an integer
    # greater than 0.  If not it will return 0.  Coercing this to 1 or 0 (with
    # "> 0") allows sphinx to interpret this as a boolean value.
    join page_data_latest.page_datum
    has 'LOCATE(":follow_external_links: true", `page_data`.`content`) > 0',
      :as => :pro_followed, :type => :boolean
  end

This code searches a field in the database for the some literal YAML text, and stores whether it was found as a true/false attribute in the search index. This is way easier than deserializing the YAML and storing this value in a boolean field in mySQL.

This has proved to be a very useful technique for us, and is far easier and more efficient than denormalizing some of our data to a “search text” column, or doing a bunch of post-querying processing within Ruby on Rails.