Google App Engine Basic Text Search

MAY 19, 2010
This post is part of the Who's @ Google I/O, a series of blog posts that give a closer look at developers who'll be speaking or demoing at Google I/O. This guest post is written by Brian Dorry from LTech who is demoing as part of the Developer Sandbox.

Having trouble implementing search on your App Engine Data Store? Try this technique for a basic search until official full text support is ready.

Since adding Google App Engine to our technical tool belt in 2008, we at LTech have utilized the platform for a wide range of products and customer solutions. It is cost effective, easy to use, and will automatically scale your application on Google's immense infrastructure. Knowing your applications will be running on the same technologies that Google's own systems take advantage of make it the easy choice again and again.

From our own experiences and participation in the developer community, the biggest complaint we hear is the lack of a full text search in the datastore. Google has marked this issue as "Started", but has not announced a release date yet, so alternative approaches are still going to be in play for the short term. We are big fans of Lucene (http://lucene.apache.org/), an open source indexing and search engine, but without the ability to save the index file to disk, it becomes a non-starter.

We need a quick, non-CPU taxing solution that still takes advantage of the Google infrastructure.

Problem
Taking advantage of the App Engine Datastore, we can issue an inequalities query to perform a basic "starts with" search. This can be a good solution for searching users, tags, and domains and works well for implementing a search box auto-complete feature.

Solution
Our example solution uses JDO to generate a query that instructs the DataStore to return all records that start with the search string. This is accomplished by issuing a greater than or equal condition against the search term, and a less than condition against the search input concatenated with the unicode replacement character ('\ufffd'). The resulting query limits results to items that start with the search input, as well as any other unicode characters that follow.

This code uses JDO behind the scenes, but this trick will work with straight GQL as well. Let's take a look at the sample:

import java.util.List;
import javax.jdo.PersistenceManager;
import javax.jdo.Query;

(...)

public static List searchGreeting(String query) {

// let's clean up the input
query = ( query != null ? query.toLowerCase() : "").trim();

PersistenceManager pm = PMF.get().getPersistenceManager();
Query q = pm.newQuery(Greeting.class);

// set the filter and params
q.setFilter("content >= :1 && content < :2");

// run query with param values and return results
return (List) q.execute(query, (query + "\ufffd"));

}

This code snippet is going to search the JDO defined Employee entity on the name column and return the full Employee payload for each match. Let's focus on the last two lines of code.

q.setFilter("name >= :1 && name < :2");

Here we set up the inequality. We are asking the data store to return all matches where name is between a set of two values. But how does that define a search?

return (List) q.execute(query, (query + "\ufffd"));

When we set our parameters, we pass the same query value to both with an extra character on the end of the second one. This is essentially telling the data store to return all records that start with the query term. In terms of sets, the first part of the query returns the set of all words greater than the query term, including words that don't even start with the query term. The second part of the query returns the set of all words less than the query term including any that start with the query term. The intersection of the two sets is the search result for all words starting with the search term.

This simple to implement technique will solve many basic search problems until a full text solution is available. It will work outside of JDO as well with regular GQL statements. For a python implementation, please see our friend Graeme's blog.

Posted by Brian Dorry, LTech team