Google App Engine Basic Text Search
This post is part of the Who's
@ Google I/O, a series of blog posts that give a closer look at developers who'll be
speaking or demoing at Google
I/O. This guest post is written by Brian Dorry from LTech who is demoing as part of
the Developer
Sandbox.Having trouble implementing search on your
App Engine Data Store? Try this technique for a basic search until official full text support
is ready.
Since adding
Google App Engine to our technical tool
belt in 2008, we at
LTech have utilized the
platform for a wide range of products and customer solutions. It is cost effective, easy to
use, and will automatically scale your application on Google's immense infrastructure. Knowing
your applications will be running on the same technologies that Google's own systems take
advantage of make it the easy choice again and again.
From our own
experiences and participation in the developer community, the biggest complaint we hear is the
lack of a full text search in the datastore. Google has marked this issue as "Started", but
has not announced a release date yet, so alternative approaches are still going to be in play
for the short term. We are big fans of Lucene (
http://lucene.apache.org/), an open source
indexing and search engine, but without the ability to save the index file to disk, it becomes
a non-starter.
We need a quick, non-CPU taxing solution that still
takes advantage of the Google infrastructure.
ProblemTaking advantage of the App Engine Datastore, we
can issue an inequalities query to perform a basic "starts with" search. This can be a good
solution for searching users, tags, and domains and works well for implementing a search box
auto-complete feature.
SolutionOur
example solution uses
JDO to generate a
query that instructs the DataStore to return all records that start with the search string.
This is accomplished by issuing a greater than or equal condition against the search term, and
a less than condition against the search input concatenated with the unicode replacement
character ('\ufffd'). The resulting query limits results to items that start with the search
input, as well as any other unicode characters that follow.
This code
uses JDO behind the scenes, but this trick will work with straight
GQL as well. Let's take a look at the
sample:
import java.util.List;
import javax.jdo.PersistenceManager;
import javax.jdo.Query;
(...)
public static List searchGreeting(String query) {
// let's clean up the input
query = ( query != null ?
query.toLowerCase() : "").trim();
PersistenceManager pm =
PMF.get().getPersistenceManager();
Query q = pm.newQuery(Greeting.class);
// set the filter and params
q.setFilter("content >= :1
&& content < :2");
// run query with param values and
return results
return (List) q.execute(query, (query + "\ufffd"));
}
This code snippet is going to search the
JDO defined Employee entity on the name column and return the full Employee payload for each
match. Let's focus on the last two lines of code.
q.setFilter("name >= :1 && name <
:2");
Here we set up the inequality. We are asking the data
store to return all matches where name is between a set of two values. But how does that
define a search?
return (List)
q.execute(query, (query + "\ufffd"));
When we set our
parameters, we pass the same query value to both with an extra character on the end of the
second one. This is essentially telling the data store to return all records that start with
the query term. In terms of sets, the first part of the query returns the set of all words
greater than the query term, including words that don't even start with the query term. The
second part of the query returns the set of all words less than the query term including any
that start with the query term. The intersection of the two sets is the search result for all
words starting with the search term.
This simple to implement technique
will solve many basic search problems until a full text solution is available. It will work
outside of JDO as well with regular GQL statements. For a python implementation, please see
our friend
Graeme's
blog.
Posted by Brian Dorry, LTech team