In this article, I will show how to implement auto-completion using Hibernate Search.
The same can be achieved using Solr or ElasticSearch. But I decided to use Hibernate Search as its the simplest to get started with, easily integrates with an existing application and leverages the same core - Lucene. And we get all of this without the overhead of managing Solr/ElasticSearch cluster. In all, I found Hibernate Search to be the go-to search engine for simple use cases.
For our use case, we build a product title based auto-completion where often, the user queries are searches for product title. While typing, users should immediately see titles matching their requests, and Hibernate Search should do the hard work to filter the relevant documents in near real-time.
Lets have the following JPA annotated Product entity class.
We are interested in returning suggestions based on the 'title' field. Title will be indexed based on 2 strategies - N-Gram and Edge N-Gram.
Edge N-Gram - This will match only from the left edge of the suggestion text. For this we use KeywordTokenizerFactory (emits the entire input as a single token) and EdgeNGramFilterFactory along with some regex cleansing.
N-Gram matches from the start of every word, so that you can get right-truncated suggestions for any word in the text, not only from the first word. The main difference from N-gram is the tokenizer which is StandardTokenizerFactory along with NGramFilterFactory.
Using these strategies, if the document field is "A brown fox" and the query is
a) "A bro"- Will match
b) "bro" - Will match
Implementation: In the entity defined above, we can map 'title' property twice with the above strategies. Below are the annotations to instruct Hibernate to index 'title' twice.
Once indexed, inspect the index using Luke and you should be able to see title analyzed and stored as N-Grams and Edge N-Grams.
Search Query:
What next? Expose the functionality via a REST API and integrate it with jQuery, examples of which can be easily found.
You can also use the same strategies with Solr and ElasticSearch.
The same can be achieved using Solr or ElasticSearch. But I decided to use Hibernate Search as its the simplest to get started with, easily integrates with an existing application and leverages the same core - Lucene. And we get all of this without the overhead of managing Solr/ElasticSearch cluster. In all, I found Hibernate Search to be the go-to search engine for simple use cases.
For our use case, we build a product title based auto-completion where often, the user queries are searches for product title. While typing, users should immediately see titles matching their requests, and Hibernate Search should do the hard work to filter the relevant documents in near real-time.
Lets have the following JPA annotated Product entity class.
public class Product {
@Id
@Column(name = "sku")
private String sku;
@Column(name = "upc")
private String upc;
@Column(name = "title")
private String title;
....
}
We are interested in returning suggestions based on the 'title' field. Title will be indexed based on 2 strategies - N-Gram and Edge N-Gram.
Edge N-Gram - This will match only from the left edge of the suggestion text. For this we use KeywordTokenizerFactory (emits the entire input as a single token) and EdgeNGramFilterFactory along with some regex cleansing.
N-Gram matches from the start of every word, so that you can get right-truncated suggestions for any word in the text, not only from the first word. The main difference from N-gram is the tokenizer which is StandardTokenizerFactory along with NGramFilterFactory.
Using these strategies, if the document field is "A brown fox" and the query is
a) "A bro"- Will match
b) "bro" - Will match
Implementation: In the entity defined above, we can map 'title' property twice with the above strategies. Below are the annotations to instruct Hibernate to index 'title' twice.
@Entity
@Table(name = "item_master")
@Indexed(index = "Products")
@AnalyzerDefs({
@AnalyzerDef(name = "autocompleteEdgeAnalyzer",
// Split input into tokens according to tokenizer
tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),
filters = {
// Normalize token text to lowercase, as the user is unlikely to
// care about casing when searching for matches
@TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
@Parameter(name = "pattern",value = "([^a-zA-Z0-9\\.])"),
@Parameter(name = "replacement", value = " "),
@Parameter(name = "replace", value = "all") }),
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = StopFilterFactory.class),
// Index partial words starting at the front, so we can provide
// Autocomplete functionality
@TokenFilterDef(factory = EdgeNGramFilterFactory.class, params = {
@Parameter(name = "minGramSize", value = "3"),
@Parameter(name = "maxGramSize", value = "50") }) }),
@AnalyzerDef(name = "autocompleteNGramAnalyzer",
// Split input into tokens according to tokenizer
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
// Normalize token text to lowercase, as the user is unlikely to
// care about casing when searching for matches
@TokenFilterDef(factory = WordDelimiterFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = NGramFilterFactory.class, params = {
@Parameter(name = "minGramSize", value = "3"),
@Parameter(name = "maxGramSize", value = "5") }),
@TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
@Parameter(name = "pattern",value = "([^a-zA-Z0-9\\.])"),
@Parameter(name = "replacement", value = " "),
@Parameter(name = "replace", value = "all") })
}),
@AnalyzerDef(name = "standardAnalyzer",
// Split input into tokens according to tokenizer
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
// Normalize token text to lowercase, as the user is unlikely to
// care about casing when searching for matches
@TokenFilterDef(factory = WordDelimiterFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
@Parameter(name = "pattern", value = "([^a-zA-Z0-9\\.])"),
@Parameter(name = "replacement", value = " "),
@Parameter(name = "replace", value = "all") })
}) // Def
})
public class Product {
....
}
Explanation: 2 custom analyzers - autocompleteEdgeAnalyzer and autocompleteNGramAnalyzer have been defined as per theory in the previous section.
Next, we apply these analyzers on the 'title' field to create 2 different indexes. Here is how we do it:
@Column(name = "title")
@Fields({
@Field(name = "title", index = Index.YES, store = Store.YES,
analyze = Analyze.YES, analyzer = @Analyzer(definition = "standardAnalyzer")),
@Field(name = "edgeNGramTitle", index = Index.YES, store = Store.NO,
analyze = Analyze.YES, analyzer = @Analyzer(definition = "autocompleteEdgeAnalyzer")),
@Field(name = "nGramTitle", index = Index.YES, store = Store.NO,
analyze = Analyze.YES, analyzer = @Analyzer(definition = "autocompleteNGramAnalyzer"))
})
private String title;
Start indexing:
public void index() throws InterruptedException {
getFullTextSession().createIndexer().startAndWait();
}
Once indexed, inspect the index using Luke and you should be able to see title analyzed and stored as N-Grams and Edge N-Grams.
Search Query:
private static final String TITLE_EDGE_NGRAM_INDEX = "edgeNGramTitle";
private static final String TITLE_NGRAM_INDEX = "nGramTitle";
@Transactional(readOnly = true)
public synchronized List getSuggestions(final String searchTerm) {
QueryBuilder titleQB = getFullTextSession().getSearchFactory()
.buildQueryBuilder().forEntity(Product.class).get();
Query query = titleQB.phrase().withSlop(2).onField(TITLE_NGRAM_INDEX)
.andField(TITLE_EDGE_NGRAM_INDEX).boostedTo(5)
.sentence(searchTerm.toLowerCase()).createQuery();
FullTextQuery fullTextQuery = getFullTextSession().createFullTextQuery(
query, Product.class);
fullTextQuery.setMaxResults(20);
@SuppressWarnings("unchecked")
List<product> results = fullTextQuery.list();
return results;
}
And we have a working suggester.What next? Expose the functionality via a REST API and integrate it with jQuery, examples of which can be easily found.
You can also use the same strategies with Solr and ElasticSearch.