In pursuit of happiness!

Sunday, January 25, 2015

Easy Tricks for JPA, Spring and Hibernate

Java frameworks have evolved, making us write less code and ship faster! Here, I will discuss some neat tricks to address common concerns in Hibernate and Spring.

1. Auto Scan JPA entities. This is an old trick!

Listing entities (via <class> element) in persistence.xml isn't needed any more. You may drop it all together and use Spring's packagesToScan feature. Sample spring configuration below.


<bean id="entityManagerFactory" class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
 <property name="persistenceUnitName" value="HelloService" />
 <property name="packagesToScan" value="com.x.y.z" />
 <property name="dataSource" ref="dataSource" />
 <property name="jpaVendorAdapter">
  <bean class="org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter">
   <property name="showSql" value="true" />
   <property name="databasePlatform" value="org.hibernate.dialect.MySQL5Dialect" />
  </bean>
 </property>
</bean>

2.Entities often have audit columns like the following:



@Column(name = "created_by", updatable = false)
protected String createdBy;

@Column(name = "creation_date", updatable = false)
protected Date createdOn;

@Column(name = "last_updated_by")
protected String lastUpdatedBy;

@Column(name = "last_updated")
protected Date lastUpdatedOn;

And providing those values require either writing pre-insert/update Hibernate listeners or writing setters. How about sprinkling some annotations to get the job done? I mean ...


@Column(name = "created_by", updatable = false)
@ModifiedBy
protected String createdBy;

@Column(name = "creation_date", updatable = false)
@CreationTimestamp
protected Date createdOn;

@Column(name = "last_updated_by")
@ModifiedBy
protected String lastUpdatedBy;

@Column(name = "last_updated")
@UpdateTimestamp
protected Date lastUpdatedOn;

It is quite simple. Read this: http://docs.jboss.org/hibernate/orm/4.3/topical/html/generated/GeneratedValues.html

3. Applications are deployed in different environments and configuring those are often messy and by messy I mean repetitive. Spring property place holder has a neat trick by using property of property to dynamically create the property key for each environment.

Sample common properties file for all environments.


# ec2.properties is a properties for all environment.  
awsAccountId=${${env}.awsAccountId}
availabilityZone=${${env}.availabilityZone}
keyName=${${env}.dev}
expiry=${${env}.expiry}
instanceType=${${env}.instanceType}

dev.awsAccountId = 11111111111
dev.availabilityZone = us-east-1a
dev.keyName = dev
dev.expiry = 360000
dev.instanceType = m1.small

beta.awsAccountId = 222222222222
beta.availabilityZone = us-east-1a
beta.keyName = beta
beta.expiry = 360000
beta.instanceType = m1.large

Here is how you will use this file in spring context configuration -


<bean id="ec2Settings"  class="x.y.z.EC2Settings">  
 <property name="accountId" value="${awsAccountId}" />  
 <property name="zone" value="${availabilityZone}" />  
 <property name="key" value="${keyName}" />  
 <property name="expiry" value="${expiry}" />
 <property name="instanceType" value="${instanceType}" />  
</bean>

Sunday, December 21, 2014

Codahale Metrics and Spring

This is a revision of my previous post on Coda Hale Metrics available here.

Goal: Integrate Metrics (v 3.1.0) and Spring (v 4.1.x) in a JEE environment.

Metrics-spring resides here. Code snippets to help you get started below:

1. Add the following dependency in pom.xml


<dependency>
     <groupid>io.dropwizard.metrics</groupid>
     <artifactid>metrics-servlets</artifactid>
     <version>3.1.0</version>
</dependency>
<dependency>
     <groupid>com.ryantenney.metrics</groupid>
     <artifactid>metrics-spring</artifactid>
     <version>3.0.3</version>
</dependency>
..
Other metrics dependencies

2. Add the following in web.xml


<servlet>
 <servlet-name>metrics-admin</servlet-name>
 <servlet-class>com.codahale.metrics.servlets.AdminServlet</servlet-class>
</servlet>
<servlet-mapping>
 <servlet-name>metrics-admin</servlet-name>
 <url-pattern>/metrics/admin/*</url-pattern>
</servlet-mapping>

3. Define metrics.xml and include it in your main spring configuration file.


<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:metrics="http://www.ryantenney.com/schema/metrics"
    xsi:schemaLocation="http://www.springframework.org/schema/beans
            http://www.springframework.org/schema/beans/spring-beans.xsd
               http://www.ryantenney.com/schema/metrics
               http://www.ryantenney.com/schema/metrics/metrics-3.0.xsd">

    <!-- Registry should be defined in only one context XML file -->
    <metrics:metric-registry id="metrics" />

    <metrics:health-check-registry id="healthCheck" />

    <!-- annotation-driven must be included in all context files -->
    <metrics:annotation-driven metric-registry="metrics"
        health-check-registry="healthCheck" />

    <!-- (Optional) Registry should be defined in only one context XML file -->
    <metrics:reporter type="console" metric-registry="metrics"
        period="1m" />

    <bean
        class="org.springframework.web.context.support.ServletContextAttributeExporter">
        <property name="attributes">
            <map>
                <entry key="com.codahale.metrics.servlets.MetricsServlet.registry">
                    <ref bean="metrics" />
                </entry>
                <entry key="com.codahale.metrics.servlets.HealthCheckServlet.registry">
                    <ref bean="healthCheck" />
                </entry>
            </map>
        </property>
    </bean>
</beans>

4. Define HealthCheck classes and annotate methods with @Timed as per Metrics documentation. And we are done!

Navigate to: http://hostname:port/<webappname>/metrics/admin to view "Operational Menu"

The output metrics is dumped in JSON format. You can parse it and pass it to your favorite graphing library.

Note: All metrics are reported via Console. You can easily enable other reporters in metrics.xml.

Friday, October 17, 2014

Stuff Software Engineers should know about product management!

Success will most likely come your way if you have a great Product Manager in the team. However, not all teams are that lucky. In some team setups, senior tech leads or architects also double up as product managers. Even otherwise, it is good to know some product management to take sound decisions in your architecture, code, deployment and metrics.

Here are some tips to build a great product when software engineers manage the product:

1. Know your customers and their problems.

Who is the end user? Meet and talk to them even if they are sitting in a warehouse!

Ask if you should include feature X before they can use it? Is this an incremental value addition or a game changer? This will get you to MVP quickly.

2. What is your delivery channel?
Mobile, web or multichannel presence. Do your research and pick one to start with.

3. List and estimate non-functional requirements

How many concurrent users? Acceptable latency? Downtime? You get the gist.

4. Release fast and release often

Take customer feedback and iterate. Building the most awesome product takes time. And time is often a luxury.

Releasing product often keeps you on track. Intensity and quality of the feedback helps you prioritize the next set of features to code. Feature prioritization is the key. It also ensure that the team is productive.

If your customers say "I want to use it today", you are on track. Make the feature available to them in pre-production.

Such interaction also saves you from big software re-architectures and re-implementation.

Caution: Ensure that your customers spend serious time interacting with your software.

5. Measure your success (or failure!)

In helps you analyze what's working and what's not. Get rid of later and reinforce the former. Set goals and strive towards it.

If you can measure metrics like conversion, sales, traffic, API calls etc in real time, nothing like it!

6. Experiment
Fear not! Capture those new ideas from the team.Let the numbers and your intuition find its way to newer and successful experiments.

That one new feature can be the game changer!

7. Provide good customer service!

Monitor your application and support it. FAQs, Documentation, Tickets, Training etc are some options to chose from.

And may you become a great product manager!

Wednesday, October 15, 2014

Software stack deployment automation the Amazon CloudFormation way!

Deploying a complete software stack is breeze with Amazon's CloudFormation. It provides a repeatable and predictable mechanism to launch a stack comprising of EC2 instances, load balancers, databases and other Amazon resources.

There are many sample CloudFormation (CF) recipes to get you started, LAMP recipe being the simplest one. Once the hardware is setup, you can use CloudFormation to bootstrap applications. This is where is gets a little tricky.

Multiple options are available including CloudInit. You can pass executable actions to instances at launch time through EC2 user-data attribute. Other options are Chef and Puppet.

Some scenarios where CloudFormation can be of great help:

You own a simple web application running on Tomcat and fronted by a load balancer. New engineers joining your team can simply run the CF template to create their own developer environment on the cloud.
You own an application that builds on multi-instance architectures. Multi-instance architectures separate software instances (or hardware systems) operate on behalf of different client organizations. Launching a new stack for a new client becomes just too simple!
Continuous delivery is simplified via use of CF. Launch a new stack for QA for Feature 1. However, Feature 2 will have to wait deployment till QA has certified Feature 1. Launch a new stack for Feature 2 and build parallel continuous delivery pipelines.

Do remember to delete the stack once it is not needed!

Monday, August 4, 2014

Pray!

Right off YouTube -

"Please pray for my friend's 4 year old son. I just found out that 4 minutes of his life was not documented on Facebook."

Saturday, June 7, 2014

Collocation extraction using NLTK

A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. Collocations include noun phrases like strong tea and weapons of mass destruction, phrasal verbs like to make up, and other stock phrases like the rich and powerful.

Collocations are important for a number of applications: natural language generation (to make sure that the output sounds natural and mistakes like powerful tea or to take a decision are avoided), computational lexicography (to automatically identify the important collocations to be listed in a dictionary entry), parsing (so that preference can be given to parses with natural collocations), and corpus linguistic research.

2 Collocation algorithms:

a. The simplest method for ﬁnding collocations in a text corpus is counting. Pass the candidate phrases through a part-of-speech ﬁlter which only lets through those patterns that are likely to be “phrases”.

b. Mean and variance based methods work by looking at the pattern of varying distance between two words.

We also want to know is whether two words occur together more often than chance. Assessing whether or not something is a chance event is one of the classical problems of statistics. It is usually couched in terms of hypothesis testing. The options are t-test, chi-square, PMI, likelihood ratios, Poisson-Stirling measure and Jaccard index.

Likelihood ratios is an effective approach to hypothesis testing for extracting collocations. You can read more about it here.

Now, to some code to see how collocations can be extracted. Collocation processing modules are available in Apache Mahout, NLTK and Stanford NLP. In this article, we will use NLTK library written in Python to extract bigram collocations from a set of documents.


'''
@author: Nishant
'''

import nltk
from nltk.collocations import *

from com.nishant import DataLoader

if __name__ == '__main__':

    tokens = []
    stopwords = nltk.corpus.stopwords.words('english')

    '''Load multiple documents and return it as a list. 
       Provide your implementation here'''
    data = DataLoader.load("data.xml")

    print 'Total documents loaded: %d' % (data.len)

    ''' Apply stop words '''
    for d in data:
        data_tokens = nltk.wordpunct_tokenize(d)
        filtered_tokens = [w for w in data_tokens if w.lower() not in stopwords]
        tokens = tokens + filtered_tokens

    print 'Total tokens loaded: %d' % (tokens.len)
    print 'Calculating Collocations'

    ''' Extract only bigrams within a window of 5. 
        Implementation for trigram also available. NLTK
        has utility functions for frequency, ngram and word filter''' 
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tokens, 5)
    finder.apply_freq_filter(5)

    ''' Return the 1000 n-grams with the highest likelihood ratio.
        You can also use PMI or other methods and compare results '''
    print 'Printing Top 1000 Collocations''
    print finder.nbest(bigram_measures.likelihood_ratio, 1000)


Output >>
('side', 'neck'), ('stomach', 'pain'), ('doctor', 'suggested'), ('loss', 'appetite') .....

NLTK documentation for collocations are available here.
Must read on this topic - http://nlp.stanford.edu/fsnlp/promo/colloc.pdf

Monday, April 21, 2014

The sweetness of developing REST services using Dropwizard

Jersey is my goto software for developing REST services and then I came across Dropwizard. It is a simple and neat framework written on top of Jersey and glues together all essential libraries for creating production ready services.

Before externalizing a web service, it must be operationally ready to take real world traffic and provide HA. So many engineers end up writing health checks, enabling the required logs and metrics metrics etc. All of these features are available out of the box in Dropwizard. Furthermore, it has nice features such as HTTP client for invoking other services, authentication, integration with Hibernate and DI.

So, while the documentation is straight forward and 'Getting Started' guide does get you started, integration with Spring and JPA via Hibernate requires some work. This article will help you with that assuming you have a working service written using Dropwizard.

1. Wiring up Spring and Dropwizard. HelloWorldApplication is the entry point to the Dropwizard application. The run method is where we initialize Spring.



   @Override
   public void run(HelloWorldConfiguration configuration,
   Environment environment) {

  //init Spring context
        //before we init the app context, we have to create a parent context with all the config objects others rely on to get initialized
        AnnotationConfigWebApplicationContext parent = new AnnotationConfigWebApplicationContext();
        AnnotationConfigWebApplicationContext ctx = new AnnotationConfigWebApplicationContext();

        parent.refresh();
        parent.getBeanFactory().registerSingleton("configuration", configuration);
        parent.registerShutdownHook();
        parent.start();

        //the real main app context has a link to the parent context
        ctx.setParent(parent);
        ctx.register(MyAppSpringConfiguration.class);
        ctx.refresh();
        ctx.registerShutdownHook();
        ctx.start();

        //now that Spring is started, let's get all the beans that matter into DropWizard

        //health checks
        Map healthChecks = ctx.getBeansOfType(HealthCheck.class);
        for(Map.Entry entry : healthChecks.entrySet()) {
            environment.healthChecks().register("template", entry.getValue());
        }

        //resources
        Map resources = ctx.getBeansWithAnnotation(Path.class);
        for(Map.Entry entry : resources.entrySet()) {
            environment.jersey().register(entry.getValue());
        }

        //last, but not least,let's link Spring to the embedded Jetty in Dropwizard
        environment.servlets().addServletListeners(new SpringContextLoaderListener(ctx));
}

It is a standard way to initialize Spring. Two important points to note here. One, no need to register resources explicitly. Spring will look for classes annotated with @PATH and register it with Dropwizard. Second is to include MyAppSpringConfiguration which provides additional resources to be included.

2. Definition for MyAppSpringConfiguration is below:



/**
Main Spring Configuration
 */
@Configuration
@ImportResource({ "classpath:spring/applicationContext.xml", "classpath:spring/dao.xml" })
@ComponentScan(basePackages = {"com.nishant.example"})
public class MyAppSpringConfiguration {

}

3. Next, annotate the resource class with @Service annotation and we are ready!


@Service
@Path("/hello-world")
@Produces(MediaType.APPLICATION_JSON)
public class HelloWorldResource {

 @Autowired
 private HelloWorldConfiguration configuration;

    private final AtomicLong counter = new AtomicLong();

    @GET
    @Timed
    public Saying sayHello(@QueryParam("name") Optional name) {

        final String value = String.format(configuration.getTemplate(), name.or(configuration.getDefaultName()));
        return new Saying(counter.incrementAndGet(), value);
    }
}

For complete working example, see https://github.com/nchandra/SampleDropwizardService

Tuesday, September 24, 2013

Hibernate Search based Autocomplete Suggester

In this article, I will show how to implement auto-completion using Hibernate Search.

The same can be achieved using Solr or ElasticSearch. But I decided to use Hibernate Search as its the simplest to get started with, easily integrates with an existing application and leverages the same core - Lucene. And we get all of this without the overhead of managing Solr/ElasticSearch cluster. In all, I found Hibernate Search to be the go-to search engine for simple use cases.

For our use case, we build a product title based auto-completion where often, the user queries are searches for product title. While typing, users should immediately see titles matching their requests, and Hibernate Search should do the hard work to filter the relevant documents in near real-time.

Lets have the following JPA annotated Product entity class.


public class Product {

  @Id
 @Column(name = "sku")
 private String sku;

  @Column(name = "upc")
 private String upc;

  @Column(name = "title")
 private String title;

....
}

We are interested in returning suggestions based on the 'title' field. Title will be indexed based on 2 strategies - N-Gram and Edge N-Gram.

Edge N-Gram - This will match only from the left edge of the suggestion text. For this we use KeywordTokenizerFactory (emits the entire input as a single token) and EdgeNGramFilterFactory along with some regex cleansing.

N-Gram matches from the start of every word, so that you can get right-truncated suggestions for any word in the text, not only from the first word. The main difference from N-gram is the tokenizer which is StandardTokenizerFactory along with NGramFilterFactory.

Using these strategies, if the document field is "A brown fox" and the query is
a) "A bro"- Will match
b) "bro" - Will match

Implementation: In the entity defined above, we can map 'title' property twice with the above strategies. Below are the annotations to instruct Hibernate to index 'title' twice.



@Entity
@Table(name = "item_master")
@Indexed(index = "Products")
@AnalyzerDefs({

@AnalyzerDef(name = "autocompleteEdgeAnalyzer",

// Split input into tokens according to tokenizer
tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),

filters = {
 // Normalize token text to lowercase, as the user is unlikely to
 // care about casing when searching for matches
 @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
   @Parameter(name = "pattern",value = "([^a-zA-Z0-9\\.])"),
   @Parameter(name = "replacement", value = " "),
   @Parameter(name = "replace", value = "all") }),
 @TokenFilterDef(factory = LowerCaseFilterFactory.class),
 @TokenFilterDef(factory = StopFilterFactory.class),
 // Index partial words starting at the front, so we can provide
 // Autocomplete functionality
 @TokenFilterDef(factory = EdgeNGramFilterFactory.class, params = {
   @Parameter(name = "minGramSize", value = "3"),
   @Parameter(name = "maxGramSize", value = "50") }) }),

@AnalyzerDef(name = "autocompleteNGramAnalyzer",

// Split input into tokens according to tokenizer
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),

filters = {
 // Normalize token text to lowercase, as the user is unlikely to
 // care about casing when searching for matches
 @TokenFilterDef(factory = WordDelimiterFilterFactory.class),
 @TokenFilterDef(factory = LowerCaseFilterFactory.class),
 @TokenFilterDef(factory = NGramFilterFactory.class, params = {
   @Parameter(name = "minGramSize", value = "3"),
   @Parameter(name = "maxGramSize", value = "5") }),
 @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
   @Parameter(name = "pattern",value = "([^a-zA-Z0-9\\.])"),
   @Parameter(name = "replacement", value = " "),
   @Parameter(name = "replace", value = "all") })
}),

@AnalyzerDef(name = "standardAnalyzer",

// Split input into tokens according to tokenizer
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),

filters = {
 // Normalize token text to lowercase, as the user is unlikely to
 // care about casing when searching for matches
 @TokenFilterDef(factory = WordDelimiterFilterFactory.class),
 @TokenFilterDef(factory = LowerCaseFilterFactory.class),
 @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
   @Parameter(name = "pattern", value = "([^a-zA-Z0-9\\.])"),
   @Parameter(name = "replacement", value = " "),
   @Parameter(name = "replace", value = "all") })
}) // Def
})
public class Product {

....
}

Explanation: 2 custom analyzers - autocompleteEdgeAnalyzer and autocompleteNGramAnalyzer have been defined as per theory in the previous section. Next, we apply these analyzers on the 'title' field to create 2 different indexes. Here is how we do it:



@Column(name = "title")
@Fields({
  @Field(name = "title", index = Index.YES, store = Store.YES,
analyze = Analyze.YES, analyzer = @Analyzer(definition = "standardAnalyzer")),
  @Field(name = "edgeNGramTitle", index = Index.YES, store = Store.NO,
analyze = Analyze.YES, analyzer = @Analyzer(definition = "autocompleteEdgeAnalyzer")),
  @Field(name = "nGramTitle", index = Index.YES, store = Store.NO,
analyze = Analyze.YES, analyzer = @Analyzer(definition = "autocompleteNGramAnalyzer"))
})
private String title;

Start indexing:


 public void index() throws InterruptedException {
  getFullTextSession().createIndexer().startAndWait();
 }

Once indexed, inspect the index using Luke and you should be able to see title analyzed and stored as N-Grams and Edge N-Grams.

Search Query:


 private static final String TITLE_EDGE_NGRAM_INDEX = "edgeNGramTitle";
 private static final String TITLE_NGRAM_INDEX = "nGramTitle";

 @Transactional(readOnly = true)
 public synchronized List getSuggestions(final String searchTerm) {

 QueryBuilder titleQB = getFullTextSession().getSearchFactory()
   .buildQueryBuilder().forEntity(Product.class).get();

 Query query = titleQB.phrase().withSlop(2).onField(TITLE_NGRAM_INDEX)
   .andField(TITLE_EDGE_NGRAM_INDEX).boostedTo(5)
   .sentence(searchTerm.toLowerCase()).createQuery();

 FullTextQuery fullTextQuery = getFullTextSession().createFullTextQuery(
    query, Product.class);
 fullTextQuery.setMaxResults(20);

 @SuppressWarnings("unchecked")
 List<product> results = fullTextQuery.list();
 return results;
}

And we have a working suggester.

What next? Expose the functionality via a REST API and integrate it with jQuery, examples of which can be easily found.

You can also use the same strategies with Solr and ElasticSearch.

Friday, June 28, 2013

The story of 'Big Data'

Bob, a dairy farm owner, was unhappy. He owned many cows but could not milk them efficiently. Business was good though.

One morning, his neighbor, an owner of fodder factory, came and told him that he had found a way to efficiently milk the cows by letting them sleep for 24 hours and then milking them. Bob ran towards his farm, shouting "big cow, big cow"...

Bob and his neighbor were very happy. Bob got more cows to earn more profit. Few months passed.

One fine morning, his neighbor came and told him that he had found a way to milk the cows round the clock, without going through 24 hours of sleep period - by giving a rapid burst of low voltage electric shock to cows. Bob ran towards his farm, shouting "real-time big cow, real-time big cow"...

Now we have big cows, real-time big cows and fodder factories everywhere.

Wednesday, June 12, 2013

Graph Analytics: Discovering the undiscovered!

I was researching graph analytics and it's probable applications. Graph analysis and big data are overlapping areas and then I came across this piece of text which beautifully summarizes the difficulty of discovering the unknown.

"Many Big Data problems are about searching for things you know that you want to find. It's challenging because the volumes of data make it like searching for a needle in a haystack. But that’s easy because a needle and a piece of hay, though similar, do not look exactly alike.

However, discovering problems are about finding what you don't know. Imagine trying to find a needle in a stack of needles. That's even harder. How can you find the right needle if you don't know what it looks like? How can you discover something new if you don't know what you're looking for?

In order to find the unknown, you often have to know the right question to ask. It takes time and effort to ask every question and you keep learning as you continue to ask questions ....."

This happens to be true for any evolving technology that solves some existing problem in a novel way, and due to its sheer sophistication, opens door for us to ask for more, to be inquisitive and finally discover.

Original source: YarcData

Monday, June 10, 2013

Architecture: Cloud based Log Management Service

This article discusses the architecture of a cloud-based log management service. The LaaS (Logging-as-a-Service), code named ‘Trolly’, will provide log management and data analysis. We will cover technical requirements, high level architecture, data flow and technology selection based on open source software.

1.0 Modelling

The proposed architecture is a layered architecture, and each layer has components. Within layers, pipeline and event-driven architecture are employed.

1.1 Layered architecture

Multi-layered software architecture is a software architecture that uses many layers for allocating the responsibilities.

1.2 Pipeline architecture

A pipeline consists of a chain of processing elements (processes, threads etc.), arranged so that the output of each element is the input of the next.

A pipe is a message queue. A message can be anything. A filter is a process, thread, or other component that perpetually reads messages from an input pipe, one at a time, processes each message, then writes the result to an output pipe. Thus, it is possible to form pipelines of filters connected by pipes.

1.3 Event driven architecture

Event-driven architecture (EDA) is a software architecture pattern promoting the production, detection, consumption of, and reaction to events.

An event can be defined as "a significant change in state". For this system, a line in a log file represents an event.

2. Motivation

Layered architecture helps in:

· Clear separation of responsibilities – search, alerting and visualization.

· Ability to replace one or several layers implementation with minimum effort and side effects.

An event driven architecture is extremely loosely coupled and well distributed. Event driven architectures have loose coupling within space, time and synchronization, providing a scalable infrastructure for information exchange and distributed workflows.

3. Requirements

3.1 In scope

· Data Visualization

· Search

· Custom Alerts

3.2 Technical requirements

· Near-real time processing

· Multi-tenant – Support multiple clients

· High availability

· Scalable – The system must scale for following reasons:

o New clients sign up

o Existing clients upgrade

3.3 Not covered in this article

· Client Registration

· Billing, Reporting and Archival

· User Management

· Third party integration

- Self-service – Automatic provisioning of hardware and software

· Multi-data center deployment and redundancy

· Client and Service geographic proximity

4.0 Trolly Architecture

Trolly High Level Architecture

Trolly has the following components:

i. Data collection

a. Adaptors for accepting input log streams

ii. Data access

a. Portal

b. API gateway

iii. Data pre-processing

a. Core processor

iv. Alerting

a. Complex event processing engine

v. Search service

a. Distributed indexer and search

vi. Log storage

a. Distributed data store

Each component is explained in the following sections.

Data Flow and Components Diagram

5. Concepts

5.1.1 Definitions

Client: Users of Trolly and uniquely identified by a Client Id.

Source: Log source e.g. Apache HTTP logs. Client can configure multiple sources.

Stream: A log stream sent by a client and originating from a pre-defined source. Stream is keyed by Stream Id. Stream uniquely maps to a source.

5.1.2 Data Collection

The API gateway will accept logs from multiple end points like rsyslog, REST over HTTP/HTTPS etc. It will authenticate client and queue the log event for further preprocessing.

RabbitMQ will be used for messaging. A cluster setup with mirrored queues provides high availability.

5.1.3 Data pre-processing

The core processor cleans data, applies regex based on the input format and converts it to JSON format. Each processed log event is forked to three different locations - CEP engine, indexing engine and data store.

The JSON file is:

i. Sent to CEP engine for alerting.

ii. Sent to indexing service.

iii. Stored in a distributed datastore.

5.1.4 Custom Alerts

Here we choose software that provides low-latency and high-throughput complex event processing. The software must support HA configuration. Esper is a complex event processing library. It is stream oriented, supports HA and can process 100,000 - 500,000 events / sec at an average latency of 3 micro seconds and well suited for our requirements.

The system will create a new queue for every Stream defined by a client. This will act as a source for Esper. Custom user supplied alerts are converted to Espers DSL (EPL). The subscribers receive output events that is -

i. Persisted it in MySQL database and

ii. Published to 3^rd party integration end-points e.g. SMS gateway.

5.1.5 Indexing and Search

The design must support near-real time indexing, high availability (HA) and multi-tenancy. To address scalability and HA, we use a distributed cluster based design. ElasticSearch is a good fit for above requirements.

ElasticSearch supports multi-tenancy. Multiple indexes can be stored in a single ElasticSearch server, and data of multiple indexes can be searched with a single query.

5.1.5.1 Index Setup

The system will use indices based on time. For example, the index name can be the date in a format like <Client Id>-DD-MM-YYYY. Client Id is a unique identifier assigned to every client. The timeframe depends on how long the client intends to keep logs.

For storing a week worth of logs, system will maintain one index per day and for a year worth of logs, system will maintain one index per month. This will help in narrowing searches to the relevant indices. Often the searches will be for recent logs, and a “quick search” option in GUI can only looks in the last index.

Further, the BookKeeping module will rotate and optimize logs as well as delete old indexes.

5.1.5.2 High Availability

Each index will be divided into 5 or more shrads. Each shrad will have one replica. Replicas will help the cluster continue running when one or more nodes go down. It will also be used for searching.

5.1.5.3 Performance considerations

· Empirically determine number of shrads and replicas

· Refresh indexes asynchronously and automatically every few seconds

· Enable bulk indexing


Proposed Elastic Search Topology

5.1.6 Log Analysis and Storage

If the log volume is low, the original log line can be compressed and stored in the search index. But this seems unlikely and the raw logs will be written to a distributed highly-available data store.

5.1.6.1 Log Analysis Software Stack

Hadoop is an open source framework that allows users to process very large data in parallel. The Hadoop core is mainly divided into two modules:

o HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using multiple commodity servers connected in a cluster.

o Map-Reduce (MR) is a framework for parallel processing of large data sets. The default implementation is bonded with HDFS.

o The database will be a NoSQL database such as HBase. The advantage of a NoSQL database is that it provides scalability for the reporting module as well, as we can keep historical processed data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives real-time, random read/write access to very large data sets - HBase can save very large tables having million of rows. It's a distributed database and can also keep multiple versions of a single row.

o The Pig framework is an open source platform for analyzing large data sets and is implemented as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to be written in Java. In contrast, Pig enables users to write code in a scripting language.

5.1.6.2 Data Flow

As discussed above, raw log lines are processed and stored as JSON in HBase. JSON will serve as a common input format for all visualization tools.

5.1.7 Data Access

5.1.7.1 Reporting and Visualization

For basic reporting, Google chart or High Charts are a good fit. For data visualization, D3 or InfoViz toolkit can be used.

o For analyzing recent log events, HBase will be directly queried. For advanced analysis, HBase will have the data processed by Pig scripts ready for reporting and further slicing and dicing can be performed.

o The data-access Web service is a REST-based service that eases the access and integrations with data clients.

6. Summary

The proposed architecture attempts to show how large amounts of unstructured log events can be easily analyzed using the Apache Hadoop, ElasticSearch and Esper framework.

Feel free to comment and share your thoughts about the proposed architecture.

Download Android App