This article discusses the architecture of a cloud-based
log management service. The LaaS (Logging-as-a-Service), code named ‘Trolly’, will provide log
management and data analysis. We will cover technical requirements, high level architecture, data flow and technology
selection based on open source software.
1.0 Modelling
The proposed architecture is a layered architecture, and
each layer has components. Within layers, pipeline and event-driven
architecture are employed.
1.1 Layered
architecture
Multi-layered software architecture is a software
architecture that uses many layers for allocating the responsibilities.
1.2 Pipeline
architecture
A pipeline consists
of a chain of processing elements (processes, threads etc.), arranged so that
the output of each element is the input of the next.
A pipe is a
message queue. A message can be anything. A filter is a process, thread, or other component that perpetually
reads messages from an input pipe, one at a time, processes each message, then
writes the result to an output pipe. Thus, it is possible to form pipelines of
filters connected by pipes.
1.3 Event driven
architecture
Event-driven architecture (EDA) is a software architecture
pattern promoting the production, detection, consumption of, and reaction to
events.
An event can be defined as "a significant change in
state". For this system, a line in a log file represents an event.
2.
Motivation
Layered architecture helps in:
· Clear separation of responsibilities – search,
alerting and visualization.
· Ability to replace one or several
layers implementation with minimum effort and side effects.
An event driven architecture is extremely loosely coupled
and well distributed. Event driven architectures have loose coupling within
space, time and synchronization, providing a scalable infrastructure for
information exchange and distributed workflows.
3. Requirements
3.1 In scope
·
Data Visualization
·
Search
·
Custom Alerts
3.2
Technical requirements
·
Near-real time processing
·
Multi-tenant – Support multiple clients
·
High availability
·
Scalable – The system must scale for following
reasons:
o
New clients sign up
o
Existing clients upgrade
3.3 Not covered in this article
·
Client Registration
·
Billing, Reporting and Archival
·
User Management
·
Third party integration
- Self-service – Automatic provisioning
of hardware and software
·
Multi-data center deployment and redundancy
·
Client and Service geographic proximity
4.0 Trolly Architecture
|
Trolly High Level Architecture |
Trolly has the following
components:
i.
Data
collection
a. Adaptors for accepting input log
streams
ii.
Data access
a. Portal
b. API gateway
iii.
Data
pre-processing
a. Core processor
iv.
Alerting
a. Complex event processing engine
v.
Search
service
a. Distributed indexer and search
vi.
Log storage
a. Distributed data store
Each component is explained in the
following sections.
|
Data Flow and Components Diagram |
5.
Concepts
5.1.1
Definitions
Client: Users of Trolly and uniquely
identified by a Client Id.
Source: Log source e.g. Apache HTTP logs. Client can configure multiple
sources.
Stream: A log stream sent by a client and originating from a pre-defined
source. Stream is keyed by Stream Id. Stream uniquely maps to a source.
5.1.2
Data Collection
The API gateway will accept logs from
multiple end points like rsyslog, REST over HTTP/HTTPS etc. It will
authenticate client and queue the log event for further preprocessing.
RabbitMQ will be used for
messaging. A cluster setup with mirrored queues provides high availability.
5.1.3
Data
pre-processing
The core processor cleans data,
applies regex based on the input format and converts it to JSON format. Each
processed log event is forked to three different locations - CEP engine,
indexing engine and data store.
The JSON file is:
i.
Sent to CEP
engine for alerting.
ii.
Sent to
indexing service.
iii.
Stored in a
distributed datastore.
5.1.4
Custom Alerts
Here we choose software that
provides low-latency and high-throughput complex event processing. The software
must support HA configuration. Esper is a complex event processing library. It is
stream oriented, supports HA and can process 100,000 - 500,000 events / sec at
an average latency of 3 micro seconds and well suited for our requirements.
The system will create a new queue
for every Stream defined by a client. This will act as a source for Esper. Custom
user supplied alerts are converted to Espers DSL (EPL). The subscribers receive
output events that is -
i.
Persisted it in
MySQL database and
ii.
Published to
3rd party integration end-points e.g. SMS gateway.
5.1.5
Indexing and
Search
The design must support near-real
time indexing, high availability (HA) and multi-tenancy. To address scalability
and HA, we use a distributed cluster based design. ElasticSearch is a good fit
for above requirements.
ElasticSearch
supports multi-tenancy. Multiple indexes can be stored in a single
ElasticSearch server, and data of multiple indexes can be searched with a single
query.
5.1.5.1
Index Setup
The system will use indices based
on time. For example, the index name can be the date in a format like
<Client Id>-DD-MM-YYYY. Client Id is a unique identifier assigned to
every client. The timeframe depends on how long the client intends to keep
logs.
For storing a week worth of logs, system
will maintain one index per day and for a year worth of logs, system will
maintain one index per month. This will help in narrowing searches to the
relevant indices. Often the searches will be for recent logs, and a “quick
search” option in GUI can only looks in the last index.
Further, the BookKeeping module
will rotate and optimize logs as well as delete old indexes.
5.1.5.2
High
Availability
Each index will be divided into 5
or more shrads. Each shrad will have one replica. Replicas will help the
cluster continue running when one or more nodes go down. It will also be used
for searching.
5.1.5.3
Performance
considerations
·
Empirically
determine number of shrads and replicas
·
Refresh
indexes asynchronously and automatically every few seconds
·
Enable bulk
indexing
|
Proposed Elastic Search Topology | |
5.1.6
Log Analysis and
Storage
If the log volume is low, the
original log line can be compressed and stored in the search index. But this
seems unlikely and the raw logs will be written to a distributed highly-available
data store.
5.1.6.1
Log Analysis
Software Stack
Hadoop is an
open source framework that allows users to process very large data in parallel.
The Hadoop core is mainly divided into two modules:
o Map-Reduce (MR) is a framework for parallel
processing of large data sets. The default implementation is bonded with HDFS.
o The database will be a NoSQL
database such as HBase. The advantage of a NoSQL database is that it provides
scalability for the reporting module as well, as we can keep historical
processed data for reporting purposes. HBase is an open source columnar DB or
NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives
real-time, random read/write access to very large data sets - HBase can save
very large tables having million of rows. It's a distributed database and can
also keep multiple versions of a single row.
o The Pig framework is an open source platform for
analyzing large data sets and is implemented as a layered language over the
Hadoop Map-Reduce framework. It is built to ease the work of developers who
write code in the Map-Reduce format, since code in Map-Reduce format needs to
be written in Java. In contrast, Pig enables users to write code in a scripting
language.
5.1.6.2
Data Flow
As discussed above, raw log lines are processed and stored as JSON in HBase. JSON
will serve as a common input format for all visualization tools.
5.1.7
Data Access
5.1.7.1
Reporting and
Visualization
For basic
reporting, Google chart or High Charts are a good fit. For data visualization,
D3 or InfoViz toolkit can be used.
o For analyzing recent log events,
HBase will be directly queried. For advanced analysis, HBase will have the data
processed by Pig scripts ready for reporting and further slicing and dicing can
be performed.
o The data-access Web service is a
REST-based service that eases the access and integrations with data clients.
6. Summary
The proposed architecture attempts
to show how large amounts of unstructured log events can be easily analyzed
using the Apache Hadoop, ElasticSearch and Esper framework.
Feel free to comment and share your thoughts about the proposed architecture.