Log Management System - Phase 1: Planning & Specification

Overview

This is the first post of a four-part series documenting our Log Management System (LMS) evaluation project. Phase 1 covers everything that happened before we touched a single server — the requirements, the tool candidates, the evaluation framework, and the timeline. Posts 2–4 will cover the actual setup, testing, and final results.

The Problem

The starting condition was straightforward: log management was broken, and we knew it.

Every server was managing its own logs independently. No central collection, no standardized format, no automated backup. When something went wrong, investigating it meant SSHing into servers one by one and grepping manually. We had real cases where logs were just gone by the time we needed them — rotated out, disk issues, whatever — and that made incident response a lot harder than it needed to be.

On top of the operational pain, there’s a compliance angle. Our environment is expected to align with ISO/IEC 27001, ISO/IEC 27035, NIST SP 800-92, and local government regulations on information security (specifically BSSN Regulation No. 4/2021). None of those frameworks are particularly lenient about log gaps.

So the project goal was clear: build a centralized LMS that actually works. Phase 1 was about figuring out what “actually works” means before committing to any tool.

Current State

Before writing requirements, we mapped what we were actually dealing with.

The environment runs Debian and CentOS servers spread across on-premise infrastructure and a national data center. Web traffic goes through Apache and NGINX. The network perimeter is a Palo Alto firewall. We already had Wazuh deployed as a baseline SIEM, which was good — whatever we build needs to play nicely with it.

The problems were:

No centralized log collection. Correlating events across servers required direct access to each one.
No backup mechanism. If a disk died or a log rotated too aggressively, that data was gone.
Audit gaps. During incident response, the absence of reliable log history slowed everything down.

The goal wasn’t to rip everything out. It was to build a proper collection and analysis layer on top of what we had.

System Requirements

Rather than going straight to tool selection, we formalized requirements first. This ended up being the most important output of Phase 1 — having clear requirements made the evaluation criteria later much easier to justify.

The functional requirements we landed on:

F-01 — Centralized Log Collection. Receive logs from servers, applications, and network devices via agent or syslog. Everything into one place.

F-02 — Log Normalization. Convert incoming logs to a uniform format (JSON or GELF) so they’re consistently indexable regardless of source.

F-03 — Storage and Retention. Active retention minimum 90 days, archive minimum 1 year. That’s the floor for useful forensic and audit work.

F-04 — Search and Analysis. Fast search by keyword, source, timestamp, and severity. Not “fast on a test dataset” — fast at real scale.

F-05 — Visualization and Reporting. Dashboards, trend graphs, automated daily and monthly reports. The ops team should be able to understand system health at a glance.

F-06 — Alerting and Notification. Automatic alerts on defined conditions — repeated failed logins, error spikes, specific patterns. The system tells you about problems, you don’t go looking for them.

F-07 — Audit Trail and Access Control. All user activity logged, access gated by role. Non-negotiable for compliance.

F-08 — Log Agent Health Monitoring. The system needs to detect when an agent goes silent. An unmonitored agent is a blind spot you don’t know you have.

Tool Candidates

With requirements in place, we identified four platforms to evaluate in the PoC phase. The selection covers the spectrum from quick-to-deploy proprietary to maximum-flexibility open-source.

ManageEngine EventLog Analyzer (ELA)

ELA is a proprietary web-based log management platform. It handles syslog, SNMP, and Windows/Linux agents, ships with compliance report templates (ISO 27001, PCI-DSS, HIPAA, NIST), and includes real-time dashboards and event-based notifications out of the box.

The appeal is speed to value — relatively fast to get running, no stack assembly required. The downside: it’s proprietary, requires significant hardware (8 cores, 16 GB RAM, 500 GB+ storage), and the free tier has real limitations. It’s the “just works” option if budget isn’t a concern.

Graylog

Graylog is the open-source option with the most polished out-of-the-box experience. It runs on Elasticsearch for storage and MongoDB for metadata, supports Syslog and GELF natively, and has a genuinely usable web interface. It supports rule-based alerting, stream-based log routing, LDAP integration, and works with both Wazuh and Grafana — which makes it a natural fit for our existing stack.

The downside is that it inherits Elasticsearch’s operational complexity. Scaling it for high-volume ingestion requires careful tuning, and the hardware requirements end up similar to ELA.

ELK Stack (Elasticsearch, Logstash, Kibana)

The industry reference implementation. Logstash handles ingestion and transformation, Elasticsearch handles search and indexing, Kibana handles visualization. Filebeat or Fluentd sit upstream as lightweight shippers.

ELK is the most capable option for complex high-volume environments. It’s also the most demanding — 8–16 cores, 16–32 GB RAM, 1 TB SSD recommended — and the hardest to configure and maintain. It’s the right answer if you have the resources and expertise to back it up.

Grafana Loki

Loki takes a different approach from the other three. Rather than full-text indexing every log line (which is what makes Elasticsearch so storage-heavy), it indexes only metadata labels and stores log content compressed. The result: much better storage efficiency and a lighter overall footprint.

The trade-off is that you lose full-text search across log content. LogQL queries work against labels and log streams, not arbitrary field values. For observability use cases — especially if you’re already running Grafana for metrics — Loki is excellent. For deep forensic search across unstructured log content, it’s limited.

Comparison at a Glance

Aspect	ELA	Graylog	ELK Stack	Grafana Loki
License	Proprietary	Open-source	Open-source	Open-source
Storage Backend	PostgreSQL	Elasticsearch	Elasticsearch	Object store
Query Language	SQL-like	Lucene	KQL	LogQL
Storage Efficiency	Medium	Medium	High	Very High
Alerting	Yes	Yes	Yes	Limited
Setup Complexity	Easy	Medium	Complex	Easy
Scalability	Medium	High	Very High	High
Resource Usage	Medium	Medium	High	Low

Evaluation Framework

Before running any tests, we defined exactly how we’d score the results. The framework uses six weighted categories totaling 100%:

Category	Weight	What We’re Measuring
A. Functional Fit	30%	Multi-source collection, search quality, normalization, dashboards, retention, alerting
B. Integration & Compatibility	20%	Debian/CentOS support, Palo Alto syslog ingestion, Apache/NGINX parsing, agent compatibility
C. Security & Auditability	15%	TLS encryption, RBAC, audit trail completeness, LDAP/SSO support
D. Performance & Scalability	15%	Indexing speed, query latency on large datasets, storage efficiency, CPU/RAM usage
E. Usability & Administration	10%	Interface clarity, setup complexity, documentation quality
F. Cost & Maintenance	10%	Licensing costs, operational overhead, long-term sustainability

Each parameter gets scored 1–5:

Score	Meaning
5	Fully meets the requirement
4	Meets most of it
3	Partially meets
2	Needs significant improvement
1	Does not meet

The final score is calculated as:

Final Score = (A × 0.30) + (B × 0.20) + (C × 0.15) + (D × 0.15) + (E × 0.10) + (F × 0.10)

Highest score gets the primary recommendation. The runner-up gets documented as an alternative.

The reason we spent time on this upfront: evaluation without predefined criteria tends to drift toward whichever tool the evaluator is most familiar with. Having weighted parameters forces the comparison to stay grounded in actual requirements.

Timeline

The full PoC runs from late September through early November 2025:

Phase	Period	Activity
1 — Scope Definition	Sep 19–25	Finalize asset inventory, log sources, and evaluation criteria
2 — Environment Setup	Sep 26–Oct 4	Provision test infrastructure, install tools, configure integrations
3 — Technical Testing	Oct 5–17	Run PoC scenarios: ingestion, retention, search, performance, security
4 — Results Documentation	Oct 18–25	Capture test data, note issues, record observations per tool
5 — Report Writing	Oct 26–Nov 1	Compile results into formal evaluation reports with weighted scores
6 — Recommendation	Nov 2–3	Deliver final recommendation to stakeholders

Each phase feeds directly into the next. The scope definition from Phase 1 drives the environment setup. The test results from Phase 3 get documented in Phase 4 and compiled in Phase 5. Nothing gets skipped.

What’s Next

Phase 1 closed with four documents: the requirements specification, the tool technical profiles, the evaluation parameter framework, and the PoC timeline. Those four documents are what everything else in this project is built on.

The next post covers Phase 2 — actually provisioning the test environments and getting all four platforms running. That’s where things get less clean.

Part 1 of 4. Parts 2–4 cover environment setup, integration and security testing, and the final evaluation results.