Why? What is that good for?
Save license-costs and putting ingest-pipeline load somewhere else! :smiley:
At my workplace we currently parse the events mostly with Logstash, but configuring those grok-filters so that the messages are ECS-comform is a lot of work. Yes, I know that there is a possibility to export those ingest-pipelines from Beats to Logstash compatible filters, but a few things will not work (for more info have a look at https://www.elastic.co/guide/en/logstash/current/ingest-converter.html).
The thought about doing the parsing with Logstash was, that the parsing load is kept away from the Elasticsearch-cluster. But things evolve and our employees want to use more and more features where the data should be saved in ECS-schema. Hardly manageable with Logstash in the long-term :pensive:.
So I thought, why not use an own "Ingest-Cluster" where we just send all the wonderful logs in via the different Beats? All the Beats include the necessary ingest-pipelines and for this, we even can use the basic-license. The "Ingest-Cluster" would keep away the load from the expensive licensed one, and we would not need more licenses for extra ingest-nodes.
Only caveat is, that the events land with about 1 min delay in the final Elasticsearch-cluster.
Time to set up a PoC
Prerequisites
For that we need two VMs with Docker and docker-compose
. I quickly set them up with Fedora Server 34 each with about 30GB of space and 8GB RAM.
The first machine in this HowTo has the IP 192.168.56.107
the second one 192.168.56.108
. On both machines there is a single-node Elasticsearch-cluster and Kibana installed and started. Both Elastic-services are listening on all interfaces and for testing there are no credentials or any security features enabled.
The plan looks like following (sorry for the ASCII-art ;)):
logs --- sent via beats ---> Logstash ---> Elasticsearch*
|
|
|
Elasticsearch** <--- Logstash
* doing the ingest-pipeline load (uses basic-license)
** used for searching (uses some costly license)
The "Ingest-Cluster"
Oh, dear – In my hassle to accomplish this I forgot the config of the first Logstash-instance :flushed: – I’ll hand that in ASAP. But for a PoC this is also OK. In case you want to do it already with the first Logstash-service, the Elasticsearch output must be configured similar to this – https://www.elastic.co/guide/en/logstash/current/use-ingest-pipelines.html
In my real use-case I also really have some Beats running directly on the nodes, where Elasticsearch is running for scraping the Elasticsearch logs and so on. This is not that bad, because those will set up the needed ILM stuff automatically.
For the ingest-cluster I created following folder-structure on the first VM:
.
├── docker-compose.yml
└── etc
└── filebeat
└── filebeat.yml
It only configures a nginx-log-generator and Filebeat.
The docker-compose.yml
file has following content:
version: '3.8'
services:
nginx-loggen:
labels:
co.elastic.logs/module: nginx
image: kscarlett/nginx-log-generator
container_name: nginx-loggen
environment:
- RATE=20
filebeat:
container_name: filebeat
image: docker.elastic.co/beats/filebeat:7.14.0
# Need to override user so we can access the log files, and docker.sock
user: root
environment:
KIBANA_HOST: http://192.168.56.107:5601
ELASTICSEARCH_HOSTS: http://192.168.56.107:9200
restart: unless-stopped
volumes:
- filebeat:/usr/share/filebeat/data
- /var/run/docker.sock:/var/run/docker.sock
# This is needed for filebeat to load container log path as specified in filebeat.yml
- /var/lib/docker/containers/:/var/lib/docker/containers/:ro
- type: bind
source: ./etc/filebeat/filebeat.yml
target: /usr/share/filebeat/filebeat.yml
# disable strict permission checks
command: ["--strict.perms=false"]
volumes:
filebeat:
As you can see I use some Nginx log-generating service (https://github.com/kscarlett/nginx-log-generator) and Filebeat for ingesting the logs to the first Elasticsearch-cluster. This cluster will parse the logs nicely (except a few events, that seem not the be in the correct grok-pattern – TODO: further investigation needed).
The etc/filebeat/filebeat.yml
looks like:
##########
# load modules - we only need nginx
filebeat.modules:
- module: nginx
access:
enabled: true
error:
enabled: true
#==========
name: docker-filebeat
#========================== Filebeat autodiscover ==============================
# See this URL on how to run Apache2 Filebeat module: # https://www.elastic.co/guide/en/beats/filebeat/current/running-on-docker.html
filebeat.autodiscover:
providers:
- type: docker
# https://www.elastic.co/guide/en/beats/filebeat/current/configuration-autodiscover-hints.html
hints.enabled: true
hints.default_config:
type: container
paths:
- /var/lib/docker/containers/${data.container.id}/*.log
#================================ Processors ===================================
processors:
- add_cloud_metadata: ~
- add_docker_metadata: ~
- add_locale:
format: offset
- add_host_metadata:
netinfo.enabled: true
#========================== Elasticsearch output ===============================
output.elasticsearch:
hosts: ["192.168.56.107:9200"]
#========================= Elasticsearch templates setup =======================
# more info at https://www.elastic.co/guide/en/beats/filebeat/current/configuration-template.html
# options are legacy, component or index
setup.template.type: index
# we only have a single node - so no replicas
setup.template.settings:
index.number_of_replicas: 0
#============================== Dashboards =====================================
setup.dashboards:
enabled: true
setup.dashboards.retry.enabled: true
setup.dashboards.retry.interval: 5s
#============================== Kibana =========================================
setup.kibana:
host: "http://192.168.56.107:5601"
So you see – nothing really exciting :blush:
The "Costly Licensed Cluster"
Now comes the "serious" part :wink:. We want those nicely ingested events in our main-cluster which has all the fancy licensed stuff. Again I will use docker-compose
on the second VM.
The Filebeat index-template and ILM-policy can just be copied to the second-cluster via dev-console in Kibana.
.
├── docker-compose.yml
└── etc
└── logstash
├── config
│ └── pipelines.yml
├── pipeline
│ ├── check_es.conf
│ └── from_es.conf
└── query_template
└── check_es_query_template.json
The docker-compose.yml
will just set up two Logstash containers, which will scrape the events from the "Ingest-Cluster" and further ingest them into the next Elasticsearch-cluster nearly without any filter-rules, as the data is already in the needed format – this is defined the the from_es.conf
-pipeline. Further the events are forwarded to the next pipeline called check_es.conf
. This pipeline will check if the event was really ingested and sets a tag. The output part of this pipeline will then check for this tag and if available, it will delete the event from the first Elasticsearch-cluster, as we already have it where we want it in the end.
The docker-compose.yml
is not that long:
version: '3.8'
services:
logstash-1:
container_name: logstash-1
image: docker.elastic.co/logstash/logstash:7.14.0
labels:
co.elastic.logs/module: logstash
environment:
LS_JAVA_OPTS: -Xmx1G -Xms1G
restart: unless-stopped
volumes:
- ./etc/logstash/config/pipelines.yml:/usr/share/logstash/config/pipelines.yml:ro
- ./etc/logstash/query_template/check_es_query_template.json:/usr/share/logstash/query_template/check_es_query_template.json:ro
- ./etc/logstash/pipeline/:/usr/share/logstash/pipeline:ro
logstash-2:
container_name: logstash-2
image: docker.elastic.co/logstash/logstash:7.14.0
labels:
co.elastic.logs/module: logstash
environment:
LS_JAVA_OPTS: -Xmx1G -Xms1G
restart: unless-stopped
volumes:
- ./etc/logstash/config/pipelines.yml:/usr/share/logstash/config/pipelines.yml:ro
- ./etc/logstash/query_template/check_es_query_template.json:/usr/share/logstash/query_template/check_es_query_template.json:ro
- ./etc/logstash/pipeline/:/usr/share/logstash/pipeline:ro
The etc/logstash/config/pipelines.yml
As we will do some pipeline-to-pipeline communication (see https://www.elastic.co/guide/en/logstash/current/pipeline-to-pipeline.html) in Logstash, it will look like:
# This file is where you define your pipelines. You can define multiple.
# For more information on multiple pipelines, see the documentation:
# https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html
- pipeline.id: from_es
path.config: "/usr/share/logstash/pipeline/from_es.conf"
- pipeline.id: check_es
path.config: "/usr/share/logstash/pipeline/check_es.conf"
The etc/logstash/pipeline/from_es.yml
This one has following content:
input {
# we get ALL events from the buffer-elasticsearch
# every minute we try to scroll about 5m - the sooner or later we should receive all events from there - no matter what
elasticsearch {
id => "elasticsearch_input_1"
hosts => "http://192.168.56.107:9200"
# must be changed accordingly - this is a caveat!
index => "filebeat-7.14.0"
query => '{"query":{"match_all":{}}}'
scroll => "5m"
docinfo => true
schedule => "* * * * *"
}
}
filter {
# we need the metadata fields also in the next pipeline
# Logstash removes the @metadata stuff automatically
mutate {
id => "mutate_1"
copy => { "@metadata" => "metadata" }
}
}
output {
# we ingest the events to the last elasticsearch - the document_ids will be the same -- document_type prolly not really needed
elasticsearch {
hosts => "http://192.168.56.108:9200"
document_type => "%{[@metadata][_type]}"
document_id => "%{[@metadata][_id]}"
# the last elasticsearch should have the same index-templates and ilm-policies available else this will make a mess!
# make sure in case of an update, that you update this elasticsearch cluster first, before updating the buffer-elasticsearch
ilm_pattern => "{now/d}-000001"
ilm_enabled => "true"
ilm_policy => "%{[agent][type]}"
ilm_rollover_alias => "%{[agent][type]}-%{[agent][version]}"
manage_template => false
}
# we now send the events further to the next pipeline
# which will check if the events got ingested and deletes them from the buffer, so that it's not tried
# to ingest them again and again.
pipeline {
id => "pipeline_output_1"
send_to => [check_es]
}
}
The etc/logstash/pipeline/check_es.yml
and etc/logstash/query_template/check_es_query_template.yml
Those two files also are built up pretty simple:
input {
pipeline {
address => check_es
}
}
filter {
mutate {
id => "mutate_1"
copy => { "metadata" => "@metadata" }
}
# we query if the id got ingested - if so a tag is set which is later used in the output
elasticsearch {
hosts => ["192.168.56.108:9200"]
# we query the index-alias for the agent's index - e.g. if filebeat 7.13.4 sent an event. there should be
# an index-alias filebeat-7.13.4 where the _id should be present
index => "%{[agent][type]}-%{[agent][version]}"
# seems to have problems if "-" is at the start of the id e.g. id:-234fjkg89245
# within escaped " this is not working e.g. \"%{[...\"
#query => "_id:%{[@metadata][_id]}"
# but it works in a DSL query_template
query_template => "/usr/share/logstash/query_template/check_es_query_template.json"
add_tag => [ "ingested" ]
}
}
output {
if "ingested" in [tags] {
# we delete the original document from the buffer-elasticsearch cluster
elasticsearch {
hosts => "192.168.56.107:9200"
index => "%{[@metadata][_index]}"
document_type => "%{[@metadata][_type]}"
document_id => "%{[@metadata][_id]}"
action => "delete"
}
# for debugging
#stdout {
# codec => line { format => "deleted document: %{[@metadata][_id]}" }
#}
}
# else we do nothing - the document gets later ingested again if not already and will later be deleted
}
You may notice my "query"-comment. For a workaround I use a query_template-json-file
{
"query": {
"bool": {
"must": [],
"filter": [
{
"bool": {
"should": [
{
"match": {
"_id": "%{[@metadata][_id]}"
}
}
],
"minimum_should_match": 1
}
}
]
}
}
}
That is everything if I am not wrong. Are there any things I missed? Feel free to leave comments.
Todos
LOADTESTS :expressionless: not sure what the performance impact to the Elasticsearch-clusters will be while Logstash has to always check, if the events got ingested
Kommentare