<?xml version="1.0" encoding="utf-8"?><rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:media="http://search.yahoo.com/mrss/"
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
>
<channel>
  <title>Srijan Choudhary, all posts tagged: influxdb</title>
  <link>https://srijan.ch/feed/all/tag:influxdb</link>
  <lastBuildDate>Thu, 11 Mar 2021 15:30:00 +0000</lastBuildDate>
  <image>
    <url>https://srijan.ch/assets/favicon/favicon-32x32.png</url>
    <title>Srijan Choudhary, all posts tagged: influxdb</title>
    <link>https://srijan.ch/feed/all/tag:influxdb</link>
  </image>
  <sy:updatePeriod>daily</sy:updatePeriod>
  <sy:updateFrequency>1</sy:updateFrequency>
  <generator>Kirby</generator>
  <atom:link href="https://srijan.ch/feed/all.xml/tag:influxdb" rel="self" type="application/rss+xml" />
  <description>Srijan Choudhary&#039;s Articles and Notes Feed for tag: influxdb</description>
  <item>
    <title>Advanced PostgreSQL monitoring using Telegraf, InfluxDB, Grafana</title>
    <description><![CDATA[My experience with advanced monitoring for PostgreSQL database using Telegraf, InfluxDB, and Grafana, using a custom postgresql plugin for Telegraf.]]></description>
    <link>https://srijan.ch/advanced-postgresql-monitoring-using-telegraf</link>
    <guid isPermaLink="false">603cefe38527ef00014f776d</guid>
    <category><![CDATA[devops]]></category>
    <category><![CDATA[postgresql]]></category>
    <category><![CDATA[monitoring]]></category>
    <category><![CDATA[telegraf]]></category>
    <category><![CDATA[influxdb]]></category>
    <category><![CDATA[ansible]]></category>
    <dc:creator>Srijan Choudhary</dc:creator>
    <pubDate>Thu, 11 Mar 2021 15:30:00 +0000</pubDate>
    <media:content url="https://srijan.ch/media/pages/blog/advanced-postgresql-monitoring-using-telegraf/d28e269c6f-1699621096/grafana-postgresql-monitoring.png" medium="image" />
    <content:encoded><![CDATA[<figure data-ratio="auto">
    <img src="https://srijan.ch/media/pages/blog/advanced-postgresql-monitoring-using-telegraf/54e29f97da-1699621096/photo-1564760055775-d63b17a55c44.jpeg" alt="Advanced PostgreSQL monitoring using Telegraf, InfluxDB, Grafana">
  
  </figure>
<h2>Introduction</h2>
<p>This post will go 
through my experience with setting up some advanced monitoring for 
PostgreSQL database using Telegraf, InfluxDB, and Grafana (also known as
 the TIG stack), the problems I faced, and what I ended up doing at the 
end.</p> <p>What do I mean by advanced? I liked <a href="https://www.datadoghq.com/blog/postgresql-monitoring/#key-metrics-for-postgresql-monitoring" rel="noreferrer">this Datadog article</a> about some key metrics for PostgreSQL monitoring. Also, this <a href="https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/db/postgresql" rel="noreferrer">PostgreSQL monitoring template for Zabbix</a>
 has some good pointers. I didn’t need everything mentioned in these 
links, but they acted as a good reference. I also prioritized monitoring
 for issues which I’ve myself faced in the past.</p> <p>Some key things that I planned to monitor:</p><ul><li>Active (and idle) connections vs. max connections configured</li><li>Size of databases and tables</li><li><a href="https://www.datadoghq.com/blog/postgresql-monitoring/#read-query-throughput-and-performance" rel="noreferrer">Read query throughput and performance</a> (sequential vs. index scans, rows fetched vs. returned, temporary data written to disk)</li><li><a href="https://www.datadoghq.com/blog/postgresql-monitoring/#write-query-throughput-and-performance" rel="noreferrer">Write query throughput and performance</a> (rows inserted/updated/deleted, locks, deadlocks, dead rows)</li></ul><p>There
 are a lot of resources online about setting up the data collection 
pipeline from Telegraf to InfluxDB, and creating dashboards on Grafana. 
So, I’m not going into too much detail on this part. This is what the 
pipeline looks like:</p><figure data-ratio="auto">
    <img src="https://srijan.ch/media/pages/blog/advanced-postgresql-monitoring-using-telegraf/7c74d1bdcd-1699621096/pg_telegraf_influx_grafana.png" alt="PostgreSQL to Telegraf to InfluxDB to Grafana">
  
    <figcaption class="text-center">
    PostgreSQL to Telegraf to InfluxDB to Grafana. <a href="https://www.planttext.com/?text=TP9RRu8m5CVV-oawdp2PCfCzBTkY8d4cA0OmcqzD1nqsmPRqacc6ttr5A7Etyz2UzlpE_vnUnb9XeVI-05UKfONEY1O5t2bLoZlN5VXzc5ErqwzQ4f5ofWXJmvJltOYcM6HyHKb92jUx7QmBpDHc6RY250HBueu6DsOVUIO9KqR4iAoh19Djk4dGyo9vGe4_zrSpfm_0b6kMON5qkBo6lJ3kzU47WCRYerHaZ_o3SfJHpGL-Cq3IkXtsXJgKbLePPb7FS5tedB9U_oT53YJD3ENNCrmBdX8fkVYNvrerik7P-SrrJaGADBDTs3BmWco0DjBfMk84EhMBiwVbo32UbehlRRTjGYqNMRc6go2KAgCCmke22XeLsr9b45FT4k04WBbKmZ8eQBvJe7g0tyoiasD9O0Mg-tWR9_uIJUV82uCmUgp3q3vAUpTdq7z9_6Wr2T0V6UUaCBR7CRmfthG0ncOml-KJ" target="_blank" rel="noreferrer">View Source</a>  </figcaption>
  </figure>
<p>And here’s what my final Grafana dashboard looks like</p><figure data-ratio="auto">
    <img src="https://srijan.ch/media/pages/blog/advanced-postgresql-monitoring-using-telegraf/d28e269c6f-1699621096/grafana-postgresql-monitoring.png" alt="Grafana dashboard sample for postgresql monitoring">
  
    <figcaption class="text-center">
    Grafana dashboard sample for PostgreSQL monitoring  </figcaption>
  </figure>
<h2>Research on existing solutions</h2>
<p>I found several solutions and articles online about monitoring PostgreSQL using Telegraf:</p><h3>1. Telegraf PostgreSQL input plugin</h3>
<p>Telegraf has a <a href="https://github.com/influxdata/telegraf/tree/master/plugins/inputs/postgresql" rel="noreferrer">PostgreSQL input plugin</a> which provides some built-in metrics from the <code>pg_stat_database</code> and <code>pg_stat_bgwriter</code>
 views. But this plugin cannot be configured to run any custom SQL 
script to gather the data that we want. And the built-in metrics are a 
good starting point, but not enough. So, I rejected it.</p><h3>2. Telegraf postgresql_extensible input plugin</h3>
<p>Telegraf has another PostgreSQL input plugin called <a href="https://github.com/influxdata/telegraf/tree/master/plugins/inputs/postgresql_extensible" rel="noreferrer">postgresql_extensible</a>.
 At first glance, this looks promising: it can run any custom query, and
 multiple queries can be defined in its configuration file.</p> <p>However, there is an <a href="https://github.com/influxdata/telegraf/issues/5009" rel="noreferrer">open issue</a>
 due to which this plugin does not run the specified query against all 
databases, but only against the database name specified in the 
connection string.</p> <p>One way this can still work is to specify multiple input blocks in the Telegraf config file, one for each database.</p><figure>
  <pre><code class="language-toml">[[inputs.postgresql_extensible]]
  address = &quot;host=localhost user=postgres dbname=database1&quot;
  [[inputs.postgresql_extensible.query]]
    script=&quot;db_stats.sql&quot;

[[inputs.postgresql_extensible]]
  address = &quot;host=localhost user=postgres dbname=database2&quot;
  [[inputs.postgresql_extensible.query]]
    script=&quot;db_stats.sql&quot;</code></pre>
  </figure>
<p>But, <strong>configuring this does not scale</strong>, especially if the database names are dynamic or we don’t want to hardcode them in the config.</p> <p>But I really liked the configuration method of this plugin, and I think this will work very well for my use case once the <a href="https://github.com/influxdata/telegraf/issues/5009" rel="noreferrer">associated Telegraf issue</a> gets resolved.</p><h3>3. Using a monitoring package like pgwatch2</h3>
<p>Another method I found was to use a package like <a href="https://github.com/cybertec-postgresql/pgwatch2" rel="noreferrer">pgwatch2</a>. This is a self-contained solution for PostgreSQL monitoring and includes dashboards as well.</p> <p>Its main components are</p><ol><li><u>A metrics collector service</u>.
 This can either be run centrally and “pull” metrics from one or more 
PostgreSQL instances, or alongside each PostgreSQL instance (like a 
sidecar) and “push” metrics to a metrics storage backend.</li><li><u>Metrics storage backend</u>. pgwatch2 supports multiple metrics storage backends like bare PostgreSQL, TimescaleDB, InfluxDB, Prometheus, and Graphite.</li><li><u>Grafana dashboards</u></li><li><u>A configuration layer</u> and associated UI to configure all of the above.</li></ol><p>I
 really liked this tool as well, but felt like this might be too complex
 for my needs. For example, it monitors a lot more than what I want to 
monitor, and it has some complexity to handle multiple PostgreSQL 
versions and multiple deployment configurations.</p> <p>But I will definitely keep this in mind for a more “batteries included” approach to PostgreSQL monitoring for future projects.</p><h2>My solution: custom Telegraf plugin</h2>
<p>Telegraf supports writing an external custom plugin, and running it via the <a href="https://github.com/influxdata/telegraf/tree/master/plugins/inputs/execd" rel="noreferrer">execd plugin</a>. The <code>execd</code> plugin runs an external program as a long-running daemon.</p> <p>This
 approach enabled me to build the exact features I wanted, while also 
keeping things simple enough to someday revert to using the Telegraf 
built-in plugin for PostgreSQL.</p> <p>The custom plugin code can be found at <a href="https://github.com/srijan/telegraf-execd-pg-custom" rel="noreferrer">this Github repo</a>. Note that I’ve also included the <code>line_protocol.py</code> file from influx python sdk so that I would not have to install the whole sdk just for line protocol encoding.</p> <p>What this plugin (and included configuration) does:</p><ol><li>Runs as a daemon using Telegraf execd plugin.</li><li>When
 Telegraf asks for data (by sending a newline on STDIN), it runs the 
queries defined in the plugin’s config file (against the configured 
databases), converts the results into Influx line format, and sends it 
to Telegraf.</li><li>Queries can be defined to run either on a single database, or on all databases that the configured pg user has access to.</li></ol><p>This
 plugin solves the issue with Telegraf’s postgresql_extensible plugin 
for me—I don’t need to manually define the list of databases to be able 
to run queries against all of them.</p> <p>This is what the custom plugin configuration looks like</p><figure>
  <pre><code class="language-toml">[postgresql_custom]
address=&quot;&quot;

[[postgresql_custom.query]]
sqlquery=&quot;select pg_database_size(current_database()) as size_b;&quot;
per_db=true
measurement=&quot;pg_db_size&quot;

[[postgresql_custom.query]]
script=&quot;queries/backends.sql&quot;
per_db=true
measurement=&quot;pg_backends&quot;

[[postgresql_custom.query]]
script=&quot;queries/db_stats.sql&quot;
per_db=true
measurement=&quot;pg_db_stats&quot;

[[postgresql_custom.query]]
script=&quot;queries/table_stats.sql&quot;
per_db=true
tagvalue=&quot;table_name,schema&quot;
measurement=&quot;pg_table_stats&quot;</code></pre>
  </figure>
<p>Any queries defined with <code>per_db=true</code> will be run against all databases. Queries can be specified either inline, or using a separate file.</p> <p>The <a href="https://github.com/srijan/telegraf-execd-pg-custom" rel="noreferrer">repository for this plugin</a>
 has the exact queries configured above. It also has the Grafana 
dashboard JSON which can be imported to get the same dashboard as above.</p><h2>Future optimizations</h2>
<ul><li>Monitoring related to replication is not added yet, but can be added easily</li><li>No need to use superuser account in PostgreSQL 10+</li><li>This does not support running different queries depending on version of the target PostgreSQL system.</li></ul><hr />
<p>Let me know in the comments below if you have any doubts or suggestions to make this better.</p>]]></content:encoded>
    <comments>https://srijan.ch/advanced-postgresql-monitoring-using-telegraf#comments</comments>
    <slash:comments>3</slash:comments>
  </item><item>
    <title>Telegraf: dynamically adding custom tags</title>
    <description><![CDATA[Adding a custom tag to data coming in from an input plugin for telegraf]]></description>
    <link>https://srijan.ch/telegraf-dynamic-tags</link>
    <guid isPermaLink="false">6030d3dab5e0920001f557d7</guid>
    <category><![CDATA[devops]]></category>
    <category><![CDATA[telegraf]]></category>
    <category><![CDATA[influxdb]]></category>
    <dc:creator>Srijan Choudhary</dc:creator>
    <pubDate>Wed, 14 Oct 2020 00:00:00 +0000</pubDate>
    <media:content url="https://srijan.ch/media/pages/blog/telegraf-dynamic-tags/4aa8784b8f-1699621096/telegraf-plugin-interactions.png" medium="image" />
    <content:encoded><![CDATA[<h3>Background</h3>
<p>For a recent project, I wanted to add a custom tag to data coming in from a built-in input plugin for <a href="https://www.influxdata.com/time-series-platform/telegraf/" rel="noreferrer">telegraf</a>.</p> <p>The input plugin was the <a href="https://github.com/influxdata/telegraf/tree/master/plugins/inputs/procstat" rel="noreferrer">procstat plugin</a>, and the custom data was information from <a href="https://clusterlabs.org/pacemaker/doc/" rel="noreferrer">pacemaker</a>
 (a clustering solution for linux). I wanted to add a tag indicating if 
the current host was the "active" host in my active/passive setup.</p> <p>For this, the best solution I came up with was to use a <a href="https://www.influxdata.com/blog/telegraf-1-15-starlark-nginx-go-redfish-new-relic-mongodb/" rel="noreferrer">recently released</a> <a href="https://github.com/influxdata/telegraf/tree/master/plugins/processors/execd" rel="noreferrer">execd processor</a> plugin for telegraf.</p><h3>How it works</h3>
<p>The execd processor plugin runs an external program as a separate 
process and pipes metrics in to the process's STDIN and reads processed 
metrics from its STDOUT.</p><figure data-ratio="auto">
    <img src="https://srijan.ch/media/pages/blog/telegraf-dynamic-tags/4aa8784b8f-1699621096/telegraf-plugin-interactions.png" alt="Telegraf plugins interaction diagram">
  
    <figcaption class="text-center">
    Telegraf plugins interaction. <a href="https://www.planttext.com/?text=TP9RRu8m5CVV-oawdp2PCfCzBTkY8d4cA0OmcqzD1nqsmPRqacc6ttr5A7Etyz2UzlpE_vnUnb9XeVI-05UKfONEY1O5t2bLoZlN5VXzc5ErqwzQ4f5ofWXJmvJltOYcM6HyHKb92jUx7QmBpDHc6RY250HBueu6DsOVUIO9KqR4iAoh19Djk4dGyo9vGe4_zrSpfm_0b6kMON5qkBo6lJ3kzU47WCRYerHaZ_o3SfJHpGL-Cq3IkXtsXJgKbLePPb7FS5tedB9U_oT53YJD3ENNCrmBdX8fkVYNvrerik7P-SrrJaGADBDTs3BmWco0DjBfMk84EhMBiwVbo32UbehlRRTjGYqNMRc6go2KAgCCmke22XeLsr9b45FT4k04WBbKmZ8eQBvJe7g0tyoiasD9O0Mg-tWR9_uIJUV82uCmUgp3q3vAUpTdq7z9_6Wr2T0V6UUaCBR7CRmfthG0ncOml-KJ" target="_blank" rel="noreferrer">View Source</a>  </figcaption>
  </figure>
<p>Telegraf's <a href="https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md#metric-filtering" rel="noreferrer">filtering parameters</a> can be used to select or limit data from which input plugins will go to this processor.</p><h3>The external program</h3>
<p>The external program I wrote does the following:</p><ol><li>Get pacemaker status and cache it for 10 seconds</li><li>Read a line from stdin</li><li>Append the required information as a tag in the data</li><li>Write it to stdout</li></ol><p>The caching is just an optimization - it was more to do with decreasing polluting the logs than actual speed improvements.</p> <p>Also, I've done the Influxdb lineprotocol parsing in my code directly
 (because my usecase is simple), but this can be substituted by an 
actual library meant for handling lineprotocol.</p><figure>
  <pre><code class="language-python">#!/usr/bin/python

from __future__ import print_function
from sys import stderr
import fileinput
import subprocess
import time

cache_value = None
cache_time = 0
resource_name = &quot;VIP&quot;

def get_crm_status():
    global cache_value, cache_time, resource_name
    ctime = time.time()
    if ctime - cache_time &gt; 10:
        # print(&quot;Cache busted&quot;, file=stderr)
        try:
            crm_node = subprocess.check_output([&quot;sudo&quot;, &quot;/usr/sbin/crm_node&quot;, &quot;-n&quot;]).rstrip()
            crm_resource = subprocess.check_output([&quot;sudo&quot;, &quot;/usr/sbin/crm_resource&quot;, &quot;-r&quot;, resource_name, &quot;-W&quot;]).rstrip()
            active_node = crm_resource.split(&quot; &quot;)[-1]
            if active_node == crm_node:
                cache_value = &quot;active&quot;
            else:
                cache_value = &quot;inactive&quot;
        except (OSError, IOError) as e:
            print(&quot;Exception: %s&quot; % e, file=stderr)
            # Don&#039;t report active/inactive if crm commands are not found
            cache_value = None
        except Exception as e:
            print(&quot;Exception: %s&quot; % e, file=stderr)
            # Report as inactive in other cases by default
            cache_value = &quot;inactive&quot;
        cache_time = ctime
    return cache_value

def lineprotocol_add_tag(line, key, value):
    first_comma = line.find(&quot;,&quot;)
    first_space = line.find(&quot; &quot;)
    if first_comma &gt;= 0 and first_comma &lt;= first_space:
        split_str = &quot;,&quot;
    else:
        split_str = &quot; &quot;
    parts = line.split(split_str)
    first, rest = parts[0], parts[1:]
    first_new = first + &quot;,&quot; + key + &quot;=&quot; + value
    return split_str.join([first_new] + rest)

for line in fileinput.input():
    line = line.rstrip()
    crm_status = get_crm_status()
    if crm_status:
        try:
            new_line = lineprotocol_add_tag(line, &quot;crm_status&quot;, crm_status)
        except Exception as e:
            print(&quot;Exception: %s, Input: %s&quot; % (e, line), file=stderr)
            new_line = line
    else:
        new_line = line

    print(new_line)</code></pre>
    <figcaption class="text-center">pacemaker_status.py</figcaption>
  </figure>
<h3>Telegraf configuration</h3>
<p>Here's
 a sample telegraf configuration that routes data from "system" plugin 
to execd processor plugin, and finally outputs to influxdb.</p><figure>
  <pre><code class="language-toml">[agent]
  interval = &quot;30s&quot;

[[inputs.cpu]]

[[inputs.system]]

[[processors.execd]]
  command = [&quot;/usr/bin/python&quot;, &quot;/etc/telegraf/scripts/pacemaker_status.py&quot;]
  namepass = [&quot;system&quot;]

[[outputs.influxdb]]
  urls = [&quot;http://127.0.0.1:8086&quot;]
  database = &quot;telegraf&quot;</code></pre>
    <figcaption class="text-center">telegraf.conf</figcaption>
  </figure>
<h3>Other types of dynamic tags</h3>
<p>In this example, we wanted to get the value of the tag from an 
external program. If the tag can be calculated from the incoming data 
itself, then things are much simpler. There are <a href="https://github.com/influxdata/telegraf/tree/release-1.15/plugins/processors" rel="noreferrer">a lot of processor plugins</a>, and many things can be achieved using just those.</p>]]></content:encoded>
    <comments>https://srijan.ch/telegraf-dynamic-tags#comments</comments>
    <slash:comments>0</slash:comments>
  </item></channel>
</rss>
