<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[The Blog of Hamed Karbasi]]></title><description><![CDATA[The Blog of Hamed Karbasi]]></description><link>https://hamedkarbasi.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1755435950446/783f2f2f-268b-4a6d-be46-8452d3965f5b.png</url><title>The Blog of Hamed Karbasi</title><link>https://hamedkarbasi.com</link></image><generator>RSS for Node</generator><lastBuildDate>Wed, 15 Apr 2026 06:16:14 GMT</lastBuildDate><atom:link href="https://hamedkarbasi.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Visualizing Avro Kafka Data in Grafana: Streaming Real-Time Schemas]]></title><description><![CDATA[Hey everyone, if you've been following along from my last post on streaming JSON into Grafana with the Kafka Datasource plugin, you know how awesome it is to watch real-time data light up your dashboards. Today, we're leveling up to Avro; that compac...]]></description><link>https://hamedkarbasi.com/visualizing-avro-kafka-data-in-grafana-streaming-real-time-schemas</link><guid isPermaLink="true">https://hamedkarbasi.com/visualizing-avro-kafka-data-in-grafana-streaming-real-time-schemas</guid><category><![CDATA[Grafana]]></category><category><![CDATA[streaming]]></category><category><![CDATA[avro]]></category><category><![CDATA[kafka]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[realtime]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Sun, 15 Feb 2026 13:21:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/-QyXCuVHwfE/upload/9597ddca2f82bfa080b5312c3316daa1.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hey everyone, if you've been following along from <a target="_blank" href="https://hamedkarbasi.com/visualizing-kafka-data-in-grafana-consuming-real-time-messages-for-dashboards">my last post</a> on streaming JSON into Grafana with the <a target="_blank" href="https://grafana.com/grafana/plugins/hamedkarbasi93-kafka-datasource/">Kafka Datasource plugin</a>, you know how awesome it is to watch real-time data light up your dashboards. Today, we're leveling up to <a target="_blank" href="https://avro.apache.org/docs/">Avro</a>; that compact, schema-evolved format so many Kafka setups swear by. I'll walk you through using the plugin's Avro support, whether you're pulling from a <a target="_blank" href="https://github.com/confluentinc/schema-registry">Schema Registry</a> or embedding schemas inline. It's straightforward, and you'll be visualizing those binary messages in no time.</p>
<h2 id="heading-before-we-jump-in">Before we jump in</h2>
<p>If you haven't checked out <a target="_blank" href="https://hamedkarbasi.com/visualizing-kafka-data-in-grafana-consuming-real-time-messages-for-dashboards">my previous article</a> on the JSON feature, I'd highly recommend giving it a quick read first. It covers the foundational concepts of how this plugin works: connecting to Kafka, picking topics, authentication, configuring partitions and offsets, and mapping message fields to Grafana panels. The Avro support builds on the exact same mental model: Kafka topic → live stream → fields in Grafana. So you'll feel right at home.</p>
<p>The only real difference? Avro messages are binary and need a schema to be decoded into something Grafana can work with. Once decoded, everything else (queries, panels, alerts) works identically to JSON.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766099914314/48a6091d-2757-432f-b4a6-8fae7df90c27.png" alt /></p>
<h2 id="heading-what-is-avro-and-why-teams-use-it">What is Avro (and why teams use it)?</h2>
<p>Apache Avro is a schema-based serialization format that encodes data compactly, often much smaller than JSON, and relies on a schema to describe what each message contains. Think of it as a contract between your producer and consumer: the schema defines field names, types, and structure, and the binary payload is just the raw values packed efficiently.</p>
<p>Here's a quick example to see the difference. Below is an Avro schema of measurements of a sensor:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"type"</span>: <span class="hljs-string">"record"</span>,
  <span class="hljs-attr">"name"</span>: <span class="hljs-string">"SensorReading"</span>,
  <span class="hljs-attr">"fields"</span>: [
    {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"sensor_id"</span>, <span class="hljs-attr">"type"</span>: <span class="hljs-string">"string"</span>},
    {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"temperature"</span>, <span class="hljs-attr">"type"</span>: <span class="hljs-string">"double"</span>},
    {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"humidity"</span>, <span class="hljs-attr">"type"</span>: [<span class="hljs-string">"null"</span>, <span class="hljs-string">"int"</span>], <span class="hljs-attr">"default"</span>: <span class="hljs-literal">null</span>},
    {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"timestamp"</span>, <span class="hljs-attr">"type"</span>: <span class="hljs-string">"long"</span>}
  ]
}
</code></pre>
<p><strong>The data in Avro binary</strong> (~29 bytes, strongly typed):</p>
<pre><code class="lang-plaintext">\x10sensor-42C@33333\x02\xc8\x01\xd0\xf4\x8b\x9a\xb2\x30
</code></pre>
<p><strong>Same data in JSON</strong> (~91 bytes, no type safety):</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"sensor_id"</span>: <span class="hljs-string">"sensor-42"</span>,
  <span class="hljs-attr">"temperature"</span>: <span class="hljs-number">23.4</span>,
  <span class="hljs-attr">"humidity"</span>: <span class="hljs-number">68</span>,
  <span class="hljs-attr">"timestamp"</span>: <span class="hljs-number">1708027200000</span>
}
</code></pre>
<p><em>↑ Field names repeated in every message +</em> <strong><em>UTF-8 text encoding</em></strong> <em>make JSON ~3x larger than Avro's</em> <strong><em>compact binary format*</em></strong>; JSON types are also implicit; you won't know if<em> <code>temperature</code> </em>is a float, double, or string until runtime.*</p>
<p><strong>Decoded by Grafana plugin</strong> (what you see in panels):</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"sensor_id"</span>: <span class="hljs-string">"sensor-42"</span>,
  <span class="hljs-attr">"temperature"</span>: <span class="hljs-number">23.4</span>,
  <span class="hljs-attr">"humidity"</span>: <span class="hljs-number">68</span>,
  <span class="hljs-attr">"timestamp"</span>: <span class="hljs-number">1708027200000</span>
}
</code></pre>
<p><em>↑ Ready to map to Grafana fields:</em> <code>temperature</code> <em>becomes a time series,</em> <code>sensor_id</code> <em>a label</em></p>
<p>The big wins are <strong>efficiency and type safety</strong>: Avro doesn't repeat field names in every message (huge savings at scale), and the schema enforces types at write time, no surprise strings where you expected numbers. Plus, <strong>schema evolution</strong> lets you add fields and maintain backward/forward compatibility when you follow the rules, which is why Avro is commonly paired with a Schema Registry in Kafka ecosystems. Teams love it because streaming millions of messages per second becomes cheaper, safer, and more manageable than with JSON.</p>
<h2 id="heading-big-picture-what-the-plugin-does-for-avro">Big picture: what the plugin does for Avro</h2>
<p>This Grafana Kafka datasource plugin can consume Kafka messages and turn them into Grafana-friendly fields in real time, supporting both JSON and Avro payloads. For Avro specifically, the plugin handles the heavy lifting:</p>
<ul>
<li><p><strong>Fetches schemas</strong> from a Schema Registry (like Confluent's) automatically based on topic/subject naming conventions</p>
</li>
<li><p><strong>Accepts inline schemas</strong> you paste directly into the query or upload the schema file (great for quick tests or environments without a registry)</p>
</li>
<li><p><strong>Deserializes binary messages</strong> on the fly and flattens nested records into dot-notation fields that Grafana understands</p>
</li>
</ul>
<p>So whether your Avro messages come from IoT sensors, clickstream events, or service logs, the plugin decodes them, and you build dashboards just like you would with JSON data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771104774238/8a222ceb-2a06-4bb8-a9d4-450f548d064c.png" alt class="image--center mx-auto" /></p>
<p><em>↑ A sample of the decoded Avro data in Grafana using the plugin</em></p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before we dive into configs, make sure you have:</p>
<ul>
<li><p><strong>Grafana 10.2+</strong> installed and running</p>
</li>
<li><p><strong>Plugin version 1.2.0+</strong> (Avro support landed here): install via <code>grafana-cli plugins install hamedkarbasi93-kafka-datasource</code> or grab the latest zip from the <a target="_blank" href="https://github.com/hoptical/grafana-kafka-datasource/releases">GitHub releases</a></p>
</li>
<li><p><strong>Kafka broker</strong> (v0.9+ works great)</p>
</li>
<li><p><strong>(Optional) Schema Registry</strong> running at something like <code>http://localhost:8081</code> if you want the registry approach</p>
</li>
</ul>
<p>No Schema Registry? Totally fine! Inline schemas work perfectly for demos, local dev, and quick debugging.</p>
<h2 id="heading-data-source-setup-for-avro">Data source setup for Avro</h2>
<p>Head to <strong>Connections &gt; Data sources &gt; Add new data source</strong> and search for "Kafka". Fill in the basics:</p>
<ul>
<li><p><strong>Bootstrap Servers</strong>: e.g., <code>localhost:9092</code></p>
</li>
<li><p><strong>SASL/SSL toggles</strong>: if your cluster requires auth, flip these on and add credentials</p>
</li>
<li><p><strong>Schema Registry URL</strong> <strong>&amp; its username/password:</strong> e.g., <code>http://schema-registry:8081</code></p>
</li>
</ul>
<p>For inline schema mode, you don't need to set the registry URL at the datasource level; you'll paste the schema directly in the data query panel. But if you're using a registry in production, configuring it once here means all your Avro queries auto-fetch schemas without extra config.</p>
<p>Hit <strong>Save &amp; Test</strong>. A green check means you're connected and the plugin can reach Kafka.</p>
<h2 id="heading-two-ways-to-decode-avro">Two ways to decode Avro</h2>
<h3 id="heading-schema-registry-recommended-for-production">Schema Registry (recommended for production)</h3>
<p>If you already have a Schema Registry (common with Confluent-style setups), you point the datasource at the registry URL and let it resolve schemas automatically. After hitting <strong>Test Connection</strong> and receiving the "T<em>he schema registry is accessible</em>" confirmation, you can be confident the plugin can reach your registry successfully. The plugin follows Kafka's subject naming convention (e.g., <code>&lt;topic-name&gt;-value</code> for message values), fetches the latest schema version, and deserializes each message.</p>
<p>This keeps dashboards clean because you're not pasting schemas into Grafana queries, and it naturally supports teams that evolve schemas over time, just register a new version, and the plugin picks it up.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771105462707/4207d8ac-cdda-4a30-9b1d-ba92f2b3bd30.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-inlineonline-schema-great-for-demos-and-quick-debugging">Inline/online schema (great for demos and quick debugging)</h3>
<p>If you don't have a registry (or you're just prototyping), you can paste the full Avro schema JSON directly into the query editor, and the plugin will use it to decode messages. Alternatively, you can upload an Avro schema file formatted as <code>.avsc</code> for convenience. The plugin validates the inline schema automatically to ensure it's properly formatted before attempting to deserialize messages</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771105654939/05792b50-31ef-43ae-a5e7-91ba271233cf.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-building-your-first-avro-query">Building your first Avro query</h2>
<p>Create a new panel, pick your Kafka datasource, and configure the query:</p>
<ol>
<li><p><strong>Format</strong>: Switch from JSON to <strong>Avro</strong></p>
</li>
<li><p><strong>Topic</strong>: Type or autocomplete your Avro topic name (e.g., <code>server-metrics</code>)</p>
</li>
<li><p><strong>Partitions</strong>: Click "Fetch" to list partitions, then pick specific ones or select all</p>
</li>
<li><p><strong>Offset</strong>: Choose "Latest" for live streams, or "Last N" to replay recent messages</p>
</li>
<li><p><strong>Schema Mode</strong>:</p>
<ul>
<li><p><strong>Schema Registry</strong>: The plugin auto-derives the subject (e.g., <code>server-metrics-value</code>) and fetches the schema</p>
</li>
<li><p><strong>Inline Schema</strong>: Paste the full Avro schema JSON in the text box</p>
</li>
</ul>
</li>
</ol>
<p>Once you save, the plugin starts consuming, deserializing on the fly, and populating fields in the panel's data frame.</p>
<h2 id="heading-understanding-field-flattening-with-nested-schemas">Understanding field flattening with nested schemas</h2>
<p>One of the features is how the plugin handles nested Avro records: it flattens them into dot-notation field names, allowing Grafana to work with them naturally. Let's use a real example to see this in action.</p>
<p>Here's a nested schema representing server metrics with host info, multi-level metrics, nullable fields, and arrays:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"type"</span>: <span class="hljs-string">"record"</span>,
    <span class="hljs-attr">"name"</span>: <span class="hljs-string">"NestedMessage"</span>,
    <span class="hljs-attr">"fields"</span>: [
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"host"</span>,
            <span class="hljs-attr">"type"</span>: {
                <span class="hljs-attr">"type"</span>: <span class="hljs-string">"record"</span>,
                <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Host"</span>,
                <span class="hljs-attr">"fields"</span>: [
                    {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"name"</span>, <span class="hljs-attr">"type"</span>: <span class="hljs-string">"string"</span>},
                    {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"ip"</span>, <span class="hljs-attr">"type"</span>: <span class="hljs-string">"string"</span>}
                ]
            }
        },
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"metrics"</span>,
            <span class="hljs-attr">"type"</span>: {
                <span class="hljs-attr">"type"</span>: <span class="hljs-string">"record"</span>,
                <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Metrics"</span>,
                <span class="hljs-attr">"fields"</span>: [
                    {
                        <span class="hljs-attr">"name"</span>: <span class="hljs-string">"cpu"</span>,
                        <span class="hljs-attr">"type"</span>: {
                            <span class="hljs-attr">"type"</span>: <span class="hljs-string">"record"</span>,
                            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"CPU"</span>,
                            <span class="hljs-attr">"fields"</span>: [
                                {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"load"</span>, <span class="hljs-attr">"type"</span>: [<span class="hljs-string">"null"</span>, <span class="hljs-string">"double"</span>], <span class="hljs-attr">"default"</span>: <span class="hljs-literal">null</span>},
                                {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"temp"</span>, <span class="hljs-attr">"type"</span>: <span class="hljs-string">"double"</span>}
                            ]
                        }
                    },
                    {
                        <span class="hljs-attr">"name"</span>: <span class="hljs-string">"mem"</span>,
                        <span class="hljs-attr">"type"</span>: {
                            <span class="hljs-attr">"type"</span>: <span class="hljs-string">"record"</span>,
                            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Memory"</span>,
                            <span class="hljs-attr">"fields"</span>: [
                                {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"used"</span>, <span class="hljs-attr">"type"</span>: <span class="hljs-string">"int"</span>},
                                {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"free"</span>, <span class="hljs-attr">"type"</span>: <span class="hljs-string">"int"</span>}
                            ]
                        }
                    }
                ]
            }
        },
        {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"value1"</span>, <span class="hljs-attr">"type"</span>: [<span class="hljs-string">"null"</span>, <span class="hljs-string">"double"</span>], <span class="hljs-attr">"default"</span>: <span class="hljs-literal">null</span>},
        {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"value2"</span>, <span class="hljs-attr">"type"</span>: [<span class="hljs-string">"null"</span>, <span class="hljs-string">"double"</span>], <span class="hljs-attr">"default"</span>: <span class="hljs-literal">null</span>},
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"tags"</span>,
            <span class="hljs-attr">"type"</span>: {<span class="hljs-attr">"type"</span>: <span class="hljs-string">"array"</span>, <span class="hljs-attr">"items"</span>: <span class="hljs-string">"string"</span>}
        },
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"alerts"</span>,
            <span class="hljs-attr">"type"</span>: {
                <span class="hljs-attr">"type"</span>: <span class="hljs-string">"array"</span>,
                <span class="hljs-attr">"items"</span>: {
                    <span class="hljs-attr">"type"</span>: <span class="hljs-string">"record"</span>,
                    <span class="hljs-attr">"name"</span>: <span class="hljs-string">"Alert"</span>,
                    <span class="hljs-attr">"fields"</span>: [
                        {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"type"</span>, <span class="hljs-attr">"type"</span>: <span class="hljs-string">"string"</span>},
                        {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"severity"</span>, <span class="hljs-attr">"type"</span>: <span class="hljs-string">"string"</span>},
                        {<span class="hljs-attr">"name"</span>: <span class="hljs-string">"value"</span>, <span class="hljs-attr">"type"</span>: <span class="hljs-string">"double"</span>}
                    ]
                }
            }
        },
        {
            <span class="hljs-attr">"name"</span>: <span class="hljs-string">"processes"</span>,
            <span class="hljs-attr">"type"</span>: {<span class="hljs-attr">"type"</span>: <span class="hljs-string">"array"</span>, <span class="hljs-attr">"items"</span>: <span class="hljs-string">"string"</span>}
        }
    ]
}
</code></pre>
<p>When the plugin decodes a message with this schema, you'll see flattened fields like:</p>
<ul>
<li><p><code>host.name</code> (string) → label/dimension</p>
</li>
<li><p><code>host.ip</code> (string) → label/dimension</p>
</li>
<li><p><code>metrics.cpu.load</code> (double, nullable) → time series value</p>
</li>
<li><p><code>metrics.cpu.temp</code> (double) → time series value</p>
</li>
<li><p><code>metrics.mem.used</code> (int) → time series value</p>
</li>
<li><p><code>metrics.mem.free</code> (int) → time series value</p>
</li>
<li><p><code>value1</code>, <code>value2</code> (nullable doubles) → time series values</p>
</li>
<li><p><code>tags</code> → JSON string <code>["prod","edge"]</code></p>
</li>
<li><p><code>alerts</code> → JSON string <code>[{"severity":"warning","type":"cpu_high","value":98.99}]</code></p>
</li>
<li><p><code>processes</code> → JSON string <code>["nginx","mysql","redis"]</code></p>
</li>
</ul>
<p>This means you can directly reference <code>metrics.cpu.temp</code> in a Graph panel or use <code>host.name</code> as a grouping dimension in a Table; no manual parsing is needed.</p>
<h3 id="heading-practical-tips-for-nested-schemas">Practical tips for nested schemas</h3>
<ul>
<li><p><strong>Nullable unions</strong> (like <code>["null", "double"]</code>) are handled gracefully; null values just show up as gaps in the time series.</p>
</li>
<li><p><strong>Arrays</strong> are serialized as JSON strings in the data frame (e.g., <code>["nginx","mysql"]</code> or <code>[{"severity":"warning","type":"cpu_high","value":98.99}]</code>). You can parse them further with Grafana's JSON transform or display them as-is in Table panels.</p>
</li>
<li><p><strong>Deep nesting</strong> is flattened up to depth 5 by default; beyond that, the plugin won't flatten further to avoid performance issues.</p>
</li>
<li><p>You can adjust both the <strong>flatten depth</strong> and the <strong>maximum field limit</strong> (defaults: depth <code>5</code>, fields <code>1000</code>) in the <strong>Advanced</strong> <strong>Settings</strong> section of the <strong>Config Editor</strong> if your schema needs more room.</p>
</li>
<li><p>Use field aliases in Grafana's Transform tab to rename <code>metrics.cpu.temp</code> to something friendlier, like <code>"CPU Temperature"</code> in legends.</p>
</li>
</ul>
<h2 id="heading-testing-it-out-with-live-data">Testing it out with live data</h2>
<p>The repo includes a <a target="_blank" href="https://github.com/hoptical/grafana-kafka-datasource/tree/main/example/go">Go-based producer</a> that can publish Avro messages to your local Kafka. Fire it up:</p>
<pre><code class="lang-bash">go run ./example/go \
  -broker localhost:9094 \
  -topic server-metrics \
  -interval 500 \
  -format avro \
  -schema-registry http://localhost:8081
</code></pre>
<p>This pushes a message every 500ms with randomized metrics. Now create a panel in Grafana, point it at the <code>server-metrics</code> topic with Avro format, and watch the fields populate in real time. Build a Stat panel for <code>metrics.cpu.temp</code>, a Graph for <code>metrics.mem.used</code> over time, or a Table grouped by <code>host.name</code>, it all just works.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771108505476/72783508-1009-443e-9899-c9e6a79fdf77.gif" alt class="image--center mx-auto" /></p>
<h2 id="heading-wrapping-up">Wrapping up</h2>
<p>There you have it! Avro streaming unlocked in Grafana. Whether you're running a full Confluent Platform with Schema Registry or just need to decode some binary messages locally, this plugin makes it dead simple to visualize Avro data alongside your other telemetry.</p>
<p>Questions? Found a bug? Feature idea? Hit up the <a target="_blank" href="https://github.com/hoptical/grafana-kafka-datasource/issues">GitHub issues</a> or drop a comment below. And if this saved you a few hours of head-scratching, toss the <a target="_blank" href="https://github.com/hoptical/grafana-kafka-datasource">repo</a> a star, which helps more folks find it!​</p>
<p>Happy streaming! 🚀</p>
]]></content:encoded></item><item><title><![CDATA[Visualizing Kafka Data in Grafana: Consuming Real-Time Messages for Dashboards]]></title><description><![CDATA[Ever tried visualizing data with Grafana? Whether you’re building a simple dashboard with Prometheus to track resource usage or running complex queries on Loki or Elasticsearch, Grafana is one of the popular go-to tools.
Grafana offers open-source an...]]></description><link>https://hamedkarbasi.com/visualizing-kafka-data-in-grafana-consuming-real-time-messages-for-dashboards</link><guid isPermaLink="true">https://hamedkarbasi.com/visualizing-kafka-data-in-grafana-consuming-real-time-messages-for-dashboards</guid><category><![CDATA[kafka]]></category><category><![CDATA[Grafana]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[open source]]></category><category><![CDATA[streaming]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Fri, 19 Dec 2025 12:25:47 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/7cLqEYJws8E/upload/79d29542e354db06633c8d72838c1a64.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ever tried visualizing data with <a target="_blank" href="https://grafana.com/">Grafana</a>? Whether you’re building a simple dashboard with <a target="_blank" href="https://prometheus.io/">Prometheus</a> to track resource usage or running complex queries on <a target="_blank" href="https://grafana.com/oss/loki/">Loki</a> or <a target="_blank" href="https://www.google.com/aclk?sa=L&amp;ai=DChsSEwjkzLSdtcmRAxUdZpEFHcJfF1wYACICCAEQARoCbHI&amp;co=1&amp;ase=2&amp;gclid=Cj0KCQiAjJTKBhCjARIsAIMC44_GVVKNaSbr_viJ-fWkHYMMftDHTWvLzYojJlk4yoIENvDtQbXgnpMaAsJ3EALw_wcB&amp;ei=QCVFac6hH_XBwPAPgLfEsQE&amp;cce=2&amp;category=acrcp_v1_32&amp;sig=AOD64_35RjmPCiHEWYooYHiqc4Rr_974bA&amp;q&amp;sqi=2&amp;nis=4&amp;adurl&amp;ved=2ahUKEwjOwa6dtcmRAxX1IBAIHYAbMRYQ0Qx6BAgxEAE">Elasticsearch</a>, Grafana is one of the popular go-to tools.</p>
<p>Grafana offers <a target="_blank" href="https://grafana.com/oss/grafana/">open-source</a> and self-hosted options, as well as enterprise and <a target="_blank" href="https://grafana.com/products/cloud/">cloud solutions</a>, all designed to help you gain insights from almost anything. And by <em>anything</em>, I really mean it: from time-series databases like Prometheus and InfluxDB, to relational databases such as PostgreSQL, and even cloud-native metrics from services like AWS CloudWatch and Google Cloud Monitoring.</p>
<p>However, in modern data platforms, a huge amount of data doesn’t sit in databases necessarily; it flows through streaming systems. And one name shows up almost everywhere: <a target="_blank" href="https://kafka.apache.org/">Kafka</a>!</p>
<h2 id="heading-got-kafka-lets-talk-about-it">Got Kafka? Let's Talk About It!</h2>
<p>Kafka is a powerful broker, and if you're using it, you probably have a bunch of questions about your Kafka cluster. These questions usually fall into two categories:</p>
<h3 id="heading-curious-about-kafka-metrics">Curious About Kafka Metrics?</h3>
<ul>
<li><p>How many clusters are we running?</p>
</li>
<li><p>How much disk space is each broker using?</p>
</li>
<li><p>Is the cluster healthy?</p>
</li>
<li><p>What’s the throughput in messages per second or bytes per second?</p>
</li>
<li><p>How many topics and partitions do we have?</p>
</li>
</ul>
<p>These are all <strong>Kafka cluster metrics</strong>. The good news is that this part is already well covered: you can <a target="_blank" href="https://github.com/danielqsj/kafka_exporter">export</a> Kafka metrics to Prometheus and visualize them in Grafana using <a target="_blank" href="https://grafana.com/grafana/dashboards/?search=kafka+exporter">ready-made dashboards</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766142019266/28881803-3b6c-401a-afc7-8c17c472f597.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-wondering-about-kafka-data">Wondering About Kafka Data?</h3>
<ul>
<li><p>What messages are being produced on a specific topic?</p>
</li>
<li><p>Can I stream messages in real time to see what’s actually happening?</p>
</li>
<li><p>Can I build dashboards based on the data flowing through Kafka?</p>
</li>
<li><p>My messages are in JSON, Avro, or Protobuf and are fairly complex. How can I visualize them effectively?</p>
</li>
</ul>
<p>If these questions sound familiar, then you’re no longer just interested in <em>metrics</em>. You want visibility into the <strong>data itself</strong>.</p>
<p>This is where a <a target="_blank" href="https://grafana.com/grafana/plugins/hamedkarbasi93-kafka-datasource/"><strong>Grafana data source plugin</strong></a> becomes essential, allowing Grafana to connect directly to Kafka, consume messages, and turn streaming data into dashboards.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766146382952/315f3edb-a4bc-480e-a425-4d22002baea4.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-existing-tools-for-inspecting-kafka-data">Existing Tools for Inspecting Kafka Data</h2>
<p>Kafka ecosystems already offer excellent tools for working directly with Kafka data, and many teams rely on them on a daily basis.</p>
<ul>
<li><p><a target="_blank" href="https://github.com/edenhill/kcat"><strong>kcat</strong></a> <strong>(formerly kafkacat)</strong> is a lightweight CLI tool that’s extremely useful for quick inspections, debugging producers and consumers, and validating message payloads.</p>
</li>
<li><p><a target="_blank" href="https://akhq.io/"><strong>AKHQ</strong></a><strong>,</strong> <a target="_blank" href="https://conduktor.io/"><strong>Conduktor</strong></a><strong>,</strong> <a target="_blank" href="https://docs.confluent.io/control-center/current/overview.html"><strong>Confluent Control Center</strong></a><strong>,</strong> <a target="_blank" href="https://www.redpanda.com/redpanda-console-kafka-ui"><strong>Redpanda Console</strong></a><strong>, and similar UIs</strong> provide rich, Kafka-focused interfaces for browsing topics, inspecting messages, and working with schemas.</p>
</li>
</ul>
<p>These tools are purpose-built for Kafka-specific workflows. Beyond message inspection, they offer deep visibility into Kafka itself, including cluster health, metadata, Schema Registry, Kafka Connect, and overall operational state. When teams need to understand or operate Kafka as a system, these tools are often a natural fit.</p>
<h3 id="heading-then-why-visualize-kafka-data-in-grafana">Then, Why Visualize Kafka Data in Grafana?</h3>
<p>The Kafka Data Source plugin does not aim to replace these tools. Instead, it focuses on a different and complementary use case.</p>
<p>In many organizations, <strong>Grafana is already the central place for dashboards and observability</strong>. Metrics, logs, traces, and business KPIs often live side by side in Grafana, giving teams a shared view of how systems behave.</p>
<p>By bringing Kafka data into Grafana, teams can:</p>
<ul>
<li><p>Visualize Kafka data <strong>alongside existing dashboards</strong>, rather than in a separate, Kafka-only tool</p>
</li>
<li><p>Correlate Kafka messages with <strong>application metrics, infrastructure metrics, and alerts</strong></p>
</li>
<li><p>Use Grafana’s alerting and visualization capabilities to gain deeper insights</p>
</li>
<li><p>Follow the principle of <strong>having all operational insights in one place</strong></p>
</li>
</ul>
<p>While the aforementioned tools focus exclusively on Kafka, Grafana provides a broader observability context. The Kafka Data Source plugin helps bridge that gap, making Kafka data part of the same story as the rest of your systems.</p>
<h2 id="heading-kafka-data-source-plugin">Kafka Data Source Plugin</h2>
<p>The Kafka Data Source plugin works as a <a target="_blank" href="https://www.instaclustr.com/blog/a-beginners-guide-to-kafka-consumers/">Kafka <strong>consumer</strong></a>. It reads messages from the topics and partitions you specify and makes that data available directly inside Grafana, so you can visualize streaming data instead of just monitoring cluster metrics.</p>
<h3 id="heading-get-started">Get Started</h3>
<p>To use the plugin, you first need to install it in Grafana. You can do this either via the Grafana <a target="_blank" href="https://grafana.com/docs/grafana/latest/administration/plugin-management/plugin-install/#install-a-plugin-through-the-grafana-ui">UI</a> or the CLI. Alternatively, you can install it using the <a target="_blank" href="https://github.com/hoptical/grafana-kafka-datasource/releases/latest">plugin ZIP file</a> or by following the provisioning instructions available on the plugin’s <a target="_blank" href="https://github.com/hoptical/grafana-kafka-datasource?tab=readme-ov-file#provisioning">GitHub page</a>.</p>
<pre><code class="lang-bash">grafana-cli plugins install hamedkarbasi93-kafka-datasource
</code></pre>
<p>In this article, we’ll focus on using the plugin to visualize <strong>nested JSON messages</strong> in Grafana. I’ll cover other features such as Avro in future articles.</p>
<h3 id="heading-what-you-need-before-you-start">What You Need Before You Start</h3>
<p>Before diving in, make sure you have:</p>
<ul>
<li><p>Grafana with the Kafka Data Source plugin installed</p>
</li>
<li><p>Access to a Kafka cluster, including its connection details (bootstrap servers and TLS/SASL credentials, if required)</p>
</li>
</ul>
<h3 id="heading-setting-up-your-data-source-in-grafana">Setting Up Your Data Source in Grafana</h3>
<p>Once the plugin is installed, you can add a new Kafka data source in Grafana and configure the following settings.</p>
<h4 id="heading-connection-settings">Connection Settings</h4>
<ul>
<li><p><strong>Bootstrap servers</strong><br />  Enter a comma-separated list of Kafka brokers.</p>
</li>
<li><p><strong>Security protocol</strong><br />  Choose one of the supported protocols:</p>
<ul>
<li><p><code>PLAINTEXT</code></p>
</li>
<li><p><code>SSL</code></p>
</li>
<li><p><code>SASL_PLAINTEXT</code></p>
</li>
<li><p><code>SASL_SSL</code></p>
</li>
</ul>
</li>
</ul>
<h4 id="heading-authentication-and-tls">Authentication and TLS</h4>
<p>Depending on the selected security protocol, you may need to configure authentication and TLS:</p>
<ul>
<li><p><strong>SASL settings</strong></p>
<ul>
<li><p>SASL mechanism</p>
</li>
<li><p>SASL username</p>
</li>
<li><p>SASL password</p>
</li>
</ul>
</li>
<li><p><strong>TLS settings</strong></p>
<ul>
<li><p>Server certificate</p>
</li>
<li><p>Client certificate</p>
</li>
<li><p>Client key</p>
</li>
</ul>
</li>
</ul>
<p>    These fields are required only when SASL or TLS is enabled.</p>
<h4 id="heading-schema-registry-avro-only">Schema Registry (Avro Only)</h4>
<p>Schema Registry settings are <strong>only required when the message format is set to Avro</strong>. If you’re working with JSON messages, you can safely skip this section.</p>
<h4 id="heading-advanced-json-settings">Advanced JSON Settings</h4>
<p>For JSON messages, the plugin provides advanced controls to help manage complex and deeply nested structures:</p>
<ul>
<li><p><strong>Maximum JSON flatten depth</strong><br />  Controls how deeply nested objects are flattened.</p>
</li>
<li><p><strong>JSON field limit</strong><br />  Sets the maximum number of fields that will be expanded.</p>
</li>
</ul>
<p>These settings keep dashboards readable, prevent overly wide tables, and help protect Grafana from heavy data loads.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766099204254/1f317451-1336-4512-af8a-f92d05498d5a.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766099277422/a7a2f302-3a5c-4ce8-9c36-5fac4009251f.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-key-query-editor-fields">Key Query Editor Fields</h3>
<p>Once the data source is configured, you’ll use the query editor to define what data you want to read from Kafka and how it should appear in Grafana.</p>
<ul>
<li><p><strong>Topic</strong> — Enter the Kafka topic name. You can click <code>Fetch</code> to retrieve the available partitions for that topic.</p>
</li>
<li><p><strong>Partition</strong> — Choose <code>All partitions</code> or a specific partition.</p>
</li>
<li><p><strong>Offset</strong> — Select the starting position for the query:</p>
<ul>
<li><p><code>Latest</code> — read the newest messages.</p>
</li>
<li><p><code>Last N messages</code> — set N to read the most recent N messages.</p>
</li>
<li><p><code>Earliest</code> — start from the earliest available offset.</p>
</li>
</ul>
</li>
<li><p><strong>Timestamp Mode</strong> — Choose the timestamp for Grafana points:</p>
<ul>
<li><p><code>Kafka Event Time</code> — uses Kafka message metadata timestamp.</p>
</li>
<li><p><code>Dashboard received time</code> — uses the time Grafana receives the message.</p>
</li>
</ul>
</li>
<li><p><strong>Message Format</strong> — Set to <code>JSON</code> for JSON message parsing.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766099914314/48a6091d-2757-432f-b4a6-8fae7df90c27.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-mapping-json-to-grafana">Mapping JSON to Grafana</h3>
<p>When the <strong>Message Format</strong> is set to JSON, the plugin parses the Kafka record value as a JSON object and prepares it for visualization in Grafana. To make nested structures usable in dashboards, the plugin automatically <strong>flattens nested JSON into dot-delimited fields</strong>.</p>
<h4 id="heading-example-json-message">Example JSON Message</h4>
<pre><code class="lang-json">{
  <span class="hljs-attr">"service"</span>: {
    <span class="hljs-attr">"name"</span>: <span class="hljs-string">"api"</span>,
    <span class="hljs-attr">"latency"</span>: {
      <span class="hljs-attr">"p50"</span>: <span class="hljs-number">23</span>,
      <span class="hljs-attr">"p95"</span>: <span class="hljs-number">75</span>
    }
  },
  <span class="hljs-attr">"host"</span>: <span class="hljs-string">"node-2"</span>,
  <span class="hljs-attr">"requests"</span>: <span class="hljs-number">128</span>
}
</code></pre>
<h4 id="heading-after-flattening">After Flattening</h4>
<p>The JSON above is transformed into the following fields:</p>
<ul>
<li><p><code>service.name</code> → <code>"api"</code></p>
</li>
<li><p><code>service.latency.p50</code> → <code>23</code> <em>(numeric)</em></p>
</li>
<li><p><code>service.latency.p95</code> → <code>75</code> <em>(numeric)</em></p>
</li>
<li><p><code>host</code> → <code>"node-2"</code></p>
</li>
<li><p><code>requests</code> → <code>128</code> <em>(numeric)</em></p>
</li>
</ul>
<h4 id="heading-how-grafana-uses-these-fields">How Grafana Uses These Fields</h4>
<ul>
<li><p><strong>Numeric fields</strong> (such as latency percentiles or request counts) are ideal for <strong>time series visualizations</strong> like line charts.</p>
</li>
<li><p><strong>String or non-numeric fields</strong> work best in <strong>table panels</strong> or as labels and metadata.</p>
</li>
</ul>
<p>This mapping allows you to turn raw Kafka messages into meaningful dashboards that you can leverage Grafana transformations like group-by or filtering to achieve what you are looking for.</p>
<h2 id="heading-example-workflow">Example Workflow</h2>
<p>Let’s walk through a simple case study to see how everything comes together.</p>
<h3 id="heading-step-1-produce-sample-data">Step 1: Produce Sample Data</h3>
<p>Start by producing some sample JSON messages to a Kafka topic (for example, <code>test</code>). You can use the <a target="_blank" href="https://github.com/hoptical/grafana-kafka-datasource/tree/main/example#go">example producer</a> provided in the repository to generate messages like the one below:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"alerts"</span>: [
    {
      <span class="hljs-attr">"severity"</span>: <span class="hljs-string">"warning"</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"cpu_high"</span>,
      <span class="hljs-attr">"value"</span>: <span class="hljs-number">25.867063332016414</span>
    },
    {
      <span class="hljs-attr">"severity"</span>: <span class="hljs-string">"info"</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"mem_low"</span>,
      <span class="hljs-attr">"value"</span>: <span class="hljs-number">55.72157456801047</span>
    }
  ],
  <span class="hljs-attr">"host"</span>: {
    <span class="hljs-attr">"ip"</span>: <span class="hljs-string">"127.0.0.1"</span>,
    <span class="hljs-attr">"name"</span>: <span class="hljs-string">"srv-01"</span>
  },
  <span class="hljs-attr">"metrics"</span>: {
    <span class="hljs-attr">"cpu"</span>: {
      <span class="hljs-attr">"load"</span>: <span class="hljs-number">0.25867063332016416</span>,
      <span class="hljs-attr">"temp"</span>: <span class="hljs-number">65.21794911901341</span>
    },
    <span class="hljs-attr">"mem"</span>: {
      <span class="hljs-attr">"free"</span>: <span class="hljs-number">9065</span>,
      <span class="hljs-attr">"used"</span>: <span class="hljs-number">2767</span>
    }
  },
  <span class="hljs-attr">"processes"</span>: [
    <span class="hljs-string">"nginx"</span>,
    <span class="hljs-string">"mysql"</span>,
    <span class="hljs-string">"redis"</span>
  ],
  <span class="hljs-attr">"tags"</span>: [
    <span class="hljs-string">"prod"</span>,
    <span class="hljs-string">"edge"</span>
  ],
  <span class="hljs-attr">"value1"</span>: <span class="hljs-number">0.25867063332016416</span>,
  <span class="hljs-attr">"value2"</span>: <span class="hljs-number">1.1144314913602094</span>
}
</code></pre>
<p>This message includes a mix of nested objects, arrays, numeric values, and strings, a fairly realistic example of what Kafka messages often look like in practice.</p>
<h3 id="heading-step-2-explore-the-data-in-grafana">Step 2: Explore the Data in Grafana</h3>
<p>Next, open the <strong>Explore</strong> page in your Grafana instance. Select the <strong>Kafka data source</strong> and configure the query editor with the following values:</p>
<ul>
<li><p><strong>Topic</strong>: <code>test</code></p>
</li>
<li><p><strong>Partition</strong>: All partitions</p>
</li>
<li><p><strong>Offset</strong>: Latest</p>
</li>
<li><p><strong>Timestamp Mode</strong>: Kafka Event Time</p>
</li>
<li><p><strong>Message Format</strong>: JSON</p>
</li>
</ul>
<p>Once the query is running, you should see <strong>live data streaming into Grafana as a table</strong>. Nested JSON fields will be flattened automatically, making it easy to inspect both numeric and non-numeric values.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766102645426/c8d5e8bb-eedd-4685-9f4f-3791f32edfd6.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-step-3-visualize-the-data-with-dashboards">Step 3: Visualize the Data with Dashboards</h3>
<p>To take this a step further, you can use the <a target="_blank" href="https://github.com/hoptical/grafana-kafka-datasource/tree/main/provisioning/dashboards"><strong>provisioned dashboard</strong></a> included in the repository to visualize the same Kafka data using multiple panels.</p>
<p>This dashboard demonstrates how different parts of the message can be visualized:</p>
<ul>
<li><p>numeric fields as time series</p>
</li>
<li><p>structured data in tables</p>
</li>
<li><p>multiple signals derived from the same Kafka topic</p>
</li>
</ul>
<p>It’s a simple example, but it highlights the core idea: Kafka data can be treated just like any other data source in Grafana and combined with existing dashboards and observability signals.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766103200480/42957da2-43b9-494b-93ae-a0a7591e4b5e.gif" alt class="image--center mx-auto" /></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Kafka plays a central role in many modern data platforms, but working with its data often requires jumping between specialized tools. At the same time, Grafana has become the place where teams come together to observe, correlate, and understand how their systems behave. The Kafka Data Source plugin is designed to bridge that gap.</p>
<p>By making Kafka data available directly in Grafana, the plugin allows you to visualize streaming data alongside existing dashboards, correlate it with metrics and alerts, and gain insights without leaving the tools your organization already relies on. It doesn’t replace Kafka-native tools; instead, it complements them by bringing Kafka data into the broader observability picture.</p>
<p>In this article, we focused on visualizing JSON messages and building simple, real-time workflows. In upcoming articles, we’ll explore more advanced features such as Avro support, schema registry integration, and additional use cases that build on the same core ideas.</p>
<h2 id="heading-your-turn">Your Turn</h2>
<ul>
<li><p>How would visualizing Kafka data in Grafana change the way your team monitors and understands your systems?</p>
</li>
<li><p>What challenges have you faced when working with Kafka data, and how do you currently deal with them?</p>
</li>
<li><p>What’s the first Kafka use case you’d try visualizing in Grafana?</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Securing Kafka: Demystifying SASL, SSL, and Authentication Essentials]]></title><description><![CDATA[The first time I had to secure a Kafka cluster, I felt like I was drowning in acronyms: SASL, SSL, SCRAM, Kerberos… The more I delved into the documentation, the more confused I became.
If you’ve felt the same, don’t worry! You’re not alone. Kafka se...]]></description><link>https://hamedkarbasi.com/securing-kafka-demystifying-sasl-ssl-and-authentication-essentials</link><guid isPermaLink="true">https://hamedkarbasi.com/securing-kafka-demystifying-sasl-ssl-and-authentication-essentials</guid><category><![CDATA[kafka]]></category><category><![CDATA[Security]]></category><category><![CDATA[SSL]]></category><category><![CDATA[SSL/TLS]]></category><category><![CDATA[Apache Kafka]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Sun, 31 Aug 2025 14:36:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/7k-o-54prMI/upload/f2a552de95d842db9f47cbce55809395.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The first time I had to secure a Kafka cluster, I felt like I was drowning in acronyms: SASL, SSL, SCRAM, Kerberos… The more I delved into the <a target="_blank" href="https://kafka.apache.org/documentation/#security">documentation</a>, the more confused I became.</p>
<p>If you’ve felt the same, don’t worry! You’re not alone. Kafka security <em>sounds</em> more complicated than it is. Once you separate <strong>protocols</strong> (the road your data travels on) from <strong>mechanisms</strong> (how you prove who you are), it starts making sense.</p>
<p>At each step, I'll break down the jargon and provide practical configuration examples.</p>
<h2 id="heading-why-kafka-security-even-matters">Why Kafka Security Even Matters</h2>
<p>Kafka often sits at the heart of an organization’s data pipeline. It’s where clickstream data, financial transactions, and system logs all flow.</p>
<p>If it’s left unsecured:</p>
<ul>
<li><p>Anyone could publish garbage messages.</p>
</li>
<li><p>Attackers could silently read sensitive data.</p>
</li>
<li><p>A bad actor could impersonate a legitimate client.</p>
</li>
</ul>
<p>That’s why Kafka bakes in a flexible security model. You just have to piece together the building blocks.</p>
<hr />
<h2 id="heading-protocols-vs-mechanisms-a-simple-analogy">Protocols vs. Mechanisms: A Simple Analogy</h2>
<p>Here’s the mental model that helped me:</p>
<ul>
<li><p><strong>Protocol = the road.</strong> Is it safe to drive on? Is it fenced off? Or is it a wide-open dirt path anyone can walk onto?</p>
</li>
<li><p><strong>Mechanism = the toll booth.</strong> How do you prove you belong there? Show an ID, enter a password, flash a badge?</p>
</li>
</ul>
<p>Kafka gives you both:</p>
<ul>
<li><p><strong>SSL/TLS</strong> = the secure road (encrypted tunnel).</p>
</li>
<li><p><strong>SASL</strong> = the toll booth framework (authentication).</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756649641688/ea2915d8-dcec-451f-a4af-5415582b6ea2.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-ssltls-the-secure-tunnel">SSL/TLS: The Secure Tunnel</h2>
<p>SSL/TLS makes sure no one can peek into or tamper with the messages flowing between your clients and brokers.</p>
<p>How it works in practice:</p>
<ol>
<li><p>The Kafka broker shows a certificate, proving it’s legit.</p>
</li>
<li><p>The client checks if the broker’s certificate can be trusted.</p>
</li>
<li><p>Optionally, the client shows its own certificate back (mutual TLS).</p>
</li>
</ol>
<p>Here’s what that looks like in Kafka broker config:</p>
<pre><code class="lang-ini"><span class="hljs-attr">listeners</span>=SSL://:<span class="hljs-number">9092</span>
<span class="hljs-attr">ssl.keystore.location</span>=/etc/kafka/secrets/kafka.server.keystore.jks
<span class="hljs-attr">ssl.keystore.password</span>=YOUR_PASSWORD
<span class="hljs-attr">ssl.key.password</span>=YOUR_PASSWORD
<span class="hljs-attr">ssl.truststore.location</span>=/etc/kafka/secrets/kafka.server.truststore.jks
<span class="hljs-attr">ssl.truststore.password</span>=YOUR_PASSWORD
</code></pre>
<p>From then on, all traffic over the port <code>9092</code> is encrypted.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756650416142/ec083226-f8d8-4807-a9e2-18ed1a897ad9.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-what-are-the-keystore-and-truststore">What are the Keystore and Truststore?</h3>
<ul>
<li><p><strong>Keystore</strong>: Think of it as your ID wallet. It holds your private key and certificate that prove who you are (the broker in this case).</p>
</li>
<li><p><strong>Truststore</strong>: This is your list of trusted IDs. It contains certificates from other parties you trust (like clients or certificate authorities).</p>
</li>
</ul>
<p>In the example above, the broker uses its keystore to prove its identity to clients. The truststore is used to verify client certificates if mutual TLS is enabled. So, the truststore is what the broker uses to check others' identities.</p>
<h3 id="heading-what-are-those-jks-files">What are those JKS files?</h3>
<p><code>.jks</code> files are Java KeyStores, which are file formats used to store cryptographic keys and certificates. In Kafka, they are used for SSL/TLS encryption to ensure secure communication between clients and brokers. In other environments, other formats like PEM or PFX might be used, but Kafka primarily relies on JKS due to its Java foundation.</p>
<p>You can generate JKS files using Java's <code>keytool</code> utility. For a step-by-step guide, check out <a target="_blank" href="https://dzone.com/articles/keytool-commandutility-to-generate-a-keystorecerti">this resource</a>.</p>
<h3 id="heading-understanding-ssl-vs-tls-whats-the-difference"><strong>Understanding SSL vs. TLS: What's the Difference?</strong></h3>
<p>SSL (Secure Sockets Layer) and TLS (Transport Layer Security) are like the older and younger siblings in the world of secure communication protocols. While they often get mentioned together, they have some important differences:</p>
<ol>
<li><p><strong>Evolution</strong>: Think of SSL as the trailblazer. It paved the way, but TLS took over with more advanced versions. SSL 3.0 was the last of its kind, and then TLS 1.0 came along, evolving through versions 1.1, 1.2, and now the latest, 1.3, each bringing stronger security features.</p>
</li>
<li><p><strong>Security Enhancements</strong>: TLS is like a fortified castle compared to SSL's wooden fort. It uses stronger encryption algorithms and more secure hash functions, making it a tough nut to crack against modern cyber threats.</p>
</li>
<li><p><strong>Deprecation of SSL</strong>: SSL has had its day in the sun, but due to vulnerabilities, it's now considered outdated and insecure. Most systems have moved on to TLS to keep data safe and sound.</p>
</li>
<li><p><strong>Handshake Process</strong>: Both protocols use a handshake to establish a secure connection, but TLS does it with more finesse. Its handshakes are more efficient and secure, especially in the latest versions.</p>
</li>
<li><p><strong>Backward Compatibility</strong>: TLS is designed to be backward compatible with SSL, allowing systems to support both during transitions. However, it's wise to disable SSL support to avoid potential security risks.</p>
</li>
</ol>
<p>In a nutshell, while SSL laid the groundwork, TLS is the modern standard that offers enhanced security and performance. When securing Kafka or any other system, using the latest version of TLS is crucial to fend off vulnerabilities.</p>
<hr />
<h2 id="heading-sasl-the-authentication-framework">SASL: The Authentication Framework</h2>
<p>SASL (Simple Authentication and Security Layer) isn’t an encryption, but a framework for checking identities. You can run SASL on:</p>
<ul>
<li><p><strong>PLAINTEXT</strong> (not secure, don’t use in production).</p>
</li>
<li><p><strong>SSL</strong> (secure, because SSL wraps the authentication exchange in encryption).</p>
</li>
</ul>
<p>So in real deployments, you’ll almost always see <strong>SASL_SSL</strong>.</p>
<h3 id="heading-sasl-mechanisms-in-kafka">SASL Mechanisms in Kafka</h3>
<p>Here are the main authentication options you can plug into SASL:</p>
<ul>
<li><p><strong>PLAIN</strong></p>
<ul>
<li><p>Username + password in clear text.</p>
</li>
<li><p>Fine for quick tests, unsafe in production.</p>
</li>
</ul>
</li>
<li><p><strong>SCRAM (Salted Challenge Response Authentication Mechanism)</strong></p>
<ul>
<li><p>Like PLAIN but more secure.</p>
</li>
<li><p>Passwords are stored salted + hashed.</p>
</li>
<li><p>Production-friendly.</p>
</li>
</ul>
</li>
<li><p><strong>GSSAPI (Kerberos)</strong></p>
<ul>
<li><p>Great if your company already uses Kerberos.</p>
</li>
<li><p>Heavyweight, but enterprise-grade.</p>
</li>
</ul>
</li>
<li><p><strong>OAUTHBEARER</strong></p>
<ul>
<li><p>Uses OAuth 2.0 tokens.</p>
</li>
<li><p>Perfect for integrating with modern identity providers (Okta, Keycloak, etc.).</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756650550968/8ffc6da4-39d8-48f8-8bbf-f9cab6c875ac.png" alt class="image--center mx-auto" /></p>
<h4 id="heading-example-saslplaintext-with-plain">Example: SASL_PLAINTEXT with PLAIN</h4>
<p>Let's say you want to set up a quick test environment using SASL_PLAINTEXT with the PLAIN mechanism. Here's how you can configure both the Kafka broker and client.</p>
<p>Broker config:</p>
<pre><code class="lang-ini"><span class="hljs-attr">listeners</span>=SASL_PLAINTEXT://:<span class="hljs-number">9092</span>
<span class="hljs-attr">sasl.enabled.mechanisms</span>=PLAIN
<span class="hljs-attr">sasl.mechanism.inter.broker.protocol</span>=PLAIN
<span class="hljs-attr">security.inter.broker.protocol</span>=SASL_PLAINTEXT
</code></pre>
<p><strong>Wait! inter.broker.protocol? What is that?</strong></p>
<p>This setting defines how Kafka brokers authenticate with each other. In a multi-broker setup, brokers need to communicate securely. By setting <code>sasl.mechanism.inter.broker.protocol=PLAIN</code>, you're specifying that brokers will use the PLAIN mechanism for their internal communication. Of course, you can choose the same mechanism as clients or a different one, depending on your security requirements.</p>
<h4 id="heading-example-saslplaintext-with-scram">Example: SASL_PLAINTEXT with SCRAM</h4>
<p>This is a step up from PLAIN, adding better password security:</p>
<p>Broker config:</p>
<pre><code class="lang-ini"><span class="hljs-attr">listeners</span>=SASL_PLAINTEXT://:<span class="hljs-number">9092</span>
<span class="hljs-attr">sasl.enabled.mechanisms</span>=SCRAM-SHA-<span class="hljs-number">512</span>
<span class="hljs-attr">sasl.mechanism.inter.broker.protocol</span>=SCRAM-SHA-<span class="hljs-number">512</span>
<span class="hljs-attr">security.inter.broker.protocol</span>=SASL_PLAINTEXT
</code></pre>
<p><strong>How is SCRAM more secure than PLAIN?</strong></p>
<p>SCRAM enhances security over PLAIN in several key ways:</p>
<ol>
<li><p><strong>Password Storage</strong>: In PLAIN, passwords are stored in clear text, making them vulnerable if the storage is compromised. SCRAM stores passwords in a salted and hashed format, which means even if someone gains access to the stored credentials, they can't easily retrieve the original passwords.</p>
</li>
<li><p><strong>Challenge-Response Mechanism</strong>: SCRAM uses a challenge-response mechanism during authentication. This means that the password is never sent over the network in clear text. Instead, a challenge is issued, and the client responds with a hashed version of the password combined with the challenge, making it much harder for attackers to intercept and misuse the password.</p>
</li>
<li><p><strong>Salting</strong>: SCRAM adds a unique salt to each password before hashing it. This means that even if two users have the same password, their stored hashes will be different, protecting against rainbow table attacks.</p>
</li>
<li><p><strong>Iterative Hashing</strong>: SCRAM allows for multiple iterations of hashing, which increases the time it takes to compute the hash. This makes brute-force attacks significantly more difficult, as attackers would need to spend more time and resources to guess passwords.</p>
</li>
</ol>
<hr />
<h2 id="heading-put-them-all-together-saslssl">Put them all together: SASL_SSL</h2>
<p>This combo is one of the most common in production because it’s secure without being overcomplicated.</p>
<p>Broker config:</p>
<pre><code class="lang-ini"><span class="hljs-attr">listeners</span>=SASL_SSL://:<span class="hljs-number">9094</span>
<span class="hljs-attr">sasl.enabled.mechanisms</span>=SCRAM-SHA-<span class="hljs-number">512</span>
<span class="hljs-attr">sasl.mechanism.inter.broker.protocol</span>=SCRAM-SHA-<span class="hljs-number">512</span>
<span class="hljs-attr">security.inter.broker.protocol</span>=SASL_PLAINTEXT
<span class="hljs-attr">ssl.keystore.location</span>=/etc/kafka/secrets/kafka.server.keystore.jks
<span class="hljs-attr">ssl.keystore.password</span>=YOUR_PASSWORD
<span class="hljs-attr">ssl.key.password</span>=YOUR_PASSWORD
<span class="hljs-attr">ssl.truststore.location</span>=/etc/kafka/secrets/kafka.server.truststore.jks
<span class="hljs-attr">ssl.truststore.password</span>=YOUR_PASSWORD
</code></pre>
<p>Client config:</p>
<pre><code class="lang-ini"><span class="hljs-attr">security.protocol</span>=SASL_SSL
<span class="hljs-attr">sasl.mechanism</span>=SCRAM-SHA-<span class="hljs-number">512</span>
<span class="hljs-attr">sasl.jaas.config</span>=org.apache.kafka.common.security.scram.ScramLoginModule required \
    <span class="hljs-attr">username</span>=<span class="hljs-string">"alice"</span> \
    <span class="hljs-attr">password</span>=<span class="hljs-string">"alice-secret"</span><span class="hljs-comment">;</span>
</code></pre>
<p>Result:</p>
<ul>
<li><p>All traffic is encrypted.</p>
</li>
<li><p>Only clients with valid SCRAM credentials get in.</p>
</li>
</ul>
<h3 id="heading-docker-compose-example">Docker Compose Example</h3>
<p>Here is a full example using Docker Compose with Bitnami's Kafka image, which supports three listeners: Plaintext, SASL_PLAINTEXT with SCRAM, and SASL_SSL with SCRAM out of the box. This setup includes SSL certificates and user credentials. For more information, check out <a target="_blank" href="https://github.com/hoptical/kafka-docker-example/tree/main">this GitHub repo</a>. There I have explained how to generate the certificates and run test clients to see how different mechanisms and protocols work.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">version:</span> <span class="hljs-string">'3.8'</span>
<span class="hljs-attr">services:</span>
  <span class="hljs-attr">kafka:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">bitnami/kafka:3.7</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kafka</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_KRAFT_MODE=true</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_PROCESS_ROLES=broker,controller</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_NODE_ID=1</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=1@kafka:9093</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,SASL_PLAINTEXT://:29092,SASL_SSL://:39092,CONTROLLER://:9093</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092,SASL_PLAINTEXT://kafka:29092,SASL_SSL://kafka:39092</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=SASL_PLAINTEXT:SASL_PLAINTEXT,PLAINTEXT:PLAINTEXT,SASL_SSL:SASL_SSL,CONTROLLER:PLAINTEXT</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_SSL_CLIENT_AUTH=required</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_SSL_KEYSTORE_LOCATION=/bitnami/kafka/config/certs/kafka.keystore.jks</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_SSL_KEYSTORE_PASSWORD=bitnami123</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_SSL_KEY_PASSWORD=bitnami123</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_SSL_TRUSTSTORE_LOCATION=/bitnami/kafka/config/certs/kafka.truststore.jks</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_SSL_TRUSTSTORE_PASSWORD=bitnami123</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_SASL_ENABLED_MECHANISMS=SCRAM-SHA-512</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_SASL_MECHANISM_INTER_BROKER_PROTOCOL=SCRAM-SHA-512</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_INTER_BROKER_LISTENER_NAME=SASL_PLAINTEXT</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CLIENT_USERS=testuser</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CLIENT_PASSWORDS=testpass</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_INTER_BROKER_USER=admin</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_INTER_BROKER_PASSWORD=adminpass</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_SUPER_USERS=User:admin</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE=true</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">ALLOW_PLAINTEXT_LISTENER=yes</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">KAFKA_CFG_CLUSTER_ID=VkGbvjMzQNKtC-P_RMzqgg</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9092:9092"</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"29092:29092"</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"39092:39092"</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./certs:/bitnami/kafka/config/certs:ro</span>
</code></pre>
<h4 id="heading-what-are-those-certs">What are those certs?</h4>
<p>The <code>certs</code> directory contains the necessary SSL certificates and keystores for secure communication. You can generate these using tools like OpenSSL or Java's keytool.</p>
<hr />
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<h3 id="heading-common-real-world-patterns">Common Real-World Patterns</h3>
<ul>
<li><p><strong>SSL only</strong> → Secure channel, optional client certs.</p>
</li>
<li><p><strong>SASL_SSL + SCRAM</strong> → The sweet spot for many teams.</p>
</li>
<li><p><strong>SASL_SSL + Kerberos/OAUTHBEARER</strong> → Enterprises with Kerberos in place.</p>
</li>
</ul>
<h3 id="heading-best-practices-learned-the-hard-way">Best Practices (Learned the Hard Way)</h3>
<ul>
<li><p>Avoid using SASL_PLAINTEXT outside of development environments unless you are certain the network channel is secure.</p>
</li>
<li><p>Rotate credentials and certificates regularly.</p>
</li>
<li><p>Prefer <strong>SCRAM</strong> or <strong>OAUTHBEARER</strong> over PLAIN for authentication.</p>
</li>
<li><p>Use Kafka ACLs to enforce the least privilege.</p>
</li>
</ul>
<h3 id="heading-conclusion">Conclusion</h3>
<p>Securing Kafka might initially seem daunting due to the myriad of acronyms and configurations, but breaking it down into its core components—protocols and mechanisms—simplifies the process. By understanding SSL/TLS as the secure channel and SASL as the authentication framework, you can effectively protect your Kafka cluster. Implementing best practices such as using SCRAM or OAUTHBEARER for authentication, regularly rotating credentials, and enforcing Kafka ACLs ensures robust security. With these insights, securing Kafka becomes a manageable task, transforming what once felt like a complex challenge into a straightforward process.</p>
<p>The trick to understanding Kafka security is simple:</p>
<ul>
<li><p><strong>SSL/TLS</strong> = the secure pipe (encryption).</p>
</li>
<li><p><strong>SASL</strong> = the framework for authentication.</p>
</li>
<li><p><strong>Mechanisms</strong> = the actual way you prove identity.</p>
</li>
</ul>
<p>Once you see it this way, the acronym soup clears up, and securing Kafka stops feeling like dark magic.</p>
]]></content:encoded></item><item><title><![CDATA[Harnessing Ceph S3 with Kubernetes: An Operator Solution]]></title><description><![CDATA[Photo by Sixteen Miles Out on Unsplash
Introduction
Ceph & Rados Gateway
Ceph is an open-source, distributed storage system designed for scalability and reliability, providing unified solutions for block, file, and object storage under a single platf...]]></description><link>https://hamedkarbasi.com/harnessing-ceph-s3-with-kubernetes-an-operator-solution-005b1103c753</link><guid isPermaLink="true">https://hamedkarbasi.com/harnessing-ceph-s3-with-kubernetes-an-operator-solution-005b1103c753</guid><category><![CDATA[ceph]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[S3]]></category><category><![CDATA[Amazon S3]]></category><category><![CDATA[kubernetes-operators]]></category><category><![CDATA[openshift]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Fri, 05 Apr 2024 10:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383453458/345a99de-e8fc-45cb-b5ff-74f5db833ae9.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Photo by <a target="_blank" href="https://unsplash.com/@sixteenmilesout?utm_source=medium&amp;utm_medium=referral">Sixteen Miles Out</a> on <a target="_blank" href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></p>
<h2 id="heading-introduction">Introduction</h2>
<h3 id="heading-ceph-amp-rados-gateway">Ceph &amp; Rados Gateway</h3>
<p><a target="_blank" href="https://ceph.io/en/">Ceph</a> is an open-source, distributed storage system designed for scalability and reliability, providing unified solutions for block, file, and object storage under a single platform. At its core lies the Reliable Autonomic Distributed Object Store (RADOS), which ensures data is stored fault-tolerant across multiple storage devices.</p>
<p><a target="_blank" href="https://docs.ceph.com/en/quincy/radosgw/">Ceph RGW</a> (Rados Gateway) is an integral part of Ceph that offers object storage accessible via RESTful APIs, compatible with <strong>Amazon S3</strong> and OpenStack Swift protocols. By leveraging Ceph's distributed architecture, RGW provides a scalable, high-performance platform for storing and retrieving unstructured data such as images, videos, and backups. It is a crucial component for applications demanding highly available and scalable object storage solutions.</p>
<h3 id="heading-ok-what-is-the-problem">OK! What is the problem?</h3>
<p>Nowadays, many private cloud providers utilize Ceph and <a target="_blank" href="https://docs.ceph.com/en/latest/cephadm/services/rgw/">RGW</a> to deliver S3-compatible storage for their users. However, when it comes to the production usage under many users' demand, you (as the cloud administrator) end up with many requests inquiring about S3 user, bucket, and quota management, which can be overwhelming:</p>
<ul>
<li><p>Create or delete my user on S3</p>
</li>
<li><p>Increase or decrease my user quota</p>
</li>
<li><p>Create or delete a bucket</p>
</li>
<li><p>Create sub-users and give fine-grained access to my bucket</p>
</li>
</ul>
<p>Nevertheless, we propose a Kubernetes-native solution to enable users to handle such inquiries independently, with less burden on administrators and more capability with users.</p>
<h3 id="heading-kubernetes-operators">Kubernetes Operators</h3>
<p>Kubernetes Operators automate application management in Kubernetes environments. They utilize the <a target="_blank" href="https://sdk.operatorframework.io/">Operator SDK</a> for development, simplifying, creating, and managing these operators by encapsulating operational knowledge into code. This approach efficiently manages complex applications, significantly enhancing automation and operational efficiency. By extending Kubernetes' capabilities with custom resources, Operators offer a powerful method for automating deployment, scaling, and management tasks, streamlining the lifecycle of distributed systems.</p>
<h2 id="heading-enough-what-is-your-solution">Enough! What is your solution?</h2>
<p>We have developed the <a target="_blank" href="https://github.com/snapp-incubator/ceph-s3-operator"><strong><em>Ceph S3 Operator</em></strong></a><em>,</em> an open-source Kubernetes operator crafted to streamline the management of S3 users and buckets within a Ceph cluster environment. After installation, users can provision their S3User or S3Bucket with the Kubernetes objects, enabling them to utilize GitOps solutions like ArgoCD to manage S3 users or buckets.</p>
<h3 id="heading-features">Features</h3>
<ul>
<li><p>S3 User Management</p>
</li>
<li><p>Bucket Management</p>
</li>
<li><p>Subuser Support</p>
</li>
<li><p>Bucket policy Support</p>
</li>
<li><p>Quota Management</p>
</li>
<li><p>Webhook Integration</p>
</li>
<li><p>E2E Testing</p>
</li>
<li><p>Helm Chart and OLM Support</p>
</li>
</ul>
<h3 id="heading-ceph-s3-operator-vs-cosi">Ceph S3 Operator vs. COSI</h3>
<p>When managing users and buckets via Kubernetes Custom Resource Definitions (CRDs) on Ceph storage with an S3-compatible API, you may come across the <a target="_blank" href="https://container-object-storage-interface.github.io/">Container Object Storage Interface (COSI)</a> as an alternative solution. This raises the question: Why did we choose to develop a new operator from scratch instead of utilizing the COSI standard or its <a target="_blank" href="https://github.com/ceph/ceph-cosii">implementations</a>? Although COSI offers similar features, it has some significant limitations. COSI currently lacks support for essential features such as quota validation and bucket access policies, which are crucial for many teams to manage and control their storage resources effectively. By developing a dedicated operator, we aim to provide a more flexible and feature-rich solution that caters to a wider range of organizational needs.</p>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ul>
<li><p>Kubernetes v1.23.0+</p>
</li>
<li><p>Ceph v14.2.10+: prior Ceph versions <a target="_blank" href="https://github.com/ceph/ceph/pull/33714">don't support the sub-user bucket policy</a>. Nevertheless, other features are expected to work correctly within those earlier releases.</p>
</li>
<li><p><a target="_blank" href="https://github.com/snapp-incubator/ceph-s3-operator/blob/main/config/external-crd/openshift-clusterresourcequota.yaml">ClusterResourceQuota</a> CRD (already installed in OpenShift clusters): <code>kubectl apply -f config/external-crd</code></p>
</li>
</ul>
<h2 id="heading-installation">Installation</h2>
<h3 id="heading-using-operator-life-cycle-mhttpsolmoperatorframeworkioanager-olm">Using <a target="_blank" href="https://olm.operatorframework.io/">Operator Life Cycle M</a>anager (OLM):</h3>
<p>If you have already <a target="_blank" href="https://olm.operatorframework.io/docs/getting-started/">installed OLM</a> on your Kubernetes cluster, you can install the operator from <a target="_blank" href="https://operatorhub.io/operator/ceph-s3-operator">OperatorHub</a>:</p>
<ol>
<li>Create a <em>secret</em> containing your Ceph Cluster credentials:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Secret</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">s3-operator-controller-manager-config-override</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">operators</span>

<span class="hljs-attr">stringData:</span>
  <span class="hljs-attr">config.yaml:</span> <span class="hljs-string">|
    s3UserClass: ceph-default
    clusterName: production
    validationWebhookTimeoutSeconds: 10
    rgw:
      endpoint: &lt;CEPH_HTTP_URL&gt;
      accessKey: &lt;S3_ADMIN_ACCESS_KEY&gt;
      secretKey: &lt;S3_ADMIN_SECRET_KEY&gt;
</span>
<span class="hljs-attr">type:</span> <span class="hljs-string">Opaque</span>
</code></pre>
<p>Remember to substitute the Ceph endpoint, the admin access and secret keys accordingly in the above secret object.</p>
<p>2. Create the <em>Subscription</em>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">operators.coreos.com/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Subscription</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">ceph-s3-operator-subscription</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">operators</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">channel:</span> <span class="hljs-string">stable</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">ceph-s3-operator</span>
  <span class="hljs-attr">source:</span> <span class="hljs-string">operatorhubio-catalog</span>
  <span class="hljs-attr">sourceNamespace:</span> <span class="hljs-string">olm</span>
  <span class="hljs-attr">config:</span>
    <span class="hljs-attr">volumes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">config</span>
      <span class="hljs-attr">secret:</span>
        <span class="hljs-attr">items:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">config.yaml</span>
          <span class="hljs-attr">path:</span> <span class="hljs-string">config.yaml</span>
        <span class="hljs-attr">secretName:</span> <span class="hljs-string">s3-operator-controller-manager-config-override</span>
    <span class="hljs-attr">volumeMounts:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/s3-operator/config/</span>
      <span class="hljs-attr">name:</span> <span class="hljs-string">config</span>
</code></pre>
<p>3. Create the <em>Operator Group</em></p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">operators.coreos.com/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">OperatorGroup</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">global-operators</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">operators</span>
</code></pre>
<h3 id="heading-using-helm">Using Helm:</h3>
<ol>
<li>Create a <code>custom-values.yaml</code> file and specify your Ceph cluster configuration. Of course, you can set other operator configurations like its resources according to the chart <code>[values.yaml](https://github.com/snapp-incubator/ceph-s3-operator/blob/main/charts/ceph-s3-operator/values.yaml).</code></li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-comment"># custom-values.yaml</span>
<span class="hljs-attr">controllerManagerConfig:</span>
  <span class="hljs-attr">configYaml:</span> <span class="hljs-string">|
    s3UserClass: ceph-default
    clusterName: production
    validationWebhookTimeoutSeconds: 10
    rgw:
      endpoint: &lt;CEPH_HTTP_URL&gt;
      accessKey: &lt;S3_ADMIN_ACCESS_KEY&gt;
      secretKey: &lt;S3_ADMIN_SECRET_KEY&gt;</span>
</code></pre>
<p>2. Install the helm chart:</p>
<pre><code class="lang-bash">helm upgrade --install ceph-s3-operator \
oci://ghcr.io/snapp-incubator/ceph-s3-operator/helm-charts/ceph-s3-operator \
--version v0.3.7 \
--values custom-values.yaml
</code></pre>
<p>Whether installed via OLM or helm, supposing everything goes well, you can see the controller pod running successfully:</p>
<pre><code class="lang-bash">kubectl get pods -n operators
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383447920/30bb8612-c2ad-476f-8229-b93de48d9327.png" alt /></p>
<h2 id="heading-usage">Usage</h2>
<h3 id="heading-s3userclaim">S3UserClaim</h3>
<ol>
<li>Since the operator assesses S3 quotas concerning the cluster's resource quota and namespace resource quota, we need to create CRQ and RQ objects along with the hard quotas consisting of the below parameters:</li>
</ol>
<ul>
<li><p>s3/objects: Maximum number of objects the user can store</p>
</li>
<li><p>s3/size: Maximum size of the storage the user can store in bytes</p>
</li>
<li><p>s3/buckets: Maximum number of buckets the user can store</p>
</li>
</ul>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">quota.openshift.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ClusterResourceQuota</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">myteam</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">quota:</span>
    <span class="hljs-attr">hard:</span>
      <span class="hljs-attr">s3/objects:</span> <span class="hljs-string">5k</span>
      <span class="hljs-attr">s3/size:</span> <span class="hljs-string">22k</span>
      <span class="hljs-attr">s3/buckets:</span> <span class="hljs-string">5k</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">labels:</span>
      <span class="hljs-attr">matchLabels:</span>
        <span class="hljs-attr">snappcloud.io/team:</span> <span class="hljs-string">myteam</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ResourceQuota</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">example-quota</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">s3-test</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">hard:</span>
    <span class="hljs-attr">s3/objects:</span> <span class="hljs-number">1000</span>
    <span class="hljs-attr">s3/size:</span> <span class="hljs-number">20000</span>
    <span class="hljs-attr">s3/buckets:</span> <span class="hljs-number">15</span>
</code></pre>
<p>2. Now, we need to add the<code>snappcloud.io/team</code>label specified in the previous step to the namespace where we plan to deploy the operator objects (<code>s3-test</code> here):</p>
<pre><code class="lang-bash">kubectl label namespace s3-test snappcloud.io/team=myteam
</code></pre>
<p>3. Create an S3UserClaim object. With the specified quota, this creates a user, a read-only sub-user, and two other sub-users on Ceph RGW.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">s3.snappcloud.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">S3UserClaim</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">s3userclaim-sample</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">s3-test</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">adminSecret:</span> <span class="hljs-string">s3-sample-admin-secret</span>
  <span class="hljs-attr">quota:</span>
    <span class="hljs-attr">maxBuckets:</span> <span class="hljs-number">5</span>
    <span class="hljs-attr">maxObjects:</span> <span class="hljs-number">1000</span>
    <span class="hljs-attr">maxSize:</span> <span class="hljs-number">1000</span>
  <span class="hljs-attr">readonlySecret:</span> <span class="hljs-string">s3-sample-readonly-secret</span>
  <span class="hljs-attr">s3UserClass:</span> <span class="hljs-string">ceph-default</span>
  <span class="hljs-attr">subusers:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">subuser1</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">subuser2</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383449078/fdf3d17f-a5fe-4d65-876e-c78b54211335.png" alt /></p>
<p><em>s3UserClaim</em></p>
<p>The user and sub-user credentials are kept in the secrets:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383450176/b3abcbf8-9994-42be-8db9-3542b9e98d07.png" alt /></p>
<p><em>User and sub-user credentials</em></p>
<p>With the access and secret keys in the above secrets, we would have the following:</p>
<ul>
<li><p><code>s3-sample-admin-secret</code>: Credentials for the main user capable of creating a bucket under its tenant with full access.</p>
</li>
<li><p><code>s3-sample-readonly-secret</code>: Credentials for a read-only sub-user which can only read the buckets and objects in the tenant</p>
</li>
<li><p><code>s3userclaim-sample-subuserx</code>: Credentials for the subuser1 that can only list the buckets in the tenant and doesn't have any further access unless we promote its access in the S3Bucket object (we will see it in the next section)</p>
</li>
</ul>
<blockquote>
<p>Note: Any user provision is handled via the S3UserClaim. We have a S3User which is created by the S3 User Claim instance automatically and is not applicable for the operator user. With the claim, we’ve followed the similar concept in the persistent volume claim (PVC) and persistent volume (PV).</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383451250/a20c3e44-55db-4cc8-8391-7f5845bbcbef.png" alt /></p>
<p><em>S3User</em></p>
<p>If you try to create or edit the S3UserClaim with a quota exceeding any of the ClusterResourceQuota or ResourceQuota, the change will be prohibited by the operator webhook. For example, if I try to increase the s3userclaim-sample<code>maxObjects</code> to 10K, which is greater than the corresponding values in CRQ and RQ (5K and 1K), I will face the bellow error:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383452293/7d7b29d0-08d0-4926-9cc2-3bcd3c6834f2.png" alt /></p>
<p><em>The operator avoids the change due to the quota excess</em></p>
<h3 id="heading-s3bucket">S3Bucket</h3>
<p>Now, let's create an S3Bucket whose owner is the S3User we created in the previous step:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">s3.snappcloud.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">S3Bucket</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">s3bucket-sample</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">s3-test</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">s3DeletionPolicy:</span> <span class="hljs-string">delete</span>
  <span class="hljs-attr">s3UserRef:</span> <span class="hljs-string">s3userclaim-sample</span>
</code></pre>
<p>The <code>s3DeletionPolicy</code> is on <code>delete</code> by default, which will delete the corresponding bucket on storage if the S3Bucket object is deleted. However, setting it to <code>retain</code>, keeps the storage bucket even if the S3Bucket is deleted (for unintentional deletion).</p>
<p>As expressed before, sub-users don't have either write or read access to the bucket by default. However, you can gain their access by specifying the <code>s3SubuserBinding</code> field:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">s3.snappcloud.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">S3Bucket</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">s3bucket-sample</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">s3-test</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">s3DeletionPolicy:</span> <span class="hljs-string">delete</span>
  <span class="hljs-attr">s3SubuserBinding:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">access:</span> <span class="hljs-string">write</span>
      <span class="hljs-attr">name:</span> <span class="hljs-string">subuser1</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">access:</span> <span class="hljs-string">read</span>
      <span class="hljs-attr">name:</span> <span class="hljs-string">subuser2</span>
  <span class="hljs-attr">s3UserRef:</span> <span class="hljs-string">s3userclaim-sample</span>
</code></pre>
<p>Now, subuser1 has the read, and subuser2 has the read/write access to the <code>s3bucket-sample</code>. You can change them at any moment and re-apply the Yaml file.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>In conclusion, the Ceph S3 Operator presents a Kubernetes solution that simplifies the management and deployment of Ceph S3 storage in cloud-native environments. Automating complex processes and ensuring seamless integration with existing Kubernetes infrastructure enhances operational efficiency and paves the way for more resilient and scalable storage solutions. Embracing this operator signifies a step forward in optimizing storage strategies, unlocking new possibilities for developers and organizations keen on leveraging the best of both Ceph S3 and Kubernetes.</p>
<hr />
<p>Originally published at <a target="_blank" href="https://itnext.io/harnessing-ceph-s3-with-kubernetes-an-operator-solution-005b1103c753">itnext.io</a>.</p>
]]></content:encoded></item><item><title><![CDATA[ClickHouse Advanced Tutorial: Performance Comparison with MySQL]]></title><description><![CDATA[Introduction
Nothing is perfect. In terms of databases, you can't expect the best performance for every task and query from your deployed database. However, the vital step as a software developer is to know their strengths and weaknesses and how to d...]]></description><link>https://hamedkarbasi.com/clickhouse-advanced-tutorial-performance-comparison-with-mysql</link><guid isPermaLink="true">https://hamedkarbasi.com/clickhouse-advanced-tutorial-performance-comparison-with-mysql</guid><category><![CDATA[ClickHouse]]></category><category><![CDATA[MySQL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[performance]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Thu, 08 Jun 2023 10:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439375292/2ec0b927-e290-4311-9531-64ed6fd1923f.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><img src="https://www.azquotes.com/picture-quotes/quote-it-s-more-important-to-know-your-weaknesses-than-your-strengths-ray-l-hunt-102-72-52.jpg" alt /></p>
<h2 id="heading-introduction">Introduction</h2>
<p>Nothing is perfect. In terms of databases, you can't expect the best performance for every task and query from your deployed database. However, the vital step as a software developer is to know their strengths and weaknesses and how to deal with them.</p>
<p>In this post, I will compare Clickhouse as a representative of the OLAP database and MySQL of the OLTP. This will help us to choose better solutions for our challenges according to our conditions and desires. Before jumping into the main context, let's discuss OLTP, OLAP, MySQL, and ClickHouse.</p>
<h3 id="heading-oltp">OLTP</h3>
<p>OLTP stands for Online Transaction Processing and is used for day-to-day operations, such as processing orders and updating customer information. OLTP is best for short, fast transactions and is optimized for quick response times. It is essential to ensure data accuracy and consistency and provide an efficient way to access data.</p>
<h3 id="heading-olap">OLAP</h3>
<p>OLAP stands for Online Analytical Processing and is used for data mining and analysis. It enables organizations to analyze large amounts of data from multiple perspectives and identify trends and patterns. OLAP is best for complex queries and data mining and can provide impossible insights with traditional reporting tools.</p>
<h3 id="heading-mysql">MySQL</h3>
<p>MySQL is a popular open-source database management system. It is used to store and manage data and is utilized by websites and applications to store and manage information. MySQL is a relational database management system that holds data in tables and allows users to query the data. It also provides features such as triggers, stored procedures, and views. MySQL is easy to use and has a wide range of features that can be used to create powerful and efficient applications.</p>
<h3 id="heading-clickhouse">ClickHouse</h3>
<p>ClickHouse is an open-source column-oriented database management system developed by Yandex. It is designed to provide high performance for analytical queries. ClickHouse uses a SQL-like query language for querying data and supports different data types, including integers, strings, dates, and floats. It offers various features such as clustering, distributed query processing, and fault tolerance. It also supports replication and data sharding. You can know more about this database by visiting the first part of this series:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://hamedkarbasi.com/clickhouse-basic-tutorial-an-introduction">https://hamedkarbasi.com/clickhouse-basic-tutorial-an-introduction</a></div>
<p> </p>
<p>Now we can talk about the performance comparison.</p>
<h2 id="heading-comparison-case-study">Comparison Case Study</h2>
<p>I've followed the <a target="_blank" href="https://github.com/ClickHouse/ClickBench">Clickbench</a> repository methodology for the case study. It uses the <em>hits</em> dataset obtained from the actual traffic recording of one of the world's largest web analytics platforms. <code>hits</code> contain about 100M rows as a single flat table. This repository studies more than 20 databases regarding dataset load time, elapsed time for 43 OLAP queries, and occupied storage. You can access their visualized results <a target="_blank" href="https://github.com/ClickHouse/ClickBench">here</a>.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/ClickHouse/ClickBench">https://github.com/ClickHouse/ClickBench</a></div>
<p> </p>
<p>To investigate ClickHouse and MySQL performance specifically, I separated 10M rows of the table and chose some of the predefined <a target="_blank" href="https://github.com/ClickHouse/ClickBench/blob/main/mysql/queries.sql">queries</a> that can make our point more clear. Those queries are mainly in OLAP manner, so they only show ClickHouse strengths compared to MySQL (i.e., MySQL loses in all those queries). Hence, I added other queries showing the opposite (OLTP queries). Although I've limited the benchmark to these two databases, you can generalize the concept to other row-oriented and column-oriented DBMSs.</p>
<p><strong><em>Disclaimer:</em></strong> <em>This benchmark only clarifies the main difference between column-oriented and row-oriented databases regarding their performance and use cases. It should not be considered a reference for your use cases. Hence, you should perform your benchmarks with your queries to achieve the best decision.</em></p>
<h3 id="heading-system-specification">System Specification</h3>
<p>Databases are installed on Ubuntu 22.04 LTS on a system with the below specifications:</p>
<ul>
<li><p>CPU: Intel® Core™ i7-10510U CPU @ 1.80GHz × 8</p>
</li>
<li><p>RAM: 16 GiB</p>
</li>
<li><p>Storage: 256 GiB SSD</p>
</li>
</ul>
<h3 id="heading-benchmark-flow">Benchmark Flow</h3>
<ol>
<li><p>The database is created.</p>
</li>
<li><p>The table is created with the <a target="_blank" href="https://github.com/ClickHouse/ClickBench/blob/main/mysql/create.sql">defined DDL</a>.</p>
</li>
<li><p>Data (<code>hits.tsv</code>) is loaded into the table, and its time is measured.</p>
</li>
<li><p>Queries are run, and each query's elapsed time is measured.</p>
</li>
</ol>
<h3 id="heading-queries">Queries</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Query Number</td><td>Statement</td><td>Type</td></tr>
</thead>
<tbody>
<tr>
<td>1</td><td><code>SELECT COUNT(*) FROM hits;</code></td><td>OLAP</td></tr>
<tr>
<td>2</td><td><code>SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;</code></td><td>OLAP</td></tr>
<tr>
<td>3</td><td><code>SELECT URL, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate &gt;= '2013-07-01' AND EventDate &lt;= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND URL &lt;&gt; '' GROUP BY URL ORDER BY PageViews DESC LIMIT 10;</code></td><td>OLAP</td></tr>
<tr>
<td>4</td><td><code>SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;</code></td><td>OLAP</td></tr>
<tr>
<td>5</td><td><code>SELECT EventTime, WatchID FROM hits WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239';</code></td><td>OLTP</td></tr>
<tr>
<td>6</td><td><code>SELECT Title, URL, Referer FROM hits WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239';</code></td><td>OLTP</td></tr>
<tr>
<td>7</td><td><code>UPDATE hits SET Title='my title', URL='my url', Referer='my referer' WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239';</code></td><td>OLTP</td></tr>
</tbody>
</table>
</div><h2 id="heading-results">Results</h2>
<p>I'll study the results under four categories:</p>
<ul>
<li><p>Dataset Load</p>
</li>
<li><p>Table Size</p>
</li>
<li><p>Read Queries Execution</p>
</li>
<li><p>Update Query Execution: I've discussed the update query (query number 7) separately since it needs more discussion and attention.</p>
</li>
</ul>
<h3 id="heading-dataset-load">Dataset Load</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>ClickHouse</td><td>MySQL</td><td>Ratio</td></tr>
</thead>
<tbody>
<tr>
<td><strong>65s</strong></td><td>11m35s</td><td>x10.7</td></tr>
</tbody>
</table>
</div><p>Thanks to the LSM and sparse indexes, ClickHouse load time is much faster than MySQL, which uses BTree. However, ClickHouse inserts efficiency is observable in bulk inserts instead of many individual inserts. This behavior comes from the fact that it creates immutable parts for each insert and is unwilling to change, remove or create its data for a few rows.</p>
<h3 id="heading-table-size">Table Size</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>ClickHouse (GiB)</td><td>MySQL (GiB)</td><td>Ratio</td></tr>
</thead>
<tbody>
<tr>
<td><strong>1.3</strong></td><td>6.32</td><td>x4.86</td></tr>
</tbody>
</table>
</div><p>The column-oriented structure gives the ability of <a target="_blank" href="https://clickhouse.com/docs/en/about-us/distinctive-features#data-compression"><em>Data Compression</em></a>, something that is not available in row-oriented databases. That is why ClickHouse can do a practical favor to the teams storing a high amount of data, reducing the storage cost.</p>
<h3 id="heading-read-queries-execution">Read Queries Execution</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Query Number</td><td>ClickHouse (s)</td><td>MySQL (s)</td><td>Ratio</td></tr>
</thead>
<tbody>
<tr>
<td>1</td><td><strong>0.005</strong></td><td>7.79</td><td>x1558</td></tr>
<tr>
<td>2</td><td><strong>0.030</strong></td><td>16.0</td><td>x533.3</td></tr>
<tr>
<td>3</td><td><strong>0.193</strong></td><td>4.35</td><td>x22.5</td></tr>
<tr>
<td>4</td><td><strong>2.600</strong></td><td>180.93</td><td>x69.58</td></tr>
<tr>
<td>5</td><td>0.01</td><td><strong>0.00</strong></td><td>x0</td></tr>
<tr>
<td>6</td><td>0.011</td><td><strong>0.00</strong></td><td>x0</td></tr>
</tbody>
</table>
</div><p>ClickHouse's <a target="_blank" href="https://clickhouse.com/docs/en/optimize/sparse-primary-indexes#:~:text=At%20the%20very%20large%20scale,technique%20is%20called%20sparse%20index.">sparse index</a> and column-oriented structure have outperformed MySQL in all OLAP queries (numbers 1 to 4). That's why BI and Data Analysts would be more than happy with ClickHouse for their daily reports.</p>
<p>However, MySQL wins the battle when it comes to OLTP queries (numbers 5 and 6). Btree (equipped by MySQL) indeed performs better for pointy queries in which you demand short transactions requiring few rows.</p>
<h3 id="heading-update-query-execution">Update Query Execution</h3>
<p>For the update query (number 7), we should execute a different query in ClickHouse as it doesn't support updates in a naive way, and the <code>Alter</code> command has to be used:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> hits <span class="hljs-keyword">UPDATE</span> JavaEnable=<span class="hljs-number">0</span> <span class="hljs-keyword">WHERE</span> CounterID = <span class="hljs-number">38</span> <span class="hljs-keyword">AND</span> EventDate = <span class="hljs-string">'2013-07-15'</span> <span class="hljs-keyword">AND</span> UserID = <span class="hljs-string">'1387668437822950552'</span> <span class="hljs-keyword">AND</span> WatchID = <span class="hljs-string">'8899477221003616239'</span>;
</code></pre>
<p>Additionally, ClickHouse applies the update asynchronously. To have the result immediately, you've to perform an <code>optimize</code> command:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">OPTIMIZE</span> <span class="hljs-keyword">TABLE</span> hits <span class="hljs-keyword">FINAL</span>;
</code></pre>
<p>By performing query number 7 statement shown in the <a class="post-section-overview" href="#queries">queries table</a> for MySQL and the two above SQL statements for ClickHouse, we achieve the below results:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Query Number</td><td>ClickHouse (s)</td><td>MySQL (s)</td><td>Ratio</td></tr>
</thead>
<tbody>
<tr>
<td>7</td><td>26</td><td><strong>0.00</strong></td><td>0</td></tr>
</tbody>
</table>
</div><p>Again, ClickHouse mutation hatred makes it a loser for real-time updates (and similarly deletes) compared to MySQL. Consequently, other methods like deduplication using ReplacingMergeTree can be utilized to handle updates. You can find valuable resources in the below links:</p>
<ul>
<li><p><a target="_blank" href="https://clickhouse.com/docs/en/guides/developer/deduplication">Row-level Deduplication Strategies for Upserts and Frequent Updates</a></p>
</li>
<li><p><a target="_blank" href="https://clickhouse.com/blog/handling-updates-and-deletes-in-clickhouse">Handling Updates and Deletes in ClickHouse</a></p>
</li>
<li><p><a target="_blank" href="https://altinity.com/blog/2020/4/14/handling-real-time-updates-in-clickhouse">Handling Real-Time Updates in ClickHouse</a></p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this post, I benchmarked MySQL and ClickHouse databases to study some of their strengths and weaknesses that may help us choose a suitable solution. To summarize:</p>
<ul>
<li><p>MySQL performs better on pointy and OLTP queries.</p>
</li>
<li><p>ClickHoues performs better on OLAP queries.</p>
</li>
<li><p>ClickHouse is not designed for frequent updates and deletes. You have to handle them with deduplication methods.</p>
</li>
<li><p>ClickHouse reduces the storage cost thanks to its column-oriented structure.</p>
</li>
<li><p>ClickHouse bulk inserts load time operates far better than MySQL.</p>
</li>
</ul>
<hr />
<p><em>Originally published at</em> <a target="_blank" href="https://dev.to/hoptical/clickhouse-advanced-tutorial-performance-comparison-with-mysql-2cj2"><em>https://dev.to</em></a> <em>on June 8, 2023.</em></p>
]]></content:encoded></item><item><title><![CDATA[ClickHouse Basic Tutorial: Keys & Indexes]]></title><description><![CDATA[Photo by Maksym Kaharlytskyi on Unsplash
In the previous parts, we saw an introduction to ClickHouse and its features. Furthermore, we learned about its different table engine families and their most usable members. In this part, I will walk through ...]]></description><link>https://hamedkarbasi.com/clickhouse-basic-tutorial-keys-and-indexes</link><guid isPermaLink="true">https://hamedkarbasi.com/clickhouse-basic-tutorial-keys-and-indexes</guid><category><![CDATA[ClickHouse]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Tutorial]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Fri, 02 Jun 2023 10:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439378684/578a9973-f61b-4c5f-851d-a2e703154b42.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Photo by</em> <a target="_blank" href="https://unsplash.com/de/@qwitka?utm_source=medium&amp;utm_medium=referral"><em>Maksym Kaharlytskyi</em></a> <a target="_blank" href="https://unsplash.com/de/@qwitka?utm_source=medium&amp;utm_medium=referral"><em>on Unsplash</em></a></p>
<p>In the previous parts, we saw an introduction to ClickHouse and its features. Furthermore, we learned about its different table engine families and their most usable members. In this part, I will walk through the special keys and indexes in ClickHouse, which can help reduce query latency and database load significantly.</p>
<p>It should be said that these concepts are only applicable to the default table engine family: Merge-Trees.</p>
<h2 id="heading-primary-key">Primary Key</h2>
<p>ClickHouse indexes are based on <em>Sparse Indexing</em>, an alternative to the <a target="_blank" href="https://en.wikipedia.org/wiki/B-tree">B-Tree</a> index utilized by traditional DBMSs. In B-tree, every row is indexed, which is suitable for locating and updating a single row, also known as pointy-queries common in OLTP tasks. This comes with the cost of poor performance on high-volume insert speed and high memory and storage consumption. On the contrary, the sparse index splits data into multiple <em>parts</em>, each group by a fixed portion called <em>granules</em>. ClickHouse considers an index for every granule (group of data) instead of every row, and that's where the <em>sparse index</em> term comes from. Having a query filtered on the primary keys, ClickHouse looks for those granules and loads the matched granules in parallel to the memory. That brings a notable performance on range queries common in OLAP tasks. Additionally, as data is stored in columns in multiple files, it can be compressed, resulting in much less storage consumption.</p>
<p>The nature of the spars-index is based on <a target="_blank" href="https://en.wikipedia.org/wiki/Log-structured_merge-tree">LSM trees</a> allowing you to insert high-volume data per second. All these come with the cost of not being suitable for pointy queries, which is not the purpose of the ClickHouse.</p>
<h3 id="heading-structure">Structure</h3>
<p>In the below figure, we can see how ClickHouse stores data:</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jgwpv7hn78pjm0zmiv9y.png" alt /></p>
<p><em><cite>ClickHouse Data Store Structure</cite></em></p>
<ul>
<li><p>Data is split into multiple parts (ClickHouse default or user-defined partition key)</p>
</li>
<li><p>Parts are split in granules which is a logical concept, and ClickHouse doesn't split data into them as the physical. Instead, it can locate the granules via the marks. Granules' locations (start and end) are defined in the mark files with the <code>mrk2</code> extension.</p>
</li>
<li><p>Index values are stored in the <code>primary.idx</code> file, which contains one row per granule.</p>
</li>
<li><p>Columns are stored as compressed blocks in <code>.bin</code> files: One file for every column in the <code>Wide</code> and a single file for all columns in the <code>Compact</code> format. Being Wide or Compact is determined by ClickHouse based on the size of the columns.</p>
</li>
</ul>
<p>Now let's see how ClickHouse finds the matching rows using primary keys:</p>
<ol>
<li><p>ClickHouse finds the matching granule marks utilizing the <code>primary.idx</code> file via the binary search.</p>
</li>
<li><p>Looks into the mark files to find the granules' location in the <code>bin</code> files.</p>
</li>
<li><p>Loads the matching granules from the <code>bin</code> files into the memory in parallel and looks for the matching rows in those granules using binary search.</p>
</li>
</ol>
<h3 id="heading-case-study">Case Study</h3>
<p>To clarify the flow mentioned above, let's create a table and insert data into it:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> default.projects
(

    <span class="hljs-string">`project_id`</span> UInt32,

    <span class="hljs-string">`name`</span> <span class="hljs-keyword">String</span>,

    <span class="hljs-string">`created_date`</span> <span class="hljs-built_in">Date</span>
)
<span class="hljs-keyword">ENGINE</span> = MergeTree
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (project_id, created_date)
</code></pre>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> projects 
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> generateRandom(<span class="hljs-string">'project_id Int32, name String, created_date Date'</span>, <span class="hljs-number">10</span>, <span class="hljs-number">10</span>, <span class="hljs-number">1</span>)
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10000000</span>;
</code></pre>
<p>First, if you don't specify primary keys separately, ClickHouse will consider sort keys (in order by) as primary keys. Hence, in this table, <code>project_id</code> and <code>created_date</code> are the primary keys. Every time you insert data into this table, it will sort data first by <code>project_id</code> and then by <code>created_date</code>.</p>
<p>If we look into the data structure stored on the hard drive, we face this:</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8u1cfv0vor7vj98dzrcj.png" alt class="image--center mx-auto" /></p>
<p><cite>Physical files stored in a part</cite></p>
<p>We have five parts, and one of them is: <code>all_1_1_0</code>. You can visit <a target="_blank" href="https://kb.altinity.com/engines/mergetree-table-engine-family/part-naming-and-mvcc/#part-names--multiversion-concurrency-control">this link</a> if you're curious about the naming convention. As you can see, columns are stored in <code>bin</code> files, and we see mark files named as primary keys along with the <code>primary.idx</code> file.</p>
<h4 id="heading-filter-on-the-first-primary-key">Filter on the first primary-key</h4>
<p>Now let's filter on <code>project_id</code>, which is the first primary key, and explain its indexes:</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rt3u2ool0pla83zjokbq.png" alt /></p>
<p><cite>Index analysis of a query on first primary key</cite></p>
<p>As you can see, the system has detected <code>project_id</code> as a primary key and ruled out 1224 granules out of 1225 using it!</p>
<h4 id="heading-filter-on-second-primary-key">Filter on second primary-key</h4>
<p>What if we filter on <code>created_date</code>: the second PK:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">EXPLAIN</span> <span class="hljs-keyword">indexes</span>=<span class="hljs-number">1</span>
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> projects <span class="hljs-keyword">WHERE</span> created_date=today()
</code></pre>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/meauo6583t2d9vclbb1u.png" alt /></p>
<p><cite>Index analysis of a query on second primary key</cite></p>
<p>The database has detected <code>created_date</code> as a primary key, but it hasn't been able to filter any granules. Why? Because ClickHouse uses binary search only for the first key and <a target="_blank" href="https://github.com/ClickHouse/ClickHouse/blob/22.3/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp#L1444">generic exclusive search</a> for other keys, which is much less efficient than the former. So how can we make it more efficient?</p>
<p>If we substitute <code>project_id</code> and <code>created_date</code> in the sort keys while defining the table, you will achieve better results in filtering for the non-first keys since the created_date has lower cardinality (uniqueness) than the <code>project_id</code>:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> default.projects
(

    <span class="hljs-string">`project_id`</span> UInt32,

    <span class="hljs-string">`name`</span> <span class="hljs-keyword">String</span>,

    <span class="hljs-string">`created_date`</span> <span class="hljs-built_in">Date</span>
)
<span class="hljs-keyword">ENGINE</span> = MergeTree
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (created_date, project_id)
</code></pre>
<pre><code class="lang-sql"><span class="hljs-keyword">EXPLAIN</span> <span class="hljs-keyword">indexes</span>=<span class="hljs-number">1</span>
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> projects <span class="hljs-keyword">WHERE</span> project_id=<span class="hljs-number">700</span>
</code></pre>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tz9fgeo23rncr5cx3oah.png" alt /></p>
<p><cite>Index analysis of a query on second primary key on an improved sort keys table</cite></p>
<p>If we filter on the <code>project_id</code>, the second key, now ClickHouse, would use only 909 granules instead of the whole data.</p>
<p>So to summarize, always try to order the primary keys from <strong>low</strong> to <strong>high</strong> cardinality.</p>
<h3 id="heading-order-key">Order Key</h3>
<p>I mentioned earlier that if you don't specify the <code>PRIMARY KEY</code> option, ClickHouse considers sort keys as the primary keys. However, if you want to set primary keys separately, it should be a subset of the sort keys. As a result, additional keys specified in the sort keys are only utilized for sorting purposes and don't play any role in indexing.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> default.projects
(

    <span class="hljs-string">`project_id`</span> UInt32,

    <span class="hljs-string">`name`</span> <span class="hljs-keyword">String</span>,

    <span class="hljs-string">`created_date`</span> <span class="hljs-built_in">Date</span>
)
<span class="hljs-keyword">ENGINE</span> = MergeTree
PRIMARY <span class="hljs-keyword">KEY</span> (created_date, project_id)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (created_date, project_id, <span class="hljs-keyword">name</span>)
</code></pre>
<p>In this example, <code>created_date</code> and <code>project_id</code> columns are utilized in the sparse index and sorting, and <code>name</code> column is only used as the last item for sorting.</p>
<p>Use this option if you wish to use a column in the <code>ORDER BY</code> part of the query since it will eliminate the database sorting effort while running it.</p>
<h3 id="heading-partition-key">Partition Key</h3>
<p>A partition is a logical combination of parts in ClickHouse. It considers all parts under no specific partition by default. To find out more, look into the <code>system.parts</code> table for that <code>projects</code> table defined in the previous section:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    <span class="hljs-keyword">name</span>,
    <span class="hljs-keyword">partition</span>
<span class="hljs-keyword">FROM</span>
    system.parts
<span class="hljs-keyword">WHERE</span>
    <span class="hljs-keyword">table</span> = <span class="hljs-string">'projects'</span>;
</code></pre>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/epygdscxk5pahpfzgjnu.png" alt /></p>
<p><cite>Parts structure in an unpartitioned table</cite></p>
<p>You can see that the <code>projects</code> table has no particular partition. However, you can customize it using the <code>PARTITION BY</code> option:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> default.projects_partitioned
(

    <span class="hljs-string">`project_id`</span> UInt32,

    <span class="hljs-string">`name`</span> <span class="hljs-keyword">String</span>,

    <span class="hljs-string">`created_date`</span> <span class="hljs-built_in">Date</span>
)
<span class="hljs-keyword">ENGINE</span> = MergeTree
<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> toYYYYMM(created_date)
PRIMARY <span class="hljs-keyword">KEY</span> (created_date, project_id)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (created_date, project_id, <span class="hljs-keyword">name</span>)
</code></pre>
<p>In the above table, ClickHouse partitions data based on the month of the <code>created_date</code> column:</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qthhonsue7p6qj26mgxj.png" alt /></p>
<p><cite>Parts structure in a partitioned table</cite></p>
<h3 id="heading-index">Index</h3>
<p>ClickHouse creates a <em>min-max</em> index for the partition key and uses it as the first filter layer in query running. Let's see what happens when we filter data by a column existent in the partition key:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">EXPLAIN</span> <span class="hljs-keyword">indexes</span>=<span class="hljs-number">1</span>
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> projects_partitioned <span class="hljs-keyword">WHERE</span> created_date=<span class="hljs-string">'2020-02-01'</span>
</code></pre>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4z17e0jmy8v8nvbp04ow.png" alt /></p>
<p><cite>Index analysis on a partitioned table</cite></p>
<p>You can see that database has chosen one part out of 16 using the min-max index of the partition key.</p>
<h3 id="heading-usage">Usage</h3>
<p>Partitioning in ClickHouse aims to bring data manipulation capabilities to the table. For instance, you can delete or move parts belonging to partitions older than a year. It is way more efficient than an unpartitioned table since ClickHouse has split data based on the month physically on the storage. Consequently, such operations can be performed easily.</p>
<p>Although Clickhouse creates an additional index for the partition key, it should never be considered a query performance improvement method because it loses the performance battle to define the column in the sort keys. So if you wish to enhance the query performance, contemplate those columns in the sort keys and use a column as the partition key if you have particular plans for data manipulation based on that column.</p>
<p>Finally, don't get partitions in ClickHouse wrong with the same term in the distributed systems where data is split on different nodes. You should use <a target="_blank" href="https://clickhouse.com/docs/en/engines/table-engines/special/distributed">shards and distributed tables</a> if you're inclined to achieve such purposes.</p>
<h2 id="heading-skip-index">Skip Index</h2>
<p>You may have recognized that defining a column in the last items of the sort key cannot be helpful, mainly if you only filter on that column without the sort keys. What should you do in those cases?</p>
<p>Consider a dictionary you want to read. You can find words using the table of contents, sorted by the alphabet. Those items are the sort keys in the table. You can simply find a word starting with <em>W</em>, but how can you find pages containing words related to wars?</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439377491/ee19e0df-cb3b-4215-a9dd-eb5c7c4fbae2.jpeg" alt="A book with sticky notes" /></p>
<p>You can put marks or sticky notes on those pages making your effort less the next time. That's how <a target="_blank" href="https://clickhouse.com/docs/en/optimize/skipping-indexes">Skip Index</a> works. It helps the database filter granules that don't have desired values of some columns by creating additional indexes.</p>
<h3 id="heading-case-study-1">Case Study</h3>
<p>Consider the <code>projects</code> table defined in the <em>Order By</em> section. <code>created_date</code> and <code>project_id</code> were defined as primary keys. Now if we filter on the <code>name</code> column, we'll encounter this:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">EXPLAIN</span> <span class="hljs-keyword">indexes</span>=<span class="hljs-number">1</span>
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> projects <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">name</span>=<span class="hljs-string">'hamed'</span>
</code></pre>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eiofrdndocznh2d242u5.png" alt /></p>
<p><cite>Index analysis on a query on non-indexed column</cite></p>
<p>The result was expected. Now what if we define a skip index on it?</p>
<pre><code class="lang-sql"><span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> projects <span class="hljs-keyword">ADD</span> <span class="hljs-keyword">INDEX</span> name_index <span class="hljs-keyword">name</span> <span class="hljs-keyword">TYPE</span> bloom_filter GRANULARITY <span class="hljs-number">1</span>;
</code></pre>
<p>The above command creates a skip index on the <code>name</code> column. I've used the bloom filter type because the column was a string. You can find more about the other kinds <a target="_blank" href="https://clickhouse.com/docs/en/optimize/skipping-indexes#skip-index-types">here</a>.</p>
<p>This command only makes the index for the new data. Wishing to create for already inserted, you can use this:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> projects MATERIALIZE <span class="hljs-keyword">INDEX</span> name_index;
</code></pre>
<p>Let's see the query analysis this time:</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/imwu2pw2hbl2zeh7cb9x.png" alt /></p>
<p><cite>Index analysis on a query on skip-indexed column</cite></p>
<p>As you can see, the skip index greatly affected granules' rule-out and performance.</p>
<p>While the skip index performed efficiently in this example, it can show poor performance in other cases. It depends on the correlation of your specified column and sort keys and settings like index granularity and its type.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In conclusion, understanding and utilizing ClickHouse's primary keys, order keys, partition keys, and skip index is crucial for optimizing query performance and scalability. Choosing appropriate primary keys, order keys, and partitioning strategies can enhance data distribution, improve query execution speed, and prevent overloading. Additionally, leveraging the skip index feature intelligently helps minimize disk I/O and reduce query execution time. By considering these factors in your ClickHouse schema design, you can unlock the full potential of ClickHouse for efficient and performant data solutions.</p>
<hr />
<p><em>Originally published at</em> <a target="_blank" href="https://dev.to/hoptical/clickhouse-basic-tutorial-keys-indexes-5d7a"><em>https://dev.to</em></a> <em>on June 2, 2023.</em></p>
]]></content:encoded></item><item><title><![CDATA[How to Schedule Database Backups with Cronjob and Upload to AWS S3]]></title><description><![CDATA[Introduction
The update procedure is a vital operation every team should consider. However, doing it manually can be exhausting. As a toil, you can automate it simply by creating a cronjob to take the backup and upload it to your desired object stora...]]></description><link>https://hamedkarbasi.com/how-to-schedule-database-backups-with-cronjob-and-upload-to-aws-s3</link><guid isPermaLink="true">https://hamedkarbasi.com/how-to-schedule-database-backups-with-cronjob-and-upload-to-aws-s3</guid><category><![CDATA[Backup]]></category><category><![CDATA[automation]]></category><category><![CDATA[Amazon S3]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Tue, 23 May 2023 10:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439381990/78236569-584a-4e7f-9044-e189c9a3e4fb.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>The update procedure is a vital operation every team should consider. However, doing it manually can be exhausting. As a toil, you can automate it simply by creating a cronjob to take the backup and upload it to your desired object storage.</p>
<p>This article explains how to automate your database backup with a Cronjob and upload it to the Cloud S3. Postgres has been considered as the database, but you can generalize it to any other database or data type you want.</p>
<h2 id="heading-step-1-create-configmap">Step 1: Create configmap</h2>
<p>To perform the backup and upload, we need a bash script. It should first login into the cloud via the <code>oc login</code> command. Then gets your desired database pod name. Executes the dump command, zips it, and downloads it via the <code>rsync</code> command. Finally, it uploads it to the AWS S3 object storage.</p>
<p>During running the script and before the dumping, it gets a prompt from the user as an <em>Are you sure?</em> which you can bypass by the ‍‍<code>-y‍</code> option.</p>
<p>All credentials like <em>OKD Token</em> or <em>Postgres Password</em> are passed to the application as environment variables.</p>
<p>By putting this bash script in a config map, it can be mounted as a volume in the cronjob. Remember to replace <code>PROJECT_HERE</code> with your project name and customize the variables in the bash script according to your project specifications.</p>
<pre><code class="lang-yml"><span class="hljs-attr">kind:</span> <span class="hljs-string">ConfigMap</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">bash-script</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">PROJECT_HERE</span>
<span class="hljs-attr">data:</span>
  <span class="hljs-attr">backup.sh:</span> <span class="hljs-string">&gt;
    #!/bin/bash
</span>
    <span class="hljs-comment"># This file provides a backup script for postgres</span>

    <span class="hljs-comment"># Variables: Modify your variables according to the okd projects and database secrets</span>

    <span class="hljs-string">NAMESPACE=PROJECT_HERE</span>

    <span class="hljs-string">S3_URL=https://s3.mycompany.com</span>

    <span class="hljs-string">STATEFULSET_NAME=postgres</span>

    <span class="hljs-string">BACKUP_NAME=backup-$(date</span> <span class="hljs-string">"+%F"</span><span class="hljs-string">)</span>

    <span class="hljs-string">S3_BUCKET=databases-backup</span>

    <span class="hljs-comment"># Exit the script anywhere faced the error </span>

    <span class="hljs-string">set</span> <span class="hljs-string">-e</span>

    <span class="hljs-comment"># Define the confirm option about user prompt (yes or no)</span>

    <span class="hljs-string">confirm=""</span>

    <span class="hljs-comment"># Parse command-line options</span>

    <span class="hljs-string">while</span> <span class="hljs-string">getopts</span> <span class="hljs-string">"y"</span> <span class="hljs-string">opt;</span> <span class="hljs-string">do</span>
        <span class="hljs-string">case</span> <span class="hljs-string">$opt</span> <span class="hljs-string">in</span>
        <span class="hljs-string">y)</span>
            <span class="hljs-string">confirm="y"</span>
            <span class="hljs-string">;;</span>
        <span class="hljs-string">\?)</span>
            <span class="hljs-string">echo</span> <span class="hljs-string">"Invalid option: -$OPTARG"</span> <span class="hljs-string">&gt;&amp;2</span>
            <span class="hljs-string">exit</span> <span class="hljs-number">1</span>
            <span class="hljs-string">;;</span>
        <span class="hljs-string">esac</span>
    <span class="hljs-string">done</span>

    <span class="hljs-comment"># Login to OKD</span>

    <span class="hljs-string">oc</span> <span class="hljs-string">login</span> <span class="hljs-string">${S3_URL}</span> <span class="hljs-string">--token=${OKD_TOKEN}</span>

    <span class="hljs-string">POD_NAME=$(oc</span> <span class="hljs-string">get</span> <span class="hljs-string">pods</span> <span class="hljs-string">-n</span> <span class="hljs-string">${NAMESPACE}</span> <span class="hljs-string">|</span> <span class="hljs-string">grep</span> <span class="hljs-string">${STATEFULSET_NAME}</span> <span class="hljs-string">|</span> <span class="hljs-string">cut</span> <span class="hljs-string">-d'</span> <span class="hljs-string">'
    -f1)

    echo The backup of database in pod ${POD_NAME} will be dumped in ${BACKUP_NAME}
    file.

    DUMP_COMMAND='</span><span class="hljs-string">PGPASSWORD="'${POSTGRES_USER_PASSWORD}'"</span> <span class="hljs-string">pg_dump</span> <span class="hljs-string">-U</span>
    <span class="hljs-string">'${POSTGRES_USER}'</span> <span class="hljs-string">'${POSTGRES_DB}'</span> <span class="hljs-string">&gt;</span> <span class="hljs-string">/bitnami/postgresql/backup/'${BACKUP_NAME}</span>

    <span class="hljs-string">GZIP_COMMAND='gzip</span> <span class="hljs-string">/bitnami/postgresql/backup/'${BACKUP_NAME}</span>

    <span class="hljs-string">REMOVE_COMMAND='rm</span> <span class="hljs-string">/bitnami/postgresql/backup/'${BACKUP_NAME}.gz</span>

    <span class="hljs-comment"># Prompt the user for confirmation if the -y option was not provided</span>

    <span class="hljs-string">if</span> [[ <span class="hljs-string">$confirm</span> <span class="hljs-type">!=</span> <span class="hljs-string">"y"</span> ]]<span class="hljs-string">;</span> <span class="hljs-string">then</span>
        <span class="hljs-string">read</span> <span class="hljs-string">-r</span> <span class="hljs-string">-p</span> <span class="hljs-string">"Are you sure you want to proceed? [y/N] "</span> <span class="hljs-string">response</span>
        <span class="hljs-string">case</span> <span class="hljs-string">"$response"</span> <span class="hljs-string">in</span>
        [<span class="hljs-string">yY</span>][<span class="hljs-string">eE</span>][<span class="hljs-string">sS</span>] <span class="hljs-string">|</span> [<span class="hljs-string">yY</span>]<span class="hljs-string">)</span>
            <span class="hljs-string">confirm="y"</span>
            <span class="hljs-string">;;</span>
        <span class="hljs-string">*)</span>
            <span class="hljs-string">echo</span> <span class="hljs-string">"Aborted"</span>
            <span class="hljs-string">exit</span> <span class="hljs-number">0</span>
            <span class="hljs-string">;;</span>
        <span class="hljs-string">esac</span>
    <span class="hljs-string">fi</span>

    <span class="hljs-comment"># Dump the backup and zip it</span>

    <span class="hljs-string">oc</span> <span class="hljs-string">exec</span> <span class="hljs-string">-n</span> <span class="hljs-string">${NAMESPACE}</span> <span class="hljs-string">"${POD_NAME}"</span> <span class="hljs-string">--</span> <span class="hljs-string">sh</span> <span class="hljs-string">-c</span> <span class="hljs-string">"${DUMP_COMMAND} &amp;&amp; ${GZIP_COMMAND}"</span>

    <span class="hljs-string">echo</span> <span class="hljs-string">Transfer</span> <span class="hljs-string">it</span> <span class="hljs-string">to</span> <span class="hljs-string">current</span> <span class="hljs-string">local</span> <span class="hljs-string">folder</span>

    <span class="hljs-string">oc</span> <span class="hljs-string">rsync</span> <span class="hljs-string">-n</span> <span class="hljs-string">${NAMESPACE}</span> <span class="hljs-string">${POD_NAME}:/bitnami/postgresql/backup/</span> <span class="hljs-string">/backup-files</span>
    <span class="hljs-string">&amp;&amp;</span>
        <span class="hljs-string">oc</span> <span class="hljs-string">exec</span> <span class="hljs-string">-n</span> <span class="hljs-string">${NAMESPACE}</span> <span class="hljs-string">"${POD_NAME}"</span> <span class="hljs-string">--</span> <span class="hljs-string">sh</span> <span class="hljs-string">-c</span> <span class="hljs-string">"${REMOVE_COMMAND}"</span>

    <span class="hljs-comment"># Send backup files to AWS S3</span>

    <span class="hljs-string">aws</span> <span class="hljs-string">--endpoint-url</span> <span class="hljs-string">"${S3_URL}"</span> <span class="hljs-string">s3</span> <span class="hljs-string">sync</span> <span class="hljs-string">/backup-files</span>
    <span class="hljs-string">s3://${S3_BUCKET}</span>
</code></pre>
<h2 id="heading-step-2-create-secrets">Step 2: Create secrets</h2>
<p>Database, AWS, and OC credentials should be kept as secrets. First, we’ll create a secret containing the AWS CA Bundles. After downloading the bundle, you can make a secret file from it:</p>
<pre><code class="lang-bash">$ oc create secret -n PROJECT_HERE generic certs --from-file ca-bundle.crt
</code></pre>
<p>You should replace <code>PROJECT_HERE</code> with your project name.</p>
<p>Now let’s create another secret for other credentials. Consider that you should specify AWS_CA_BUNDLE with=/certs/ca-bundle.crt</p>
<pre><code class="lang-yml"><span class="hljs-attr">kind:</span> <span class="hljs-string">Secret</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">mysecret</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">PROJECT_HERE</span>

<span class="hljs-attr">data:</span>
  <span class="hljs-attr">AWS_CA_BUNDLE:</span> 
  <span class="hljs-attr">OKD_TOKEN:</span> 
  <span class="hljs-attr">POSTGRES_USER_PASSWORD:</span> 
  <span class="hljs-attr">POSTGRES_USER:</span>
  <span class="hljs-attr">POSTGRES_DB:</span>  
  <span class="hljs-attr">AWS_SECRET_ACCESS_KEY:</span> 
  <span class="hljs-attr">AWS_ACCESS_KEY_ID:</span> 

<span class="hljs-attr">type:</span> <span class="hljs-string">Opaque</span>
</code></pre>
<h2 id="heading-step-3-create-cronjob">Step 3: Create cronjob</h2>
<p>To create the cronjob, we need a docker image capable of running oc and aws commands. You can find this image and its Docker file <a target="_blank" href="https://hub.docker.com/repository/docker/hamedkarbasi/aws-cli-oc/">here</a> if you are inclined to customize it.</p>
<p>Now let’s create the cronjob:</p>
<pre><code class="lang-yml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">batch/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">CronJob</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">database-backup</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">PROJECT_HERE</span> 
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">schedule:</span> <span class="hljs-number">0</span> <span class="hljs-number">3</span> <span class="hljs-string">*</span> <span class="hljs-string">*</span> <span class="hljs-string">*</span>
  <span class="hljs-attr">jobTemplate:</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">template:</span>
        <span class="hljs-attr">spec:</span>
          <span class="hljs-attr">containers:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">backup</span>
            <span class="hljs-attr">image:</span> <span class="hljs-string">hamedkarbasi/aws-cli-oc:1.0.0</span>
            <span class="hljs-attr">command:</span> [<span class="hljs-string">"/bin/bash"</span>, <span class="hljs-string">"-c"</span>, <span class="hljs-string">"/backup-script/backup.sh -y"</span>]
            <span class="hljs-attr">envFrom:</span>
              <span class="hljs-bullet">-</span> <span class="hljs-attr">secretRef:</span>
                  <span class="hljs-attr">name:</span> <span class="hljs-string">mysecret</span>
            <span class="hljs-attr">volumeMounts:</span>
              <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">script</span>
                <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/backup-script/backup.sh</span>
                <span class="hljs-attr">subPath:</span> <span class="hljs-string">backup.sh</span>
              <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">certs</span>
                <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/certs/ca-bundle.crt</span>
                <span class="hljs-attr">subPath:</span> <span class="hljs-string">ca-bundle.crt</span>
              <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">kube-dir</span>
                <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/.kube</span>
              <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">backup-files</span>
                <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/backup-files</span>
          <span class="hljs-attr">volumes:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">script</span>
              <span class="hljs-attr">configMap:</span>
                <span class="hljs-attr">name:</span> <span class="hljs-string">backup-script</span>
                <span class="hljs-attr">defaultMode:</span> <span class="hljs-number">0777</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">certs</span>
              <span class="hljs-attr">secret:</span> 
                <span class="hljs-attr">secretName:</span> <span class="hljs-string">certs</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">kube-dir</span>
              <span class="hljs-attr">emptyDir:</span> {}
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">backup-files</span>
              <span class="hljs-attr">emptyDir:</span> {}
          <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">Never</span>
</code></pre>
<p>Again, you should replace <code>PROJECT_HERE</code> with your project name and the schedule parameter, with your desired run job frequency. By putting all manifests in a folder named <code>backup</code>, we can apply it to Kubernetes:</p>
<pre><code class="lang-bash">$ oc apply -f backup
</code></pre>
<p>This cronjob will be run at 3:00 AM every night, dumping the database and uploading to the AWS S3.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In conclusion, automating database backups to AWS S3 using cronjob can save you time and effort while ensuring your valuable data is stored securely in the cloud. Following the steps outlined in this guide, you can easily set up a backup schedule that meets your needs and upload your backups to AWS S3 for safekeeping. Remember to test your backups regularly to ensure they can be restored when needed, and keep your AWS credentials and permissions secure to prevent unauthorized access. With these best practices in mind, you can have peace of mind knowing that your database backups are automated and securely stored in the cloud.</p>
<hr />
<p>Originally published at <a target="_blank" href="https://itnext.io/how-to-schedule-database-backups-with-cronjob-and-upload-to-aws-s3-ac1962329e27">itnext.io</a>.</p>
]]></content:encoded></item><item><title><![CDATA[ClickHouse Basic Tutorial: Table Engines]]></title><description><![CDATA[In this part, I will cover ClickHouse table engines. Like any other database, ClickHouse uses engines to determine a table's storage, replication, and concurrency methodologies. Every engine has pros and cons, and you should choose them by your need....]]></description><link>https://hamedkarbasi.com/clickhouse-basic-tutorial-table-engines</link><guid isPermaLink="true">https://hamedkarbasi.com/clickhouse-basic-tutorial-table-engines</guid><category><![CDATA[ClickHouse]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Tutorial]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Sat, 29 Apr 2023 10:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439389305/0ecb0c70-7c36-4c93-88c9-e0415df825b7.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this part, I will cover ClickHouse table engines. Like any other database, ClickHouse uses engines to determine a table's storage, replication, and concurrency methodologies. Every engine has pros and cons, and you should choose them by your need. Moreover, engines are categorized into families sharing the main features. As a practical article, I will deep dive into the most usable ones in every family and leave the others to your interest.</p>
<p>Now, let's start with the first and most usable family:</p>
<h2 id="heading-merge-tree-family">Merge-Tree Family</h2>
<p>‌As it said, this is the most possible choice when you want to create a table in ClickHouse. It's based on the data structure of the <a target="_blank" href="https://en.wikipedia.org/wiki/Log-structured_merge-tree">Log Structured Merge-Tree</a>. LSM trees are optimized for write-intensive workloads. They are designed to handle a large volume of writes by buffering them in memory and then periodically flushing them to disk in sorted order. This allows for faster writes of massive data and reduces the likelihood of disk fragmentation. They are considered an alternative to the <a target="_blank" href="https://en.wikipedia.org/wiki/B-tree">B-Tree</a> data structure which is common in traditional relational databases like MySQL.</p>
<blockquote>
<p><em>Note: For all engines of this family, you can use</em> <code>Replicated</code> as a prefix to the engine name to create a replication of the table on every ClickHouse node.</p>
</blockquote>
<p>Now let's investigate common engines in this family.</p>
<h3 id="heading-mergetree">MergeTree</h3>
<p>Here is an example of a merge-tree DDL:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> inventory
(
    <span class="hljs-string">`id`</span> Int32,
    <span class="hljs-string">`status`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`price`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`comment`</span> <span class="hljs-keyword">String</span>
)
<span class="hljs-keyword">ENGINE</span> = MergeTree
PRIMARY <span class="hljs-keyword">KEY</span> (<span class="hljs-keyword">id</span>, price)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (<span class="hljs-keyword">id</span>, price, <span class="hljs-keyword">status</span>)
</code></pre>
<p>Merge-tree tables use <a target="_blank" href="https://clickhouse.com/docs/en/optimize/sparse-primary-indexes">sparse indexing</a> to optimize queries. Briefly, in sparse indexing, data is split into multiple parts. Every part is sorted by the <code>order by</code> keys (referred to as <em>sort keys</em>), where the first key has the highest priority in sorting. Then every part is broken down into groups called <em>granules</em> whose first and last items for primary keys are considered as marks. Since these marks are extracted from the sorted data, primary keys should be a subset of sort keys. Then for every query containing a filter on primary keys, ClickHouse performs a binary search on those marks to find the target granules as fast as possible. Finally, ClickHouse loads target granules in memory and searches for the matching rows.</p>
<blockquote>
<p><em>Note: You can omit the</em> <code>PRIMARY KEY</code> in DDL, and ClickHouse will consider sort keys as primary keys.</p>
</blockquote>
<h3 id="heading-replaingmergetree">ReplaingMergeTree</h3>
<h4 id="heading-ddl">DDL</h4>
<p>In this engine, rows with equal order keys are replaced by the last row. Consider the below engine:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> inventory
(
    <span class="hljs-string">`id`</span> Int32,
    <span class="hljs-string">`status`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`price`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`comment`</span> <span class="hljs-keyword">String</span>
)
<span class="hljs-keyword">ENGINE</span> = ReplacingMergeTree
PRIMARY <span class="hljs-keyword">KEY</span> (<span class="hljs-keyword">id</span>)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (<span class="hljs-keyword">id</span>, <span class="hljs-keyword">status</span>);
</code></pre>
<p>Suppose that you insert a row in this table:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> inventory <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">23</span>, <span class="hljs-string">'success'</span>, <span class="hljs-string">'1000'</span>, <span class="hljs-string">'Confirmed'</span>);
</code></pre>
<p>Now let's insert another row with the same sort keys:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> inventory <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">23</span>, <span class="hljs-string">'success'</span>, <span class="hljs-string">'2000'</span>, <span class="hljs-string">'Cancelled'</span>);
</code></pre>
<p>Now the latter row will replace the previous one. Note that if you get select rows, you may face both of them:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">from</span> inventory <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">id</span>=<span class="hljs-number">23</span>;
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439384197/2993c1e2-193c-46f3-9e1c-9628875d1ec6.png" alt="Result of the replacing merge tree without Final modifier" /></p>
<p>That's because Clickhouse performs the replacement process while merging the parts, which happens in the background asynchronously and not immediately. To see the final result immediately, you can use the <code>FINAL</code> modifier:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">from</span> inventory <span class="hljs-keyword">FINAL</span> <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">id</span>=<span class="hljs-number">23</span>;
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439384975/3b660614-195d-4c94-9eb0-f616b40605c2.png" alt="Result of the replacing merge tree with Final modifier" /></p>
<blockquote>
<p><em>Note: You can specify a</em> <a target="_blank" href="https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree#ver"><em>column as version</em></a> <em>while defining the table to replace rows accordingly.</em></p>
</blockquote>
<h4 id="heading-usage">Usage</h4>
<p>Replacing Merge Tree is widely used for deduplication. As ClickHouse performs poorly in frequent updates, you can update a column by inserting a new row with the equal sort keys, and ClickHouse will remove the stalled rows in the background. Surely it's challenging to update sort keys because it won't delete the old rows in that situation. In that case, you can use Collapsing Merge Trees, explained in the next part.</p>
<h3 id="heading-collapsingmergetree">CollapsingMergeTree</h3>
<p>In this engine, you can define a sign column and ask the database to delete stall rows with <code>sign=-1</code> and keep the new row with <code>sign=1</code>.</p>
<h4 id="heading-ddl-1">DDL</h4>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> inventory
(
    <span class="hljs-string">`id`</span> Int32,
    <span class="hljs-string">`status`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`price`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`comment`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`sign`</span> <span class="hljs-built_in">Int8</span>
)
<span class="hljs-keyword">ENGINE</span> = CollapsingMergeTree(<span class="hljs-keyword">sign</span>)
PRIMARY <span class="hljs-keyword">KEY</span> (<span class="hljs-keyword">id</span>)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (<span class="hljs-keyword">id</span>, <span class="hljs-keyword">status</span>);
</code></pre>
<p>Let's insert a row in this table:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> inventory <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">23</span>, <span class="hljs-string">'success'</span>, <span class="hljs-string">'1000'</span>, <span class="hljs-string">'Confirmed'</span>, <span class="hljs-number">1</span>);
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439385700/26fa721b-7fcb-479e-87fd-4f4243ba0cd6.png" alt="Data in Collapsing Merge Tree before update" /></p>
<p>Now to update the row:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> inventory <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">23</span>, <span class="hljs-string">'success'</span>, <span class="hljs-string">'1000'</span>, <span class="hljs-string">'Confirmed'</span>, <span class="hljs-number">-1</span>), (<span class="hljs-number">23</span>, <span class="hljs-string">'success'</span>, <span class="hljs-string">'2000'</span>, <span class="hljs-string">'Cancelled'</span>, <span class="hljs-number">1</span>);
</code></pre>
<p>To see the results:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> inventory;
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439386428/be1978ec-d582-42e8-927c-c6c586440916.png" alt="Data in Collapsing Merge Tree after update without final modifier" /></p>
<p>To see the final results immediately:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> inventory <span class="hljs-keyword">FINAL</span>;
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439387372/04627aa4-f6ef-4739-9c6d-262ebb27033b.png" alt="Data in Collapsing Merge Tree after update with final modifier" /></p>
<h4 id="heading-usage-1">Usage</h4>
<p>Collapsing Merge Trees can handle updates and deletes in a more controlled manner. For example, you can update sorts keys by inserting the same row with <code>sign=-1</code> and the row with new sort keys with <code>sign=1</code>. There are two challenges with this engine:</p>
<ol>
<li><p>Since you need to insert the old row with <code>sign=1</code>, you need to inquire about it by fetching from the database or another data store.</p>
</li>
<li><p>In case of inserting multiple rows accidentally or deliberately, with the <code>sign</code> equal to 1 or -1, you may face unwanted results. That's why you should consider all situations explained <a target="_blank" href="https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/collapsingmergetree#table_engine-collapsingmergetree-collapsing-algorithm">here</a>.</p>
</li>
</ol>
<h3 id="heading-aggreragatingmergetree">AggreragatingMergeTree</h3>
<p>Using this engine, you can materialize the aggregation of a table into another one.</p>
<h4 id="heading-ddl-2">DDL</h4>
<p>Consider this <em>inventory</em> table. We need to have the maximum price per every item id and the sum of its number of items in another table.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> inventory
 (
    <span class="hljs-string">`id`</span> Int32,
    <span class="hljs-string">`status`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`price`</span> Int32,
    <span class="hljs-string">`num_items`</span> UInt64
) <span class="hljs-keyword">ENGINE</span> = MergeTree <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (<span class="hljs-keyword">id</span>, <span class="hljs-keyword">status</span>);
</code></pre>
<p>Now let's materialize its results into another table via AggregatingMergeTree:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">MATERIALIZED</span> <span class="hljs-keyword">VIEW</span> agg_inventory
(
    <span class="hljs-string">`id`</span> Int32,
    <span class="hljs-string">`max_price`</span> AggregateFunction(<span class="hljs-keyword">max</span>, Int32),
    <span class="hljs-string">`sum_items`</span> AggregateFunction(<span class="hljs-keyword">sum</span>, UInt64)
)
<span class="hljs-keyword">ENGINE</span> = AggregatingMergeTree() <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (<span class="hljs-keyword">id</span>)
<span class="hljs-keyword">AS</span> <span class="hljs-keyword">SELECT</span>
    <span class="hljs-keyword">id</span>,
    maxState(price) <span class="hljs-keyword">as</span> max_price,
    sumState(num_items) <span class="hljs-keyword">as</span> sum_items
<span class="hljs-keyword">FROM</span> inventory2
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">id</span>;
</code></pre>
<p>Now let's insert rows into it and see the results:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> inventory2 <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">3</span>, <span class="hljs-number">100</span>, <span class="hljs-number">2</span>), (<span class="hljs-number">3</span>, <span class="hljs-number">500</span>, <span class="hljs-number">4</span>);

<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">id</span>, maxMerge(max_price) <span class="hljs-keyword">AS</span> max_price, sumMerge(sum_items) <span class="hljs-keyword">AS</span> sum_items 
<span class="hljs-keyword">FROM</span> agg_inventory <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">id</span>=<span class="hljs-number">3</span> <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">id</span>;
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439388131/1e38c08d-478a-48ac-a68f-2eff01f5aeef.png" alt="Output of aggregating merge tree" /></p>
<h4 id="heading-usage-2">Usage</h4>
<p>This engine helps you reduce the response time of heavy, fixed analytics queries by calculating them in writing time. That will end up decreasing in database load in query time too.</p>
<h2 id="heading-log-family">Log Family</h2>
<p>Lightweight engines with minimum functionality. They're the most effective when you need to quickly write many small tables (up to approximately 1 million rows) and read them later. Additionally, there are no indexes in this family. However, <code>Log</code> and <code>StripeLog</code> engines can break down data into multiple blocks to support multi-threading while reading data.</p>
<p>I will only look into the <em>TinyLog</em> engine. To check the others, you can visit <a target="_blank" href="https://clickhouse.com/docs/en/engines/table-engines#log">this link</a>.</p>
<h3 id="heading-tinylog">TinyLog</h3>
<p>This table is mainly used as a write-once method. i.e., you will write data once and read it as often as you want. As ClickHouse reads data in a single stream, it's better to keep the size of the table up to 1M rows.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> log_location
 (
    <span class="hljs-string">`id`</span> Int32,
    <span class="hljs-string">`long`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`lat`</span> Int32
) <span class="hljs-keyword">ENGINE</span> = TinyLog;
</code></pre>
<h4 id="heading-usage-3">Usage</h4>
<p>You can use this engine as an intermediate state for batch operations.</p>
<h2 id="heading-integration-family">Integration Family</h2>
<p>The engines in this family are widely used to connect with other databases and brokers with the ability to fetch or insert data.</p>
<p>I'll cover MySQL and Kafka Engines, but you can study the others <a target="_blank" href="https://clickhouse.com/docs/en/engines/table-engines#integration-engines">here</a>.</p>
<h3 id="heading-mysql-engine">MySQL Engine</h3>
<p>With this engine, you can connect with a MySQL database through ClickHouse and read its data or insert rows.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> mysql_inventory
(
    <span class="hljs-string">`id`</span> Int32,
    <span class="hljs-string">`price`</span> Int32
)
<span class="hljs-keyword">ENGINE</span> = MySQL(<span class="hljs-string">'host:port'</span>, <span class="hljs-string">'database'</span>, <span class="hljs-string">'table'</span>, <span class="hljs-string">'user'</span>, <span class="hljs-string">'password'</span>)
</code></pre>
<h3 id="heading-kafka-engine">Kafka Engine</h3>
<p>Using this engine, you can make a connection to a Kafka Cluster and read its data with a defined consumer group. This engine is broadly used for CDC purposes.</p>
<p>To learn more about this feature, read <a target="_blank" href="https://medium.com/@hoptical/apply-cdc-from-mysql-to-clickhouse-d660873311c7">this</a> article specifically on this topic.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this article, we saw some of the most important engines of the ClickHouse database. It is clear that ClickHouse provides a wide range of engine options to suit various use-cases. The Merge Tree engine is the default engine and is suitable for most scenarios, but it can be replaced with other engines like AggregatingMergeTree, TinyLog, etc.</p>
<p>It's important to note that choosing the right engine for your use-case can significantly improve performance and efficiency. Therefore, it's worth taking the time to understand the strengths and limitations of each engine and select the one that best meets your needs.</p>
<hr />
<p><em>Originally published at</em> <a target="_blank" href="https://dev.to"><em>https://dev.to</em></a> <a target="_blank" href="https://dev.to/hoptical/clickhouse-basic-tutorial-table-engines-30i1"><em>on April 29, 2</em></a><em>023.</em></p>
]]></content:encoded></item><item><title><![CDATA[Step-by-Step Guide: Deploying Kafka Connect via Strimzi Operator on Kubernetes]]></title><description><![CDATA[Photo by Ryoji Iwata on Unsplash
Strimzi is almost the richest Kubernetes Kafka operator, which you can utilize to deploy Apache Kafka or its other components like Kafka Connect, Kafka Mirror, etc. This article will provide a step-by-step tutorial ab...]]></description><link>https://hamedkarbasi.com/step-by-step-guide-deploying-kafka-connect-via-strimzi-operator-on-kubernetes</link><guid isPermaLink="true">https://hamedkarbasi.com/step-by-step-guide-deploying-kafka-connect-via-strimzi-operator-on-kubernetes</guid><category><![CDATA[kafka connect]]></category><category><![CDATA[Strimzi]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Tutorial]]></category><category><![CDATA[openshift]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Tue, 25 Apr 2023 10:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439392822/69b20859-6299-49d0-9b61-7d85433d3b1f.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Photo by <a target="_blank" href="https://unsplash.com/pt-br/@ryoji__iwata?utm_source=medium&amp;utm_medium=referral">Ryoji Iwata</a> <a target="_blank" href="https://unsplash.com/pt-br/@ryoji__iwata?utm_source=medium&amp;utm_medium=referral">on Unsplash</a></p>
<p><a target="_blank" href="https://unsplash.com/?utm_source=medium&amp;utm_medium=referral">Strimzi</a> is almost the richest Kubernetes Kafka operator, which you can utilize to deploy Apache Kafka or its other components like Kafka Connect, Kafka Mirror, etc. This article will provide a step-by-step tutorial about deploying Kafka Connect on Kubernetes. I brought all issues I encountered during the deployment procedure and their best mitigation.</p>
<blockquote>
<p>Note: Consider that this operator is based on <a target="_blank" href="https://kafka.apache.org/">Apache Kafka</a>, not the <a target="_blank" href="https://docs.confluent.io/platform/current/platform.html">Confluent Platform</a>. That's why you may need to add some confluent artifacts like <a target="_blank" href="https://www.confluent.io/hub/confluentinc/kafka-connect-avro-converter">Confluent Avro Converter</a> to get the most out of it.</p>
</blockquote>
<p>This article is based on <code>Strimzi v0.29.0</code>. Thus you're able to install the following versions of Kafka Connect:</p>
<ul>
<li><p>Strimzi: 0.29.0</p>
</li>
<li><p>Apache Kafka &amp; Kafka Connect: Up to 3.2</p>
</li>
<li><p>Equivalent Confluent Platform: 7.2.4</p>
</li>
</ul>
<blockquote>
<p>Note: You can convert Confluent Platform version to Apache Kafka version and vice versa with the provided table <a target="_blank" href="https://docs.confluent.io/platform/current/installation/versions-interoperability.html#supported-versions-and-interoperability-for-cp">here</a>.</p>
</blockquote>
<h2 id="heading-installation">Installation</h2>
<h3 id="heading-openshift-gui-and-kubernetes-cli">Openshift GUI and Kubernetes CLI</h3>
<p>If you're using Openshift, navigate to Operators &gt; installed Operators &gt; Strimzi &gt; Kafka Connect.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755439391616/80d17d7e-a787-40dc-906d-1ad07cf1528c.png" alt="Openshift Strimzi Operator Page" /></p>
<p>Now you will face a form containing the Kafka connect configurations. You can get the equivalent Yaml file of the form by clicking on Yaml View. Any update on the form view will be applied to the Yaml view on the fly. Although the form view is quite straightforward, It's strongly recommended not to use it for creating the instance directly. Use it only for converting your desired configuration to a Yaml file and then deploy the operator with the <code>kubectl apply</code> command. So to summarize:</p>
<ol>
<li><p>Enter the configuration in the form view</p>
</li>
<li><p>Click on Yaml view</p>
</li>
<li><p>Copy its contents to a Yaml file on your local (e.g. <code>kafka-connect.yaml</code>)</p>
</li>
<li><p>Run: <code>kubectl apply -f kafka-connect.yaml</code></p>
</li>
</ol>
<p>Now the Kafka-Connect kind should be deployed or updated. The deployed resources consist of Deployment and pods, Service, config maps, and secrets.</p>
<p>Let's get through the minimum configuration and make it more advanced, step by step.</p>
<h2 id="heading-minimum-configuration">Minimum Configuration</h2>
<p>To deploy a simple minimum configuration of Kafka Connect, you can use the below Yaml:</p>
<pre><code class="lang-yml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">kafka.strimzi.io/v1beta2</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">KafkaConnect</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">my-connect-cluster</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">&lt;YOUR_PROJECT_NAME&gt;</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">config:</span>
    <span class="hljs-attr">config.storage.replication.factor:</span> <span class="hljs-number">-1</span>
    <span class="hljs-attr">config.storage.topic:</span> <span class="hljs-string">okd4-connect-cluster-configs</span>
    <span class="hljs-attr">group.id:</span> <span class="hljs-string">okd4-connect-cluster</span>
    <span class="hljs-attr">offset.storage.replication.factor:</span> <span class="hljs-number">-1</span>
    <span class="hljs-attr">offset.storage.topic:</span> <span class="hljs-string">okd4-connect-cluster-offsets</span>
    <span class="hljs-attr">status.storage.replication.factor:</span> <span class="hljs-number">-1</span>
    <span class="hljs-attr">status.storage.topic:</span> <span class="hljs-string">okd4-connect-cluster-status</span>
  <span class="hljs-attr">bootstrapServers:</span> <span class="hljs-string">kafka1,</span> <span class="hljs-string">kafka2</span>
  <span class="hljs-attr">version:</span> <span class="hljs-number">3.2</span><span class="hljs-number">.0</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">1</span>
</code></pre>
<p>You can have the Kafka Connect Rest API on port 8083 exposed on the pod. You can expose it on a private or internal network by defining a route on OKD.</p>
<h2 id="heading-rest-api-authentication">REST API Authentication</h2>
<p>With the configuration explained <a target="_blank" href="https://docs.confluent.io/platform/current/security/basic-auth.html#kconnect-rest-api">here</a>, you can add authentication to the Kafka Connect REST proxy. Unfortunately, that doesn't work on the Strimzi operator, as discussed <a target="_blank" href="https://github.com/strimzi/strimzi-kafka-operator/issues/3229">here</a>. So to provide security on Kafka Connect, you've two options:</p>
<ol>
<li><p>Use <em>the Kafka Connector</em> operator API. Strimzi operator lets you have a Connector kind defined in a YAML file. However, it may not be practical for some use cases since updating, pausing, and stopping connectors via the REST API is necessary.</p>
</li>
<li><p>Put the insecure REST API behind an authenticated API Gateway like Apache APISIX or any other tool or self-developed application.</p>
</li>
</ol>
<h2 id="heading-jmx-prometheus-metrics">JMX Prometheus Metrics</h2>
<p>To expose JMX Prometheus Metrics, useful for observing connectors statuses in Grafana, add the below configuration:</p>
<pre><code class="lang-yml">  <span class="hljs-attr">metricsConfig:</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">jmxPrometheusExporter</span>
    <span class="hljs-attr">valueFrom:</span>
      <span class="hljs-attr">configMapKeyRef:</span>
        <span class="hljs-attr">key:</span> <span class="hljs-string">jmx-prometheus</span>
        <span class="hljs-attr">name:</span> <span class="hljs-string">configs</span>
  <span class="hljs-attr">jmxOptions:</span> {}
</code></pre>
<p>It uses a pre-defined config for Prometheus export. You can use this config:</p>
<pre><code class="lang-yml"><span class="hljs-attr">startDelaySeconds:</span> <span class="hljs-number">0</span>
<span class="hljs-attr">ssl:</span> <span class="hljs-literal">false</span>
<span class="hljs-attr">lowercaseOutputName:</span> <span class="hljs-literal">false</span>
<span class="hljs-attr">lowercaseOutputLabelNames:</span> <span class="hljs-literal">false</span>
<span class="hljs-attr">rules:</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">pattern :</span> <span class="hljs-string">"kafka.connect&lt;type=connect-worker-metrics&gt;([^:]+):"</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">"kafka_connect_connect_worker_metrics_$1"</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">pattern :</span> <span class="hljs-string">"kafka.connect&lt;type=connect-metrics, client-id=([^:]+)&gt;&lt;&gt;([^:]+)"</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">"kafka_connect_connect_metrics_$2"</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">client:</span> <span class="hljs-string">"$1"</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">pattern:</span> <span class="hljs-string">"debezium.([^:]+)&lt;type=connector-metrics, context=([^,]+), server=([^,]+), key=([^&gt;]+)&gt;&lt;&gt;RowsScanned"</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">"debezium_metrics_RowsScanned"</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">plugin:</span> <span class="hljs-string">"$1"</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">"$3"</span>
    <span class="hljs-attr">context:</span> <span class="hljs-string">"$2"</span>
    <span class="hljs-attr">table:</span> <span class="hljs-string">"$4"</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">pattern:</span> <span class="hljs-string">"debezium.([^:]+)&lt;type=connector-metrics, context=([^,]+), server=([^&gt;]+)&gt;([^:]+)"</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">"debezium_metrics_$4"</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">plugin:</span> <span class="hljs-string">"$1"</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">"$3"</span>
    <span class="hljs-attr">context:</span> <span class="hljs-string">"$2"</span>
</code></pre>
<h3 id="heading-service-for-external-prometheus">Service for External Prometheus</h3>
<p>If you are intended to deploy Prometheus in companion with Strimzi to collect the metrics, follow the instructions <a target="_blank" href="https://strimzi.io/docs/operators/latest/deploying.html#assembly-metrics-setup-str">here</a>. However, in the case of using external Prometheus, the story goes another way:</p>
<p>Strimzi operator only creates port mapping in Service for these ports:</p>
<ul>
<li><p>8083: Kafka Connect REST API</p>
</li>
<li><p>9999: JMX port</p>
</li>
</ul>
<p>Sadly it doesn't create a mapping for port 9404, the Prometheus exporter HTTP port. <a target="_blank" href="https://github.com/strimzi/strimzi-kafka-operator/issues/8403">So we've to create a service on our own</a>:</p>
<pre><code class="lang-yml"><span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">kafka-connect-jmx-prometheus</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">kafka-connect</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">app.kubernetes.io/instance:</span> <span class="hljs-string">kafka-connect</span>
    <span class="hljs-attr">app.kubernetes.io/managed-by:</span> <span class="hljs-string">strimzi-cluster-operator</span>
    <span class="hljs-attr">app.kubernetes.io/name:</span> <span class="hljs-string">kafka-connect</span>
    <span class="hljs-attr">app.kubernetes.io/part-of:</span> <span class="hljs-string">strimzi-kafka-connect</span>
    <span class="hljs-attr">strimzi.io/cluster:</span> <span class="hljs-string">kafka-connect</span>
    <span class="hljs-attr">strimzi.io/kind:</span> <span class="hljs-string">KafkaConnect</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">tcp-prometheus</span>
      <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
      <span class="hljs-attr">port:</span> <span class="hljs-number">9404</span>
      <span class="hljs-attr">targetPort:</span> <span class="hljs-number">9404</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">ClusterIP</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">strimzi.io/cluster:</span> <span class="hljs-string">kafka-connect</span>
    <span class="hljs-attr">strimzi.io/kind:</span> <span class="hljs-string">KafkaConnect</span>
    <span class="hljs-attr">strimzi.io/name:</span> <span class="hljs-string">kafka-connect-connect</span>
<span class="hljs-attr">status:</span>
  <span class="hljs-attr">loadBalancer:</span> {}
</code></pre>
<blockquote>
<p>Note: This method only works for single-pod deployments since you should define a route for the service and even in the case of headless service, the route returns one IP of a pod at a time. Hence, Prometheus can't scrape all pods metrics. That's why it is recommended to use Podmonitor and Prometheus on Cloud. This issue is discussed here</p>
</blockquote>
<h2 id="heading-plugins-and-artifacts">Plugins and Artifacts</h2>
<p>To add plugins and artifacts, there are two ways:</p>
<h3 id="heading-operator-build-section">Operator Build Section</h3>
<p>To add plugins, you can use the operator build section. It gets the plugin or artifact addresses, downloads them in the build stage (The operator creates the build config automatically), and adds them to the plugin directory of the image.</p>
<p>It supports <code>jar, tgz, zip, and maven</code>.  However, in the case of Maven,  a multi-stage Dockerfile is created, which is <a target="_blank" href="https://bugzilla.redhat.com/show_bug.cgi?id=1937243">problematic to Openshift</a>, and it faces failure in the build stage.  Hence, you should only use other types that don't need compile stage (i.e., jar, zip, tgz) and end up with a single-stage Dockerfile.</p>
<p>For example, to add the Debezium MySQL plugin, you can use the below configuration:</p>
<pre><code class="lang-yml"><span class="hljs-attr">spec:</span>  
  <span class="hljs-attr">build:</span>
    <span class="hljs-attr">output:</span>
      <span class="hljs-attr">image:</span> <span class="hljs-string">'kafkaconnect:1.0'</span>
      <span class="hljs-attr">type:</span> <span class="hljs-string">imagestream</span>
    <span class="hljs-attr">plugins:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">artifacts:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">type:</span> <span class="hljs-string">tgz</span>
            <span class="hljs-attr">url:</span> <span class="hljs-string">&gt;-
              https://repo1.maven.org/maven2/io/debezium/debezium-connector-mysql/2.1.4.Final/debezium-connector-mysql-2.1.4.Final-plugin.tar.gz
</span>        <span class="hljs-attr">name:</span> <span class="hljs-string">debezium-connector-mysql</span>
</code></pre>
<blockquote>
<p>Note: Strimzi operator is only able to download public artifacts. So if you wish to download a privately secured artifact that is not accessible by Kubernetes, you've to give up this method and follow the next one.</p>
</blockquote>
<h3 id="heading-changing-image">Changing Image</h3>
<p>The operator is able to use your desired image instead of its default one. Thus you can add your desired artifacts and plugins by building an image manually or via CI/CD. One of the other reasons why you may want to use this method is that Strimzi uses Apache Kafka image, not the Confluent Platform. So the deployments don't have Confluent useful packages like Confluent Avro Converter, etc. So you need to add them to your image and configure the operator to use your docker image.</p>
<p>For example, If you want to add your customized Debezium MySQL Connector plugin from Gitlab Generic Packages and Confluent Avro Converter to the base image, first use this Dockerfile:</p>
<pre><code class="lang-yml"><span class="hljs-string">ARG</span> <span class="hljs-string">CONFLUENT_VERSION=7.2.4</span>

<span class="hljs-comment"># Install confluent avro converter</span>
<span class="hljs-string">FROM</span> <span class="hljs-string">confluentinc/cp-kafka-connect:${CONFLUENT_VERSION}</span> <span class="hljs-string">as</span> <span class="hljs-string">cp</span>
<span class="hljs-comment"># Reassign version</span>
<span class="hljs-string">ARG</span> <span class="hljs-string">CONFLUENT_VERSION</span>
<span class="hljs-string">RUN</span> <span class="hljs-string">confluent-hub</span> <span class="hljs-string">install</span> <span class="hljs-string">--no-prompt</span> <span class="hljs-string">confluentinc/kafka-connect-avro-converter:${CONFLUENT_VERSION}</span>

<span class="hljs-comment"># Copy privious artifacts to the main strimzi kafka image</span>
<span class="hljs-string">FROM</span> <span class="hljs-string">quay.io/strimzi/kafka:0.29.0-kafka-3.2.0</span>
<span class="hljs-string">ARG</span> <span class="hljs-string">GITLAB_TOKEN</span>
<span class="hljs-string">ARG</span> <span class="hljs-string">CI_API_V4_URL=https://gitlab.snapp.ir/api/v4</span>
<span class="hljs-string">ARG</span> <span class="hljs-string">CI_PROJECT_ID=3873</span>
<span class="hljs-string">ARG</span> <span class="hljs-string">DEBEZIUM_CONNECTOR_MYSQL_CUSTOMIZED_VERSION=1.0</span>
<span class="hljs-string">USER</span> <span class="hljs-string">root:root</span>

<span class="hljs-comment"># Copy Confluent packages from previous stage</span>
<span class="hljs-string">RUN</span> <span class="hljs-string">mkdir</span> <span class="hljs-string">-p</span> <span class="hljs-string">/opt/kafka/plugins/avro/</span>
<span class="hljs-string">COPY</span> <span class="hljs-string">--from=cp</span> <span class="hljs-string">/usr/share/confluent-hub-components/confluentinc-kafka-connect-avro-converter/lib</span> <span class="hljs-string">/opt/kafka/plugins/avro/</span>

<span class="hljs-comment"># Connector plugin debezium-connector-mysql</span>
<span class="hljs-string">RUN</span> <span class="hljs-string">'mkdir'</span> <span class="hljs-string">'-p'</span> <span class="hljs-string">'/opt/kafka/plugins/debezium-connector-mysql'</span> <span class="hljs-string">\</span>
    <span class="hljs-string">&amp;&amp;</span> <span class="hljs-string">curl</span> <span class="hljs-string">--header</span> <span class="hljs-string">"${GITLAB_TOKEN}"</span> <span class="hljs-string">-f</span> <span class="hljs-string">-L</span> <span class="hljs-string">\</span>
    <span class="hljs-string">--output</span> <span class="hljs-string">/opt/kafka/plugins/debezium-connector-mysql.tgz</span> <span class="hljs-string">\</span>
    <span class="hljs-string">${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/debezium-customized/${DEBEZIUM_CONNECTOR_MYSQL_CUSTOMIZED_VERSION}/debezium-connector-mysql-customized.tar.gz</span> <span class="hljs-string">\</span>
    <span class="hljs-string">&amp;&amp;</span> <span class="hljs-string">'tar'</span> <span class="hljs-string">'xvfz'</span> <span class="hljs-string">'/opt/kafka/plugins/debezium-connector-mysql.tgz'</span> <span class="hljs-string">'-C'</span> <span class="hljs-string">'/opt/kafka/plugins/debezium-connector-mysql'</span> <span class="hljs-string">\</span>
    <span class="hljs-string">&amp;&amp;</span> <span class="hljs-string">'rm'</span> <span class="hljs-string">'-vf'</span> <span class="hljs-string">'/opt/kafka/plugins/debezium-connector-mysql.tgz'</span>

<span class="hljs-string">USER</span> <span class="hljs-number">1001</span>
</code></pre>
<p>Build the image. Push it to the image stream or any other docker repository and configure the operator by adding the below line:</p>
<pre><code class="lang-yml"><span class="hljs-attr">spec:</span>  
  <span class="hljs-attr">image:</span> <span class="hljs-string">image-registry.openshift-image-registry.svc:5000/kafka-connect/kafkaconnect-customized:1.0</span>
</code></pre>
<h2 id="heading-kafka-authentication">Kafka Authentication</h2>
<p>Depending on its type, you need to use different configurations to add Kafka authentication. However, to bring an example, here you can see the configuration for Kafka with SASL/Plaintext mechanism and scram-sha-512:</p>
<pre><code class="lang-yml"><span class="hljs-attr">spec:</span>
  <span class="hljs-attr">authentication:</span>
    <span class="hljs-attr">passwordSecret:</span>
      <span class="hljs-attr">password:</span> <span class="hljs-string">kafka-password</span>
      <span class="hljs-attr">secretName:</span> <span class="hljs-string">mysecrets</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">scram-sha-512</span>
    <span class="hljs-attr">username:</span> <span class="hljs-string">myuser</span>
</code></pre>
<p>No need to say that you must provide the password in a secret file named <em>mysecret</em>.</p>
<h2 id="heading-handling-file-credentials">Handling File Credentials</h2>
<p>Since connectors need credentials to access databases, you've to define them as secrets and access them with environment variables. However, if there are too many of them, you can put all credentials in a file and address them in the connector with the <code>$file modifier</code>:</p>
<p>1- Put all credentials as the value of a key named <em>credentials</em> in a secret file.</p>
<p><em>Credentials file:</em></p>
<pre><code class="lang-text">USERNAME_DB_1=user1
PASSWORD_DB_1=pass1

USERNAME_DB_2=user2
PASSWORD_DB_2=pass2
</code></pre>
<p><em>Secret file:</em></p>
<pre><code class="lang-yml"><span class="hljs-attr">kind:</span> <span class="hljs-string">Secret</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">mysecrets</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">kafka-connect</span>
<span class="hljs-attr">data:</span>
  <span class="hljs-attr">credentials:</span> <span class="hljs-string">&lt;BASE64</span> <span class="hljs-string">YOUR</span> <span class="hljs-string">DATA&gt;</span>
</code></pre>
<p>2- Configure the operator with the secret as volume:</p>
<pre><code class="lang-yml"><span class="hljs-attr">spec:</span>
  <span class="hljs-attr">config:</span>
    <span class="hljs-attr">config.providers:</span> <span class="hljs-string">file</span>
    <span class="hljs-attr">config.providers.file.class:</span> <span class="hljs-string">org.apache.kafka.common.config.provider.FileConfigProvider</span>  
  <span class="hljs-attr">externalConfiguration:</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">database_credentials</span>
        <span class="hljs-attr">secret:</span>
          <span class="hljs-attr">items:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">credentials</span>
              <span class="hljs-attr">path:</span> <span class="hljs-string">credentials</span>
          <span class="hljs-attr">optional:</span> <span class="hljs-literal">false</span>
          <span class="hljs-attr">secretName:</span> <span class="hljs-string">mysecrets</span>
</code></pre>
<p>3- Now in the connector, you can access PASSWORD_DB_1 with the below command:</p>
<pre><code class="lang-java"><span class="hljs-string">"${file:/opt/kafka/external-configuration/database_credentials/credentials:PASSWORD_DB_1}"</span>
</code></pre>
<h2 id="heading-put-it-all-together">Put it all together</h2>
<p>If we put all configurations together, we'll have the below configuration for Kafka Connect:</p>
<blockquote>
<p>Service, route and build configuration are ommited since we've discussed earlier in the article.</p>
</blockquote>
<pre><code class="lang-yml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">kafka.strimzi.io/v1beta2</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">KafkaConnect</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">kafka-connect</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">kafka-connect</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">authentication:</span>
    <span class="hljs-attr">passwordSecret:</span>
      <span class="hljs-attr">password:</span> <span class="hljs-string">kafka-password</span>
      <span class="hljs-attr">secretName:</span> <span class="hljs-string">mysecrets</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">scram-sha-512</span>
    <span class="hljs-attr">username:</span> <span class="hljs-string">myuser</span>
  <span class="hljs-attr">config:</span>
    <span class="hljs-attr">config.providers:</span> <span class="hljs-string">file</span>
    <span class="hljs-attr">config.providers.file.class:</span> <span class="hljs-string">org.apache.kafka.common.config.provider.FileConfigProvider</span>
    <span class="hljs-attr">config.storage.replication.factor:</span> <span class="hljs-number">-1</span>
    <span class="hljs-attr">config.storage.topic:</span> <span class="hljs-string">okd4-connect-cluster-configs</span>
    <span class="hljs-attr">group.id:</span> <span class="hljs-string">okd4-connect-cluster</span>
    <span class="hljs-attr">offset.storage.replication.factor:</span> <span class="hljs-number">-1</span>
    <span class="hljs-attr">offset.storage.topic:</span> <span class="hljs-string">okd4-connect-cluster-offsets</span>
    <span class="hljs-attr">status.storage.replication.factor:</span> <span class="hljs-number">-1</span>
    <span class="hljs-attr">status.storage.topic:</span> <span class="hljs-string">okd4-connect-cluster-status</span>
  <span class="hljs-attr">bootstrapServers:</span> <span class="hljs-string">'kafka1:9092, kafka2:9092'</span>
  <span class="hljs-attr">metricsConfig:</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">jmxPrometheusExporter</span>
    <span class="hljs-attr">valueFrom:</span>
      <span class="hljs-attr">configMapKeyRef:</span>
        <span class="hljs-attr">key:</span> <span class="hljs-string">jmx-prometheus</span>
        <span class="hljs-attr">name:</span> <span class="hljs-string">configs</span>
  <span class="hljs-attr">resources:</span>
    <span class="hljs-attr">limits:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">1Gi</span>
    <span class="hljs-attr">requests:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">1Gi</span>
  <span class="hljs-attr">readinessProbe:</span>
    <span class="hljs-attr">failureThreshold:</span> <span class="hljs-number">10</span>
    <span class="hljs-attr">initialDelaySeconds:</span> <span class="hljs-number">60</span>
    <span class="hljs-attr">periodSeconds:</span> <span class="hljs-number">20</span>
  <span class="hljs-attr">jmxOptions:</span> {}
  <span class="hljs-attr">livenessProbe:</span>
    <span class="hljs-attr">failureThreshold:</span> <span class="hljs-number">10</span>
    <span class="hljs-attr">initialDelaySeconds:</span> <span class="hljs-number">60</span>
    <span class="hljs-attr">periodSeconds:</span> <span class="hljs-number">20</span>
  <span class="hljs-attr">image:</span> <span class="hljs-string">image-registry.openshift-image-registry.svc:5000/kafka-connect/kafkaconnect-customized:1.0</span>
  <span class="hljs-attr">version:</span> <span class="hljs-number">3.2</span><span class="hljs-number">.0</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">2</span>
  <span class="hljs-attr">externalConfiguration:</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">database_credentials</span>
        <span class="hljs-attr">secret:</span>
          <span class="hljs-attr">items:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">credentials</span>
              <span class="hljs-attr">path:</span> <span class="hljs-string">credentials</span>
          <span class="hljs-attr">optional:</span> <span class="hljs-literal">false</span>
          <span class="hljs-attr">secretName:</span> <span class="hljs-string">mysecrets</span>
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In conclusion, deploying Kafka Connect using the Strimzi Operator can be a powerful and efficient way to manage data integration in your organization. By leveraging the flexibility and scalability of Kafka, along with the ease of use and automation provided by the Strimzi Operator, you can streamline your data pipelines and improve your data-driven decision-making. In this article, I've covered the key steps involved in deploying Kafka Connect via the Strimzi Operator, including creating its minimal custom resource definition (CRD), REST API Basic authentication issue, Kafka Authentication, JMX Prometheus metrics, plugins and artifacts and handling file credentials. Following these steps, you can easily customize your Kafka Connect deployment to meet your specific needs.</p>
<hr />
<p><em>Originally published at</em> <a target="_blank" href="https://itnext.io/step-by-step-guide-deploying-kafka-connect-via-strimzi-operator-on-kubernetes-6357c123abe9"><em>itnext.io</em></a></p>
]]></content:encoded></item><item><title><![CDATA[ClickHouse Basic Tutorial: An Introduction]]></title><description><![CDATA[This is the first part of the ClickHouse Tutorial Series. In this series, I cover some practical and vital aspects of the ClickHouse database, a robust OLAP technology many enterprise companies utilize.
In this part, I'll talk about the main features...]]></description><link>https://hamedkarbasi.com/clickhouse-basic-tutorial-an-introduction</link><guid isPermaLink="true">https://hamedkarbasi.com/clickhouse-basic-tutorial-an-introduction</guid><category><![CDATA[ClickHouse]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Tutorial]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Thu, 13 Apr 2023 10:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755437186240/c76e1347-462d-4248-ae89-8570945dc6ec.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is the first part of the <strong>ClickHouse Tutorial Series</strong>. In this series, I cover some practical and vital aspects of the ClickHouse database, a robust OLAP technology many enterprise companies utilize.</p>
<p>In this part, I'll talk about the main features, weaknesses, installation, and usage of ClickHouse. I'll also refer to some helpful links for those who want to dive into broader details.</p>
<h2 id="heading-what-is-clickhouse">What is ClickHouse</h2>
<p>ClickHouse is an open-source column-oriented database developed by Yandex. It is designed to provide high performance for analytical queries. ClickHouse uses a SQL-like query language for querying data and supports different data types, including integers, strings, dates, and floats. It offers various features such as clustering, distributed query processing, and fault tolerance. It also supports replication and data sharding. ClickHouse is used by companies such as Yandex, Facebook, and Uber for data analysis, machine learning, and more.</p>
<h3 id="heading-main-features">Main Features</h3>
<p>The main features of Clickhouse Database are:</p>
<h4 id="heading-column-oriented">Column-Oriented</h4>
<p>Data in ClickHouse is stored in <a target="_blank" href="https://clickhouse.com/docs/en/about-us/distinctive-features#true-column-oriented-database-management-system">columns instead of rows</a>, bringing at least two benefits:</p>
<ol>
<li><p>Every column can be sorted in a separate file; hence, stronger compression happens on each column and the whole table.</p>
</li>
<li><p>In range queries common in analytical processing, the system can access and process data easier since data is sorted in some columns (i.e., columns defined as sort keys). Additionally, it can parallelize processes on multi-cores while loading massive columns.</p>
</li>
</ol>
<p><img src="https://clickhouse.com/docs/assets/images/row-oriented-3e6fd5aa48e3075202d242b4799da8fa.gif#" alt /></p>
<p><em>Row-Oriented Database (Gif by</em> <a target="_blank" href="https://clickhouse.com/docs/en/faq/general/columnar-database"><em>ClickHouse</em></a><em>)</em></p>
<hr />
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/cb05sjd69q2feq3w6gfn.gif" alt /></p>
<p><em><cite>Columnar Database (Gif by </cite></em> <a target="_blank" href="https://clickhouse.com/docs/en/faq/general/columnar-database"><em><cite>ClickHouse</cite></em></a><em><cite>)</cite></em></p>
<blockquote>
<p>Note: It should not get mistaken with Wide-Column databases like Cassandra as they store data in rows but enable you to denormalize intensive data in a table with many columns leading to a No-SQL structure.</p>
</blockquote>
<h4 id="heading-data-compression">Data Compression</h4>
<p>Thanks to compression algorithms (<a target="_blank" href="https://github.com/facebook/zstd">zstd</a> and <a target="_blank" href="https://github.com/lz4/lz4">LZ4</a>), data occupies much less storage, even more than 20x smaller! You can study some of the benchmarks on ClickHouse and other databases storage <a target="_blank" href="https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIChwYXJ0aXRpb25lZCkiOnRydWUsIkF0aGVuYSAoc2luZ2xlKSI6dHJ1ZSwiQXVyb3JhIGZvciBNeVNRTCI6dHJ1ZSwiQXVyb3JhIGZvciBQb3N0Z3JlU1FMIjp0cnVlLCJCeXRlSG91c2UiOnRydWUsIkNpdHVzIjp0cnVlLCJjbGlja2hvdXNlLWxvY2FsIChwYXJ0aXRpb25lZCkiOnRydWUsImNsaWNraG91c2UtbG9jYWwgKHNpbmdsZSkiOnRydWUsIkNsaWNrSG91c2UgKHdlYikiOnRydWUsIkNsaWNrSG91c2UiOnRydWUsIkNsaWNrSG91c2UgKHR1bmVkKSI6dHJ1ZSwiQ2xpY2tIb3VzZSAoenN0ZCkiOnRydWUsIkNsaWNrSG91c2UgQ2xvdWQiOnRydWUsIkNyYXRlREIiOnRydWUsIkRhdGFiZW5kIjp0cnVlLCJEYXRhRnVzaW9uIChzaW5nbGUpIjp0cnVlLCJBcGFjaGUgRG9yaXMiOnRydWUsIkRydWlkIjp0cnVlLCJEdWNrREIgKFBhcnF1ZXQpIjp0cnVlLCJEdWNrREIiOnRydWUsIkVsYXN0aWNzZWFyY2giOnRydWUsIkVsYXN0aWNzZWFyY2ggKHR1bmVkKSI6ZmFsc2UsIkdyZWVucGx1bSI6dHJ1ZSwiSGVhdnlBSSI6dHJ1ZSwiSHlkcmEiOnRydWUsIkluZm9icmlnaHQiOnRydWUsIktpbmV0aWNhIjp0cnVlLCJNYXJpYURCIENvbHVtblN0b3JlIjp0cnVlLCJNYXJpYURCIjpmYWxzZSwiTW9uZXREQiI6dHJ1ZSwiTW9uZ29EQiI6dHJ1ZSwiTXlTUUwgKE15SVNBTSkiOnRydWUsIk15U1FMIjp0cnVlLCJQaW5vdCI6dHJ1ZSwiUG9zdGdyZVNRTCAodHVuZWQpIjpmYWxzZSwiUG9zdGdyZVNRTCI6dHJ1ZSwiUXVlc3REQiAocGFydGl0aW9uZWQpIjp0cnVlLCJRdWVzdERCIjp0cnVlLCJSZWRzaGlmdCI6dHJ1ZSwiU2VsZWN0REIiOnRydWUsIlNpbmdsZVN0b3JlIjp0cnVlLCJTbm93Zmxha2UiOnRydWUsIlNRTGl0ZSI6dHJ1ZSwiU3RhclJvY2tzIjp0cnVlLCJUaW1lc2NhbGVEQiAoY29tcHJlc3Npb24pIjp0cnVlLCJUaW1lc2NhbGVEQiI6dHJ1ZX0sInR5cGUiOnsic3RhdGVsZXNzIjp0cnVlLCJtYW5hZ2VkIjp0cnVlLCJKYXZhIjp0cnVlLCJjb2x1bW4tb3JpZW50ZWQiOnRydWUsIkMrKyI6dHJ1ZSwiTXlTUUwgY29tcGF0aWJsZSI6dHJ1ZSwicm93LW9yaWVudGVkIjp0cnVlLCJDIjp0cnVlLCJQb3N0Z3JlU1FMIGNvbXBhdGlibGUiOnRydWUsIkNsaWNrSG91c2UgZGVyaXZhdGl2ZSI6dHJ1ZSwiZW1iZWRkZWQiOnRydWUsInNlcnZlcmxlc3MiOnRydWUsIlJ1c3QiOnRydWUsInNlYXJjaCI6dHJ1ZSwiZG9jdW1lbnQiOnRydWUsInRpbWUtc2VyaWVzIjp0cnVlfSwibWFjaGluZSI6eyJzZXJ2ZXJsZXNzIjp0cnVlLCIxNmFjdSI6dHJ1ZSwiTCI6dHJ1ZSwiTSI6dHJ1ZSwiUyI6dHJ1ZSwiWFMiOnRydWUsImM2YS40eGxhcmdlLCA1MDBnYiBncDIiOnRydWUsImM1bi40eGxhcmdlLCAyMDBnYiBncDIiOnRydWUsImM1LjR4bGFyZ2UsIDUwMGdiIGdwMiI6dHJ1ZSwiYzZhLm1ldGFsLCA1MDBnYiBncDIiOnRydWUsIjE2IHRocmVhZHMiOnRydWUsIjIwIHRocmVhZHMiOnRydWUsIjI0IHRocmVhZHMiOnRydWUsIjI4IHRocmVhZHMiOnRydWUsIjMwIHRocmVhZHMiOnRydWUsIjQ4IHRocmVhZHMiOnRydWUsIjYwIHRocmVhZHMiOnRydWUsIm01ZC4yNHhsYXJnZSI6dHJ1ZSwiYzZhLjR4bGFyZ2UsIDE1MDBnYiBncDIiOnRydWUsInJhMy4xNnhsYXJnZSI6dHJ1ZSwicmEzLjR4bGFyZ2UiOnRydWUsInJhMy54bHBsdXMiOnRydWUsImRjMi44eGxhcmdlIjp0cnVlLCJTMiI6dHJ1ZSwiUzI0Ijp0cnVlLCIyWEwiOnRydWUsIjNYTCI6dHJ1ZSwiNFhMIjp0cnVlLCJYTCI6dHJ1ZX0sImNsdXN0ZXJfc2l6ZSI6eyIxIjp0cnVlLCIyIjp0cnVlLCI0Ijp0cnVlLCI4Ijp0cnVlLCIxNiI6dHJ1ZSwiMzIiOnRydWUsIjY0Ijp0cnVlLCIxMjgiOnRydWUsInNlcnZlcmxlc3MiOnRydWUsInVuZGVmaW5lZCI6dHJ1ZX0sIm1ldHJpYyI6InNpemUiLCJxdWVyaWVzIjpbdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZV19">here</a>.</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xauowhmmaq89hmgoxvht.png" alt /></p>
<p><em>ClickHouse columnar structure leads to storing and reading columns more efficiently (Graph by</em> <a target="_blank" href="https://altinity.com/blog/using-clickhouse-as-an-analytic-extension-for-mysql"><em>Altinity</em></a><em>)</em></p>
<h4 id="heading-scalability">Scalability</h4>
<p>ClickHouse scales well both vertically and horizontally. It can be scaled by adding <a target="_blank" href="https://altinitydb.medium.com/clickhouse-for-time-series-scalability-benchmarks-e181132a895b">extra replicas</a> and extra shards to process queries in a distributed way. ClickHouse supports multi-master asynchronous replication and can be deployed across multiple data centers. All nodes are equal, which allows for avoiding having single points of failure.</p>
<h3 id="heading-weaknesses">Weaknesses</h3>
<p>To mention some:</p>
<ul>
<li><p>Lack of full-fledged UPDATE/DELETE implementation: ClickHouse is unsuited for modification and mutations. So you'll come across poor performance regarding those kinds of queries.</p>
</li>
<li><p>OLTP queries like pointy ones would not make you happy since ClickHouse is easily outperformed by traditional RDBMSs like MySQL with those queries.</p>
</li>
</ul>
<h3 id="heading-rivals-and-alternatives">Rivals and Alternatives</h3>
<p>To name a few:</p>
<ul>
<li><p>Apache Druid</p>
</li>
<li><p>ElasticSearch</p>
</li>
<li><p>SingleStore</p>
</li>
<li><p>Snowflake</p>
</li>
<li><p>TimescaleDB</p>
</li>
</ul>
<p>Surely, each one is suitable for different use cases and has its pros and cons, but I won't discuss their comparison here. However, you can study some valuable benchmarks <a target="_blank" href="https://benchmark.clickhouse.com/">here</a> and <a target="_blank" href="https://www.timescale.com/blog/what-is-clickhouse-how-does-it-compare-to-postgresql-and-timescaledb-and-how-does-it-perform-for-time-series-data/">here</a>.</p>
<h2 id="heading-quick-start">Quick Start</h2>
<h3 id="heading-installation">Installation</h3>
<p>I only cover the Docker approach here. For other methods on different distros, please follow <a target="_blank" href="https://clickhouse.com/docs/en/install">ClicHouse's official Installation.</a></p>
<p>The docker-compose file:</p>
<pre><code class="lang-yml"><span class="hljs-attr">version:</span> <span class="hljs-string">'2'</span>
<span class="hljs-attr">services:</span>
  <span class="hljs-attr">clickhouse:</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">myclickhouse</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">clickhouse/clickhouse-server:latest</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"8123:8123"</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9000:9000"</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./clickhouse-data:/var/lib/clickhouse/</span>  
    <span class="hljs-attr">restart:</span> <span class="hljs-string">unless-stopped</span>
</code></pre>
<p>And then run it by:</p>
<pre><code class="lang-bash">docker compose up -d
</code></pre>
<p>As you can see, two ports have been exposed:</p>
<ul>
<li><p><strong>8123</strong>: HTTP API Port for HTTP requests, used by JDBC, ODBC, and web interfaces.</p>
</li>
<li><p><strong>9000</strong>: Native Protocol port (ClickHouse TCP protocol). Used by ClickHouse apps and processes like <em>clickhouse-server</em>, <em>clickhouse-client</em>, and native ClickHouse tools. Used for inter-server communication for distributed queries.</p>
</li>
</ul>
<p>It's up to your client driver to choose one of them. For example, <a target="_blank" href="https://dbeaver.io/">DBeaver</a> uses 8123, and <a target="_blank" href="https://clickhouse-driver.readthedocs.io/en/latest/">Python ClickhHouse-Driver</a> uses 9000.</p>
<p>To continue the tutorial, we use ClickHouse-Client available on the installed server:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it myclickhouse clickhouse-client
</code></pre>
<h3 id="heading-database-and-table-creation">Database and Table Creation</h3>
<p>Create database test:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">DATABASE</span> <span class="hljs-keyword">test</span>;
</code></pre>
<p>create table <code>orders</code>:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> test.orders
(<span class="hljs-string">`OrderID`</span> Int64,
<span class="hljs-string">`CustomerID`</span> Int64,
<span class="hljs-string">`OrderDate`</span> DateTime,
<span class="hljs-string">`Comments`</span> <span class="hljs-keyword">String</span>,
<span class="hljs-string">`Cancelled`</span> <span class="hljs-built_in">Bool</span>)
<span class="hljs-keyword">ENGINE</span> = MergeTree
PRIMARY <span class="hljs-keyword">KEY</span> (OrderID, OrderDate)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (OrderID, OrderDate, CustomerID)
<span class="hljs-keyword">SETTINGS</span> index_granularity = <span class="hljs-number">8192</span>;
</code></pre>
<p>In the next parts, we'll talk about other configurations like <code>Engine</code>, <code>PRIMARY KEY</code>, <code>ORDER BY</code>, etc.</p>
<h3 id="heading-insert-data">Insert Data</h3>
<p>To insert sample data:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> test.orders 
<span class="hljs-keyword">VALUES</span> (<span class="hljs-number">334</span>, <span class="hljs-number">123</span>, <span class="hljs-string">'2021-09-15 14:30:00'</span>, <span class="hljs-string">'some comment'</span>, 
<span class="hljs-literal">false</span>);
</code></pre>
<h3 id="heading-read-data">Read Data</h3>
<p>Just like any other SQL query:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> OrderID, OrderDate <span class="hljs-keyword">FROM</span> test.orders;
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In the first part of the <em>ClickHouse Tutorial Series</em>, we discussed the traits, features, and weaknesses of ClickHouse. Then we saw how to set up an instance with minimum configuration, create a database and table, insert data into it, and read from it.</p>
<h2 id="heading-useful-links">Useful Links</h2>
<ul>
<li><p><a target="_blank" href="https://dev.to/taw/getting-started-with-clickhouse-3nf9">Getting Started with ClickHouse</a> <a target="_blank" href="https://dev.to/taw/getting-started-with-clickhouse-3nf9">by Tomasz Wegrzanowski</a></p>
</li>
<li><p><a target="_blank" href="https://dev.to/olena_kutsenko/introduction-to-clickhouse-8em">Introduction to ClickHouse</a> by <a target="_blank" href="https://dev.to/olena_kutsenko">olena_kutsenko</a></p>
</li>
</ul>
<hr />
<p><em>Originally published at</em> <a target="_blank" href="https://dev.to/hoptical/clickhouse-basic-tutorial-an-introduction-52il"><em>https://dev.to</em></a> <em>on April 13, 2023.</em></p>
]]></content:encoded></item><item><title><![CDATA[Apply CDC from MySQL to ClickHouse]]></title><description><![CDATA[Photo by Martin Adams on Unsplash
Introduction
Suppose that you have a database handling OLTP queries. To tackle intensive analytical BI reports, you set up an OLAP-friendly database such as ClickHouse. How do you synchronize your follower database (...]]></description><link>https://hamedkarbasi.com/apply-cdc-from-mysql-to-clickhouse-d660873311c7</link><guid isPermaLink="true">https://hamedkarbasi.com/apply-cdc-from-mysql-to-clickhouse-d660873311c7</guid><category><![CDATA[cdc]]></category><category><![CDATA[change data capture]]></category><category><![CDATA[ClickHouse]]></category><category><![CDATA[MySQL]]></category><category><![CDATA[debezium]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Fri, 17 Feb 2023 11:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383513314/9ad758de-4cd2-4eab-8692-c77a45e75fe4.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Photo by <a target="_blank" href="https://unsplash.com/@martinadams?utm_source=medium&amp;utm_medium=referral">Martin Adams</a> on <a target="_blank" href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></p>
<h2 id="heading-introduction">Introduction</h2>
<p>Suppose that you have a database handling <a target="_blank" href="https://en.wikipedia.org/wiki/Online_transaction_processing">OLTP</a> queries. To tackle intensive analytical BI reports, you set up an <a target="_blank" href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a>-friendly database such as ClickHouse. How do you synchronize your follower database (which is ClickHouse here)? What challenges should you be prepared for?</p>
<p>Synchronizing two or more databases in a data-intensive application is one of the usual routines you may have encountered before or are dealing with now. Thanks to Change Data Capture (CDC) and technologies such as Kafka, this process is not sophisticated anymore. However, depending on the databases you’re utilizing, it could be challenging if the source database works in the OLTP paradigm and the target in the OLAP. In this article, I will walk through this process from MySQL as the source to ClickHouse as the destination. Although I’ve limited this article to those technologies, it’s pretty generalizable to similar cases.</p>
<h2 id="heading-system-design-overview">System Design Overview</h2>
<p>Contrary to what it sounds, it’s quite straightforward. The database changes are captured via <a target="_blank" href="https://debezium.io/">Debezium</a> and published as events on Apache Kafka. ClickHouse consumes those changes in partial order by <a target="_blank" href="https://clickhouse.com/docs/en/engines/table-engines/integrations/kafka/">Kafka Engine</a>. Real-time and eventually consistent.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383508733/ba307241-6285-4b99-a3cb-f78428f293a1.jpeg" alt="CDC Architecture" class="image--center mx-auto" /></p>
<h2 id="heading-case-study">Case Study</h2>
<p>Imagine that we have an <em>orders</em> table in Mysql with the following DDL:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-string">`orders`</span> (
  <span class="hljs-string">`id`</span> <span class="hljs-built_in">int</span>(<span class="hljs-number">11</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  <span class="hljs-string">`status`</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">50</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  <span class="hljs-string">`price`</span> <span class="hljs-built_in">varchar</span>(<span class="hljs-number">50</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  PRIMARY <span class="hljs-keyword">KEY</span> (<span class="hljs-string">`id`</span>)
) <span class="hljs-keyword">ENGINE</span>=<span class="hljs-keyword">InnoDB</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">CHARSET</span>=latin1
</code></pre>
<p>Users may create, delete and update any column or the whole record. We want to capture its changes and sink them to ClickHouse to synchronize them.</p>
<p>We’re going to use Debezium <em>v2.1</em> and the <em>ReplacingMergeTree</em> engine in ClickHouse.</p>
<h2 id="heading-implementation">Implementation</h2>
<h3 id="heading-step-1-cdc-with-debezium">Step 1: CDC with Debezium</h3>
<p>Most databases have a log that every operation is written there before applying on data (Write Ahead Log or WAL). In Mysql, this file is called Binlog. If you read that file, parse and apply it to your target database, you’re following the Change Data Capture (CDC) manifest.</p>
<p>CDC is one of the best ways to synchronize two or multiple heterogeneous databases. It’s real-time, eventually consistent, and prevents you from the other methods imposing more costs like batch-backfills with Airflow. No matter what happens on the source, you can capture it in order and be consistent with the original (eventually, of course!)</p>
<p>Debezium is a well-known tool for reading and parsing the Binlog. It simply integrates with Kafka Connect as a connector and produces every change on a Kafka topic.</p>
<p>To do so, you’ve to enable log-bin on the MySQL database and set up Kafka Connect, Kafka, and Debezium accordingly. Since it is well-explained in other articles like <a target="_blank" href="https://rmoff.net/2018/03/24/streaming-data-from-mysql-into-kafka-with-kafka-connect-and-debezium/">this</a> or <a target="_blank" href="https://medium.com/nagoya-foundation/simple-cdc-with-debezium-kafka-a27b28d8c3b8">this</a>, I’ll only focus on the Debezium configuration customized for our purpose: Capture the changes while being functional and parsable by ClickHouse.</p>
<p>Before showing the whole configuration, we should discuss three necessary configs:</p>
<h3 id="heading-extracting-new-record-state">Extracting New Record State</h3>
<p>Debezium emits every record concluding of <em>before</em> and <em>after</em> states for every operation by default which is hard to parse by ClickHouse Kafka Table. Additionally, it creates tombstone records (i.e., a record with a Null value) in case of a delete operation (Again, unparsable by Clickhouse). The entire behavior has been demonstrated in the table below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383509750/957f1979-03a9-4339-9a78-504de7b63263.png" alt="Records states for different operations in the default configuration." /></p>
<p>We use the ExtractNewRecod transformer in the Debezium configuration to handle the problem. Thanks to this option, Debezium only keeps the <em>after</em> state for the <strong>create/update</strong> operations and disregards the before state. But as a drawback, It drops the <strong>Delete</strong> record containing the previous state and the tombstone record mentioned earlier. In other words, you won’t capture the delete operation anymore. Don’t worry! We’ll tackle it in the next section.</p>
<pre><code class="lang-json"><span class="hljs-string">"transforms"</span>: <span class="hljs-string">"unwrap"</span>,
<span class="hljs-string">"transforms.unwrap.type"</span>: <span class="hljs-string">"io.debezium.transforms.ExtractNewRecordState"</span>
</code></pre>
<p>The picture below shows how the state <em>before</em> is dropped and <em>after</em> is flattened by using the <em>ExtractNewRecord</em> configuration.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383510751/74b165a5-e794-4182-bab0-1d697ce1d5db.png" alt /></p>
<p><em>Left: Record without ExtractNewRecord config; Right: Record with ExtractNewRecord config</em></p>
<h3 id="heading-rewriting-delete-events">Rewriting Delete Events</h3>
<p>To capture delete operations, we must add the <strong>rewrite</strong> config as below:</p>
<p>"transforms.unwrap.delete.handling.mode":"rewrite"</p>
<p>Debezium adds field __deleted with this config*,* which is true for the delete operation and false for the others. Hence, a deletion would contain the previous state as well as a <code>__deleted:true</code> field.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383511939/a08fe992-10b3-4a5d-abee-1e38c69275e7.png" alt /></p>
<p><em>The __deleted field is added after using the rewrite configuration</em></p>
<h3 id="heading-handling-non-primary-keys-update">Handling Non-Primary Keys Update</h3>
<p>Providing the mentioned configurations, updating a record (every column except the primary key) emits a simple record with the new state. Having another relational database with the same DDL is OK since the updated record replaces the previous one in the destination. But in the case of ClickHouse, the story goes wrong!</p>
<p>In our example, the source uses <em>id</em> as the primary key, and ClickHouse uses <em>id</em> and <em>status</em> as order keys. Replaces and uniqueness only guarantees for the records with the same <em>id</em> and <em>status</em>! So what happens if the source updates the <em>status</em> column? We end up with duplicate records implying equal <em>ids</em> but different <em>statuses</em> in ClikHouse!</p>
<p>Fortunately, there is a way. By default, Debezium creates a delete record and a create record for updating on primary keys. So if the source updates the <em>id</em>, it emits a <strong>delete</strong> record with the previous <em>id</em> and a <strong>create</strong> record with the new <em>id</em>. The previous one with the <code>__deleted=ture</code> field replaces our stall record in CH. Then the records implying deletion can be filtered in the view. We can extend this behavior to other columns with the below option:</p>
<pre><code class="lang-json"><span class="hljs-string">"message.key.columns"</span>: <span class="hljs-string">"inventory.orders:id;inventory.orders:status"</span>
</code></pre>
<p>Now by putting together all the above options and the usual ones, we’ll have a fully functional Debezium configuration capable of handling any change desired by ClickHouse:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"name"</span>: <span class="hljs-string">"mysql-connector"</span>,
    <span class="hljs-attr">"config"</span>: {
        <span class="hljs-attr">"connector.class"</span>: <span class="hljs-string">"io.debezium.connector.mysql.MySqlConnector"</span>,
        <span class="hljs-attr">"database.hostname"</span>: <span class="hljs-string">"mysql"</span>,
        <span class="hljs-attr">"database.include.list"</span>: <span class="hljs-string">"inventory"</span>,
        <span class="hljs-attr">"database.password"</span>: <span class="hljs-string">"mypassword"</span>,
        <span class="hljs-attr">"database.port"</span>: <span class="hljs-string">"3306"</span>,
        <span class="hljs-attr">"database.server.id"</span>: <span class="hljs-string">"2"</span>,
        <span class="hljs-attr">"database.server.name"</span>: <span class="hljs-string">"dbz.inventory.v2"</span>,
        <span class="hljs-attr">"database.user"</span>: <span class="hljs-string">"root"</span>,
        <span class="hljs-attr">"message.key.columns"</span>: <span class="hljs-string">"inventory.orders:id;inventory.orders:status"</span>,
        <span class="hljs-attr">"name"</span>: <span class="hljs-string">"mysql-connector-v2"</span>,
        <span class="hljs-attr">"schema.history.internal.kafka.bootstrap.servers"</span>: <span class="hljs-string">"broker:9092"</span>,
        <span class="hljs-attr">"schema.history.internal.kafka.topic"</span>: <span class="hljs-string">"dbz.inventory.history.v2"</span>,
        <span class="hljs-attr">"snapshot.mode"</span>: <span class="hljs-string">"schema_only"</span>,
        <span class="hljs-attr">"table.include.list"</span>: <span class="hljs-string">"inventory.orders"</span>,
        <span class="hljs-attr">"topic.prefix"</span>: <span class="hljs-string">"dbz.inventory.v2"</span>,
        <span class="hljs-attr">"transforms"</span>: <span class="hljs-string">"unwrap"</span>,
        <span class="hljs-attr">"transforms.unwrap.delete.handling.mode"</span>: <span class="hljs-string">"rewrite"</span>,
        <span class="hljs-attr">"transforms.unwrap.type"</span>: <span class="hljs-string">"io.debezium.transforms.ExtractNewRecordState"</span>
  }
}
</code></pre>
<h3 id="heading-important-how-to-choose-the-debezium-key-columns">Important: How to choose the Debezium key columns?</h3>
<p>By changing the key columns of the connector, Debezium uses those columns as the topic keys instead of the default Primary key of the source table. So different operations related to a record of the database may end up at the other partitions in Kafka. As records lose their order in different partitions, it can lead to inconsistency in Clikchouse unless you ensure that ClickHouse order keys and Debezium message keys are the same.</p>
<p>The rule of thumb is as below:</p>
<ol>
<li><p>Design the partition key and order key based on your desired table design.</p>
</li>
<li><p>Extract the source origin of the partition and sort keys, supposing they are calculated during materialization.</p>
</li>
<li><p>Union all of those columns</p>
</li>
<li><p>Define the result of step 3 as the <em>message.column.keys</em> in the Debezium connector configuration.</p>
</li>
<li><p>Check if the Clickhouse sort key has all those columns. If not, add them.</p>
</li>
</ol>
<h2 id="heading-step-2-clickhouse-tables">Step 2: ClickHouse Tables</h2>
<p>ClickHouse can sink Kafka records into a table by utilizing <a target="_blank" href="https://clickhouse.com/docs/en/engines/table-engines/integrations/kafka/">Kafka Engine</a>. We need to define three tables: Kafka table, Consumer Materilizaed table, and Main table.</p>
<h3 id="heading-kafka-table">Kafka Table</h3>
<p>Kafka table defines the record structure and Kafka topic intended to be read.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> default.kafka_orders
(
    <span class="hljs-string">`id`</span> Int32,
    <span class="hljs-string">`status`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`price`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`__deleted`</span> Nullable(<span class="hljs-keyword">String</span>)
)
<span class="hljs-keyword">ENGINE</span> = Kafka(<span class="hljs-string">'broker:9092'</span>, <span class="hljs-string">'inventory.orders'</span>, <span class="hljs-string">'clickhouse'</span>, <span class="hljs-string">'AvroConfluent'</span>)
<span class="hljs-keyword">SETTINGS</span> format_avro_schema_registry_url = <span class="hljs-string">'http://schema-registry:8081'</span>
</code></pre>
<h3 id="heading-consumer-materializer">Consumer Materializer</h3>
<p>Every record of the Kafka Table is only read once since its consumer group bumps the offset, and we can’t read it twice. So, we need to define a main table and materialize every Kafka table record to it via the view Materializer:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">MATERIALIZED</span> <span class="hljs-keyword">VIEW</span> default.consumer__orders <span class="hljs-keyword">TO</span> default.stream_orders
(
    <span class="hljs-string">`id`</span> Int32,
    <span class="hljs-string">`status`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`price`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`__deleted`</span> Nullable(<span class="hljs-keyword">String</span>)
) <span class="hljs-keyword">AS</span>
<span class="hljs-keyword">SELECT</span>
    <span class="hljs-keyword">id</span> <span class="hljs-keyword">AS</span> <span class="hljs-keyword">id</span>,
    <span class="hljs-keyword">status</span> <span class="hljs-keyword">AS</span> <span class="hljs-keyword">status</span>,
    price <span class="hljs-keyword">AS</span> price,
    __deleted <span class="hljs-keyword">AS</span> __deleted
<span class="hljs-keyword">FROM</span> default.kafka_orders
</code></pre>
<h3 id="heading-main-table">Main Table</h3>
<p>The main table has the source structure plus the __deleted field. I’m using a Replacing Merge Tree since we need to replace stall records with their deleted or updated ones.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> default.stream_orders
(
    <span class="hljs-string">`id`</span> Int32,
    <span class="hljs-string">`status`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`price`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`__deleted`</span><span class="hljs-keyword">String</span>
)
<span class="hljs-keyword">ENGINE</span> = ReplacingMergeTree
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> (<span class="hljs-keyword">id</span>, price)
<span class="hljs-keyword">SETTINGS</span> index_granularity = <span class="hljs-number">8192</span>
</code></pre>
<h3 id="heading-view-table">View Table</h3>
<p>Finally, we need to filter every deleted record (since we don’t want to see them) and have the most recent one in case of having different records with the same sort key. This can be tackled by using the <em>Final</em> modifier. But to avoid using filter and final in every query, we can define a simple View to do the job implicitly:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">VIEW</span> default.orders
(
    <span class="hljs-string">`id`</span> Int32,
    <span class="hljs-string">`status`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`price`</span> <span class="hljs-keyword">String</span>,
    <span class="hljs-string">`__deleted`</span> <span class="hljs-keyword">String</span>
) <span class="hljs-keyword">AS</span>
<span class="hljs-keyword">SELECT</span> *
<span class="hljs-keyword">FROM</span> default.stream_orders
<span class="hljs-keyword">FINAL</span>
<span class="hljs-keyword">WHERE</span> __deleted = <span class="hljs-string">'false'</span>
</code></pre>
<p>Note: It’s inefficient to use Final for every query, especially in production. You can use aggregations to see the last records or wait for ClickHouse to merge records in the background.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this article, we saw how we could synchronize the ClickHouse database with MySQL via CDC and prevent duplication using a soft-delete approach.</p>
<hr />
<p>Originally published at <a target="_blank" href="https://medium.com/@hoptical/apply-cdc-from-mysql-to-clickhouse-d660873311c7">Medium</a>.</p>
]]></content:encoded></item><item><title><![CDATA[A Quick Start to Load Test with k6]]></title><description><![CDATA[Note: To access the files of this post please visit https://github.com/exaco/k6-loadtest

Introduction
After developing an overwhelming, ready-to-sell product, every organization will face a question titled: Is it ready for production yet or not? Bes...]]></description><link>https://hamedkarbasi.com/a-quick-start-to-load-test-with-k6-7137a0b52ca1</link><guid isPermaLink="true">https://hamedkarbasi.com/a-quick-start-to-load-test-with-k6-7137a0b52ca1</guid><category><![CDATA[Load Testing]]></category><category><![CDATA[k6]]></category><category><![CDATA[Quality Assurance]]></category><category><![CDATA[stress test]]></category><dc:creator><![CDATA[Hamed Karbasi]]></dc:creator><pubDate>Sat, 09 Oct 2021 10:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383506121/d3d1cf18-79f4-4eaa-9fe4-b04ef94e359d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Note: To access the files of this post please visit <a target="_blank" href="https://github.com/exaco/k6-loadtest">https://github.com/exaco/k6-loadtest</a></p>
</blockquote>
<h2 id="heading-introduction">Introduction</h2>
<p>After developing an overwhelming, ready-to-sell product, every organization will face a question titled: <em>Is it ready for production yet or not?</em> Besides the business-market fit, UI/UX acceptance, on the technical side, we are worried about the product under load conditions. It is working and fully functional with one user. What about 10, 100, or 1000 users working simultaneously? Would it crash or not?</p>
<p>The above questions are typically answered by the following QA/QC and performance test concepts defined here:</p>
<p><strong>Smoke Test</strong>: Targets the system’s functionality under the slightest load: Is it working with only one user?</p>
<p><strong>Load Test</strong>: Targets the system under normal usage by the users. You should ask: “How many users are using the system simultaneously typically, and how long will their sessions last?:</p>
<p><strong>Stress Test</strong>: What is the maximum capacity of the system? How many users with what kind of behavior should use the site to disrate its quality due to the SLO: i.e., availability, request latency, throughput, etc.</p>
<p><strong>Soak Test</strong>: Would the system sustain for a long time(generally from hours to days) under normal conditions? It’s working for 15min under typical conditions in the load test but is it feasible for a much longer time?</p>
<p>Answering the above questions helps you have a better perspective of the capability of your product under different conditions while redirecting the company’s business team to schedule marketing and release plans better according to the system capacity.</p>
<p>Now that we know the load tests’ importance, the next question comes to mind titling: “What tools can we utilize?”</p>
<h2 id="heading-load-test-tools">Load Test Tools</h2>
<p>There are various load test tools; among them, some popular are:</p>
<ul>
<li><p>k6.io: Created by Load Impact; written in GO; scriptable in Javascript</p>
</li>
<li><p>Locust: Created by Jonathan Heyman; written in Python; scriptable in Python</p>
</li>
<li><p>Jmeter: Created by Apache foundation; written in Java; scriptable(Limited) in XML.</p>
</li>
</ul>
<p>We’ll investigate k6 here. It’s an Open source load testing tool and SaaS for engineering teams. Some of its pros are as follows:</p>
<ul>
<li><p>Easy to write test scenarios, especially with a <a target="_blank" href="https://chrome.google.com/webstore/detail/k6-browser-recorder/phjdhndljphphehjpgbmpocddnnmdbda?hl=en">screen recorder</a></p>
</li>
<li><p>Capable of handling 30.000–40.000 generating up to 300,000 Requests per Second(RPS) with only one instance.</p>
</li>
<li><p>Recently acquired by GrafanaLabs, demonstrating its reputation.</p>
</li>
<li><p>Easily exportable to influx DB and visualizable in Grafana(lovely by SRE teams)</p>
</li>
</ul>
<h3 id="heading-what-tool-should-i-use">What tool should I use?</h3>
<p>If you want to test your APIs, server backend services, k6 can be a reasonable choice. It’s easy to use and capable. But remember, it can only test the backend services, not the frontend components! If you want to test those components(client-side), too, Selenium is the better choice. Keep in mind that Selenium can only perform the functional(Smoke) test and not the load or stress tests!</p>
<h2 id="heading-getting-started-with-k6">Getting started with k6</h2>
<p>To use k6 and acquire the results, follow these steps:</p>
<h4 id="heading-1-install-the-requirements">1. Install the requirements</h4>
<ul>
<li><p><a target="_blank" href="https://docs.influxdata.com/influxdb/v1.8/">Influxdb</a> v1.8: Timeseries database to store test results</p>
</li>
<li><p><a target="_blank" href="https://grafana.com/docs/grafana/latest/installation/">Grafana</a>: Visualizing the results</p>
</li>
</ul>
<p>To install the above tools, you can use the below docker-compose file:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">version:</span> <span class="hljs-string">'3'</span>
<span class="hljs-attr">services:</span>
  <span class="hljs-attr">influxdb:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">influxdb:1.8</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-comment"># Mount for influxdb data directory and configuration</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./influxdb_data:/var/lib/influxdb</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"8086:8086"</span>

  <span class="hljs-attr">grafana:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">grafana/grafana:8.1.4</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"3000:3000"</span>
</code></pre>
<ul>
<li><p>AUT: The application under test; As an example, we used a Flask app in conjunction with VueJs provided by https://github.com/testdrivenio/flask-vue-crud. The installation guide is in the repository <a target="_blank" href="https://github.com/testdrivenio/flask-vue-crud/blob/master/README.md">readme</a>.</p>
</li>
<li><p><a target="_blank" href="https://k6.io/docs/getting-started/installation/">k6</a>: Run the below commands:</p>
</li>
</ul>
<pre><code class="lang-bash">$ sudo apt-get update &amp;&amp; sudo apt-get install ca-certificates gnupg2 -y
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
$ <span class="hljs-built_in">echo</span> <span class="hljs-string">"deb https://dl.k6.io/deb stable main"</span> | sudo tee /etc/apt/sources.list.d/k6.list
$ sudo apt-get update
$ sudo apt-get install k6
</code></pre>
<h4 id="heading-2-write-the-test-scenario">2. Write the test scenario</h4>
<p>k6 uses a javascript file as the tests scenario. Don’t freak out if you’re not familiar or dominated with js. k6 has provided a beautiful and friendly Chrome extension. If you’re testing a web application, install the extension <a target="_blank" href="https://chrome.google.com/webstore/detail/k6-browser-recorder/phjdhndljphphehjpgbmpocddnnmdbda?hl=en">here.</a></p>
<p>You can learn the usage of the extension by the below gif:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383493834/27be1d59-26c4-4cb1-87c6-d0c311ef2d04.gif" alt="Recording screen for writing test scenario" class="image--center mx-auto" /></p>
<p>Eventually, the recorded extension will redirect you to the below page.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383495661/18bfb7f7-dcb3-4a68-80b8-6406b170c28a.png" alt="k6 test builder page" /></p>
<p>You may choose the k6 native test builder, which provides a graphical request editor capable of launching testing online. But if you want to run k6 on your own, choose the script editor. Copy the js file into another file called <code>test.js</code>.</p>
<p>Remember that the current file is not perfect for the test, and you’ve to check all of the requests to make sure of the being normal and generalization of the scenario. Furthermore, the k6 screen recorder only catches the requests from the browser and cannot fetch your application frontend logic. For example, you may use xsrf-tokens or pass a last request parameter to the next one. Like our AUT example, which uses book ids from the Get request, we’ve to edit the put and delete requests to support the Ids. So, always check and edit the test files according to your requirements.</p>
<h4 id="heading-3-run-the-test">3. Run the test</h4>
<p>After editing the Test file, you can run it via:</p>
<pre><code class="lang-bash">$ k6 run --vus=2 --duration=3m test.js
</code></pre>
<p><em>vus</em> is the number of simultaneous virtual users, and <em>duration</em> is the test duration.</p>
<p>Otherwise, you can determine it in the test file with the option variable:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">export</span> <span class="hljs-keyword">let</span> options = {
    <span class="hljs-attr">stages</span>: [
    { <span class="hljs-attr">duration</span>: <span class="hljs-string">'2m'</span>, <span class="hljs-attr">target</span>: <span class="hljs-number">30</span> }, <span class="hljs-comment">// simulate ramp-up of traffic from 1 to 30 users over 2 minutes.</span>
    { <span class="hljs-attr">duration</span>: <span class="hljs-string">'2m'</span>, <span class="hljs-attr">target</span>: <span class="hljs-number">30</span> }, <span class="hljs-comment">// stay at 30 users for 3 minutes</span>
    { <span class="hljs-attr">duration</span>: <span class="hljs-string">'2m'</span>, <span class="hljs-attr">target</span>: <span class="hljs-number">0</span> }, <span class="hljs-comment">// ramp-down to 0 users</span>
];
};
</code></pre>
<h4 id="heading-4-test-results">4. Test Results</h4>
<p>k6 will give you a summary of the results. This summary includes check statuses, number iterations, and http_req_duration stats which is the most important because it shows TTFB(Time to First Byte).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383496977/513e13c1-3c2e-446c-a1c0-04688a9f5706.png" alt="k6 test result summary" /></p>
<p>k6 lets you store the results with more details in the Influxdb and visualize preferable insights via Grafana. To do that, add the <code>— out</code>option with Influxdb address just like below:</p>
<pre><code class="lang-bash">$ k6 run --out influxdb=http://localhost:8086/myk6db test.js
</code></pre>
<p>Now you can jump into Grafana and see the results. But before that, you need a pre-created dashboard. You can import the Grafana dashboard provided <a target="_blank" href="https://grafana.com/grafana/dashboards/15080.">here</a>.</p>
<h2 id="heading-run-a-smoke-test">Run a smoke test</h2>
<p>In this smoke test, only one user is using the system for one minute. To launch it, add <code>test_mode=smoke</code> environment variable.</p>
<pre><code class="lang-bash">$ k6 run -e test_mode=smoke --out influxdb=http://localhost:8086/myk6db test.js
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383498619/79e32801-7132-4eeb-a5df-70d5d02482f1.png" alt="Check status results for the smoke test." /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383500083/695fe695-baf0-4072-8712-e36bdda38153.png" alt="Endpoints status codes, success percentage, and duration latencies for the smoke test" /></p>
<p>It’s evident that all of the requests are executed under 10ms and passed 100%.</p>
<h2 id="heading-run-a-stress-test">Run a stress test</h2>
<p>Now we run a stress test on the system to see its capability to handle 200 users simultaneously.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383501511/8806cea0-08dd-449b-bbf4-017930902cda.png" alt="Check status results for the stress test." /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755383504203/49c07dfe-130d-4d17-a2ad-55700c27c0e9.png" alt="Endpoints status codes, success percentage, and duration latencies for the stress test" /></p>
<p>The system is facing severe issues. The latencies have increased from milliseconds to 9.3s. Some endpoints were unable to finish the request. This system may be considered a failure under some SLOs. Of course, it’s not a surprise! We’ve used its Flask in development mode(not production with several workers), increasing its capacity much higher.</p>
<hr />
<p>Originally published at <a target="_blank" href="https://medium.com/exa-technical-blog/a-quick-start-to-load-test-with-k6-7137a0b52ca1">Medium</a>.</p>
]]></content:encoded></item></channel></rss>