ua:Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0\\n\\n\\n\\n
Internal Representation After Processing:
\\n\\n\\n\\n[1531222025.000000000, {\\"host\\"=>\\"127.0.0.1\\", \\"ident\\"=>\\"-\\", \\"user\\"=>\\"-\\", \\"req\\"=>\\"GET / HTTP/1.1\\", \\"status\\"=>200, \\"size\\"=>16218, \\"referer\\"=>\\"<http://127.0.0.1/>\\", \\"ua\\"=>\\"Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0\\"}]
\\n\\n\\n\\nWhat Comes First: Filtering or Parsing?
\\n\\n\\n\\nIn Fluent Bit, parsing typically occurs before filtering. This is because filters often rely on the structured data produced by the parser to make decisions about what to include, modify, or exclude from the log stream.
\\n\\n\\n\\nAfter parsing, filters can be applied to refine the logs further:
\\n\\n\\n\\nFor example, you can use a grep filter to keep only logs with specific log levels and a record_modifier filter to add an “env” field:
\\n\\n\\n\\nfilters:\\n - name: grep\\n match: \'*\'\\n regex:\\n level: ^(ERROR|WARN|INFO)$\\n\\n - name: record_modifier\\n match: \'*\'\\n records:\\n - env: production
\\n\\n\\n\\nThis refines the logs to include only the specified levels and adds a contextual “env” field, making the logs more useful for downstream analysis.
\\n\\n\\n\\nWith Fluent Bit’s parsing capabilities, you can transform logs into actionable insights to drive your technical and business decisions. By leveraging its built-in and customizable parsers, you can standardize diverse log formats, reduce data volume, and optimize your observability pipeline. Implementing these strategies will help you overcome common logging challenges and enable more effective monitoring and analysis.
\\n\\nMember post originally published on the yld blog by Afonso Ramos
\\n\\n\\n\\nRemember when searching for information online involved typing in a few keywords and sifting through pages of results? Thankfully, those days are long gone.
\\n\\n\\n\\nToday’s search engines have transformed the way we find information online. From simple keyword matching to advanced technologies like semantic search and natural language processing, search engines have come a long way. But, have you ever stopped to consider the UX design choices that make these search experiences possible?
\\n\\n\\n\\nAn effective UX/ search experience design involves a deep understanding of how humans interact with machines and the ability to adapt to that relationship. It’s not an innate skill, but rather something that can be developed over time, much like the act of googling is a skill that many of us have honed through years of practice. Similarly, interacting with Generative AI models requires a certain level of skill and knowledge which involves Prompt Engineering.
\\n\\n\\n\\nThis article aims to provide a comprehensive guide to Prompt Engineering, helping you understand its importance, techniques, and best practices. You’ll be equipped to create effective prompts that maximise the potential of AI models and enhance user experiences.
\\n\\n\\n\\nWell-crafted prompts are essential for maximising the potential of AI models, particularly LLMs. They enhance AI performance by providing clear instructions and context, leading to more accurate and relevant responses.
\\n\\n\\n\\nPrompt Engineering improves user experience by making interactions more intuitive and reducing ambiguity, minimising the risk of misinterpretation. It enables AI models to handle complex tasks, adapt to different use cases, and ensure consistency in outputs, which is vital for integrated systems.
\\n\\n\\n\\nPrompt Engineering is a discipline that involves developing and optimising prompts to efficiently use language models for a wide variety of applications and research topics. This discipline is particularly useful for developers, researchers, and anyone looking to leverage AI models for various applications and research topics. It encompasses various skills and techniques essential for interacting with and developing LLMs. Mastering this discipline enables you to optimise these interactions, achieving more accurate and relevant outcomes.
\\n\\n\\n\\nA prompt refers to a statement or question that is employed to trigger a response from a language model or other AI system. Prompts are generally crafted to offer context or instructions to the AI model, directing it to produce a specific type of output or carry out a particular task. A prompt can be provided to the language model by the user or by the system itself, serving as a means to define its default behaviour.
\\n\\n\\n\\nCreating effective prompts requires a deep understanding of both the AI model (various models respond uniquely to different types of prompts) and the user’s intent. Just as search engines use algorithms to understand the user’s query and return relevant results, Prompt Engineering involves designing prompts to communicate the user’s intent to the AI model.
\\n\\n\\n\\nExperimenting with different formats, testing various instructions and contexts, and refining the prompt based on the AI model’s responses are key to creating prompts that produce the desired response from the AI model while minimising the risk of misinterpretation or ambiguity.
\\n\\n\\n\\nThe images below demonstrate how slight prompt variations can lead to very different results. However, consistency is crucial in integrated systems. A good prompt aims to produce repeatable outputs with minimal variation, ensuring reliable and predictable AI performance.
\\n\\n\\n\\n
\\n\\n\\n\\n
\\n\\n\\n\\nWhen it comes to the daily interaction with an LLM, a varied yet similar result would be acceptable. But, when it comes to using an LLM as part of an integrated system, consistency is an important factor. What makes a prompt “good” is about producing a somewhat repeatable output with minimal variation.
\\n\\n\\n\\nNo. Not all prompts are created equally because different types of prompts serve different purposes and can significantly impact the quality and relevance of the AI model’s responses. Understanding the various types of prompts and their applications is key to effective Prompt Engineering.
\\n\\n\\n\\nAs a way to scientifically define methods of communication with LLMs, many have tried to create both techniques and frameworks that systematically define how to write these prompts.
\\n\\n\\n\\nHere are some common types of AI prompts that serve unique purposes:
\\n\\n\\n\\nEach of these Prompt Engineering techniques can be adapted and combined depending on the specific requirements of the task at hand and the capabilities of the AI model being used.
\\n\\n\\n\\nEffective Prompt Engineering involves a deep understanding of these techniques and the ability to apply them creatively. By leveraging the right prompt type, you can significantly enhance the AI model’s performance and the overall user experience. If you want to read more about the several types of prompts, check out Prompting Guide’s techniques page.
\\n\\n\\n\\nCrafting the perfect prompt is like giving directions to a slightly distracted, incredibly smart friend – you need to be clear, concise, and maybe even a little clever.
\\n\\n\\n\\nStart with a basic prompt and refine it iteratively based on the responses you receive, fine-tuning the AI’s outputs to your specific requirements. Incorporate relevant keywords and specific details to guide the AI more effectively towards the desired output.
\\n\\n\\n\\nMoreover, don’t use jargon or assume knowledge. Be aware of the model’s limitations to craft prompts within its capabilities, avoiding overly complex requests that lead to poor responses. Utilise feedback to continuously improve your prompts, as insights from users or the outputs themselves can guide adjustments for better results.
\\n\\n\\n\\nMind the length of your prompts to prevent confusion and higher token consumption, which can increase costs.
\\n\\n\\n\\nAlthough the model may be equipped to handle specific challenges like trick questions about prime numbers, this doesn’t guarantee it can manage every query type. Even after conjuring the best prompt, the model can still confidently reply with false information. In some cases, it might only succeed because similar examples were included in its training data, as shown in the examples in the screenshot below:
\\n\\n\\n\\n
\\n\\n\\n\\n
\\n\\n\\n\\nIf you find yourself facing a complex problem and need assistance in crafting effective prompts, don’t hesitate to ask AI for help. Leveraging the AI’s capabilities can provide valuable insights and suggestions to refine your prompts, ensuring you get the best possible outcomes. If you feel confident in your prompting capabilities, put yourself to the test against Gandalf.
\\n\\n\\n\\nFinally, different types of prompts complement each other, and combining them can lead to more effective and nuanced interactions with AI models. Whether you’re using One-Shot prompts for quick adaptation, Zero-Shot prompts for versatility, or hybrid prompts for complex tasks, understanding how to leverage these techniques together can significantly enhance the AI’s performance and the overall user experience.
\\n\\n\\n\\nIf you need help with GenAI, data engineering, MLOps, data analysis, or data science, contact us.
\\n\\nAmbassador post originally published on Dev.to by Syed Asad Raza
\\n\\n\\n\\nAs cloud-native applications scale, securing workloads while maintaining performance becomes critical. This is where Cilium, an open-source networking, observability, and security tool, shines. Backed by the power of eBPF (Extended Berkeley Packet Filter), Cilium provides secure, high-performance communication between microservices in Kubernetes environments.
\\n\\n\\n\\nCilium is a cloud-native networking solution that secures and monitors service-to-service communication. It leverages eBPF to operate within the Linux kernel, enabling dynamic programmability and reducing the performance overhead associated with traditional firewalls.
\\n\\n\\n\\nCilium uses eBPF programs attached to various points in the Linux kernel, such as network interfaces and system calls. This allows it to inspect, modify, and route network packets in real-time. Kubernetes network policies are automatically translated into eBPF code, ensuring secure communication.
\\n\\n\\n\\nhelm repo add cilium https://helm.cilium.io/\\nhelm repo update\\n
\\n\\n\\n\\nhelm install cilium cilium/cilium --version <latest-version> \\\\\\n --namespace kube-system \\\\\\n --set kubeProxyReplacement=strict \\\\\\n --set k8sServiceHost=<your-k8s-api-server> \\\\\\n --set k8sServicePort=<your-k8s-api-port>\\n
\\n\\n\\n\\nkubectl get pods -n kube-system\\n
\\n\\n\\n\\nEnsure that all Cilium-related pods are running.
\\n\\n\\n\\nhelm upgrade cilium cilium/cilium --namespace kube-system \\\\\\n --set hubble.enabled=true \\\\\\n --set hubble.relay.enabled=true \\\\\\n --set hubble.ui.enabled=true\\n
\\n\\n\\n\\nCilium is redefining cloud-native security and observability with eBPF. Its seamless integration with Kubernetes, superior performance, and deep visibility make it a go-to solution for modern cloud-native architectures. Whether securing a microservices-based application or building a scalable Kubernetes platform, Cilium offers the best of both worlds: powerful security and unmatched performance.
\\n\\n\\n\\nReady to enhance your Kubernetes security? Explore the official Cilium documentation and start your journey toward a more secure and observable cloud-native environment.
\\n\\n\\n\\nTHANK’S FOR READING
\\n\\nThis week’s Kubestronaut in Orbit is Sofonias Mengistu, a DevOps Engineer at Gebeya.INC based in Addis Ababa, Ethiopia. With 14 years of IT experience—five of those dedicated to cloud-native technologies—Sofonias has led numerous projects focused on implementing, managing, and securing Kubernetes environments. He got his start in networking, but his passion for DevOps inspired a career transition, allowing him to dive deeper into the cloud-native ecosystem. Sofonias is also an active contributor to the CNCF community, frequently participating in project meetups and webinars to share knowledge and collaborate with peers.
\\n\\n\\n\\nIf you’d like to be a Kubestronaut like Sofonias, get more details on the CNCF Kubestronaut page.
\\n\\n\\n\\nI began my journey with cloud and cloud-native technologies in 2018, using AWS as my first platform in a lab environment, primarily for study purposes. My early focus was on deploying static applications, and soon after, I delved into containerization with Docker and version control using GitHub. In 2021, after a friend recommended Kubernetes, I took on my first project, setting up an infrastructure and deploying a voting app on EKS using eksctl, which was a valuable hands-on learning experience. Since then, I have completed the KodeKloud Engineer program, up to the DevOps Architect level, where I’ve gained expertise through real-world DevOps tasks.
\\n\\n\\n\\nI’ve had the opportunity to work with several key CNCF projects, starting with Kubernetes. Some of the others I’ve utilized include ArgoCD, Flux, Prometheus, CRI-O, CoreDNS, Istio, OpenTelemetry, PostgreSQL, and Kyverno. This is just a snapshot of my experience; I’ve explored various CNCF projects throughout my career.
\\n\\n\\n\\nCNCF certifications have been instrumental in my career growth. They’ve validated my skills, opened doors to new opportunities, and helped me stand out in the competitive job market. In a short period, I’ve been able to secure senior positions and contribute significantly to projects thanks to my certifications.
\\n\\n\\n\\nFor beginners, Tech with Nana on YouTube is an excellent resource, and you should quickly move on to Mumshad Mannambeth’s 8-minute video (he also has CNCF endorsed content on Udemy). For the Kubernetes and Cloud Native Associate (KCNA) certification, KodeKloud is the best platform to start with. Back when I started, there wasn’t a KCNA certificate, so I began with CKA on KodeKloud, but now students should first gain basic Kubernetes knowledge with KCNA. KodeKloud or Udemy courses (check out the CNCF endorsed content), especially “Kubernetes from Scratch” by Mohamed, along with KodeKloud’s PDFs, are highly recommended.
\\n\\n\\n\\nFor intermediate learners, focus on the Certified Kubernetes Administrator (CKA) certification with KodeKloud and complete all tasks. After that, practice on Killer.sh, and study books related to Kubernetes. You should also check out Tech with Nana’s 3.5-hour course on YouTube for further insights. Ensure you achieve at least 85 on Killer.sh to be well-prepared for the exam. After CKA, move on to Certified Kubernetes Application Developer (CKAD), using KodeKloud and Killer.sh for practice. Additionally, explore content on GitHub and work on DevOps projects from YouTube to gain practical experience. Feel free to reach out to me on LinkedIn for project recommendations.
\\n\\n\\n\\nFor advanced learners, Certified Kubernetes Security Specialist (CKS) is the next step, and you can use free resources like Killer.sh on YouTube, KodeKloud, and GVR videos by Venkata Ramana Gali. These are excellent materials for advanced CKS preparation. The CKS requires more in-depth knowledge, which is why it’s considered advanced.
\\n\\n\\n\\nFinally, aim for the Kubernetes and Cloud Security Associate (KCSA) certification, which can be completed in two weeks using the same materials you used for CKS.
\\n\\n\\n\\nI mainly study various tools in my free time, focusing on deploying lab-based applications on Azure, AWS, or GCP platforms. In addition to tech, I enjoy staying active through sports—biking, boxing, and swimming are my favorite activities that I regularly engage in.
\\n\\n\\n\\nHonestly, I don’t have any secret tricks. I mainly use LinkedIn for networking and professional development. I often connect with people who have passed the certifications I’m interested in, and I ask them various questions beyond the exam content. My goal is not just to gather information, but also to explore potential job opportunities or collaborations. Remember, there’s no shortcut to success in Kubernetes. It requires dedication, hard work, and using reliable resources.
\\n\\n\\n\\nYes! I’m interested in both the Prometheus Certified Associate (PCA), as well as the Linux Foundation Certified System Administrator (LFCS) and GitOps Certified Associate (CGOA) certs, and last but not least, a FinOps Certification.
\\n\\nMember post originally published on the Embrace blog by Francisco Prieto Cardelle
\\n\\n\\n\\nAs an Android developer, my first instinct for solving a bug, measuring performance, or improving the overall experience of an app is to test it and profile it locally. Tools like the Android Studio Profiler provide powerful capabilities to detect and address all kinds of performance issues, such as UI thread blocking, memory leaks, or excessive CPU usage.
\\n\\n\\n\\nWhile these local tools are indispensable, they do have limitations. Certain problems don’t show up in controlled environments, with consistent network connectivity, predictable user behavior, and a limited range of testing devices. In the real world, users interact with apps in unexpected ways, with diverse hardware, and varying conditions, exposing issues that are difficult to replicate locally.
\\n\\n\\n\\nThis is where OpenTelemetry comes in.
\\n\\n\\n\\nOpenTelemetry is a framework for collecting, processing, and exporting data about application performance. While relatively new to mobile, it’s become a fast-growing standard for backend performance management.
\\n\\n\\n\\nThe benefits to using this framework for mobile are significant. OpenTelemetry enables developers to collect observability data from production environments, providing a window into the app’s real-world behavior.
\\n\\n\\n\\nLocal profiling is invaluable for identifying issues that are reproducible in a controlled environment.
\\n\\n\\n\\nThere are many common issues that can be detected and solved locally:
\\n\\n\\n\\nThese issues are quite straightforward to reproduce during testing, and local profiling tools are great for detecting and fixing them.
\\n\\n\\n\\nWhile local profiling covers a wide array of issues, not all problems are evident in a local setup. Observability in production is essential for diagnosing:
\\n\\n\\n\\nProduction-ready observability tools like OpenTelemetry are essential for uncovering and resolving these challenges.
\\n\\n\\n\\nOpenTelemetry is a powerful observability framework that helps developers collect, process, and export telemetry data like traces, metrics, and logs.
\\n\\n\\n\\nThere are many advantages to using OpenTelemetry for performance monitoring vs. proprietary tools. SDKs built on OpenTelemetry are very flexible, allowing engineers to easily extend their instrumentation to 3rd-party libraries. As an open source, widely-adopted framework, OpenTelemetry also lets organizations avoid vendor lock-in and have more control over their own data.
\\n\\n\\n\\nBy integrating OpenTelemetry into your Android app, you can track the performance of individual operations, identify bottlenecks, and gain insights into how your app performs under various real-world conditions. Let’s walk through how to do this.
\\n\\n\\n\\nTo add the OpenTelemetry SDK to your app, you can include the OTel bill of materials along with some necessary dependencies, like this:
\\n\\n\\n\\n// libs.versions.toml\\n[versions]\\nopentelemetry-bom = \\"1.44.1\\"\\nopentelemetry-semconv = \\"1.28.0-alpha\\"[libraries]\\nopentelemetry-bom = { group = \\"io.opentelemetry\\", name = \\"opentelemetry-bom\\", version.ref = \\"opentelemetry-bom\\" }\\nopentelemetry-api = { group = \\"io.opentelemetry\\", name = \\"opentelemetry-api\\" }\\nopentelemetry-context = { group = \\"io.opentelemetry\\", name = \\"opentelemetry-context\\" }\\nopentelemetry-exporter-otlp = { group = \\"io.opentelemetry\\", name = \\"opentelemetry-exporter-otlp\\" }\\nopentelemetry-exporter-logging = { group = \\"io.opentelemetry\\", name = \\"opentelemetry-exporter-logging\\" }\\nopentelemetry-extension-kotlin = { group = \\"io.opentelemetry\\", name = \\"opentelemetry-extension-kotlin\\" }\\nopentelemetry-sdk = { group = \\"io.opentelemetry\\", name = \\"opentelemetry-sdk\\" }\\nopentelemetry-semconv = { group = \\"io.opentelemetry.semconv\\", name = \\"opentelemetry-semconv\\", version.ref = \\"opentelemetry-semconv\\" }\\nopentelemetry-semconv-incubating = { group = \\"io.opentelemetry.semconv\\", name = \\"opentelemetry-semconv-incubating\\", version.ref = \\"opentelemetry-semconv\\" }// build.gradle.kts\\nimplementation(platform(libs.opentelemetry.bom))\\nimplementation(libs.opentelemetry.api)\\nimplementation(libs.opentelemetry.context)\\nimplementation(libs.opentelemetry.exporter.otlp)\\nimplementation(libs.opentelemetry.exporter.logging)\\nimplementation(libs.opentelemetry.extension.kotlin)\\nimplementation(libs.opentelemetry.sdk)\\nimplementation(libs.opentelemetry.semconv)\\nimplementation(libs.opentelemetry.semconv.incubating)
\\n\\n\\n\\nThen, we can create an OpenTelemetry instance that acts as a central configuration point, managing the tracer provider, resources, and exporters.
\\n\\n\\n\\nA tracer provider creates and manages tracers, which in turn generate spans. A resource contains metadata about the app and is attached to every span, helping to contextualize telemetry data. An exporter defines where the telemetry data will be sent, such as a backend observability platform or a local file for inspection.
\\n\\n\\n\\n// Resources that will be attached to telemetry to provide better context.\\n// This is a good place to add information about the app, device, and OS.\\nval resource = Resource.getDefault().toBuilder()\\n .put(ServiceAttributes.SERVICE_NAME, \\"[app name]\\")\\n .put(DeviceIncubatingAttributes.DEVICE_MODEL_NAME, Build.DEVICE)\\n .put(OsIncubatingAttributes.OS_VERSION, Build.VERSION.RELEASE)\\n .build()// The tracer provider will create spans and export them to the configured span processors.\\n// For now, we will use a simple span processor that logs the spans to the console.\\nval sdkTracerProvider = SdkTracerProvider.builder()\\n .addSpanProcessor(SimpleSpanProcessor.create(LoggingSpanExporter.create()))\\n .setResource(resource)\\n .build()\\n \\n// The OpenTelemetry SDK is the entry point to the OpenTelemetry API. It is used to create spans, metrics, and other telemetry data.\\n// Create it and register it as the global instance.\\nval openTelemetry = OpenTelemetrySdk.builder()\\n .setTracerProvider(sdkTracerProvider)\\n .buildAndRegisterGlobal()
\\n\\n\\n\\nOnce everything is initialized, we can get a tracer and create spans, using openTelemetry.sdkTracerProvider.get()
.
A trace represents a single operation or workflow within a distributed system. For Android apps, it could capture the entire journey of a user request or an action through the app. Within this journey, a span represents an individual unit of work, such as a network request, database query, or UI rendering task, providing detailed information about its duration and context. Here’s how it looks in code:
\\n\\n\\n\\nval tracer = openTelemetry.sdkTracerProvider.get(\\"testAppTracer\\")\\nval span = tracer.spanBuilder(\\"someUserAction\\").startSpan\\n\\n\\ntry {\\nsomeAction()\\n} catch (e: Exception) {\\nspan.recordException(e)\\nspan.setStatus(StatusCode.ERROR)\\n} finally {\\nspan.end()\\n}\\n
\\n\\n\\n\\nNow that we understand how to set up an OpenTelemetry instance in our Android app, let’s look at some common types of issues and how this framework actually helps us track them.
\\n\\n\\n\\nNetwork performance is one of the most unpredictable factors in a production environment. While local testing occurs under stable, high-speed conditions, real-world users face diverse scenarios. They might encounter intermittent mobile connections, unreliable public Wi-Fi, or backend delays during periods of heavy traffic. These challenges can lead to long request times, failed operations, or even app abandonment.
\\n\\n\\n\\nWith OpenTelemetry, you can instrument network requests to measure their durations and identify bottlenecks. By tagging spans with metadata like endpoint URLs, request sizes, or response statuses, you can analyze trends such as:
\\n\\n\\n\\nLet’s take a look at an example.
\\n\\n\\n\\nSuppose we have an endpoint where users upload images to a server. Network performance might vary based on the image size, user location, or connectivity type. By instrumenting the network request using OpenTelemetry, we can capture relevant metadata and analyze trends, such as whether larger images or specific regions are associated with longer upload times. Here’s how we can instrument this scenario:
\\n\\n\\n\\n \\nfun uploadImage(image: ByteArray, networkType: String, region: String) {\\n val span = tracer.spanBuilder(\\"imageUpload\\")\\n .setAttribute(HttpIncubatingAttributes.HTTP_REQUEST_SIZE, image.size.toLong())\\n .setAttribute(NetworkIncubatingAttributes.NETWORK_CONNECTION_TYPE, networkType)\\n .setAttribute(\\"region\\", region)\\n .startSpan()\\n try {\\n doNetworkRequest()\\n } catch (e: Exception) {\\n span.recordException(e)\\n span.setStatus(StatusCode.ERROR)\\n } finally {\\n span.end()\\n }\\n}
\\n\\n\\n\\nAndroid’s ecosystem is vast, with apps running on a wide variety of devices, OS versions, and hardware configurations. This diversity makes it challenging to ensure a consistent user experience across all devices. Certain crashes or bugs may surface only on specific devices or under particular conditions, making them hard to find in a controlled testing environment.
\\n\\n\\n\\nWith OpenTelemetry, you can capture device-specific metadata in a centralized way, and add it to the resource configuration during the OpenTelemetry setup. This ensures that important contextual information is automatically attached to spans, logs, and metrics. This approach ensures consistency across telemetry data.
\\n\\n\\n\\nBy analyzing this metadata, you can uncover trends like:
\\n\\n\\n\\nLet’s see how to set this up:
\\n\\n\\n\\n// Add some useful attributes to the Resource object.\\nval resource = Resource.getDefault().toBuilder()\\n .put(\\"device.model\\", Build.MODEL)\\n .put(\\"device.manufacturer\\", Build.MANUFACTURER)\\n .put(\\"os.version\\", Build.VERSION.SDK_INT.toString())\\n .put(\\"screen.resolution\\", getResolution())\\n .build()// Use the resource object to build the tracer, logs and other telemetry providers\\nval sdkTracerProvider = SdkTracerProvider.builder()\\n .setResource(resource)\\n .build()
\\n\\n\\n\\nReal users often interact with apps in unexpected ways. This unpredictability can lead to performance issues, crashes, or unoptimized user experiences that aren’t caught in local testing.
\\n\\n\\n\\nFor example, users might upload files much larger than anticipated, causing memory or performance bottlenecks. Others might repeatedly perform actions in rapid succession, like submitting forms or refreshing pages, leading to race conditions or server overload. Some users might navigate through the app in untested sequences, triggering unexpected states or errors.
\\n\\n\\n\\nBy leveraging OpenTelemetry to instrument user interactions, you can capture and analyze spans that detail how users actually use your app. This data provides invaluable insights into unexpected patterns, allowing you to:
\\n\\n\\n\\nLet’s consider a scenario where users frequently navigate back and forth between two screens (e.g., a product listing and a product details page) in rapid succession. While this behavior may seem harmless, it could inadvertently cause resource leaks or worsen the rendering performance.
\\n\\n\\n\\nBy tagging spans with navigation metadata like the screen name, a timestamp, and some other user interactions, you can analyze patterns in navigation behaviors:
\\n\\n\\n\\nThis ability to uncover and address unexpected user behavior ensures your app remains reliable and performant, even under unconventional usage scenarios.
\\n\\n\\n\\nAs we’ve discussed, instrumenting your Android app with OpenTelemetry is incredibly helpful for monitoring and understanding common performance issues.
\\n\\n\\n\\nOnce you’ve started collecting data, however, you’ll need to set up a place for it to go. One of the great things about OpenTelemetry as a framework is that there are many, many observability tools that support ingesting this type of data. You may choose to forward it to a vendor-specific backend or to any number of open source tools, like Jaeger for spans or Loki for logs.
\\n\\n\\n\\nForwarding OpenTelemetry data from an SDK requires adding one or multiple exporters to give a destination for your data once that data is actually generated.
\\n\\n\\n\\nThe exporter is a component that will connect the SDK you are using, which will capture the data, with an external OpenTelemetry collector that will receive the data. Exporters are designed with the OpenTelemetry data model in mind, emitting OpenTelemetry data without any loss of information. Many language-specific exporters are available.
\\n\\n\\n\\nThe OpenTelemetry collector is a vendor-agnostic way to receive, process and export telemetry data. It is not always necessary to use, as you can send data directly to your backend of choice via the exporter. However, having a collector is good practice if you’re managing multiple sources of data ingest and sending to multiple observability backends. It allows your service to offload data quickly, and the collector can take care of additional handling like retries, batching, encryption, or even sensitive data filtering.
\\n\\n\\n\\nEmbrace has a tutorial walk-through on how to set up an OpenTelemetry exporter for Android, if using our SDK. You can also use the basic OpenTelemetry Android SDK, which is built on top of the Java SDK, and which has its own exporter resources.
\\n\\n\\n\\nIn a world where apps run on countless devices, in varied environments, and are used by diverse users, achieving optimal performance and reliability requires more than just local testing. While tools like Android Studio Profiler excel at addressing issues reproducible in controlled environments, production observability fills the gap for uncovering real-world problems that only surface under specific conditions.
\\n\\n\\n\\nOpenTelemetry provides a robust framework for collecting and analyzing telemetry data, giving developers the insights they need to understand and optimize their apps in production. By instrumenting spans and attaching meaningful metadata, you can pinpoint bottlenecks, diagnose device- or OS-specific issues, and uncover unexpected user behaviors that impact the app’s performance or user experience.
\\n\\n\\n\\nInterested in getting started for yourself? Check out the OpenTelemetry SDK for Android. Or, for more advanced monitoring, you can use Embrace’s open source Android SDK. It’s built on OpenTelemetry and uses the same data conventions, but has added functionality for tracking complex Android issues, like ANRs, in a way that is OpenTelemetry-compliant.
\\n\\nCross-posted from the OpenTelemetry blog by Adam Korczynski
\\n\\n\\n\\nOpenTelemetry is happy to announce the completion of the Collector’s fuzzing audit sponsored by the CNCF and carried out by Ada Logics. The audit marks a significant step in the OpenTelemetry project, ensuring the security and reliability of the Collector for its users.
\\n\\n\\n\\nFuzzing is a testing technique that executes an API with a high amount of pseudo-random inputs and observes the API’s behavior. The technique has increased in popularity due to its empirical success in finding security vulnerabilities and reliability issues. Fuzzing initially developed with a focus on testing software implemented in memory-unsafe languages, where it has been most productive. However, in recent years, fuzzing has expanded to memory-safe languages as well.
\\n\\n\\n\\nOver several years, the CNCF has invested in fuzzing for its ecosystem. This testing has found numerous security vulnerabilities in widely used projects such as Helm (CVE-2022-36055, CVE-2022-23524, CVE-2022-23526, CVE-2022-23525), the Notary project (CVE-2023-25656), containerd (CVE-2023-25153), Crossplane (CVE-2023-28494, CVE-2023-27483) and Flux (CVE-2022-36049).
\\n\\n\\n\\nTo initiate the audit, Ada Logics auditors integrated the OpenTelemetry Collector into OSS-Fuzz. OSS-Fuzz is a service offered by Google to critical open source projects, free of charge. The service runs a project’s fuzzers with excess resources multiple times per week. If OSS-Fuzz finds a crash, it notifies the project. It then checks if the project has fixed the crash upstream and if so, marks the issue(s) as fixed. The whole workflow happens continuously on Google’s fuzzing infrastructure, supported by thousands of CPU cores. These testing resources outperform what developers or malicious threat actors can muster.
\\n\\n\\n\\nAfter the Ada Logics team integrated OpenTelemetry into OSS-Fuzz, the next step was to write a series of fuzz tests for the OpenTelemetry Collector. The auditors wrote 49 fuzz tests for core components of the Collector, as well as several receivers and processors in the opentelemetry-collector-contrib
repository.
The fuzz tests were left to run while the audit team observed their health in production. At the completion of the fuzzing audit, the 49 fuzz tests on the OSS-Fuzz platform were healthy.
\\n\\n\\n\\nTo ensure continued reliability, the fuzz testing continues on the Collector even though the audit is complete.
\\n\\n\\n\\nFuzz testing for the Collector is ongoing, allowing for changes to the project to be tested as well. As of the date of this post, no crashes have been detected.
\\n\\n\\n\\nBut the work is not done! The Ada Logics team created the Collector’s fuzzing setup as a reference implementation that other OpenTelemetry subprojects can rely on to create their own fuzz testing, ensuring greater stability for the project as a whole.
\\n\\n\\n\\nFor more insight into the audit process, see the published summary.
\\n\\nFollowing a year of significant milestones in 2023, 2024 for Cilium was pivotal in that organizations are now leveraging the project to manage their entire Kubernetes networking stack. We are pleased to share the 2024 Cilium Annual Report, which highlights the many successes experienced by the project over the last 12 months.
\\n\\n\\n\\nWhat once began as a solution for pod-to-pod connectivity has expanded into a project that unifies the critical domains of networking, observability, and security under a single eBPF-powered umbrella.
\\n\\n\\n\\nAs we move into 2025, the Cilium community has never been stronger. The project continues to add contributors and end users across a variety of industries. We invite you to take a look at the 2024 Annual Report to share in the achievements over the year. For any questions or feedback, don’t hesitate to reach out to contribute@cilium.io.
\\n\\nAmbassador post originally published on Gerald on IT by Gerald Venzl
\\n\\n\\n\\nIf you are like me, you probably have a bunch of (older) Raspberry Pi models lying around not doing much because you replaced them with newer models. So, instead of just having them collect dust, why not create your own little Kubernetes cluster and deploy something on them, or just use it to learn Kubernetes?
\\n\\n\\n\\nNote: This hardware setup is what I have available. At no point is this the recommendation for building your own cluster. If you have newer, more powerful Raspberry Pi models, you are probably better off using them instead.
\\n\\n\\n\\nAll Raspberry Pis are running Raspberry Pi OS (with desktop)
\\n\\n\\n\\nK3s is a lightweight, CNCF-certified, and fully compliant Kubernetes distribution. It ships as a single binary, requires half the memory, supports other data stores, and more. As their website says, it’s:
\\n\\n\\n\\n\\n\\n\\n\\n\\nGreat for:
\\n\\n\\n\\n\\n
\\n\\n\\n\\n\\n- Edge
\\n\\n\\n\\n- Homelab
\\n\\n\\n\\n- Internet of Things (IoT)
\\n\\n\\n\\n- Continuous Integration (CI)
\\n\\n\\n\\n- Development
\\n\\n\\n\\n- Single board computers (ARM)
\\n\\n\\n\\n- Air-gapped environments
\\n\\n\\n\\n- Embedded K8s
\\n\\n\\n\\n- Situations where a PhD in K8s clusterology is infeasible
\\n
Another advantage of K3s for Raspberry Pis is that it allows for data stores other than etcd
. That’s great because, as their website says, etcd
is write-intensive and the SD cards can usually not handle the IO load:
\\n\\n\\n\\n\\n\\n\\nK3s performance depends on the performance of the database. To ensure optimal speed, we recommend using an SSD when possible.
\\n\\n\\n\\nIf deploying K3s on a Raspberry Pi or other ARM devices, it is recommended that you use an external SSD. etcd is write intensive; SD cards and eMMC cannot handle the IO load.
\\n\\n\\n\\n\\n
In my case, I am using an external MariaDB database running on the Corsair Flash Voyager GTX 256GB USB 3.1 Premium Flash Drive. For comparison, the SanDisk 256GB Extreme microSDXC UHS-I card offers a write rate of 130MB/s and a read rate of 190MB/s, while the Voyager GTX USB 3.1 provides a read and write rate of 440MB/s. However, it comes at 2.5 times the price of a microSD card.
\\n\\n\\n\\nKubernetes requires the cgroups
(control groups) Linux kernel feature. Unfortunately, the memory subsystem of this feature is not enabled by default in the latest Raspberry Pi OS image. To verify whether it is, you can do a cat /proc/cgroups
and see whether there is a 1
in the enabled
column for the memory
row:
gvenzl@gvenzl-rbp-0:~ $ cat /proc/cgroups\\n#subsys_name hierarchy num_cgroups enabled\\ncpuset 0 58 1\\ncpu 0 58 1\\ncpuacct 0 58 1\\nblkio 0 58 1\\nmemory 0 58 0\\ndevices 0 58 1\\nfreezer 0 58 1\\nnet_cls 0 58 1\\nperf_event 0 58 1\\nnet_prio 0 58 1\\npids 0 58 1
\\n\\n\\n\\nIf you see a 0
like in the output above, you have to enable the memory subsystem. This is done by adding cgroup_enable=memory
to the /boot/firmware/cmdline.txt
file and then reboot the system. The quickest way to do this is via these commands (note: sudo reboot
will reboot your Raspberry Pi):
sudo sh -c \'echo \\" cgroup_enable=memory\\" >> /boot/firmware/cmd
\\n\\n\\n\\nNote: cgroup_enable=cpuset
and cgroup_memory=1
are no longer required.
Once the system is up again, doublecheck the entry for the memory
subsystem, which should now show a 1
:
gvenzl@gvenzl-rbp-0:~ $ cat /proc/cgroups\\n#subsys_name hierarchy num_cgroups enabled\\ncpuset 0 94 1\\ncpu 0 94 1\\ncpuacct 0 94 1\\nblkio 0 94 1\\nmemory 0 94 1\\ndevices 0 94 1\\nfreezer 0 94 1\\nnet_cls 0 94 1\\nperf_event 0 94 1\\nnet_prio 0 94 1\\npids 0 94 1
\\n\\n\\n\\nRepeat the above step on every Raspberry Pi before continuing.
\\n\\n\\n\\nStatic IP addresses make things easy for cluster management and communication. It ensures that devices always have the same IP address, which makes it easier to identify a given node and prevent communication disruption between nodes due to changing IP addresses. The latest Raspberry Pi OS has a new NetworkManager and associated command line utilities. For an interactive, text-based UI, use the nmtui
(network manager text user interface) command. For scripting purposes, you can use the nmcli
(network manager command line interface) to assign static IP addresses for the Raspberry Pis. You should find an already preconfigured Wired connection 1
on the ethernet device eth0
. You can verify that via nmcli con show
:
gvenzl@gvenzl-rbp-0:~ $ nmcli con show\\nNAME UUID TYPE DEVICE\\npreconfigured 999a68d9-b3e1-4437-bf86-1c9a5f775159 wifi wlan0\\nlo db18dd9c-94fe-4fac-8c66-d6ddc3406900 loopback lo\\nWired connection 1 68f30a89-ef57-3ee9-8238-9310f0829f21 ethernet eth0
\\n\\n\\n\\nTo change the configuration for the ethernet connection to have a static IP address, use sudo nmcli con mod
. The NN
reflects the digits you want to use for the Raspberry Pi. In my case, it’s going to be 10
, 11
, 12
, 13
, 14
, and 15
on the given node:
sudo nmcli c mod \\"Wired connection 1\\" ipv4.addresses \\"192.168.0.NN/24\\" ipv4.method manual
\\n\\n\\n\\nIf you also want to set a Gateway address to reach the outside network and/or internet and DNS entries for name resolution, you can do so with the following commands:
\\n\\n\\n\\nsudo nmcli con mod \\"Wired connection 1\\" ipv4.gateway 192.168.0.1\\nsudo nmcli con mod \\"Wired connection 1\\" ipv4.dns \\"192.168.0.1, 1.1.1.1, 8.8.8.8\\"
\\n\\n\\n\\nNote: In my case, I will reach the outside world via the WiFi connection. The ethernet connection is purely for cluster communication
\\n\\n\\n\\nRepeat the above step on every Raspberry Pi before continuing.
\\n\\n\\n\\nThe simplest way to install K3s is by running curl -sfL https://get.k3s.io | sh -
. However, because I’m using an external MariaDB database as the cluster data store, things are a bit different. Instead of using etcd
, the installation needs to connect to the MariaDB database. This can be done by supplying the --datastore-endpoint
parameter or K3S_DATASTORE_ENDPOINT
environment variable during the installation. For more details, see Cluster Datastore in the K3s documentation.
K3s is capable of connecting to the MariaDB socket at /var/run/mysqld/mysqld.sock
using the root
user if just mysql://
is provided as the datastore-endpoint. That means that the database needs to run on the same host as the control plane and socket connectivity for root
has to be enabled in the MariaDB configuration. Alternatively, one can create a user and database manually, which is what I will do. The user and database will both be called kubernetes
. Here are the four SQL statements you will need for that:
CREATE DATABASE kubernetes;\\nCREATE USER \'kubernetes\'@\'<your IP address range>\' IDENTIFIED BY \'<your password>\';\\nGRANT ALL PRIVILEGES ON kubernetes.* TO \'kubernetes\'@\'<your IP address range>\';\\nFLUSH PRIVILEGES;\\ngvenzl@gvenzl-rbp-0:~ $ sudo mysql\\nWelcome to the MariaDB monitor. Commands end with ; or \\\\g.\\nYour MariaDB connection id is 36\\nServer version: 10.11.6-MariaDB-0+deb12u1 Debian 12\\n \\nCopyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.\\n \\nType \'help;\' or \'\\\\h\' for help. Type \'\\\\c\' to clear the current input statement.\\n \\nMariaDB [(none)]> CREATE DATABASE kubernetes;\\nQuery OK, 1 row affected (0.001 sec)\\n \\nMariaDB [(none)]> CREATE USER \'kubernetes\'@\'192.168.10.%\' IDENTIFIED BY \'*********\';\\nQuery OK, 0 rows affected (0.005 sec)\\n \\nMariaDB [(none)]> GRANT ALL PRIVILEGES ON kubernetes.* TO \'kubernetes\'@\'192.168.10.%\';\\nQuery OK, 0 rows affected (0.002 sec)\\n \\nMariaDB [(none)]> FLUSH PRIVILEGES;\\nQuery OK, 0 rows affected (0.002 sec)\\n \\nMariaDB [(none)]> exit;\\nBye\\ngvenzl@gvenzl-rbp-0:~ $
\\n\\n\\n\\nTo start the K3s installation, a slightly different variation from the above setup script needs to be run to include the --datastore-endpoint
parameter:
curl -sfL https://get.k3s.io | sh -s - --datastore-endpoint mysql://<username>:<password>@tcp(<hostname>:3306)/<database-name>
\\n\\n\\n\\nIn my case, this is going to look like this:
\\n\\n\\n\\ncurl -sfL https://get.k3s.io | sh -s - --datastore-endpoint \\"mysql://kubernetes:*******@tcp(192.168.10.10:3306)/kubernetes\\"\\ngvenzl@gvenzl-rbp-0:~ $ curl -sfL https://get.k3s.io | sh -s - --datastore-endpoint \\"mysql://kubernetes:*********@tcp(192.168.10.10:3306)/kubernetes\\"\\n[INFO] Finding release for channel stable\\n[INFO] Using v1.30.6+k3s1 as release\\n[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.30.6+k3s1/sha256sum-arm64.txt\\n[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.30.6+k3s1/k3s-arm64\\n[INFO] Verifying binary download\\n[INFO] Installing k3s to /usr/local/bin/k3s\\n[INFO] Finding available k3s-selinux versions\\nsh: 416: [: k3s-selinux-1.6-1.el9.noarch.rpm: unexpected operator\\n[INFO] Creating /usr/local/bin/kubectl symlink to k3s\\n[INFO] Creating /usr/local/bin/crictl symlink to k3s\\n[INFO] Creating /usr/local/bin/ctr symlink to k3s\\n[INFO] Creating killall script /usr/local/bin/k3s-killall.sh\\n[INFO] Creating uninstall script /usr/local/bin/k3s-uninstall.sh\\n[INFO] env: Creating environment file /etc/systemd/system/k3s.service.env\\n[INFO] systemd: Creating service file /etc/systemd/system/k3s.service\\n[INFO] systemd: Enabling k3s unit\\nCreated symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.\\n[INFO] Host iptables-save/iptables-restore tools not found\\n[INFO] Host ip6tables-save/ip6tables-restore tools not found\\n[INFO] systemd: Starting k3s\\ngvenzl@gvenzl-rbp-0:~ $
\\n\\n\\n\\nOnce the script has finished, verify the control plane setup via sudo kubectl get nodes
:
gvenzl@gvenzl-rbp-0:~ $ sudo kubectl get nodes\\nNAME STATUS ROLES AGE VERSION\\ngvenzl-rbp-0 Ready control-plane,master 2m48s v1.30.6+k3s1
\\n\\n\\n\\nTo add the additional Pis to the cluster, you must first retrieve the cluster token in the /var/lib/rancher/k3s/server/token
file, which is needed for the agent installation. You can do that via the following command:
sudo cat /var/lib/rancher/k3s/server/token
\\n\\n\\n\\nAnd will get a token that looks something like this:
\\n\\n\\n\\ngvenzl@gvenzl-rbp-0:~ $ sudo cat /var/lib/rancher/k3s/server/token\\nK103bf5abb471fc2f7bcda85fa95a60c0f934a22a858c6ae943f4d7e0ee4091bc11::server:f3d376e3274a174a38b3b97d224aac6d
\\n\\n\\n\\nOnce you have retrieved the token, connect to the other Raspberry Pis and execute the following command:
\\n\\n\\n\\ncurl -sfL https://get.k3s.io | K3S_URL=https://<control plane node IP>:6443 K3S_TOKEN=<server token> sh -
\\n\\n\\n\\nFor example:
\\n\\n\\n\\ncurl -sfL https://get.k3s.io | K3S_URL=https://192.168.10.10:6443 K3S_TOKEN=K103bf5abb471fc2f7bcda85fa95a60c0f934a22a858c6ae943f4d7e0ee4091bc11::server:f3d376e3274a174a38b3b97d224aac6d sh -\\ngvenzl@gvenzl-rbp-1:~ $ curl -sfL https://get.k3s.io | K3S_URL=https://192.168.10.10:6443 K3S_TOKEN=K103bf5abb471fc2f7bcda85fa95a60c0f934a22a858c6ae943f4d7e0ee4091bc11::server:f3d376e3274a174a38b3b97d224aac6d sh -\\n[INFO] Finding release for channel stable\\n[INFO] Using v1.30.6+k3s1 as release\\n[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.30.6+k3s1/sha256sum-arm64.txt\\n[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.30.6+k3s1/k3s-arm64\\n[INFO] Verifying binary download\\n[INFO] Installing k3s to /usr/local/bin/k3s\\n[INFO] Finding available k3s-selinux versions\\nsh: 416: [: k3s-selinux-1.6-1.el9.noarch.rpm: unexpected operator\\n[INFO] Creating /usr/local/bin/kubectl symlink to k3s\\n[INFO] Creating /usr/local/bin/crictl symlink to k3s\\n[INFO] Creating /usr/local/bin/ctr symlink to k3s\\n[INFO] Creating killall script /usr/local/bin/k3s-killall.sh\\n[INFO] Creating uninstall script /usr/local/bin/k3s-agent-uninstall.sh\\n[INFO] env: Creating environment file /etc/systemd/system/k3s-agent.service.env\\n[INFO] systemd: Creating service file /etc/systemd/system/k3s-agent.service\\n[INFO] systemd: Enabling k3s-agent unit\\nCreated symlink /etc/systemd/system/multi-user.target.wants/k3s-agent.service → /etc/systemd/system/k3s-agent.service.\\n[INFO] Host iptables-save/iptables-restore tools not found\\n[INFO] Host ip6tables-save/ip6tables-restore tools not found\\n[INFO] systemd: Starting k3s-agent
\\n\\n\\n\\nOnce the installation has finished on all nodes, you have your K3s cluster up and running. You can verify that by running sudo kubectl get nodes
on the control plane one more time:
gvenzl@gvenzl-rbp-0:~ $ sudo kubectl get nodes\\nNAME STATUS ROLES AGE VERSION\\ngvenzl-rbp-0 Ready control-plane,master 16m v1.30.6+k3s1\\ngvenzl-rbp-1 Ready <none> 5m42s v1.30.6+k3s1\\ngvenzl-rbp-2 Ready <none> 3m14s v1.30.6+k3s1\\ngvenzl-rbp-3 Ready <none> 2m36s v1.30.6+k3s1\\ngvenzl-rbp-4 Ready <none> 2m7s v1.30.6+k3s1\\ngvenzl-rbp-5 Ready <none> 42s v1.30.6+k3s1\\ngvenzl@gvenzl-rbp-0:~ $
\\n\\n\\n\\nCongratulations, you now have a K3s cluster ready for action!
\\n\\n\\n\\nkubectl
If you want to access the cluster from, e.g., your local MacBook with kubectl
, you will need to save a copy of the /etc/rancher/k3s/k3s.yaml
file locally as ~/.kube/config
(the k3s.yaml
file needs to be called config
) and replace the value of the server
field with the IP address or name of the K3s server.
kubectl
itself can be installed via Homebrew:
gvenzl@gvenzl-mac ~ % brew install kubectl\\n==> Downloading https://ghcr.io/v2/homebrew/core/kubernetes-cli/manifests/1.31.3\\nAlready downloaded: /Users/gvenzl/Library/Caches/Homebrew/downloads/f8fd19d10e239038f339af3c9b47978cb154932f089fbf6b7d67ea223df378de--kubernetes-cli-1.31.3.bottle_manifest.json\\n==> Fetching kubernetes-cli\\n==> Downloading https://ghcr.io/v2/homebrew/core/kubernetes-cli/blobs/sha256:fd154ae205719c58f90bdb2a51c63e428c3bf941013557908ccd322d7488fb67\\nAlready downloaded: /Users/gvenzl/Library/Caches/Homebrew/downloads/ec1af5c100c16e5e4dc51cff36ce98eb1e257a312ce5a501fae7a07724e59bf9--kubernetes-cli--1.31.3.sonoma.bottle.tar.gz\\n==> Pouring kubernetes-cli--1.31.3.sonoma.bottle.tar.gz\\n==> Caveats\\nzsh completions have been installed to:\\n /usr/local/share/zsh/site-functions\\n==> Summary\\n🍺 /usr/local/Cellar/kubernetes-cli/1.31.3: 237 files, 61.3MB\\n==> Running `brew cleanup kubernetes-cli`...\\nDisable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.\\nHide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
\\n\\n\\n\\nNext, create the ~/.kube
folder and save a copy of k3s.yaml
as config
:
gvenzl@gvenzl-mac ~ % mkdir ~/.kube\\ngvenzl@gvenzl-mac ~ % scp root@gvenzl-rbp-0:k3s.yaml ~/.kube/config\\nroot@gvenzl-rbp-0\'s password:\\nk3s.yaml 100% 2965 221.2KB/s 00:00
\\n\\n\\n\\nAnd replace the server
parameter with the IP address or hostname of your control plane node:
gvenzl@gvenzl-mac ~ % sed -i \'\' \'s|server: .*|server: https://gvenzl-rbp-0:6443|g\' ~/.kube/config
\\n\\n\\n\\nOnce you have done that, you can control your cluster locally too:
\\n\\n\\n\\ngvenzl@gvenzl-mac ~ % kubectl get nodes\\nNAME STATUS ROLES AGE VERSION\\ngvenzl-rbp-0 Ready control-plane,master 39m v1.30.6+k3s1\\ngvenzl-rbp-1 Ready <none> 28m v1.30.6+k3s1\\ngvenzl-rbp-2 Ready <none> 25m v1.30.6+k3s1\\ngvenzl-rbp-3 Ready <none> 25m v1.30.6+k3s1\\ngvenzl-rbp-4 Ready <none> 24m v1.30.6+k3s1\\ngvenzl-rbp-5 Ready <none> 23m v1.30.6+k3s1
\\n\\nAmbassador post by Liam Randall, CNCF Ambassador and CEO, Cosmonic
\\n\\n\\n\\nWe recently had the opportunity to reflect on the state of platform engineering within large companies—and the role WebAssembly has to play in the discipline’s future. In this post, we’ll take a look at the cost and complexity challenges faced by teams building and running applications at scale. Then, we’ll consider the role of WebAssembly—and platforms like incubating Cloud Native Foundation (CNCF) wasmCloud—in overcoming these obstacles.
\\n\\n\\n\\nPlatform engineering as a practice really began during the “container era” of compute. We split archaic systems into manageable microservices, packaged those microservices into containers, then orchestrated those containers with Kubernetes.
\\n\\n\\n\\nThis brought infrastructure and platform teams closer together. Infrastructure teams became part of the same team that provisioned common application services for developers to build with—everything from Secrets management to authentication and authorization. Without a doubt, containers have improved engineering for the better, but they’ve also brought along some big problems.
\\n\\n\\n\\n💡 For a more in-depth analysis of the challenges facing platform engineering teams Bailey Hayes’s recent Innovation Day keynote is unmissable.
\\n\\n\\n\\nToday, it is simply too expensive to build, run, and maintain applications. This starts with the choice of language, and the “golden template” we use for all our different enterprise applications. We are tied into specific clients and frameworks that are built for a specific language. All of this means we have to build and maintain applications on an app-by-app basis.
\\n\\n\\n\\nWhen platform engineers need to update a given platform, they need their users—application developers—to update dependencies for every application. Because most applications are composed largely of open source dependencies, they inherit the cost of maintaining these to prevent vulnerabilities. Engineers churn on a hamster wheel of application maintenance which is expensive and unproductive.
\\n\\n\\n\\nContainers can’t solve the hamster wheel. Moreover, containerization runs up against some significant limits in portability. We have to build differentiated containers for diverse architectures. Even at their leanest, Kubernetes and containers are too unwieldy to run on sensors, on IoT devices, and in many other edge environments.
\\n\\n\\n\\nSo how can we make applications that are truly portable? While we’re at it, can we solve the hamster wheel and language silo problems?
\\n\\n\\n\\nThe answer lies in portable, interoperable WebAssembly components. We’re entering a new era of componentization where our distributed units of compute are WebAssembly components—orchestrated by wasmCloud.
\\n\\n\\n\\nIt’s important to mention, wasmCloud loves Kubernetes—Wasm runs really well alongside containers. The wasmcloud-operator and Helm charts make deploying Wasm on Kubernetes pretty simple.
\\n\\n\\n\\n💡 For an example of a large organization bringing Wasm to its Kubernetes estate, Adobe’s use case is a great starting point.
\\n\\n\\n\\nHaving said that, Kubernetes is designed for infrastructure orchestration, not application management. wasmCloud is designed for application orchestration at-scale, and it can handle that job either standalone or on a Kubernetes cluster.
\\n\\n\\n\\nThink of WebAssembly as a tiny VM. A Wasm application is compiled with components into a highly efficient binary format which is smaller and more lightweight than traditional container images. This reduced size minimizes memory and storage requirements, allowing more workloads to fit into the same amount of space. Containers are much larger—ranging from MBs to GBs—where a componentized WebAssembly application ranges from KBs to MBs.
\\n\\n\\n\\nWasm components themselves are sandboxed, stateless, and secure by-default because they are incapable of interacting with an operating system on their own. With containers, there are a host of processes, best practices, and tools required to achieve the right level of security and isolation. Containers are an intrinsically safe way to sandbox and isolate arbitrary user code.
\\n\\n\\n\\nWhile containers enhance the portability of applications, those images are still bound to specific CPU architectures. With WebAssembly components, the .wasm binary is the same for any environment, as long as that environment has a WebAssembly runtime. Wasm’s portability allows workloads to run across heterogeneous environments, such as cloud, edge, or even resource-constrained devices.
\\n\\n\\n\\nContainer cold-starts take seconds, whereas WebAssembly components start in less than a millisecond. The latency of a user’s request is typically about 200ms, which means containers must be running at all times, ready to service requests even when they’re not needed.
\\n\\n\\n\\nWe’ll soon see new types of architectures evolve that capitalize on the size and speed of components. Why make a network request to somewhere else when you run that compute locally, on a user’s device?
\\n\\n\\n\\nThis all sounds great. But what if you’ve spent years putting Kubernetes at the center of your universe—if containers work well enough, does adopting Wasm equal another heavy lift into the unknown? To answer that, we need to explore how components work.
\\n\\n\\n\\nWe often pack the whole world into containers, with Wasm components we get down to the application’s core logic. Packaged as OCI artifacts, components radically simplify the way that developers build applications. We compose applications with standard, reusable building blocks (components) that only contain the logic we need and that are compiled at runtime.
\\n\\n\\n\\nThat’s huge. We can reuse our components—rather than rewriting them—and they only contain the code required. By eliminating boilerplate code (the stuff we import in libraries) vulnerabilities are completely eliminated. When thinking about the time spent identifying, reporting, and remediating vulnerabilities, this is transformational.
\\n\\n\\n\\nComponents build on top of each other, just like Lego blocks. They run anywhere components are supported and, of course, because they’re tiny, we can run componentized applications in resource-constrained environments at network edges. As we’ll see later manufacturing and IoT is emerging as a clear use case for wasmCloud.
\\n\\n\\n\\n💡 The Components Starter Guide is a fantastic resource for those at the start of their journey with components.
\\n\\n\\n\\nMost importantly, components are interoperable and polyglot. First principles dictate that we use applications written in many different languages, all of which must be supported. Components communicate over a common set of shared, standardized interfaces—WASI—bound together with Wasm Interface Types (WIT). One might be written in Go, another in Rust, and another in JavaScript, but they can all communicate with one another over these shared APIs.
\\n\\n\\n\\n💡 Check out Alex Creighton’s documentation which perfectly explains the ways in which the WIT IDL supports the WebAssembly Component Model.
\\n\\n\\n\\nTeams use their favorite libraries in their favorite languages, and once they compile the code to a Wasm component, other components can make use of the functions they expose—regardless of the language and libraries used to write those components.
\\n\\n\\n\\n💡 The Bytecode Alliance also has extensive resources on the WebAssembly Component Model. And check out the WIT cheat sheet.
\\n\\n\\n\\nAn incubating, open source CNCF project, wasmCloud is designed to bring the WebAssembly Component Model to life in real-world use cases. Components are the portable unit of code, and wasmCloud orchestrates deployments across distributed environments. Like Kubernetes is the chosen orchestration platform for containers, wasmCloud is cloud-native application orchestration, at-scale.
\\n\\n\\n\\nWebAssembly eliminates the rinse-and-repeat, app-by-app maintenance cycle common in most organizations. wasmCloud allows teams to create one lean set of core applications, built on common standards, and portable to any location.
\\n\\n\\n\\nThe architecture of wasmCloud makes it naturally suited to hybrid clouds and multi-tenant environments, with automatic failover and load balancing built in. wasmCloud can stand alone or run on Kubernetes with the wasmcloud-operator.. Moreover, wasmCloud is designed to integrate with existing cloud native tooling and standards such as OpenTelementry and Open Policy Agent.
\\n\\n\\n\\nIn wasmCloud, Wasm applications use open standards and common interfaces for platform capabilities. These can be swapped runtime, allowing us to switch vendors or do maintenance upgrades on-the-fly. Because these are high-level APIs, they’re vendor-agnostic which means teams can use their own tooling.
\\n\\n\\n\\n💡 Take a look at the growing set of capability providers to see how wasmCloud integrates with common cloud native tools.
\\n\\n\\n\\nwasmCloud also allows engineers to write their own custom capabilities and interfaces—completely essential given every organization’s use cases and infrastructure are different.
\\n\\n\\n\\nWe’re seeing WebAssembly being adopted in a wide variety of use cases in banking, manufacturing, telecommunications, digital services, gaming, and more. In every case, this idea of being able to build one set of applications, capable of running everywhere is what engineers are asking for.
\\n\\n\\n\\nManufacturing and IoT use cases are proliferating because orchestrators like wasmCloud make it possible to put compute power on devices and in production lines. A great example is industrial analytics company MachineMetrics: deploying wasmCloud on factory machinery to put real-time analytics on expensive machinery and improve performance and longevity.
\\n\\n\\n\\nAdobe has been a long-time user of wasmCloud. They first proved the value of bringing wasmCloud to its Kubernetes infrastructure to improve the efficiency of microservices running in Kubernetes. Lean and lightweight, wasmCloud can be almost instantly scaled as traffic scales up, giving more scheduling flexibility than a coarse-grained container. Now, the use cases for Wasm are growing at Adobe.
\\n\\n\\n\\nAdoption in telecoms is growing. Orange is working with wasmCloud to improve service delivery on customer handsets. Orange is also part of a wider group which includes Vodafone, Etisalat and nbnco who have proved the potential value of replacing Kubernetes with Wasm in managing the TM Forum’s estate of open APIs. The WebAssembly Canvas project is now in Phase II which will look at the value of bringing wasmCloud together with Kubernetes.
\\n\\n\\n\\nAkamai’s recent presentation at wasmCloud Innovation Day gave us an insight into how platforms like wasmCloud can extend compute power to the edge in even more advanced use cases. And our celestial-terrestrial mesh demo shows how space agencies could put Wasm to use in super-distributed, multi-tenant use cases.
\\n\\n\\n\\nFinally, it was incredible to see engineers from global financial services company American Express, on stage at WasmCon, showing how they’re bringing Wasm, and wasmCloud, to their architecture to elevate serverless platforms with Wasm components.
\\n\\n\\n\\nLike in past epochs of computing, VM’s did not disappear with the advent of containers, just as containers will not be completely replaced by Wasm. Stateful, long-running compute and appliances like databases are unlikely to be replaced with Wasm. This evolution in technology will not occur overnight and so Wasm apps will continue to run alongside the thousands of apps already running in containers and on Kubernetes.
\\n\\n\\n\\nTake a look at our documentation and click on Quickstart for a tour of wasmCloud, its features and functionality.
\\n\\n\\n\\nDon’t forget, the wasmCloud community meets every week on Wednesday at 1pm EST. You can add the meeting to your calendar, or join us on YouTube. Oh, and come join the discussion on Slack!
\\n\\n\\n\\n\\n\\nProject post originally published on the OpenTelemetry blog by Severin Neumann (Cisco), Patrice Chalin (CNCF), Tiffany Hrabusa (Grafana Labs)
\\n\\n\\n\\nAs 2024 draws to a close, we reflect on the year and share some insights and accomplishments from SIG Communications, the team responsible for managing this website, blog, and documentation.
\\n\\n\\n\\nSeveral key accomplishments stand out in our efforts to make OpenTelemetry documentation more accessible, user-friendly, and impactful for our global community.
\\n\\n\\n\\nA major accomplishment this year was achieving multilingual support with the launch of our localized documentation. Thanks to the efforts of localization teams, over 120 pages were translated from English into other languages. The available translations include:
\\n\\n\\n\\nA big thank you to everyone who contributed to this initiative. These translations make OpenTelemetry more accessible, enhancing the user experience for our global audience.
\\n\\n\\n\\nTo improve readership experience and make OpenTelemetry documentation more intuitive and accessible, we undertook important updates to our Information Architecture (IA) this year. These changes were driven by the need to better organize content, clarify the purpose of key sections, and provide a more structured and user-friendly experience for end-users and developers.
\\n\\n\\n\\nKey IA updates include:
\\n\\n\\n\\nInstrumentation
section to Language APIs & SDKs to better reflect its purpose and set clearer expectations for users.Automatic Instrumentation
into the new Zero-code Instrumentation section to more clearly distinguish between instrumentation APIs & SDKs and tools like the Java agent, used to inject telemetry.Next year, we aim to redesign how OpenTelemetry is introduced to beginners, ensuring a smoother and more accessible learning experience. If you’re passionate about making OpenTelemetry easier to understand and use, we’d love your contributions — join us in this collaborative effort.
\\n\\n\\n\\nIn December 2022, we started monthly releases of the website so that we could regularly summarize activities and highlight significant contributions. These releases allow us to track progress over time and perform long-term comparisons.
\\n\\n\\n\\nFor instance, comparing the periods December 2022 to November 2023 and December 2023 to November 2024, we observed an upward trend in contributions:
\\n\\n\\n\\nSince the repository’s inception in April 2019, the community has seen remarkable growth, with:
\\n\\n\\n\\nThank you to every contributor for helping to build and improve the OpenTelemetry website. Your efforts make a difference!
\\n\\n\\n\\nAccording to our publicly available analytics data, opentelemetry.io was viewed 12 million times across 4 million sessions this year. This marks a 16% increase over last year’s nearly 10 million views and over 3 million sessions.
\\n\\n\\n\\nThe most popular pages and sections of the documentation were:
\\n\\n\\n\\nPage/Section | Views | % 1 |
---|---|---|
What is OpenTelemetry? | 290K | 2.4% |
Collector | 1.3M | 10.5% |
Concepts | 1.2M | 9.8% |
Demo | 829K | 6.7% |
Ecosystem | 500K | 4.0% |
Did you know that:
\\n\\n\\n\\nWith 1.3K PRs, we collectively contributed an equally impressive number of reviews to ensure that content is accurate, valuable, aligned with our documentation goals, and easy to read and understand.
\\n\\n\\n\\nIn addition to PRs, contributors created nearly 500 issues and engaged in many discussions, reporting bugs, suggesting improvements, and driving collaboration. Each of these efforts reflects our community’s dedication to maintaining the quality of OpenTelemetry docs.
\\n\\n\\n\\nWe are fortunate to have many contributors who take on responsibilities, including:
\\n\\n\\n\\nThank you to everyone who contributed their time and expertise to OpenTelemetry docs this year!
\\n\\n\\n\\nA big shout-out to everyone for making 2024 a successful year! We look forward to continuing our collaboration in 2025.
\\n\\n\\n\\nWhether you’re an end user, a contributor, or simply enthusiastic about OpenTelemetry, we welcome your participation. You can get involved by raising issues, participating in discussions, or submitting PRs.
\\n\\n\\n\\nYou can also join us:
\\n\\n\\n\\n#otel
-prefixed channels.Together, we can make 2025 another amazing year for opentelemetry.io!
\\n\\n\\n\\nAmbassador post by Angel Ramirez, CEO of Cuemby and CNCF ambassador.
\\n\\n\\n\\nAs the technology landscape evolves, businesses must embrace innovations that enable them to adapt and thrive. Cloud-native technologies, championed by the CNCF community, have emerged as essential tools for achieving scalability, agility, and resilience. For small and medium-sized businesses (SMBs), these solutions offer a unique opportunity to streamline operations, reduce costs, and accelerate growth.
\\n\\n\\n\\nDrawing from Cuemby’s experience in implementing cloud-native practices, this article explores the transformative potential of these technologies and their practical impact on SMBs.
\\n\\n\\n\\nMigrating to cloud-native solutions helps SMBs reallocate IT budgets more effectively. Kubernetes, a CNCF graduated project, eliminates the need for costly hardware investments by orchestrating containerized applications across cloud environments. Businesses only pay for the resources they use, significantly lowering operational costs.
\\n\\n\\n\\nTake a mid-sized company with an annual IT budget of $5 million: moving to the cloud could result in savings ranging from $750,000 to $1.25 million annually (Flexera, 2023). These freed-up resources can then be strategically redirected toward growth-driving activities, such as marketing, product development, or customer acquisition.
But the savings don’t stop there. Cloud solutions operate on a pay-as-you-go model, meaning businesses only pay for the resources they actually use. This is especially advantageous for companies with fluctuating demand, such as seasonal industries. For example, organizations in retail or travel could cut costs by up to 30% during low-demand periods by scaling down their usage (Deloitte, 2022). This flexibility not only prevents overinvestment but also optimizes operational expenses.
In addition to infrastructure savings, maintenance costs are significantly reduced. On-premises systems often require 15% to 20% of the IT budget for hardware refreshes and updates (TechRepublic, 2021). With cloud migration, these tasks are handled by the cloud provider, enabling businesses to focus on their core objectives rather than routine IT upkeep.
Cloud-native platforms like Kubernetes enable businesses to scale operations in minutes rather than months. According to CNCF reports, organizations adopting Kubernetes see a 60% reduction in time-to-market for new services.
\\n\\n\\n\\nDuring seasonal peaks, industries such as retail have relied on Kubernetes to dynamically scale workloads. For example, during high-demand periods like Black Friday, businesses using cloud platforms report improved performance with up to 40% less downtime compared to those relying on on-premises infrastructure (McKinsey, 2021). This ensures that operations remain uninterrupted, enhancing customer satisfaction and protecting revenue during crucial periods.
\\n\\n\\n\\nReal-time monitoring and logging are essential for minimizing downtime. Tools like Prometheus and Fluentd, widely adopted CNCF projects, offer robust solutions for automating these tasks.
\\n\\n\\n\\nSmall businesses often lack the capacity to maintain and update traditional IT infrastructure. By migrating to the cloud, they can eliminate 15% to 20% of their IT budget typically allocated to hardware maintenance and upgrades (TechRepublic, 2021). Cloud platforms reduce downtime by leveraging built-in redundancies and failover mechanisms. For example, a Gartner 2023 study highlighted that businesses using cloud infrastructure experience 35% fewer unplanned outages compared to those with traditional on-premises systems (Gartner, 2023).
This delegation of maintenance tasks empowers businesses to concentrate on growth activities such as customer engagement, product innovation, and scaling operations. For instance, a small e-commerce company that shifted to a cloud-based system reduced its IT overhead by 35%, enabling it to allocate more resources to digital marketing and inventory expansion (Flexera, 2023).
Cloud computing transforms collaboration, especially for businesses adopting remote or hybrid work models. According to a 2023 Gartner survey, organizations using cloud-based collaboration tools like Google Workspace or Microsoft Teams improved productivity by 25% and reduced project turnaround times by 18%.
Small businesses with distributed teams particularly benefit from real-time collaboration features. Employees can access the same version of files, ensuring consistency and reducing errors. In one case, a startup leveraging cloud solutions for remote work reported a 30% increase in employee satisfaction due to the flexibility of accessing work resources from any device, anywhere (Deloitte, 2022). Additionally, cloud-based systems support secure access, ensuring sensitive data remains protected during remote operations.
In highly competitive industries like technology and e-commerce, speed is essential. The cloud goes beyond rapid scalability by enabling businesses to create sandbox environments, where new ideas can be tested, refined, and deployed with minimal risk. For example, retail companies can simulate and optimize seasonal promotions, while technology firms can rapidly prototype new applications without the delays of traditional infrastructure. This agility not only shortens time-to-market cycles but also empowers businesses to experiment and innovate with confidence, staying ahead of evolving market demands.
\\n\\n\\n\\nAdopting the latest technologies is only part of the equation—using them strategically is where the cloud truly transforms businesses. A cloud-enabled digital transformation roadmap ensures every tool and system aligns with long-term business objectives. Leveraging predictive analytics and AI-driven insights in the cloud allows companies to proactively optimize supply chains, improve operational efficiency, and deliver highly personalized customer experiences. This strategic alignment ensures that technology supports growth, innovation, and competitive advantage rather than becoming a siloed resource.
\\n\\n\\n\\nCloud providers like AWS, Microsoft Azure, and Google Cloud collectively invest billions annually into cutting-edge security measures, including encryption, real-time threat monitoring, and advanced firewalls. In 2021 alone, Alphabet, Amazon, Meta, Apple and Microsoft spent a combined $2.4 billion on funding or acquiring 23 cybersecurity companies (Statista). These investments ensure small and large businesses alike benefit from enterprise-grade security without the need for their own IT infrastructure.
\\n\\n\\n\\nCloud platforms simplify compliance with regulations like GDPR, SOC 2, and HIPAA. A 2022 study by Forrester Research found that 68% of organizations using cloud services reduced the time spent on compliance-related tasks by 40%, streamlining audits and ensuring adherence to global standards (Forrester, 2022). Businesses using cloud-based tools can ensure compliance with automated data management and reporting features, mitigating risks and building trust among stakeholders.
\\n\\n\\n\\nDisruptions such as natural disasters, cyberattacks, or hardware failures can severely impact business operations. Cloud-based disaster recovery solutions offer cost-effective and efficient means to safeguard data and maintain business continuity. For example, AWS Elastic Disaster Recovery provides scalable and reliable recovery options, enabling businesses to resume operations swiftly after a disruption (Amazon Web Services).
\\n\\n\\n\\nBy utilizing cloud-based disaster recovery services, businesses can ensure their data remains accessible and protected, even in the face of unforeseen events. This resilience is invaluable for maintaining customer trust and sustaining operations during challenging times.
\\n\\n\\n\\nThe journey to cloud migration is a call to action for businesses to reimagine their operations and embrace a future where agility and innovation define success. As companies adopt cloud solutions to drive digital transformation, they are also laying the groundwork for resilience and competitiveness in a rapidly evolving market.
\\n\\n\\n\\nTo thrive in this cloud-first world, it’s essential to build strong partnerships. Collaborating with experts who understand the complexities of cloud adoption can make the transition seamless and ensure your business maximizes the value of this transformative technology. Whether you’re looking to streamline processes or scale operations, the right partner can help design a roadmap tailored to your needs.
\\n\\n\\n\\nCuemby is dedicated to helping organizations navigate this transformation. By combining technical expertise with practical insights, we empower businesses to achieve resilience, scalability, and innovation. Discover how Cuemby’s cloud-native solutions can help your business succeed. Visit https://www.cuemby.com to learn more, or schedule a consultation to explore how cloud computing can optimize your operations.
\\n\\n\\n\\nFor more inspiration and resources, explore the CNCF blog and news section to stay updated on the latest trends and advancements in the cloud-native ecosystem. Let’s build a scalable and resilient future together.
\\n\\nCNCF is excited to share that since launching the Kubestronauts program less than a year ago, over 1000 Kubestronauts have joined the program. A special welcome to our 1,000th Kubestronaut, Remy Mollandin! Each of these 1000+ Kubestronauts have active certifications in all of CNCF’s Kubernetes certifications: KCNA, KCSA, CKA, CKAD, CKS.
\\n\\n\\n\\nCNCF’s Kubestronaut program is designed to drive the growth of Kubernetes and open source cloud native technologies by providing training, learning resources, networking opportunities, and professional development opportunities to help participants grow their cloud-native careers. There is no other comparable program in the world like this.
\\n\\n\\n\\nBecoming a Kubestronaut not only adds you to an elite group, but you also get:
\\n\\n\\n\\n“We’re very pleased to celebrate this major milestone—reaching 1,000 Kubestronauts in less than a year since the program’s inception! This achievement highlights the tremendous enthusiasm within the community to learn and grow with Kubernetes. We appreciate all our Kubestronauts and look forward to learning from them to ensure we produce better cloud native education materials for the world.” –Chris Aniszczyk, CTO, CNCF
\\n\\n\\n\\nIn automated environments, and especially with regards to security, it is extremely important to be aware of what’s currently being offered, what gets deprecated, and what best practices to follow. Continuous learning with this certification program will “push” you to get better. – Maria Salcedo
\\n\\n\\n\\nThe Kubernetes certifications are hands-on exams that will help you acquire not only knowledge of Kubernetes, but also the basic skills to build and troubleshoot application environments on Kubernetes by actually doing the work.
\\n\\n\\n\\n\\n\\n\\n\\nLearn more about how to become a Kubestronaut and read about the highlighted Kubestronauts in Orbit.
\\n\\n\\n\\nKubestronauts come from around the world with participants in 86 countries and every continent except Antarctica. Fourteen Kubestronauts are the only ones in their entire country! Although India and United States have the most Kubestronauts per country, some cities have more Kubestronauts than others. Our top 5 cities for Kubestronauts are:
\\n\\n\\n\\nOur top five countries by number of Kubestronauts are:
\\n\\n\\n\\nWe’d like to thank all the Kubestronauts for being committed members of the CNCF Open Source community. Learn more about becoming a Kubestronaut and explore stories in our Kubestronauts in Orbit series.
\\n\\nCommunity blog post by Sascha Grunert, CRI-O maintainer
\\n\\n\\n\\nThe Node Resource Interface (NRI) allows users to write plugins for Open Container Initiative (OCI) compatible runtimes like CRI-O and containerd. These plugins are capable of making controlled changes to containers at dedicated points in their life cycle. For example, by using the NRI it is possible to allocate extra node resources on container creation, which can be released again after the container got removed.
\\n\\n\\n\\nA plugin is written as daemon-like process which serves a predefined API based on ttRPC (gRPC for low-memory environments). This means in detail, that the NRI implementation in the runtime (CRI-O, containerd) will communicate using a UNIX Domain Socket (UDS) with each plugin and provide them with all required event data. For example, events can be container or pod sandbox creation, stopping or removal, while corresponding data are the name, namespace or corresponding annotations.
\\n\\n\\n\\nOn one hand plugins written as daemons have the benefit of persisting the current state out of the box, while on the other hand they come with a performance and management overhead. For that reason, the NRI also supports OCI hook-like binary plugins which get executed for each event. Combining the concept of small binary plugins with a universal standard like WebAssembly (Wasm) empowers the NRI to run on the edge and universally on all imaginable platforms.
\\n\\n\\n\\nThe required change for the NRI landed with Pull Request containerd/nri#121. This change adds a go-plugin mechanism to the NRI. Each plugin gets compiled to Wasm, which means that it is size-efficient, memory-safe, automatically sandboxed and highly portable out of the box! The plugin system works in the same way as the NRI by using Protocol Buffers. This means, that the NRI can reuse the existing API for ttRPC, while the communication will happen in memory and not over the Remote Procedure Call (RPC).
\\n\\n\\n\\nOne key benefit is that WebAssembly is designed as a portable compilation target for programming languages. Plugins compiled to Wasm can be used anywhere, which means that there is no requirement for multi architecture binaries. Beside that, the Wasm stack machine is designed to be encoded in a size and time efficient binary format, which make them great targets for binary execution.
\\n\\n\\n\\nUnfortunately, the native golang (go
) compiler does not have full WebAssembly support yet, which means the plugins have to be compiled using the alternative tinygo compiler. An example Wasm plugin within the NRI repository, can be compiled locally using:
make $(pwd)/build/bin/wasm
\\n\\n\\n\\nOr within a container image:
\\n\\n\\n\\nmake $(pwd)/build/bin/wasm TINYGO_DOCKER=1
\\n\\n\\n\\nIn the future it may be possible to cross compile plugins using GOOS=wasip1 GOARCH=wasm go build
, but that is not implemented yet (see knqyf263/go-plugin#58).
The resulting file should be a valid WebAssembly binary:
\\n\\n\\n\\nfile build/bin/wasm
\\n\\n\\n\\nbuild/bin/wasm: WebAssembly (wasm) binary module version 0x1 (MVP)\\n
\\n\\n\\n\\nTo try out the binary, we have to put it into the default local NRI directory. We also need to prefix the binary by a chosen index, which later refers to the plugin execution order:
\\n\\n\\n\\nsudo mkdir -p /opt/nri/plugins\\nsudo cp build/bin/wasm /opt/nri/plugins/10-wasm
\\n\\n\\n\\nCRI-O v1.32 (which has been not released yet as time of writing) or it’s recent main
branch can be used to verify that the plugin got loaded successfully:
sudo ./bin/crio
\\n\\n\\n\\n…\\nINFO[…] Create NRI interface\\nINFO[…] runtime interface created\\nINFO[…] Registered domain \\"k8s.io\\" with NRI\\nINFO[…] runtime interface starting up...\\nINFO[…] starting plugins...\\nINFO[…] discovered plugin 10-wasm\\nINFO[…] starting pre-installed NRI plugin \\"wasm\\"...\\nINFO[…] Found WASM plugin: /opt/nri/plugins/10-wasm\\nINFO[…] WASM: Got configure request\\nINFO[…] Synchronizing NRI (plugin) with current runtime state\\nINFO[…] synchronizing plugin 10-wasm\\nINFO[…] WASM: Got synchronize request\\nINFO[…] pre-installed NRI plugin \\"10-wasm\\" synchronization success\\nINFO[…] plugin invocation order\\nINFO[…] #1: \\"10-wasm\\" (external:10-wasm[0])\\n…\\n
\\n\\n\\n\\nThe partial logs above outline that the 10-wasm
plugin got loaded and the WebAssembly plugin received a configure
and synchronize
request. Log lines prefixed with WASM:
are directly invoked from the plugin itself:
func (p *plugin) Configure(ctx context.Context, req *api.ConfigureRequest) (*api.ConfigureResponse, error) {\\nlog(ctx, \\"Got configure request\\")\\nreturn nil, nil\\n}
\\n\\n\\n\\nThe logging itself is achieved by a so-called host function. This function can be used to pass data back to the host (the NRI) and process them there (log to stderr
). The plugin just has to fulfill the host log
function:
func log(ctx context.Context, msg string) {\\napi.NewHostFunctions().Log(ctx, &api.LogRequest{\\nMsg: \\"WASM: \\" + msg,\\nLevel: api.LogRequest_LEVEL_INFO,\\n})\\n}
\\n\\n\\n\\nAnd the NRI can fulfill the logging functionality:
\\n\\n\\n\\nfunc (wasmHostFunctions) Log(ctx context.Context, request *api.LogRequest) (*api.Empty, error) {\\nswitch request.GetLevel() {\\ncase api.LogRequest_LEVEL_INFO:\\nlog.Infof(ctx, request.GetMsg())\\ncase api.LogRequest_LEVEL_WARN:\\nlog.Warnf(ctx, request.GetMsg())\\ncase api.LogRequest_LEVEL_ERROR:\\nlog.Errorf(ctx, request.GetMsg())\\ndefault:\\nlog.Debugf(ctx, request.GetMsg())\\n}\\n\\nreturn &api.Empty{}, nil\\n}
\\n\\n\\n\\nIf the plugin is loaded into memory and CRI-O now creates an example sandbox, then the WebAssembly instance will get executed accordingly by invoking the correct entry point:
\\n\\n\\n\\nsudo crictl runp test/testdata/sandbox_config.json
\\n\\n\\n\\n…\\nINFO[…] Running pod sandbox: test.crio/podsandbox1/POD id=…\\n…\\nINFO[…] WASM: Got state change request with event: RUN_POD_SANDBOX\\nINFO[…] WASM: Got run pod sandbox request\\n…\\nINFO[…] Ran pod sandbox … with infra container: test.crio/podsandbox1/POD id=…\\n…\\n
\\n\\n\\n\\nWebAssembly NRI plugins allow to distribute functionality independently from the target platform in a secure and performant way. That makes them awesome for edge scenarios or for being distributed as OCI artifacts. For the future, it is imaginable to have a (semi) automatic reload functionality for the loaded in-memory plugins, but that is something we are currently elaborating.
\\n\\n\\n\\nThank you for reading this blog post! If you have any questions or comments feel free to open an issue in the NRI repository.
\\n\\n\\n\\n\\n\\nMember post by Rohit Raveendran, Facets.Cloud
\\n\\n\\n\\nWhat happens behind the scenes when a Kubernetes pod shuts down? In Kubernetes, understanding the intricacies of pod termination is crucial for maintaining the stability and efficiency of your applications. When a pod is terminated, it’s not just a simple shutdown; it involves a well-defined lifecycle that ensures minimal disruption and data loss. This process, known as graceful termination, is vital for handling in-progress requests and performing necessary clean-up tasks before a pod is finally removed.
\\n\\n\\n\\nThis guide examines each lifecycle phase during pod termination, detailing the mechanisms for graceful handling, resource optimization strategies, persistent data management, and troubleshooting techniques for common termination issues. By the end of this blog, you will have a thorough understanding of how to effectively manage pod termination in your Kubernetes environment, ensuring smooth and efficient operations.
\\n\\n\\n\\nIn Kubernetes, graceful termination means that the system gives the pods time to finish serving in-progress requests and shut down cleanly before removing them. This helps to avoid disruption and loss of data. Kubernetes supports graceful termination of pods, and it’s achieved through the following steps:
\\n\\n\\n\\nWhen a pod is asked to terminate, Kubernetes updates the object state and marks it as “Terminating”.
\\n\\n\\n\\nKubernetes also sends a SIGTERM signal to the main process in each container of the pod.
\\n\\n\\n\\nThe SIGTERM signal is an indication that the processes in the containers should stop. The processes have a grace period (default is 30 seconds) to shut down properly.
\\n\\n\\n\\nIf a process is still running after the grace period, Kubernetes sends a SIGKILL signal to force the process to terminate.
\\n\\n\\n\\nTo gracefully terminate the pod you can either add terminationGracePeriodSeconds to your spec or add kubectl delete pods <pod> –grace-period=<seconds>
\\n\\n\\n\\nThe below contains a time series graph for graceful termination:
\\n\\n\\n\\nThe “preStop” hook is executed just before a container is terminated.
\\n\\n\\n\\nWhen a pod’s termination is requested, Kubernetes will run the preStop hook (if it’s defined), send a SIGTERM signal to the container, and then wait for a grace period before sending a SIGKILL signal.
\\n\\n\\n\\nHere’s an example of how you might define a preStop hook in your pod configuration:
\\n\\n\\n\\n```yaml\\napiVersion: v1\\nkind: Pod\\nmetadata:\\n name: my-pod\\nspec:\\n containers:\\n - name: my-container\\n image: my-image\\n lifecycle:\\n preStop:\\n exec:\\n command: [\\"/bin/sh\\", \\"-c\\", \\"echo Hello from the preStop hook\\"]\\n```
\\n\\n\\n\\nIn this example, the preStop hook runs a command that prints a message to the container’s standard output.
\\n\\n\\n\\nThe preStop hook is a great place to put code that:
\\n\\n\\n\\nNOTE: Remember that Kubernetes will run the preStop hook, then wait for the grace period (default 30 seconds) before forcibly terminating the container. If your preStop hook takes longer than the grace period, Kubernetes will interrupt it.
\\n\\n\\n\\nThe above image explains it.
\\n\\n\\n\\n1. How resource constraints impact pod termination decisions.
\\n\\n\\n\\n2. Strategies for optimizing resource allocation to mitigate potential issues.
\\n\\n\\n\\n```yaml\\nresources:\\n limits:\\n cpu: \\"1\\"\\n memory: 1000Mi\\n requests:\\n cpu: 200m\\n memory: 500Mi\\n```
\\n\\n\\n\\nGuaranteed: All containers in the pod have CPU and memory requests and limits, and they’re equal.
\\n\\n\\n\\n```yaml\\nresources:\\n requests:\\n memory: \\"64Mi\\"\\n cpu: \\"250m\\"\\n limits:\\n memory: \\"64Mi\\"\\n cpu: \\"250m\\"\\n```
\\n\\n\\n\\nBurstable: At least one container in the pod has a CPU or memory request. Or if limits and more than the requests
\\n\\n\\n\\n```yaml\\nresources:\\n requests:\\n memory: \\"64Mi\\"\\n cpu: \\"250m\\"\\n limits:\\n memory: \\"128Mi\\"\\n cpu: \\"500m\\"\\n```\\nBestEffort: No containers in the pod have any CPU or memory requests or limits.\\n\\n```yaml\\n\\nresources:{}\\n\\n```
\\n\\n\\n\\n```yaml\\napiVersion: v1\\nkind: ResourceQuota\\nmetadata:\\n name: compute-resources\\nspec:\\n hard:\\n pods: \\"10\\"\\n requests.cpu: \\"1\\"\\n requests.memory: 1Gi\\n limits.cpu: \\"2\\"\\n limits.memory: 2Gi\\n ```
\\n\\n\\n\\n```yaml\\napiVersion: v1\\nkind: Pod\\nmetadata:\\n name: with-node-affinity\\nspec:\\n affinity:\\n nodeAffinity:\\n requiredDuringSchedulingIgnoredDuringExecution:\\n nodeSelectorTerms:\\n - matchExpressions:\\n - key: disktype\\n operator: In\\n values:\\n - ssd\\n containers:\\n - name: nginx-container\\n image: nginx\\n```
\\n\\n\\n\\nkubectl taint nodes node1 key=value:NoSchedule
\\n\\n\\n\\n```yaml\\napiVersion: v1\\nkind: Pod\\nmetadata:\\n name: my-pod\\nspec:\\n containers:\\n - name: my-container\\n image: my-image\\n tolerations:\\n - key: \\"key\\"\\n operator: \\"Equal\\"\\n value: \\"value\\"\\n effect: \\"NoSchedule\\"\\n```
\\n\\n\\n\\nIn Kubernetes, a StatefulSet ensures that each Pod gets a unique and sticky identity, which is important in maintaining the state for applications like databases. When it comes to Pod termination, Kubernetes handles it a bit differently in StatefulSets compared to other Pod controllers like Deployments or ReplicaSets.
\\n\\n\\n\\nHere are the steps how Kubernetes handles pod termination in StatefulSets:
\\n\\n\\n\\nIdentification of common issues related to pod termination:
\\n\\n\\n\\nPractical troubleshooting tips and solutions:
\\n\\n\\n\\nA solid understanding of pod termination in Kubernetes keeps applications running smoothly as workloads grow. By mastering Kubernetes tools, configuring resources well, and using best practices, you can create a resilient environment. Solutions like Facets enhance these efforts, automating terminations, managing resources, and meeting scaling needs with ease.
\\n\\n\\n\\nUltimately, a thoughtful approach to Kubernetes pod termination not only improves the stability and scalability of your infrastructure but also empowers teams to deliver higher-quality services faster, with minimal disruption to end-users.
\\n\\n\\n\\n\\n\\nMember post originally published on the InfraCloud blog by Aman Juneja, Principal Solutions Engineer at InfraCloud Technologies
\\n\\n\\n\\nIn recent years, we’ve witnessed two recurring trends: the release of increasingly powerful GPUs and the introduction of Large Language Models (LLMs) with billions or trillions of parameters and expansive context windows. Many businesses are leveraging these LLMs by fine-tuning them or building out apps with domain-specific knowledge using RAG and deploying them on dedicated GPU servers. Now, when it comes to deploying these models on GPU, one thing to notice is the model size, i.e., the space required (for storing the parameters and context tokens) to load the model into the GPU memory is too high compared to the memory available on the GPU.
\\n\\n\\n\\nThere are methods to reduce model sizes by using optimization techniques like quantization, pruning, distillation & compression, etc. But if you notice in the below comparison table between the latest GPU memory and space requirement for 70B models (FP16 quantized), it’s almost impossible to handle multiple requests at a time, or in some GPUs, the model will not even fit on the memory.
\\n\\n\\n\\nGPU | FP16 (TFLOPS) with sparsity | GPU Memory (GB) |
---|---|---|
B200 | 4500 | 192 |
B100 | 3500 | 192 |
H200 | 1979 | 141 |
H100 | 1979 | 80 |
L4 | 242 | 24 |
L40S | 733 | 48 |
L40 | 362 | 48 |
A100 | 624 | 80 |
This is all with already applied FP16 quantization that incurs some loss of precision (which is usually acceptable in many of the generic use cases).
\\n\\n\\n\\nModels | KV Cache in GB for Parameters (FP16) |
---|---|
llama3-8B | 16 |
llama3-70B | 140 |
llama-2-13B | 26 |
llama2-70B | 140 |
mistral-7B | 14 |
This brings us to the context of this blog post, i.e. how enterprises run large billion or trillion parameters LLM models on these modern datacenter GPUs. Are there any ways to split these models into smaller pieces and run only what is required at the moment, or can we distribute the parts of the model into different GPUs? I will try to answer these questions in this blog post with the current set of methods available to perform inference parallelization and also will try to highlight some of the tools/libraries that support these methods of parallelization.
\\n\\n\\n\\nInference parallelism aims to distribute the computational workload of AI models, particularly deep learning models, across multiple processing units such as GPUs. This distribution allows for faster processing, reduced latency, and the ability to handle models that exceed the memory capacity of a single device.
\\n\\n\\n\\nFour primary methods have been developed to achieve inference parallelism, each with its strengths and applications:
\\n\\n\\n\\nIn data parallelism, we deploy multiple copies of models on different GPUs or GPU clusters. Each copy of the model independently processes the user request. In a simple analogy, this is like having multiple replicas of 1 microservice.
\\n\\n\\n\\nNow, a common question one might have is how it solves the problem of model size fitting into GPU memory, which we discussed at the start, and the short answer is that it doesn’t. This method is only recommended for smaller models that can fit into the GPU memory. In those cases, we can use multiple copies of the model deployed on different GPU instances and distribute the requests to different instances hence providing enough GPU resources for each request and also increasing the availability of the service. This will also increase the overall request throughput for the system as you have more instances to handle the traffic now.
\\n\\n\\n\\nIn tensor parallelism, we split each layer of the model across different GPUs. A single user request will be shared across multiple GPUs and the result of each request’s GPU computations will be recombined over a GPU-to-GPU network.
\\n\\n\\n\\nTo understand it better, as the name suggests, we split the tensors into chunks along a particular dimension such that each device only holds 1/N chunk of the tensor. Computation is performed using this partial chunk to get partial output. These partial outputs are collected from all devices and then combined.
\\n\\n\\n\\nAs you might have noticed already, the bottleneck to the performance of tensor parallelism is the speed of the network between GPU-to-GPU. As each request will be computed across different GPUs and then combined, we need a high-performance network to ensure low latency numbers.
\\n\\n\\n\\nIn pipeline parallelism, we distribute a group of model layers across different GPUs. Layer-based partitioning is the fundamental approach in pipeline parallelism. The model’s layers are grouped into continuous blocks, forming stages. This partitioning is typically done vertically through the network’s architecture. Computational balance is a key consideration. Ideally, each stage should have an approximately equal computational load to prevent bottlenecks. This often involves grouping layers of varying complexities to achieve balance. Memory usage optimization is another critical factor. Stages are designed to fit within the memory constraints of individual devices while maximizing utilization. Communication overhead minimization is also important. The partitioning aims to reduce the amount of data transferred between stages, as inter-device communication can be a significant performance bottleneck.
\\n\\n\\n\\nSo for example, if you are deploying LLaMA3-8B model that has 32 layers on a 4 GPU instance, you can split and distribute 8 layers of model on each GPU. The processing of requests happens in a sequential manner where the computation starts at one GPU and continues to the next GPU with point-to-point communication.
\\n\\n\\n\\nAgain, as multiple GPU instances are involved, the networking can become a huge bottleneck if we do not have high-speed network communication between the GPUs.This parallelism can increase the GPU throughput as every request will need fewer resources from each GPU and should be easily available, but it will end up increasing the overall latency as the request will be processed sequentially, and delay in any GPU computation or network component will cause an overall surge in latency.
\\n\\n\\n\\nExpert parallelism, often implemented as a Mixture of Experts (MoE), is a technique that allows for the efficient use of large models during the inference process. It doesn’t solve the problem of fitting the models into the GPU memory but provides an option to have a broad capability model serving the requests based on the request context. In this technique, the model is divided into multiple expert sub-networks. Each expert is typically a neural network trained to handle specific types of inputs or subtasks within the broader problem domain. A gating network determines which expert to use for each input. Only a subset of experts is activated for any given input. Different experts can be distributed across different GPUs. Router/Gating network and active experts can operate in parallel. Inactive experts don’t consume computational resources. This greatly reduces the number of parameters that each request must interact with, as some experts are skipped. But like Tensor & Pipeline parallelism the overall request latency relies heavily on the GPU-to-GPU communication network. A request must be reconstituted back to their original GPUs after expert processing generating high networking communication over the GPU-to-GPU interconnect fabric.
\\n\\n\\n\\nThis approach can lead to better utilization of hardware compared to Tensor parallelism as you don’t have to split the operations into smaller chunks.
\\n\\n\\n\\nFollowing is a summary and comparison of the methods we discussed. You can use it as a reference when planning to choose one for your use case.
\\n\\n\\n\\nAspect | Data Parallelism | Tensor Parallelism | Pipeline Parallelism | Expert Parallelism |
---|---|---|---|---|
Basic Concept | Splits input data across multiple devices | Splits individual tensors/layers across devices | Splits model into sequential stages across devices | Splits model into multiple expert sub-networks |
How it Works | The same model is replicated on each device, processing different data chunks | Single layer/operation distributed across multiple devices | Different parts of the model pipeline on different devices | The router selects specific experts for each input |
Parallelization Unit | Batch of inputs | Individual tensors/layers | Model stages | Experts (sub-networks) |
Scalability | Scales well with batch size | Scales well for very large models | Scales well for deep models | Scales well for wide models |
Memory Efficiency | Low (full model on each device) | High (only part of each layer on each device) | High (only part of the model on each device) | Medium to High (experts distributed across devices) |
Communication Overhead | Low | Medium to High | Low (only between adjacent stages) | Medium (router communication and expert selection) |
Load Balancing | Generally balanced if data is evenly distributed | Balanced within operations | Can be challenging, and requires careful stage design | Can be challenging, and requires effective routing |
Latency | Low for large batches | Can increase for small batches | Higher due to pipeline depth | Can be low if routing is efficient |
Throughput | High for large batches | Can be high for large models | High, especially for deep models | Can be very high for diverse inputs |
Typical Use Cases | Large batch inference, embarrassingly parallel tasks | Very large models that don’t fit on a single device | Deep models with sequential dependencies | Models with diverse sub-tasks or specializations |
Challenges | Limited by batch size, high memory usage | Complex implementation, potential communication bottlenecks | Pipeline bubble, difficulty in optimal stage partitioning | Load balancing, routing overhead, training instability |
Adaptability to Input Size | Highly adaptable | Less adaptable, fixed tensor partitioning | Less adaptable, fixed pipeline | Highly adaptable, different experts for different inputs |
Suitable Model Types | Most model types | Transformer-based models, very large neural networks | Deep sequential models | Multi-task models, language models with diverse knowledge |
Supported Inference Backends | TensonRT-LLM, vLLM, TGI | TensonRT-LLM, vLLM, TGI | TensonRT-LLM, vLLM, TGI | TensonRT-LLM, vLLM, TGI |
By now, you might already be thinking that if using the above parallelism methods means we are reducing the overall consumption or utilization of GPU, then can we not combine or replicate these to increase the overall GPU throughput? Combining Inference parallelism methods can lead to more efficient and scalable systems, especially for large and complex models.
\\n\\n\\n\\nIn the table below, you can see 4 possible options but in actual scenarios based on the number of GPUs you have, this combination can grow to a very large number.
\\n\\n\\n\\nData Parallelism (DP) + Pipeline Parallelism (PP) | Tensor Parallelism (TP) + Pipeline Parallelism (PP) | Expert Parallelism (EP) + Data Parallelism (DP) | Tensor Parallelism (TP) + Expert Parallelism (EP) |
---|---|---|---|
Split the model into stages (pipeline parallelism) | Divide the model into stages (pipeline parallelism) | Distribute experts across devices (expert parallelism) | Split large expert models across devices (tensor parallelism) |
Replicate each stage across multiple devices (data parallelism) | Split large tensors within each stage across devices (tensor parallelism) | Process multiple inputs in parallel (data parallelism) | Split large expert models across devices (tensor parallelism) |
So for example let’s say you have 64 GPU available and you are planning to deploy llama3-8b or Mistral 8*7B model on them. The following are some of the possible combinations of the parallelism methods. These are just examples to understand the parallelism combination strategies, for the actual use case you need to consider and benchmark other options as well.
\\n\\n\\n\\nLLaMA 3-8B (64 GPUs)
\\n\\n\\n\\nStrategy | GPU Allocation | Pros | Cons |
---|---|---|---|
PP8TP8 | 64 = 8 (pipeline) × 8 (tensor) | – Balanced distribution – Reduced Communication | – Pipeline bubbles – Increased latency |
PP4TP4DP4 | 64 = 4 (pipeline) × 4 (tensor) × 4 (data) | – Higher throughput – Flexible for batch sizes | – Complex integration – Requires large batches |
DP8TP8 | 64 = 8 (data) × 8 (tensor) | – Higher throughput – No pipeline bubbles | – Large batch size needed – High memory per GPU |
Mistral 8*7B (64 GPUs)
\\n\\n\\n\\nStrategy | GPU Allocation | Pros | Cons |
---|---|---|---|
EP8TP8 | 64 = 8 (experts) × 8 (tensor per expert) | – Balanced memory distribution – Efficient memory use | – Complex load balancing – Potential compute underutilization |
EP8TP4DP2 | 64 = 8 (experts) × 4 (tensor) × 2 (data) | – Higher throughput – Balanced utilization | – Needs careful load balancing – Large batch sizes required |
EP8PP2TP4 | 64 = 8 (experts) × 2 (pipeline) × 4 (tensor) | – Supports deeper experts – Flexible scaling | – Increased latency – Complex synchronization |
Now we have covered 4 methods of Inference parallelism, and then we have multiple combinations of these methods, so a common question you might be having is how do you choose or identify which method to use. So, the choice of inference parallelism methods broadly depends on the following factors:
\\n\\n\\n\\nModel architecture plays a crucial role in determining the most effective inference parallelism strategy. You need to identify your model architecture and then choose the parallelism method or combination of them that fits well. Different model structures fit to different parallelization techniques:
\\n\\n\\n\\nYou need to be familiar with your business requirements or use case, i.e., what matters to business more, the latency of the user requests, or utilization of the GPU. Or can you identify the mode of your application i.e., is it a real-time application where response time to the user request is the main driver for the user experience, or is it an offline system where response time to the user request is not the primary concern
\\n\\n\\n\\nThe reason we need to be aware of this is the tradeoff between latency and GPU throughput. If you want to reduce the latency to the user request you need to allocate more GPU resources to each request and your choice of parallelism method will rely on that. You can do optimal batching so that requests are not struggling to acquire the GPU resource which increases the overall latency.
\\n\\n\\n\\nHowever, if latency is not the consideration, then your goal should be to achieve maximum throughput on the GPU by choosing the right parallelism method and appropriate batch sizes that utilize GPU resources to its maximum throughput.
\\n\\n\\n\\nUse Case | Preferred Parallelism | Latency vs Throughput Consideration |
---|---|---|
Real-time chatbots | Data parallelism or Tensor parallelism | Low latency priority; moderate throughput |
Batch text processing | Pipeline parallelism | High throughput priority; latency less critical |
Content recommendation | Expert parallelism | High throughput for diverse inputs; moderate latency |
Sentiment analysis | Data parallelism | High throughput; latency less critical for bulk processing |
Voice assistants | Tensor parallelism or Pipeline parallelism | Very low latency priority; moderate throughput |
Hardware configuration often dictates the feasible parallelism strategies, and the choice should be optimized for the specific inference workload and model architecture. Following are some hardware component choices that impact the overall parallelism choices.
\\n\\n\\n\\nHardware Component | Impact on Parallelism Choice | Examples |
---|---|---|
GPU Memory Capacity | Determines feasibility of data parallelism and influences the degree of model sharding | NVIDIA A100 (80GB) allows larger model chunks than A100 (40GB) |
Number of GPUs | Affects the degree of parallelism possible across all strategies | 8 GPUs enable more parallelism than 4 GPUs |
GPU Interconnect Bandwidth | Influences efficiency of tensor and pipeline parallelism | NVLink offers higher bandwidth than PCIe, benefiting tensor parallelism |
CPU Capabilities | Impacts data preprocessing and postprocessing in parallelism strategies | High-core count CPUs can better handle data parallelism overhead |
System Memory | Affects the ability to hold large datasets for data parallelism | 1TB system RAM allows larger batch sizes than 256GB |
Storage Speed | Influences data loading speeds in data parallelism | NVMe SSDs provide faster data loading than SATA SSDs |
Network Bandwidth | Critical for distributed inference across multiple nodes | Networks based on Infiniband, RoCE are faster than conventional networks for GPU fabrics |
Specialized Hardware | Enables specific optimizations | Google TPUs are optimized for tensor operations |
AI inference parallelism is a game-changer for running big AI models efficiently. We’ve looked at different ways to split up the work, like data parallelism, tensor parallelism, pipeline parallelism, and expert parallelism. Each method has its own pros and cons, and choosing the right one depends on your specific needs and setup. It’s exciting to see tools like TensorRT-LLM, vLLM, and Hugging Face’s Text Generation Inference making these advanced techniques easier for more people to use. As AI models keep getting bigger and more complex, knowing how to use these parallelism techniques will be super important. They’re not just about handling bigger models – they’re about running AI smarter and more efficiently. By using these methods well, we can do amazing things with AI, making it faster, cheaper, and more powerful for all kinds of uses.
\\n\\n\\n\\nThe future of AI isn’t just about bigger models; it’s about finding clever ways to use them in the real world. With these parallelism techniques, we’re opening doors to AI applications that were once thought impossible. If you’re looking for experts who can help you scale or build your AI infrastructure, reach out to our AI & GPU Cloud experts.
\\n\\n\\n\\nIf you found this post valuable and informative, subscribe to our weekly newsletter for more posts like this. I’d love to hear your thoughts on this post, so do start a conversation on LinkedIn.
\\n\\n\\n\\nMember post originally published on KubeBlocks by Yuxing Liu
\\n\\n\\n\\nAs a popular short-form video application, Kuaishou relies heavily on Redis to deliver low-latency responses to its users. Operating on private cloud infrastructure, automating the management of large-scale Redis clusters with minimal human intervention presents a significant challenge. A promising solution emerged: running Redis on Kubernetes using an Operator.
\\n\\n\\n\\nWhile containerizing stateless services like applications and Nginx is now standard, running stateful services like databases and Redis on Kubernetes remains debated. Based on Kuaishou’s experience transforming Redis from physical machines to a cloud-native solution, this blog explores solutions and key considerations for managing stateful services on Kubernetes with the KubeBlocks Operator.
\\n\\n\\n\\nAs technology evolves, Kuaishou’s infrastructure is transitioning toward cloud-native technology stack. The infrastructure team delivers containers and Kubernetes to Application and PaaS systems. While stateless services at Kuaishou have almost fully adopted Kubernetes, the path toward cloud-native stateful services presents several challenges.
\\n\\n\\n\\nTaking Redis as an example, it is one of the most widely used stateful services at Kuaishou, characterized by its massive scale. Even small cost savings at this scale can deliver substantial financial benefits to the company. In its long-term planning, Kuaishou recognizes the significant potential of running Redis on Kubernetes, particularly in terms of cost optimization through improved resource utilization. This article shares insights from Kuaishou’s experience migrating Redis to Kubernetes, covering solutions, challenges encountered, and the corresponding strategies to address them.
\\n\\n\\n\\nTo meet the need for flexible shard management and support for hotspot migration and isolation, Kuaishou adopts a horizontally sharded, master-slave high-availability Redis architecture consisting of three components: Server, Sentinel, and Proxy.
\\n\\n\\n\\nFirst, Redis Pod Management Requires a Layered Approach
\\n\\n\\n\\nRedis Pod management needs to be handled in two layers: the first layer manages multiple shards, while the second layer manages multiple replicas within a single shard. It must support the dynamic scaling of the number of shards and the number of replicas per shard to adapt to varying workloads and usage scenarios.
\\n\\n\\n\\nThis means that, in Operator’s implementation, a workload (such as a StatefulSet) is used to manage multiple replicas within each shard. On top of this, an additional layer (some CRD object) should be constructed to enable management of multiple shards within the entire Redis cluster.
\\n\\n\\n\\nSecond, Ensuring Data Consistency and Reliability During Failures and Day-2 Operations
\\n\\n\\n\\nDuring shard or replica lifecycle changes, data consistency and reliability must be ensured. For example, shard scaling requires data rebalancing, while instance scaling within a shard may require data backup and restoration.
\\n\\n\\n\\nThus, the Operator must support lifecycle hooks at both the shard and replica levels, enabling custom data management operations at different lifecycle stages.
\\n\\n\\n\\nThird, Topology Awareness for Service Discovery and Canary Releases
\\n\\n\\n\\nThe topology among multiple Redis Pods within a shard may dynamically change due to events like high-availability failovers, upgrades, or scaling operations. Service discovery and features like canary releases rely on the real-time topology.
\\n\\n\\n\\nTo achieve this, the Operator must support dynamic topology awareness by introducing role detection and role labeling capabilities. This enables service discovery and canary releases based on the dynamic topology.
\\n\\n\\n\\nThese requirements go beyond the capabilities of any existing open-source Redis Operator and would typically require developing a highly complex Kubernetes Operator to fulfill them. However, building a stable Operator with well-designed APIs from scratch is daunting for most platform teams, as it demands expertise in both Kubernetes and databases, along with extensive real-world testing.
\\n\\n\\n\\nAfter evaluating several solutions, KubeBlocks caught our attention as an open-source Kubernetes database Operator. What makes KubeBlocks unique is its extensibility, offering an Addon mechanism that allows you to use its API to describe the Day-1 and Day-2 characteristics and behaviors of a database, enabling its full lifecycle management on Kubernetes. As stated on its website, KubeBlocks’ vision is to “Run any database on Kubernetes.” This flexibility enables us to customize the KubeBlocks Redis Addon to fit our in-house Redis cluster deployment architecture.
\\n\\n\\n\\nKubeBlocks’ API design also aligns well with our requirements for managing Redis clusters:
\\n\\n\\n\\n1. InstanceSet: A More Powerful Workload Than StatefulSet
\\n\\n\\n\\nInstanceSet is a workload used within KubeBlocks to replace StatefulSet, designed specifically for managing database Pods. Like StatefulSet, InstanceSet supports managing multiple Pods (referred to as Instances). The key difference is that InstanceSet can track the Role of each database Pod (e.g., primary, secondary). For different databases (as KubeBlocks supports multiple types), KubeBlocks allows customization of Pod roles, role detection methods, and the upgrade order based on roles during canary upgrades. The InstanceSet controller dynamically detects role changes during runtime and updates the role information as labels in the Pod metadata, enabling role-based Service selector.
\\n\\n\\n\\nStatefulSet assigns each instance a globally ordered, incrementing identifier. This mechanism provides stable network and storage identities, with the topology within the cluster relying on these identifiers. However, as the topology dynamically changes during runtime, the fixed identifiers provided by StatefulSet may fall short of meeting requirements. For example, StatefulSet identifiers cannot have gaps, and deleting an intermediate identifier is not allowed.
\\n\\n\\n\\nKuaishou’s platform team has contributed several PRs to the KubeBlocks community, including enhancements such as allowing Pods within the same InstanceSet to have different configurations, decommissioning Pods with specific ordinals (without first decommissioning Pods with higher ordinals), and controlling upgrade concurrency. These improvements make InstanceSet more adaptable to Kuaishou’s requirements for managing large-scale Redis clusters in production environments.
\\n\\n\\n\\n2. Layered CRD & Controller Design: Component, Cluster Objects
\\n\\n\\n\\nKubeBlocks leverages a multi-layered CRD structure—Component, Cluster—to manage the complex topology of database clusters. This design aligns seamlessly with Kuaishou’s Redis cluster deployment architecture:
\\n\\n\\n\\n⛱️ Shard: A specialized Component that defines the sharding behavior of horizontally scalable databases. Each Shard shares the same configuration. In Kuaishou’s Redis Cluster, for example, each Shard (Component) consists of a primary Pod and a replica Pod. Scaling out adds a new Shard (Component), while scaling in removes one, enabling shard-level scaling and lifecycle management.
\\n\\n\\n\\nThis hierarchical design simplifies scaling, enhances lifecycle management, and provides the flexibility needed to support complex Redis deployment architecture in production.
\\n\\n\\n\\nThrough close collaboration with the KubeBlocks community, we implemented the orchestration of a Redis cluster in the following ways:
\\n\\n\\n\\nThere are three Components in a Redis Cluster: redis-server
, redis-sentinel
, and redis-proxy
. Within each Component, Pods are managed using InstanceSet instead of StatefulSet.
At Kuaishou, multiple applications operate in a multi-tenant manner within a single ultra-large-scale Redis cluster. For example, a single cluster may contain over 10,000 Pods, exceeding the capacity of a single Kubernetes cluster. As a result, we had to deploy a Redis cluster across multiple Kubernetes clusters. An important aspect is that we need to hide the complexity of managing multiple clusters from Redis application users.
\\n\\n\\n\\nFortunately, Kuaishou’s Kubernetes infrastructure team provides a mature Kubernetes federation service, offering unified scheduling and a unified view:
\\n\\n\\n\\nSo, the question becomes how can the Redis cluster management solution based on KubeBlocks be integrated into Kuaishou’s internal federation cluster architecture? Below is the overall architecture:
\\n\\n\\n\\nThe Federation Kubernetes Cluster serves as the central control plane for managing multiple member clusters. It is responsible for cross-cluster orchestration, resource distribution, and lifecycle management of the Redis cluster. Its responsibilities include:
\\n\\n\\n\\nMember K8s Clusters are the individual Kubernetes clusters where Redis Pods (instances) are deployed and managed. Each member cluster is responsible for running a subset of the overall Redis cluster. Its responsibilities include:
\\n\\n\\n\\nSo, we divided the KubeBlocks Operator into two parts and deployed them in different Kubernetes clusters:
\\n\\n\\n\\nOnce again, the layered CRD and Controller design of KubeBlocks is the key to enabling this deployment. If KubeBlocks had a monolithic CRD and Controller managing everything, splitting and deploying it separately in the Federation Kubernetes Cluster and Member Kubernetes Clusters would not have been possible.
\\n\\n\\n\\nThere may be multiple Member Kubernetes Clusters, requiring the InstanceSet in the Federation Kubernetes Cluster to be partitioned into multiple InstanceSets, with one InstanceSet assigned to each Member Cluster. Additionally, the Instances (Pods) managed by the original InstanceSet need to be distributed across the new InstanceSets in the Member Clusters.
\\n\\n\\n\\nTo handle this, Kuaishou developed the Fed-InstanceSet Controller to manage interactions between the Federation Cluster and its Member Clusters. Its key responsibilities include:
\\n\\n\\n\\nTo manage instance partitioning and ensure global uniqueness and proper ordering of Redis Instances in member Clusters, Kuaishou contributed a PR to the KubeBlocks community, adding an Ordinals field to InstanceSet. This allows precise index assignment to instances.
\\n\\n\\n\\nThe Fed-InstanceSet Controller uses this field to assign unique index ranges to each Member Cluster, ensuring instance uniqueness and correct ordering across clusters.
\\n\\n\\n\\nIn our view, running stateful services on Kubernetes comes with notable benefits:
\\n\\n\\n\\nAlthough running stateful services on Kubernetes offers significant benefits, the potential risks must be carefully evaluated, especially for stateful services like databases and Redis, which demand high levels of importance and stability. The challenges include:
\\n\\n\\n\\nThe following sections explore these risks in more detail.
\\n\\n\\n\\nContainerizing Redis within a cloud-native architecture introduces an additional abstraction layer compared to traditional host-based deployments. However, industry benchmarks and Kuaishou’s internal testing show that performance differences are generally within 10%, which is often negligible in most use cases. While this variance is typically acceptable, organizations are advised to conduct their own performance testing to ensure the solution meets the specific needs of their workloads.
\\n\\n\\n\\nMigrating stateful services to Kubernetes has greatly improved operational efficiency through automation. However, this also made the execution processes more opaque, with even small configuration changes potentially impacting many instances. To mitigate the stability risks from unexpected scenarios — such as pod evictions, human error, or Operator bugs— Kuaishou utilizes the Admission Webhook mechanism within the Kubernetes API server to intercept and validate change requests. This approach allows Kuaishou to directly reject any unauthorized operations. Given the multi-cluster Kubernetes setup across multiple availability zones (AZs), it’s critical to ensure change control across clusters. To achieve this, Kuaishou developed an internal risk mitigation system called kube-shield.
\\n\\n\\n\\nAdditionally, it’s worth mentioning that Kuaishou has further enhanced availability and stability by improving support for fine-grained scheduling distribution and introducing load balancing features based on resource utilization.
\\n\\n\\n\\nMigrating from a host-based system to a Kubernetes-based environment, while ensuring ongoing maintenance, requires deep expertise in both Redis and K8s technologies. Relying solely on the Redis team or the K8s team for independent support would be challenging. Proper division of responsibilities not only enhances productivity but also allows each team to fully leverage their expertise in their respective domains.
\\n\\n\\n\\nFor example, in Kuaishou’s cloud-native Redis solution:
\\n\\n\\n\\nCloud-native transformation for stateful services is a complex journey requiring careful evaluation of its pros and cons, and one filled with challenges. However, for Kuaishou, its value is self-evident. Starting with Redis, Kuaishou has worked closely with the KubeBlocks community to implement a cost-effective, cloud-native solution.
\\n\\n\\n\\nLooking forward, Kuaishou aims to build upon this experience to drive the cloud-native transformation of more stateful services, such as databases and middleware, thus reaping dual benefits in technology and cost efficiency.
\\n\\n\\n\\nAt KubeCon Hong Kong in August, Kuaishou and the KubeBlocks team delivered a joint presentation. If you’re interested, you can revisit the talk for further insights.
\\n\\n\\n\\nAbout Author: Yuxing Liu is the senior software engineer from Kuaishou. Yuxing has worked in the cloud-native teams of Alibaba Cloud and Kuaishou, focusing on the cloud-native field and gaining experience in open source, commercialization, and scaling of cloud-native technologies. Yuxing is also one of the maintainers of the CNCF/Dragonfly project and also one of the maintainers of the CNCF/Sealer project. Currently, he focuses on driving the cloud-native transformation of stateful business in Kuaishou.
\\n\\n\\n\\nAbout Kuaishou: Kuaishou is a leading content community and social platform in China and globally, committed to becoming the most customer-obsessed company in the world. Kuaishou uses its technological backbone, powered by cutting-edge AI technology, to continuously drive innovation and product enhancements that enrich its service offerings and application scenarios, creating exceptional customer value. Through short videos and live streams on Kuaishou’s platform, users can share their lives, discover goods and services they need and showcase their talent. By partnering closely with content creators and businesses, Kuaishou provides technologies, products, and services that cater to diverse user needs across a broad spectrum of entertainment, online marketing services, e-commerce, local services, gaming, and much more.
\\n\\nGet to know David
\\n\\n\\n\\nThis week’s Kubestronaut in Orbit, David Mukuzi, is a DevOps Engineer in Nairobi, Kenya. David is driven by a deep-rooted enthusiasm for continuous learning and exploration of emerging technologies. He enjoys working with collaborative teams to build reliable, high-performing, and accessible solutions. David is focused on understanding and addressing customer challenges, with innovation as well as a practical approach to problem solving.
\\n\\n\\n\\nIf you’d like to be a Kubestronaut like David, get more details on the CNCF Kubestronaut page.
\\n\\n\\n\\nWhen did you get started with Kubernetes and/or cloud-native? What was your first project?
\\n\\n\\n\\nI started working with Kubernetes in 2018 – the company I worked for was running Kubernetes workloads both on bare metal and in the cloud.
\\n\\n\\n\\nWhat are the primary CNCF projects you work on or use today? What projects have you enjoyed the most in your career?
\\n\\n\\n\\nI get to interact with a majority of the Graduated CNCF projects daily, I’ve enjoyed Kubernetes and CoreDNS the most.
\\n\\n\\n\\nHow have the certs or CNCF helped you in your career?
\\n\\n\\n\\nThe practical and hands-on aspects of the certifications helped me to put the knowledge into practice and contributing to different projects have provided additional practice and learning.
\\n\\n\\n\\nWhat are some other books/sites/courses you recommend for people who want to work with k8s?
\\n\\n\\n\\nNetworking and Kubernetes by James Strong and Vallery Lancey was helpful.
\\n\\n\\n\\nWhat do you do in your free time?
\\n\\n\\n\\nI enjoy cooking and working out.
\\n\\n\\n\\nWhat would you tell someone who is just starting their K8s certification journey? Any tips or tricks?
\\n\\n\\n\\nPractice breaking things and get hands-on experience.
\\n\\n\\n\\nToday the Cloud native ecosystem is way more than Kubernetes. Do you plan to get other cloud native certifications from the CNCF ?
\\n\\n\\n\\nNow that I have all the Kubernetes certs I’m planning to take the Prometheus Certified Associate (PCA) and the Istio Certified Associate (ICA).
\\n\\nBlog post originally published on the Middleware blog by Sri Krishna
\\n\\n\\n\\nIn the high-stakes environment of Black Friday, e-commerce platforms encounter intense traffic surges that can heavily strain system performance. For example, during Black Friday 2023, online sales soared to $9.8 billion, a 7.5% increase from the previous year, highlighting the substantial pressure placed on digital infrastructures.
\\n\\n\\n\\nDespite these gains, some retailers experienced website outages, underscoring the critical need for reliable platform engineering practices that prioritize valuable feedback from internal customers.
\\n\\n\\n\\nA key strategy to mitigate such risks is integrating observability into platform engineering. Observability offers real-time insights into system behavior, allowing teams to proactively identify and address issues before they affect users. By adopting observability, platform engineering teams can improve system resilience, sustain uninterrupted user experiences during peak events, and uphold operational stability.
\\n\\n\\n\\nThis article examines how observability elevates platform engineering by tackling complex challenges, refining workflows, and fortifying system reliability.
\\n\\n\\n\\nPlatform engineering is about creating a stable, scalable foundation that meets the needs of development and operations teams. Rather than just managing infrastructure, it involves building shared tools, environments, and workflows to improve collaboration and minimize operational friction for development teams. By providing a standardized platform, platform engineering enables faster, consistent application deployment and allows engineers to focus on development without being weighed down by infrastructure complexities.
\\n\\n\\n\\nRoles within platform engineering, such as release engineers, tooling engineers, and infrastructure architects, work together to ensure smooth deployments, maintain tool efficiency, and design scalable infrastructure, all critical for a cohesive platform engineering strategy.
\\n\\n\\n\\nModern infrastructure is increasingly complex and continuously evolving, posing significant cognitive load and challenges for engineers. This complexity stems from the need for various tools and frameworks, such as Kubernetes for container orchestration, Helm for application deployment, Terraform for infrastructure as code, and specialized monitoring systems. These tools, while powerful, must work in harmony, which requires careful planning and configuration.
\\n\\n\\n\\nPlatform engineering addresses these complexities by establishing a cohesive, scalable foundation, yet it must navigate several critical factors:
\\n\\n\\n\\nFor example, scaling a service during peak demand, such as an e-commerce sale, requires not only a reliable infrastructure but also automation and monitoring to dynamically adjust resources and prevent bottlenecks in real time.
\\n\\n\\n\\nIn platform engineering, effective infrastructure management is key to sustaining a reliable and scalable environment that supports both development and operations. Through efficient deployment, monitoring, and management of infrastructure, platform engineers establish a solid foundation that adapts to changing demands and improves application performance. Additionally, these practices enable developer self-service by providing integrated tools and workflows that empower developers to manage their applications autonomously.
\\n\\n\\n\\nThis involves:
\\n\\n\\n\\nTogether, these infrastructure management practices support platform engineering’s core goal: building a resilient environment that enables teams to deliver applications efficiently and reliably.
\\n\\n\\n\\nWhile platform engineering, DevOps, and Site Reliability Engineering (SRE) all contribute to improving software delivery, each focuses on distinct aspects of the process:
\\n\\n\\n\\nTraditional monitoring focuses on tracking known metrics, setting alert thresholds, and responding to specific issues as they arise. This makes it largely reactive and useful for catching immediate problems like high CPU usage or memory consumption. However, monitoring’s limitations become evident when dealing with the intricate, interdependent systems found in modern infrastructure, where isolated metrics rarely reveal the full picture.
\\n\\n\\n\\nObservability, by contrast, is dynamic and proactive, giving platform engineering teams and software developers a holistic view of system interactions. Instead of flagging individual metrics, observability enables engineers to query and explore data across services, providing insights into relationships and dependencies that monitoring alone might miss. This expanded visibility allows teams to troubleshoot complex issues more effectively, ensuring that all system components work together smoothly and stably.
\\n\\n\\n\\nIn a microservices architecture, where applications are built from many interdependent services, a slowdown or failure in one component can cascade across the system.
\\n\\n\\n\\nFor example, monitoring might highlight general latency in user-facing features, but observability tools can trace the source of the slowdown to a specific service. By examining traces, metrics, and logs, platform engineers can pinpoint precisely where the latency originates, whether it’s a slow database query or an overloaded API.
\\n\\n\\n\\nConsider these use cases:
\\n\\n\\n\\nThrough observability, platform engineering teams can maintain not only a responsive but also a resilient platform. They gain the depth needed to identify, address, and prevent issues, increasing overall system reliability while supporting the smooth operation of critical applications.
\\n\\n\\n\\nObservability relies on three foundational components, often called the “pillars” of observability, which together offer a comprehensive view of system health and performance:
\\n\\n\\n\\nThe platform engineering team plays a crucial role in implementing these three pillars, allowing teams to gain an in-depth view of system operations enabling them to understand both individual components and their interactions within the broader infrastructure.
\\n\\n\\n\\nOne of the most significant advantages of observability is the ability to detect potential issues before they impact users proactively. Unlike traditional monitoring, which often alerts teams after an issue has occurred, observability enables engineers to identify patterns and anomalies early.
\\n\\n\\n\\nBy tracking unusual behaviors or shifts in metrics, logs, or traces, teams can respond to signs of potential failures in real time, addressing issues before they escalate.
\\n\\n\\n\\nThis proactive approach improves system resilience, optimizes workflows, and ultimately helps maintain a smooth user experience by reducing downtime and preventing disruptions. Initiating the platform engineering journey by engaging with engineering teams to identify bottlenecks and developer frustrations is crucial for continuous improvement.
\\n\\n\\n\\nWith systems becoming increasingly distributed, internal platform teams play a crucial role in maintaining a clear overview. Observability provides the necessary visibility to understand how different components interact and where issues may arise.
\\n\\n\\n\\nReducing Mean Time to Detect (MTTD) and Mean Time to Resolution (MTTR) is critical for minimizing downtime and improving user experience. A platform team plays a crucial role in these efforts by making operations easy and improving collaboration among different tech teams. Observability lowers MTTD by continuously monitoring for anomalies, enabling engineers to catch issues as they emerge. Once a problem is detected, observability tools provide detailed, actionable insights that accelerate MTTR. With relevant data readily available, teams can efficiently assess issue severity, identify impacted areas, and implement solutions.
\\n\\n\\n\\nFor more on the benefits of reduced MTTD and MTTR, see MTTR vs MTTD and How to Reduce MTTR.
\\n\\n\\n\\nWhen issues arise, the ability to quickly diagnose and resolve them is crucial. Observability facilitates faster triage by correlating data from metrics, logs, and traces, giving engineers a comprehensive view of what happened, when, and why.
\\n\\n\\n\\nWith these insights, engineers can delve into specific events to identify the root cause, whether it’s a failing API, resource bottleneck, or misconfigured service. This efficient diagnostic approach leads to quicker resolutions and contributes to a more stable and resilient system.
\\n\\n\\n\\nThe platform engineering team binds various tools, services, and APIs into a cohesive internal developer platform, creating well-organized processes that strengthen developer autonomy and efficiency.
\\n\\n\\n\\nEffective observability in platform engineering revolves around three main components: logging, metrics, and tracing. Internal developer platforms (IDPs) play a crucial role in facilitating these components by organizing workflows and providing tools that make software development complexities easy. Together, these elements provide a holistic view of system performance and health, enabling engineers to monitor, diagnose, and improve infrastructure more effectively.
\\n\\n\\n\\nVarious tools in the industry make implementing observability practical and efficient, often managed by internal platform teams. Some of the most popular tools include:
\\n\\n\\n\\nBy combining these tools, teams can monitor their systems more effectively, gaining the visibility needed to maintain performance and reliability.
\\n\\n\\n\\nImplementing observability effectively involves more than just choosing the right tools. Here are some key practices to ensure a successful observability strategy:
\\n\\n\\n\\nWith these practices, the platform engineering team can build an observability framework that not only monitors systems constructively but also permits platform engineering to create a stable and reliable foundation for applications.
\\n\\n\\n\\nObservability provides developers with real-time visibility into system performance, enabling them to diagnose and resolve issues independently. This autonomy lessens reliance on central support and boosts productivity, aligning with platform engineering’s goal of minimizing operational bottlenecks. By tracing issues quickly, developers can make direct improvements, refine workflows, and reduce dependency on operations teams.
\\n\\n\\n\\nServing internal customers, primarily app developers, is crucial in improving self-service capabilities. With access to metrics, logs, and traces, developers can:
\\n\\n\\n\\nFor many organizations, observability is a powerful enabler of developer self-sufficiency. Consider the experience of Trademarkia, a visual search engine for trademarks, which encountered significant hurdles with an outdated tech stack. Transitioning from .NET Core to a microservices-based architecture, the company needed a reliable observability solution to keep pace with its newly distributed infrastructure.
\\n\\n\\n\\nBy implementing Middleware’s observability platform, Trademarkia gained the real-time log monitoring and insight needed to optimize issue detection and resolution. With this observability framework in place, developers could diagnose and resolve issues independently, often within minutes rather than hours. This self-service capability not only accelerated debugging times but also reduced dependency on central support, enabling the team to focus on scaling and improving the platform.
\\n\\n\\n\\nTrademarkia’s move to observability also had a measurable impact: a 20% reduction in time to resolution, improved productivity, and proactive issue detection. This observability-driven approach to platform management allowed Trademarkia to offer users a smoother, more responsive experience, ultimately reinforcing the stability of the platform and freeing engineers to focus on strategic development. The company’s success highlights the importance of initiating the platform engineering journey by engaging with engineering teams to identify bottlenecks and developer frustrations.
\\n\\n\\n\\nRead more about Trademarkia’s observability journey here.
\\n\\n\\n\\nChoosing the right observability strategy is a decisive factor for strengthening platform performance and ensuring alignment with organizational needs. Here are five key strategies with a platform engineers focus:
\\n\\n\\n\\nWhile observability is crucial, certain practices can hinder its effectiveness. By steering clear of these common pitfalls, platform engineering teams and the platform team can maintain clarity, reduce operational load, and reinforce platform resilience.
\\n\\n\\n\\nStarting with observability may seem challenging, but by focusing on key services and gradually expanding, teams can see substantial improvements.
\\n\\n\\n\\nAs platform engineering evolves, software engineering organizations will find observability pivotal in maintaining resilient and reliable systems. Emerging trends like AI-driven observability offer promise for even greater insights and operational gains.
\\n\\nMember post originally published on the Devtron blog by Prakarsh
\\n\\n\\n\\nIn the ever-evolving landscape of container orchestration, Kubernetes stands out as a powerful tool for managing and deploying applications at scale. One of Kubernetes’ key features is its extensibility, allowing users to automate complex tasks through custom controllers called Operators. While Kubernetes Operators offer tremendous flexibility and functionality, there are scenarios where their use may be unnecessary or even detrimental to deployment workflows. In this article, we’ll explore the concept of when Kubernetes Operators might be overkill and alternative approaches to streamline deployments effectively.
\\n\\n\\n\\nOut of the box, Kubernetes comes with several different features that make it quite versatile for deploying and scaling applications. However, Kubernetes can fall short in a few areas where it may only provide some basic functionality. In actual use cases, a more advanced implementation of a feature might be required. Kubernetes provides a very powerful functionality called Kubernetes Operators, which helps to extend the functionality of Kubernetes.
\\n\\n\\n\\nKubernetes Operators offer a very powerful solution to enable the automation of complex clusters and systems based on a set of rules and principles. Kubernetes was designed with automation and extensibility in mind, and Kubernetes operators play a huge role in fulfilling this design principle.
\\n\\n\\n\\nKubernetes Operators are custom controllers that extend the Kubernetes API to manage complex applications and services. They encapsulate operational knowledge and automate tasks such as deployment, scaling, and lifecycle management. Kubernetes Operators excel at managing stateful applications, databases, and middleware, providing a declarative way to define and manage application-specific logic.
\\n\\n\\n\\nSome of the common use cases where Kubernetes Operators are used include:
\\n\\n\\n\\nWhile Kubernetes Operators can help extend the functionality of Kubernetes, there are some potential pitfalls. The power of the Kubernetes Operators comes with inherent complexity. Developing and maintaining Operators requires significant effort and expertise. Kubernetes Operators need to be carefully designed, tested, and updated to ensure they function correctly and adapt to changing requirements. Moreover, managing a proliferation of multiple Operators within a Kubernetes environment can lead to operational overhead and complexity, potentially outweighing the benefits they provide.
\\n\\n\\n\\nWhile Kubernetes Operators offer a solution to manage complex applications, there are situations where their use may not be justified. In some situations, adding the Kubernetes Operators might make the problems worse. Let’s take a look at some of the situations where using a Kubernetes Operator might not be the best solution:
\\n\\n\\n\\n1. Simplicity Requirements: For straightforward deployments or applications with minimal operational complexity, leveraging built-in Kubernetes resources or simpler deployment tools may be more appropriate than developing custom Operators.
\\n\\n\\n\\n2. Resource Constraints: Organizations with limited resources or expertise may find it challenging to develop and maintain custom Kubernetes Operators effectively. In such cases, using off-the-shelf solutions or managed services may prove to be more cost-effective and efficient.
\\n\\n\\n\\n3. Overhead vs. Benefit: Assessing the trade-offs between the benefits of using Kubernetes Operators and the associated overhead is crucial. If the complexity introduced by the Operators outweighs the benefits they provide, alternative deployment approaches should be considered.
\\n\\n\\n\\nWhile Kubernetes Operators offer advanced automation capabilities, there are alternative approaches to streamline deployments without the complexity of custom Operators:
\\n\\n\\n\\n1. Helm Charts: Helm is a package manager for Kubernetes that simplifies the deployment and management of applications. Helm charts provide a templated approach to defining application configurations, making it easy to deploy applications consistently across environments.
\\n\\n\\n\\n2. ArgoCD: ArgoCD is a declarative, GitOps continuous delivery tool for Kubernetes. It automates the deployment of applications from Git repositories, ensuring that the desired state of applications is always maintained in the cluster.
\\n\\n\\n\\n3. Devtron: Devtron is a Kubernetes-native CI/CD platform that simplifies the deployment and management of applications on Kubernetes. It provides end-to-end automation for building, deploying, and monitoring applications, reducing the need for custom scripting or complex Kubernetes Operator configurations. Additionally, it has native integrations with ArgoCD for GitOps based deployments and Helm as well for Kubernetes deployments.
\\n\\n\\n\\nFeatures | Operators | Helm | ArgoCD | Devtron |
---|---|---|---|---|
Ideal for Stateful Application Lifecycle Management | ✅ | ❌ | ❌ | ❌ |
Stateless Applications | ✅ | ✅ | ✅ | ✅ |
Full Lifecycle Management | ✅ | ❌ | ❌ | ✅ |
Supports GitOps | ❌ | ❌ | ✅ | ✅ |
Built-in Rollback Support | ❌ | ✅ | ✅ | ✅ |
Customization | ✅ | ❌ | ✅ | ✅ |
Complex Application Management | ✅ | ❌ | ❌ | ✅ |
Declarative Approach | ✅ | ✅ | ✅ | ✅ |
UI Provided | ❌ | ❌ | ✅ | ✅ |
Security and Policy Management | ❌ | ❌ | ✅ | ✅ |
Low Learning Curve | ❌ | ✅ | ✅ | ✅ |
Both CI/CD Pipelines Management | ❌ | ❌ | ❌ | ✅ |
Suitable for Large Teams | ❌ | ❌ | ✅ | ✅ |
While Kubernetes Operators offer powerful automation capabilities, they are not always the best fit for every deployment scenario. Organizations must carefully evaluate their deployment requirements, operational capabilities, and resource constraints to determine whether using a Kubernetes Operator is justified. By considering alternative approaches such as Helm, ArgoCD, and Devtron, organizations can streamline their Kubernetes deployments effectively without falling into the trap of Operator overkill.
\\n\\n\\n\\nIn conclusion, while Kubernetes Operators have their place in complex deployment scenarios, it’s essential to reconsider their use when they introduce unnecessary complexity or overhead. By taking a pragmatic approach and exploring alternative deployment strategies, organizations can streamline their Kubernetes deployments effectively and achieve their automation goals without unnecessary complexity.
\\n\\n\\n\\nIf you have any queries, don’t hesitate to connect with us. Connect with our growing Discord Community for support, discussions and shared knowledge.
\\n\\nThe Open Source Technology Improvement Fund, Inc (OSTIF) is thrilled to mark another successful year of helping CNCF projects with security audits. Since this partnership began in 2021, a total of 13 projects have graduated following an OSTIF security audit. The CNCF continues to demonstrate a strong commitment to the maturity and growth of projects, investing multiple millions of dollars over the last three years in these engagements.
\\n\\n\\n\\n\\n\\n\\n\\n\\n“CNCF is a prime example of a foundation fostering good security practices and providing value to projects. The foundation sponsors security audits, which has resulted in significant security improvements. OSTIF is grateful for the opportunity to collaborate on these audits.” – Amir Montazery, Managing Director, OSTIF
\\n
Have a look at the full report!
\\n\\n\\n\\nCheck out OSTIF’s blog post here.
\\n\\nMember post originally published on Cerbos’s blog by Omu Inetimi
When building a secure application, there are plenty of factors to be considered. Who is allowed into the application, how users are allowed in, measures in place to avoid bad actors, etc. But one particularly important factor stands out within the walls of any application: authorization.
\\n\\n\\n\\nIn this article, we’ll see what authorization is all about, and explore several key authorization design patterns, how they work, and possible scenarios where they may be implemented.
\\n\\n\\n\\nBy definition, authorization is the control of someone’s access to a resource. It’s the process of checking and deciding if someone or something is worthy of carrying out a certain task or seeing certain information. It controls what a user can or cannot do within a system.
\\n\\n\\n\\nThink of where you are right now on this website. You can access this particular webpage and read this article. An editor wrote and posted this article due to their access rights but you can only view it, you can’t post. You are free to only read because you don’t have the rights of an editor, that’s authorization at work.
\\n\\n\\n\\nConsider these three components of an authorization mechanism:
\\n\\n\\n\\nAuthorization is continuously working behind the scenes in every secure system, constantly ensuring you can only do what you have access to do—nothing more, nothing less.
\\n\\n\\n\\nThere are a number of authorization paradigms, each with its own strong points and limitations. We’ll go over a few of the most common ones below.
\\n\\n\\n\\nIn RBAC, permissions are assigned based on a user’s role in an organization. Each “role” (e.g. managers, employees, etc.) has specific access rights given to them. The user then inherits the permissions of that role. This is particularly useful in large organizations as it logically models broad business groups.
\\n\\n\\n\\nABAC is more flexible. It’s considered a more fine-grained approach to authorization, which just means it can handle more complexity by considering more factors. Factors could include user attributes (such as department, role, and clearance level), resource attributes (such as document classification, owner, and creation date), and environmental attributes(such as time of day, location, network, device, IP). It can combine all these to decide access privileges.
\\n\\n\\n\\nIn this system, the resource owner controls access to their own resources. they get to grant or revoke access to or from whomever they please. This is often used in file systems such as those on your personal computer. DAC allows you to set permissions on your files or folders, ultimately allowing you to decide who can view or edit them.
\\n\\n\\n\\nIn MAC access is based on a system of classification and labels governed by a central authority. The system defines access levels (such as top secret, secret, and confidential) that users can only get if they have the right clearance level. It is a very strict system and is mainly used in environments where data security is critical.
\\n\\n\\n\\nThis is a more recent approach that’s usually used where access decisions are dependent on the interaction between users and resources within the system. It’s especially useful for social networks and team projects (think Facebook and Google Docs).
\\n\\n\\n\\nIn ReBAC, your permissions might change based on your relationship to other users or to the data itself. For example, on Facebook, you might be able to see posts from your friends, or friends of friends, but not from people you’re not connected to.
\\n\\n\\n\\nTo better illustrate these concepts, here are some examples of real-world scenarios where these paradigms would be implemented.
\\n\\n\\n\\nCorporate IT systems usually implement a combination of Role-Based Access Control (RBAC) and Attribute-Based Access Control (ABAC). Your job title (e.g. manager, staff) determines your level of access within the system—that’s RBAC. ABAC then also makes the system more secure by taking into account other factors such as the time, your location, your device, etc to decide whether or not you should be given access to certain things. This dual approach allows organizations to maintain security and flexibility.
\\n\\n\\n\\nHave you ever tried to reply to a social media post on Facebook or Twitter and noticed that the author limited who can comment? That is authorization in the form of ReBAC + DAC. ReBAC works to give access based on connections (if you’re friends or not), and determines what you can see on the person’s profile. DAC comes into play when the user controls their privacy settings, e.g. setting visibility or comment access to “friends only”.
\\n\\n\\n\\nIn such high-security environments where classified information is present, Mandatory Access Control (MAC) is often the model of choice. This system provides rigid centralized control where access is granted only when a user’s clearance meets or exceeds the classification level of the information.
\\n\\n\\n\\nWhen it comes to setting up and managing authorization systems, there are some best practices to follow, but also some challenges to watch out for.
\\n\\n\\n\\nAuthorization is an incredibly important aspect of every secure system. It is used in almost all digital systems you see today, from your personal computer to your social media platform of choice. It works behind the scenes constantly, checking if you have the required access rights to do what you need to do.
\\n\\n\\n\\nIn the end, good authorization is all about finding the right balance; between being secure and being easy to use, between giving people the access they need and protecting sensitive information. When it’s done right, authorization is a powerful tool that helps keep our digital lives running smoothly and safely. When it’s done wrong, it can leave us vulnerable to security breaches or make systems frustrating to use.
\\n\\n\\n\\nIf you want to exchange authorization design tips and ideas with other developers, or just learn more about authorization in general, you should join our Community Slack today!
\\n\\nCommunity post originally published on Dev.to by Sunny Bhambhani
\\n\\n\\n\\nIntroduction
k9s is a terminal based GUI to manage any Kubernetes(k8s) cluster. Using this single utility, we can manage, traverse, watch all our Kubernetes objects.
More information around k9s can be found here: https://k9scli.io/
\\n\\n\\n\\nWe will dive a bit into k9s and see how it can help us in our day-to-day life, how we can get started, etc.
\\n\\n\\n\\nFeatures
Before we go into the example on how it can help, what it can do for us, let’s see some of its features, it has a ton of features, but we will focus on the ones which can help us in our day-to-day activities:
pulses
.tree
kind of structure using xrays
to identify objects co-relation.Installation
\\n\\n\\n\\nDownload the latest version of k9s binary, you can use the same URL/version if required or get the latest version released and update below URL accordingly.
\\n\\n\\n\\nFor latest version, refer: https://github.com/derailed/k9s/releases
\\n\\n\\n\\n$ wget https://github.com/derailed/k9s/releases/download/v0.32.7/k9s_linux_amd64.deb
\\n\\n\\n\\nInstall it using apt package manager.
\\n\\n\\n\\n$ sudo apt install ./k9s_linux_amd64.deb
\\n\\n\\n\\nOnce done it will get the k9s binary installed here: /usr/bin/k9s
.
HOWTO
\\n\\n\\n\\nLaunch k9s
\\n\\n\\n\\n$ k9s\\n
\\n\\n\\n\\nOnce k9s is launched you will be presented with a beautiful layout with lots of options.
\\n\\n\\n\\nNavigate between Kubernetes objects
Navigating is pretty easy; it is more or less like vi
or vim
.
:
it will bring the cursor to a text area where you can write the object that you are interested in, for instance in current example deployments.Pods management
\\n\\n\\n\\n:
and type in pods, it will show you all the pods in selected namespace, if you want to see the pods from all namespaces press 0
.ctrl+d
and you are done.s
and you are done. If you want to exit, press ctrl+d
or type in exit
.d
.y
.l
.shift+f
and if you want to see any existing port-forward are present or not, press f
.Completed
and we want to clean them up.Just press z
, you will be asked “if you are sure”, type in “Yes Please!” and the job is done.
XRAY
Xray gives you a great detail in terms of co-relation between k8s objects, it basically provides a tree
like structure.
Pulses
Pulses give a great dashboard to see what exactly is happening in your cluster, what all objects are there, health of your objects, etc?
NOTE: Make sure metrics-server is installed and running in the cluster otherwise you won’t see proper results. In my case it was not installed, therefore it was just stating blank results.
\\n\\n\\n\\nAfter installing metrics-server and launching pulses
it shows an awesome dashboard.
References:
\\n\\n\\n\\nFeel free add your thoughts, Happy learning 🙂
\\n\\nAmbassador post by Prithvi Raj, CNCF Ambassador and Community Manager at Mirantis
\\n\\n\\n\\nAs Kubernetes continues to grow as the de-facto orchestration platform for containerized applications and is massively adopted by large, medium as well as small enterprises, the need for a lightweight, flexible, and easy-to-manage Kubernetes distribution has become more evident in the community.
\\n\\n\\n\\nOne such distribution, k0s, has emerged as a powerful yet minimal solution, catering to developers and enterprises seeking to deploy Kubernetes clusters with ease and efficiency. In this blog, we’ll take an in-depth look at k0s, its features, benefits, and how it compares to other Kubernetes distributions.
\\n\\n\\n\\nThe idea and aim was to reduce the complexity of setting up and managing Kubernetes clusters which are often a hassle, by providing an easy-to-use, lightweight, simpler installation process while maintaining full compatibility with Kubernetes APIs.
\\n\\n\\n\\nk0s is an open-source, single-binary Kubernetes distribution designed to be lightweight, simple to deploy, and highly flexible. Unlike other Kubernetes distributions, k0s aims to reduce the complexity of setting up and managing Kubernetes clusters by providing an easy-to-use installation process while maintaining full compatibility with Kubernetes APIs.
\\n\\n\\n\\nP.S: k0s is a “Certified Kubernetes” tool that has achieved the Software Conformance badge ensuring that every vendor’s version of Kubernetes supports the required APIs, as do open source community versions.
\\n\\n\\n\\nWhat does it mean?
\\n\\n\\n\\nMaintained by folks from Mirantis and Replicated, k0s is designed to run on a wide range of environments, from bare-metal servers to cloud-based platforms, and even on edge devices. The key differentiator of k0s is its minimalistic approach, where it consolidates multiple Kubernetes components into a single binary, making the installation and operational overhead much lower than traditional Kubernetes setups. When k0s is running, there will be the “real” binaries of apiserver, etcd, kubelet, containerd, runc and so on. k0s even ships its own statically linked versions of IP tables, for example. Which is one reason why the k0s binary is substantially bigger compared to the k3s binary.
\\n\\n\\n\\nWhile there are numerous Kubernetes distributions available, including well-known ones like k3s, OpenShift, Rancher, EKS, GKE, and AKS, k0s sets itself apart in several ways:
\\n\\n\\n\\nTo power the management of k0s clusters there is k0smotron, for efficient management of k0s Kubernetes clusters. It enables you to run Kubernetes control planes within a management cluster and with the integration of Cluster API it streamlines various cluster operations, providing support for tasks such as provisioning, scaling, and upgrading clusters.
\\n\\n\\n\\nInstalling k0s is a straightforward process, especially compared to other Kubernetes distributions. The installation process typically involves the following steps:
\\n\\n\\n\\nThe k0s community discussions will commence on the Kubernetes Slack Workspace. Join the #k0s-users & #k0s-dev channels to ask your questions, share your user stories and discuss your contributions with the maintainers.
\\n\\n\\n\\nJoin the k0s community office hours on every last Tuesday of the month at 3 PM EET/ 1 PM GMT.
\\n\\n\\n\\nTo get an invite please fill out the invitation form.
\\n\\n\\n\\nHere are the meeting notes for the community office hours.
\\n\\n\\n\\nk0s is a lightweight, efficient, and highly flexible Kubernetes distribution that stands out for its simplicity and ease of use. Its single binary architecture, compatibility with the Kubernetes ecosystem, and focus on minimalism make it a compelling choice for developers, edge deployments, and small to medium-sized enterprises looking to harness the power of Kubernetes without the complexity and overhead of traditional setups. Whether you are just starting out with Kubernetes or managing large-scale clusters, k0s offers a scalable and manageable solution that can fit a variety of use cases.
\\n\\n\\n\\nFor those interested in a more streamlined Kubernetes experience, k0s represents an exciting alternative to other Kubernetes distributions, offering both power and flexibility in a compact package.
\\n\\nProject post originally published on the Linkerd blog by William Morgan
\\n\\n\\n\\nToday we’re happy to announce the release of Linkerd 2.17, a new version of Linkerd that introduces several major new features to the project: egress traffic visibility and control; rate limiting; and federated services, a powerful new multicluster primitive that combines services running in multiple clusters into a single logical service. This release also updates Linkerd to support OpenTelemetry for distributed tracing.
\\n\\n\\n\\nLinkerd 2.17 is our first major release since our announcement of Linkerd’s sustainability in October. Not unrelatedly, it is one of the first Linkerd releases in years to introduce multiple significant features at once. Despite this, we worked hard to stay true to Linkerd’s core design principle of simplicity. For example, these new features are designed to avoid configuration when possible; and when not possible, to make it minimal, consistent, and principled. After all, Linkerd’s simplicity—our rejection of the status quo that says, “the service mesh is complex and must be complex”—is key to its popularity, and it’s our duty to live up to that reputation in this and every release.
\\n\\n\\n\\nRead on for more!
\\n\\n\\n\\nLinkerd 2.17 introduces visibility and control for egress traffic leaving the Kubernetes cluster from meshed pods. Kubernetes itself provides no mechanisms for understanding egress traffic, and only rudimentary ones for restricting it, limited to IP ranges and ports. With the 2.17 release, Linkerd now gives you full L7 (i.e. application-layer) visibility and control of all egress traffic: you can view the source, destination, and traffic levels of all traffic leaving your cluster, including the hostnames, and, with configuration, the full HTTP paths or gRPC methods. You also can deploy egress security policies that allow or disallow that traffic with that same level of granularity, allowing you to allowlist or blocklist egress by DNS domain rather than IP range and port.
\\n\\n\\n\\nLinkerd’s egress functionality does not require changes from the application and only minimal configuration to get started. For more advanced usage, egress configuration is built on Gateway API resources, allowing you to configure egress visibility and policies with the same extensible and Kubernetes-native configuration primitives used for almost every other aspect of Linkerd, including dynamic traffic routing, zero trust authorization policies, and more.
\\n\\n\\n\\nFor example, enabling basic egress metrics across the entire cluster is as simple as adding this configuration:
\\n\\n\\n\\napiVersion: policy.linkerd.io/v1alpha1\\nkind: EgressNetwork\\nmetadata:\\n namespace: linkerd-egress\\n name: all-egress-traffic\\nspec:\\n trafficPolicy: Allow\\n
\\n\\n\\n\\nSee egress docs for more.
\\n\\n\\n\\nRate limiting is a reliability mechanism that protects services from being overloaded. In contrast to Linkerd’s circuit breaking feature, which is client-side behavior designed to protect clients from failing services, rate limiting is server-side behavior: it is enforced by the service receiving the traffic and designed to protect it from misbehaving clients.
\\n\\n\\n\\nJust as with egress, Linkerd’s rate limiting feature is designed to require minimal configuration, while still being flexible and configurable to a wide variety of scenarios. For example, a basic rate limit of 100 requests per second for a Server named “web-http” can be enabled with this configuration:
\\n\\n\\n\\napiVersion: policy.linkerd.io/v1alpha1\\nkind: HTTPLocalRateLimitPolicy\\nmetadata:\\n namespace: emojivoto\\n name: web-rlpolicy\\nspec:\\n targetRef:\\n group: policy.linkerd.io\\n kind: Server\\n name: web-http\\n total:\\n requestsPerSecond: 100\\n
\\n\\n\\n\\nLinkerd’s rate limiting feature also provides per-client rate limit policies that allow you to ensure rate limits are distributed “fairly” across multiple clients. Combined with retries, timeouts, circuit breaking, latency-aware load balancing, and dynamic traffic routing, rate limiting extends Linkerd’s already wide arsenal of in-cluster distributed system reliability features.
\\n\\n\\n\\nSee rate limiting docs for more.
\\n\\n\\n\\nIn Linkerd 2.17 we’ve shipped an exciting new multicluster feature: federated services. A federated service is a logical union of the replicas of the same service across multiple clusters. Meshed clients talking to a federated service will automatically load balance across all endpoints in all clusters, taking full advantage of Linkerd’s best-in-class latency-aware load balancing.
\\n\\n\\n\\nWith federated services, not only is application code decoupled from cluster deployment decisions—service Foo talking to service Bar needs only to call “Bar”, not to specify which cluster(s) it is on—but failure handling is transparent and automatic as well. Linkerd will transparently handle a wide variety of situations, including:
\\n\\n\\n\\nIn all these cases, Linkerd will automatically load balance across all service endpoints on all clusters, using its default latency-aware (latency EWMA) balancing to send individual requests to the best endpoint.
\\n\\n\\n\\nFederated services were designed to capture a recent trend we see in multicluster Kubernetes adoption: planned large-scale multicluster Kubernetes. Linkerd’s original multicluster functionality, released in the good ol’ days of Linkerd 2.8, was designed for the ad-hoc, pair-to-pair connectivity that was common at the time. However, modern Kubernetes platforms are often much more intentional in their multicluster usage, sometimes ranging into the hundreds or thousands of clusters. Federated services join features such as flat network / pod-to-pod multicluster (introduced in Linkerd 2.14) in the toolbox for this new class of Kubernetes adoption.
\\n\\n\\n\\nSee federated services docs for more.
\\n\\n\\n\\nWe’re delighted to report that the CNCF is hosting Linkerd Day at KubeCon London next April! Many of the Linkerd maintainers will be in attendance, and we’re expecting a great lineup of Linkerd talks as well as plenty of Linkerd users. Come see us in London!
\\n\\n\\n\\nThe edge-24.11.8 release is the corresponding edge release for Linkerd 2.17. See the Linkerd releases page for more.
\\n\\n\\n\\nBuoyant, the creators of Linkerd, has additionally released Buoyant Enterprise for Linkerd 2.17.0 and published a Linkerd 2.17 changelog with additional guidance and content.
\\n\\n\\n\\nLinkerd is a graduated project of the Cloud Native Computing Foundation. Linkerd is committed to open governance. If you have feature requests, questions, or comments, we’d love to have you join our rapidly-growing community! Linkerd is hosted on GitHub, and we have a thriving community on Slack, Twitter, and the mailing lists. Come and join the fun!
\\n\\nCommunity post by Annalisa Gennaro
\\n\\n\\n\\nAt the beginning of this year, I fell apart. I found myself in pieces, struggling to say a single word without bursting into tears. I had severe sleep issues, suffered from intense anxiety and experienced a form of depression. I had to stop, to take a break, and recover. During this time, I reflected on what truly matters in life, and how we should protect ourselves and our loved ones from these risks. I wished to write a farewell post as a former CNCF Ambassador to thank everyone. The book I refer to is “Slow Productivity” by Cal Newport. Not everything I will hint at refers to my very personal experience.
\\n\\n\\n\\n———————————–
\\n\\n\\n\\nThe modern era is characterized by competitive business environments; however, knowledge workers find it difficult to balance work and life. Many of the new facets of remote work arrangements and the persistent implementation of redundant productivity (and busyness) tools have also developed a culture where eradicating people is achieved instead of processes. As a result, tracking systems such as these impose excessive constraints on the knowledge worker and expose them to burning out and health problems especially mental health problems. In his book “Slow Productivity”, Cal Newport, a writer and academic, points out that the rush of many works destroys the welfare of the worker and the end result of the work resulting in wastage in the long term.
\\n\\n\\n\\nOne of the critical challenges that knowledge workers encounter is the absence of appropriate productivity metrics. It differs from conventional types of labor, with output quantifiable, thus knowledge work is more involved in complexity and cannot be restricted by basic measures, such as the number of hours spent working or the tasks settled. According to Newport, assimilation of the standards of knowledge work into the industrial prism is irrational: “Concrete productivity metrics of the type that shaped the industrial sector will never properly fit in the more amorphous knowledge work setting. (Nor should we want it to fit, as this quantitative approach to labour ushers in its own stark inhumanities). In the absence of this clarity, however, pseudo-productivity can seem like the only viable default option.” (in “Slow Productivity”).
\\n\\n\\n\\nAs a result, productivity was measured with the help of close supervision tools such as the amount of time spent on different screens, emails sent and designated tasks performed within the set time. However, such methods frequently lead to quantity over quality, making the worker engage in an endless observable cycle of tasks revealing activity that may not necessarily depict the worth of the work done. Newport states: “The more activity you see, the more you can assume that I’m contributing to the organization’s bottom line. […] we also gravitate away from deeper efforts toward shallower, more concrete tasks that can be more easily checked off a to-do list. Long work sessions that don’t immediately produce obvious contrails of effort become a source of anxiety […]” (in “Slow Productivity”). This inconsistency provokes work related stress and anxiety allowance where the employees have to earn their relevancy by proving it’s worth on such shallow metrics.
\\n\\n\\n\\nNewport stresses the necessity of boundaries, not just towards coworkers but critically towards the C-people higher in the hierarchy. He claims that always being on call is bad for the self and for any real work that could take place, high-end innovation needs periods of disconnection and reflection.
\\n\\n\\n\\nAnd this production-saving measure has to be brought about with distinct conscious effort on the part of the employees:
\\n\\n\\n\\nAssertive Communication: Such expectation can be created where assertive and effective communication can be possible regarding working hours and availability. Newman sagaciously argues that knowledge workers should not wait to be directed but rather take charge of the boundaries establishment and maintenance, you have to be proactive in protecting your deep work spaces, because no one else will do it for you.
\\n\\n\\n\\nDefending Your Schedule: Similarly, Newport highlights those inherent time fatigue where one should focus on none of the other tasks or requests that do not lead to focusing on the goals in the long pull. It is not necessary to respond to all issues at hand immediately, and in such a case it becomes important to resist the temptation to do just anything in order to retain focus and mental energy for more worthwhile endeavors.
\\n\\n\\n\\nTime Blocking: Newport supports the practice of time blocking where people assign certain hours for serious work, other hours for rest or personal engagement and clients. This is not only a strategy for increasing productivity, but also a means of enhancing focus and improving health. Time-blocking is much more than a means to improve performance, it’s a way of protecting performance and achieving balance: “[…] I recommended better organizing your horse using time-blocking so that tasks could be better separated from deeper efforts.”
\\n\\n\\n\\nAs is the case in remote work, flexibility is built into the equation, but there are always hurdles to the achievement of the intended target of improving the sphere of life balance. Many such organisations have now adopted monitoring practices like tracking time spent online or online task performance to determine how productive employees are. Newport observes these practices as enhancing the culture of overwork, more so in the remote working setups: “Slow productivity supports legacy-building accomplishments but allows them to unfold at a more human speed”.
\\n\\n\\n\\nBoundaryless work has negative consequences on remote workers, as they are likely to work more hours, check emails after hours or even the constant sense of obligation to be at work. Newport warns of this inclination and stresses the need for work cutoff times as well as breaks: “The way we’re working no longer works. What’s needed is more intentional thinking about what we mean by “productivity” in the knowledge sector – seeking ideas that start from the premise that these efforts must be sustainable and engaging for the actual humans doing the work“.
\\n\\n\\n\\nBurnout is inevitable unless knowledge workers actively maintain a balance, and adopt resilience-building approaches that can vary widely.
\\n\\n\\n\\nFixed Routines and Time Discipline: Even in remote work, establishing fixed routines that separate work from personal time is recommended. This helps to create mental and physical space for relaxation. A solid routine is the key to creating a healthy distance between work and free time, and to protecting moments of true rest.
\\n\\n\\n\\nMinimizing Notifications: Another way to watch the clock when it is not the time to work is to restrict the number of notifications that require action remotely. This can also protect leisure time as these work related notifications are not warranted especially after office hours. Lessen the interruptions for focus’ sake and hopefully avert burn out.
\\n\\n\\n\\nScheduled Breaks: Breaks are one of the most important parts of the work for the entire day and by no means a waste of time. Breaks are not the dispensable perks, instead, they are the basic necessity for high-level cognitive performance.
\\n\\n\\n\\nMindfulness and Stress Management: Knowledge workers, especially, can benefit from managing stress and focusing through the adoption of mindfulness or meditative practices. Taking care of one’s mental health in the same manner as physical health is important.
\\n\\n\\n\\nLetting Go of Non-Essential and Monotonous Tasks: If there is a need, sometimes, identifying strategies to assist oneself from performing nonessential tasks or automating mundane work may help save working hours and concentration towards value-added tasks. Newport says that an attention to what is most important is one of the key abilities any knowledge worker can learn.
\\n\\n\\n\\nDo not confuse productivity with busyness: you shouldn’t be busy to demonstrate being busy during your workday. We should be measured on the basis of results that may go beyond the hourly tracking of our tasks.
\\n\\n\\n\\n“It seems like the benefits of technology have created the ability to stack more into our days and onto our schedules than we have the capacity to handle while maintaining a level of quality which makes the things worth doing… I think that’s where the burnout really hurts – when you want to care about something but you’re removed from the capacity to do the thing or do it properly and give it your passion and full attention and creativity because you’re expected to do so many other things.” (Steve, a strategic planner interviewed by Cal Newport).
\\n\\n\\n\\nKnowledge workers are able to safeguard their time and concentrate on important, worthwhile activities by establishing reasonable limits, being clear in their communication, and using techniques such as time-blocking and practicing mindfulness. This way it is possible to not only enhance one’s work but also achieve a more healthy, practicing work-life balance. This is what I wish for all of us. It doesn’t matter the industry field we have been working in, which company hired us or which personal and professional goals we set for ourselves.
\\n\\n\\n\\nTake care.
\\n\\nMember post originally published on Chronosphere’s blog by Carolyn King, Head of Community & Developer at Chronosphere
\\n\\n\\n\\nThis week Fluent Bit maintainers are excited to announce the launch of Fluent Bit v3.2. This release delivers major performance improvements, increased efficiency, new signal support, and new capabilities for OpenTelemetry, YAML and eBPF. With v3.2, the Fluent Bit project continues to innovate and deliver the new capabilities and ecosystem integrations required to meet the complex needs of observability and security teams.
\\n\\n\\n\\nBuilt on the best practices and learnings from Fluentd, Fluent Bit was created to be a light-weight version able to collect and forward logs from Internet of Things (IoT) devices and containers, where deploying Fluentd would be impractical due to limited system resources.
\\n\\n\\n\\nSince its inception, Fluent Bit has expanded its capabilities to include collecting logs, metrics, and traces, providing in-stream processing and multi-routing capabilities, and much more.
\\n\\n\\n\\nAfter achieving CNCF graduated project status in 2019, Fluent Bit hit 1 billion downloads in 2022 and adoption has since skyrocketed from 1 to over 15 billion downloads today. This growth has been largely fueled by the adoption of Kubernetes and Fluent Bit’s compatibility with cloud native environments.
\\n\\n\\n\\nKey drivers of Fluent Bit’s global adoption include:
\\n\\n\\n\\nWhile Fluent Bit throughput and resource usage is already best in class, v3.2 introduces internal enhancements to optimize processing of log files as well as new defaults that ensure the best performance out of the box.
\\n\\n\\n\\nWith v3.2, Fluent Bit now comes standard with Single Instruction Multi Data (SIMD) support for log processing and parsing without any additional work from users, providing immediate performance benefits with this release.
\\n\\n\\n\\nIn addition, 3.2 builds on Fluent Bit’s core with continued default multi-threading for inputs, outputs, and processing of observability signal types, including logs, metrics, and traces.
\\n\\n\\n\\nFluent Bit 3.2 includes two new signal types with blob and eBPF. While Fluent Bit has traditionally been used to move petabytes of machine and observability data, new requirements from users include binary data like photos, videos, and files. A common use case – AI companies require the ability to move large volumes of pictures and videos in order to train their AI models.
\\n\\n\\n\\nWith blob signal support Fluent Bit can be used for moving massive files to storage destinations such as Azure Blob. This is particularly relevant for applications that need to collect and process photos and video files. In addition to AI, this includes autonomous vehicles, transportation, healthcare, retail, industrial automation, and more. Fluent Bit offers a turnkey way for companies to collect this data and the ability to send it to multiple backends.
\\n\\n\\n\\nExtended Berkeley Packet Filter (eBPF) is bringing new capabilities in security and advanced observability to Fluent Bit. In v3.2, Fluent Bit adds the ability to run eBPF programs to ingest data into the pipeline.
\\n\\n\\n\\nin_docker_events
in_kubernetes_events
This includes out-of-the-box eBPF capabilities for users to plug in their own eBPF programs. New eBPF capabilities also provide integrations that enable developers to plug in to other CNCF eBPF projects, such as Falco and Tracee, for security use cases.
\\n\\n\\n\\nSince its inception, Fluent Bit was designed for extensibility and to be interoperable with the best technologies in the space.
\\n\\n\\n\\nEduardo Silva, original creator of Fluent Bit, shared “From the beginning, Fluent Bit was built to integrate with best in class technologies and open source standards, enabling users to build the tech stack that is best for them.”
\\n\\n\\n\\nWith the rise of the OpenTelemetry protocol as a standard for observability, Fluent Bit continues its integration and standardization with OTel. This includes increased compatibility across logs, metrics, and traces.
\\n\\n\\n\\nThe interoperability of Fluent Bit and OTel means that teams can choose the best technology for their needs. For example, let’s say a developer needs a lighter and more performant agent than the OTel collector but they need to be able to convert the data to an OTel format.
\\n\\n\\n\\nThe user can now use the Fluent Bit agent to collect the data, and then leverage the OTel Envelope processor to convert the logs to the correct format for an OTel backend.
\\n\\n\\n\\nThe interoperability of Fluent Bit and OTel is particularly important for users who are (1) migrating to Open Source standards and (2) implementing the OpenTelemetry protocol. Fluent Bit v3.2 introduces the following OTel capabilities:
\\n\\n\\n\\nNew additions to YAML also make it easier for developers to adopt and leverage Fluent Bit. As many developers know, YAML has been the standard for Kubernetes configuration and many other applications for years.
\\n\\n\\n\\nPrevious Fluent Bit versions included some YAML compatibility, but Fluent Bit v3.2 now includes full support for YAML in every part of the Fluent Bit pipeline, including parsers, configuration, processors, and settings.
\\n\\n\\n\\nThis allows a single unified configuration language across both Fluent Bit and Kubernetes resources which means that developers don’t need to recreate efforts inside of Fluent Bit.
\\n\\n\\n\\nFluent Bit v3.2 pushes the boundaries of performance, versatility, and interoperability in cloud-native observability. With significant enhancements like SIMD support, eBPF integrations, and full OpenTelemetry compatibility, this release ensures Fluent Bit continues to meet the evolving needs of today’s observability and security teams. Whether you’re moving large datasets or optimizing Kubernetes workflows, Fluent Bit v3.2 is equipped to help you achieve more with less.
\\n\\n\\n\\nChronosphere is a leading observability company, empowering customers to reduce data complexity and volume, optimize costs, and remediate issues faster. Visit: chronosphere.io.
\\n\\nGet to know Eyal
\\n\\n\\n\\nThis week’s Kubestronaut in Orbit, Eyal Zekaria is a Senior Cloud Architect in Berlin, Germany. Eyal has a DevOps and SRE background and has experience operating Kubernetes clusters at scale at different cloud providers. In his current role he helps other companies in their cloud journey. Outside of Kubernetes, Eyal is also interested in automation and open source software.
\\n\\n\\n\\nIf you’d like to be a Kubestronaut like Eyal, get more details on the CNCF Kubestronaut page.
\\n\\n\\n\\nWhen did you get started with Kubernetes and/or cloud-native? What was your first project?
\\n\\n\\n\\nThe company I worked for in 2017 was using Mesos/Marathon to run their microservices — after evaluating Kubernetes, we started moving services over one by one.
\\n\\n\\n\\nWhat are the primary CNCF projects you work on or use today? What projects have you enjoyed the most in your career?
\\n\\n\\n\\nBesides the components that are built into Kubernetes, I think that cert-manager, Open Policy Agent (OPA), and Prometheus are the ones I like the most.
\\n\\n\\n\\nHow have the certs or CNCF helped you in your career?
\\n\\n\\n\\nGetting the certifications helped me bridge gaps in my knowledge. My experience has been predominantly using managed Kubernetes services where many aspects are taken out of your control for the sake of convenience – which is great, but it can also be detrimental when trying to understand how things work internally.
\\n\\n\\n\\nWhat are some other books/sites/courses you recommend for people who want to work with k8s?
\\n\\n\\n\\n“Kubernetes the Hard Way” by Kelsey Hightower is awesome, as well as anything by Kelsey, but I find the preparation for all Kubernetes-related CNCF certifications very useful as well.
\\n\\n\\n\\nWhat do you do in your free time?
\\n\\n\\n\\nPlay tennis, or any other racket sports.
\\n\\n\\n\\nWhat would you tell someone who is just starting their K8s certification journey? Any tips or tricks?
\\n\\n\\n\\nCreate a sandbox cluster and start getting comfortable interacting with it. Break it, fix it, enable and disable different features.
\\n\\n\\n\\nThen start experimenting with running workloads in it, including all the involved aspects.
\\n\\nMember post by Sameer Danave, Senior Director of Marketing, MSys Technologies
\\n\\n\\n\\nI’m excited about our new project but overwhelmed by all the technological changes,” one of our solution architects shared in an MSys Slack channel before dropping a dozen links to the latest AI advancements. He’s a seasoned tech veteran who might need a hug. And who could blame him? Even in an industry used to rapid evolution, where new technologies emerge almost every quarter, AI is accelerating change at unprecedented rates. AI-powered cloud scalability, next-gen tools and platforms, and edge computing advancements stretch everyone’s capacity to keep up.
\\n\\n\\n\\nTo clarify this whirlwind, I spoke with technology professionals at MSys Technologies and worldwide to uncover the latest cloud computing trends—AI-focused and beyond. This blog is here to keep you up to date on the essentials. So dive in and explore what’s next in cloud computing!
\\n\\n\\n\\nAI has transformed multiple industries, and cloud computing is no exception. In 2025, AI won’t just be another service running in the cloud—it will be the intelligent force optimizing every aspect of cloud operations. From real-time resource allocation and automated scaling to intelligent systems countering threats, AI will play a pivotal role in reshaping the cloud landscape. Businesses that embrace this paradigm shift stand to reap extraordinary rewards: unparalleled efficiency, dramatic cost reductions, and performance levels once thought unattainable.
\\n\\n\\n\\nThe future of AI lies in the seamless integration of edge and cloud computing. In 2025, AI workloads will dynamically shift between the edge and the cloud, leveraging each search’s unique strengths. The cloud will handle training for complex AI models, while the edge will manage real-time inferencing, ensuring rapid responses. Next-generation edge platforms will support end-to-end automation, delivering comprehensive solutions across multi-cloud and edge environments.
\\n\\n\\n\\nEnterprises worldwide are already adopting hybrid and multi-cloud strategies—for good reason. By blending public cloud services from various providers, companies in 2025 will continue to boost flexibility while avoiding vendor lock-in. Hybrid cloud solutions elevate data storage management, enabling organizations to maximize existing infrastructure while seamlessly integrating public and private clouds. This approach results in scalable, secure, and redundant systems that enhance storage, improve disaster recovery, bolster data security, and keep businesses agile in a rapidly evolving landscape.
\\n\\n\\n\\nServerless computing is transforming how software services are built and deployed, reducing the need for infrastructure management. It enables developers to easily deploy code without concerns about underlying infrastructure. This has several benefits, including faster time-to-market, scalability, and lower costs for new service deployments. Given these advantages, serverless computing will see widespread adoption among enterprises globally in the coming years.
\\n\\n\\n\\nQuantum computing is beginning to find real-world applications. In 2025, it will step out of the lab and into mainstream business—not through costly hardware investments but via cloud services. Industry giants like IBM, Google, Microsoft, and Amazon are democratizing access to this technology, making quantum capabilities accessible to organizations of all sizes. The potential impacts are profound: from breakthrough drug discoveries to unbreakable encryption, quantum cloud services will unlock innovations previously deemed impossible.
\\n\\n\\n\\nEdge computing is transforming how data is being processed and utilized. Unlike traditional cloud methods, edge computing brings computational power directly to the data source. However, traditional DevOps practices, highly effective in cloud-centric environments, need adaptation for the edge. The “one-size-fits-all” approach can’t tackle the unique challenges of edge computing—such as scale, connectivity, security, and diverse devices. Enter DevEdgeOps: a specialized approach that combines the agility and automation of DevOps with the specific requirements of edge environments. This approach bridges the gap, enabling organizations to manage the complexities of edge computing with the same efficiency and speed that DevOps brings to the cloud.
\\n\\n\\n\\nThe cloud computing landscape 2025 is defined by innovation—AI-powered optimization, seamless edge-to-cloud integration, hybrid strategies, serverless scalability, and quantum breakthroughs. These trends aren’t just reshaping the cloud but transforming how businesses operate and innovate. At MSys Technologies, we specialize in delivering cutting-edge cloud services and solutions tailored to your needs. From strategy to execution, we help businesses stay ahead in this dynamic landscape. The opportunities are immense, and the future of cloud computing is here. Let MSys Technologies guide you.
\\n\\nMember post originally published on the Middleware blog by Sanjay Suthar
\\n\\n\\n\\nAs your AWS environment expands—whether in terms of resources, the number of services, or even the scale of your team—managing these elements becomes increasingly challenging. With multiple instances, databases, and services running concurrently, maintaining an organized overview can quickly become overwhelming.
\\n\\n\\n\\nThis growth often makes it difficult to know if every component is functioning optimally and delivering the expected performance. In an ideal scenario, you’d have a system that not only monitors each part of your infrastructure but also provides real-time insights, ensuring everything runs as smoothly as intended.
\\n\\n\\n\\nThis is where AWS CloudWatch comes into play. AWS CloudWatch serves as your essential tool for monitoring, analyzing, and acting on key metrics, not only within your AWS environment but also for external applications and services running on-premises or in other cloud platforms. It offers valuable insights into how each AWS service is performing, helping you manage and fine-tune your resources effectively.
\\n\\n\\n\\nThe purpose of this guide is to provide clear, actionable guidance on using CloudWatch metrics to maintain a well-monitored AWS environment. From configuring alarms to utilizing advanced features like CloudWatch’s custom dashboards, we’ll explore everything CloudWatch offers to keep your infrastructure running at its best.
\\n\\n\\n\\nAWS CloudWatch goes beyond basic monitoring—it’s a comprehensive service that helps you gain deep insights into your AWS infrastructure.
\\n\\n\\n\\nOne of CloudWatch’s core strengths is its ability to gather data from a wide range of sources, including AWS services and external applications. Whether you’re monitoring the performance of your EC2 instances, tracking database connections in RDS, following Lambda invocations, or collecting metrics from on-premises servers and third-party services via OpenTelemetry, CloudWatch consolidates all these metrics, giving you a centralized view of your entire infrastructure.
\\n\\n\\n\\nBeyond simply collecting metrics, CloudWatch offers an alerting feature. You can set up alarms that notify you when certain thresholds are reached, allowing you to respond to potential issues before they escalate.
\\n\\n\\n\\nThese features give you more control and understanding of your AWS environment, offering the insights needed to troubleshoot problems as they occur and maintain a well-functioning infrastructure.
\\n\\n\\n\\nCloudWatch metrics are essential because they provide live data on how both your AWS and non-AWS resources are performing. They help you monitor various aspects of your infrastructure, giving you insights into resource usage and potential problem areas.
\\n\\n\\n\\nFor example, metrics like CPU Utilization in EC2 instances, provided by CloudWatch, indicate when a server is experiencing high demand. This can signal the need to either scale up your resources or redistribute traffic to avoid any slowdowns. Similarly, Freeable Memory for RDS, which is tracked by CloudWatch, helps you determine if your database instance requires resizing to handle your workloads more effectively.
\\n\\n\\n\\nCloudWatch is also valuable for managing costs. By examining usage patterns, you might notice that certain instances are consistently running at a fraction of their capacity. For example, if you see that an EC2 instance’s CPU usage never exceeds 10%, it might be an indicator that you’re over-provisioned, and you could switch to a smaller instance type to save on costs.
\\n\\n\\n\\nIn comparison to other cloud platforms, AWS CloudWatch stands out with its seamless integration with the entire AWS ecosystem. For instance, Google Cloud uses Cloud Monitoring, and Azure has Azure Monitor—both effective tools but lacks the level of integration CloudWatch offers with AWS-specific services. This tight integration means CloudWatch not only monitors but can also trigger actions (like Auto Scaling) based on the metrics it tracks, making it a more cohesive option for managing AWS resources.
\\n\\n\\n\\nAWS CloudWatch functions by collecting metrics, logs, and events from various sources, including AWS services like EC2, RDS, and Lambda, as well as from on-premises servers, external cloud services, and applications instrumented with OpenTelemetry. These metrics are gathered and made accessible, allowing you to monitor performance and set alarms based on specific thresholds across your entire infrastructure.
\\n\\n\\n\\nThink of CloudWatch as a monitoring hub for your AWS infrastructure. It captures data from different services and translates it into actionable insights, which you can then visualize through dashboards or use to trigger automated actions.
\\n\\n\\n\\nFor instance, you can customize CloudWatch Dashboards to display metrics that are crucial for your organization’s operations, such as CPU utilization for EC2 instances or request counts for a load balancer. Alarms can be set to notify you when these metrics reach predefined thresholds. These alarms can also initiate automated responses, like scaling an Auto Scaling group or executing a Lambda function, ensuring that your infrastructure responds promptly to changing conditions without requiring manual intervention.
\\n\\n\\n\\nWhen monitoring your environment, whether it’s on AWS or beyond, several key metrics provide valuable insights into the performance and health of your services. While the metrics mentioned here are essential, always remember that the importance of metrics can vary based on your specific AWS setup and workloads.
\\n\\n\\n\\nCloudWatch allows you to create custom dashboards that visualize these metrics, offering a clear view of your services’ performance and enabling you to quickly identify and address any issues. This comprehensive monitoring ensures that your infrastructure is functioning as expected, reducing the chances of unexpected downtimes or performance degradations.
\\n\\n\\n\\nMonitoring an EC2 instance using CloudWatch helps you keep track of how well your server is handling workloads and enables you to quickly identify and resolve potential issues such as high CPU usage or network traffic. Here’s a comprehensive, step-by-step guide on how to set this up:
\\n\\n\\n\\nOnce your EC2 instance is running, you can proceed to monitor it using CloudWatch.
\\n\\n\\n\\nCloudWatch offers a variety of metrics for EC2 instances, but let’s focus on the most important ones:
\\n\\n\\n\\nOnce you have a good understanding of these metrics, it’s important to set up alarms to notify you when they go beyond acceptable limits:
\\n\\n\\n\\nTo ensure your alarms are functioning correctly, you can simulate load on your EC2 instance:
\\n\\n\\n\\nMonitoring your EC2 instance isn’t just about observing metrics; it’s about understanding how your infrastructure behaves. This process provides valuable insights into the performance and stability of your instance. For example, if your instance shows consistently high CPU usage, it could indicate that it’s struggling to handle the workload. Similarly, a sudden increase in disk activity might signal that your application is generating more data than expected, which could lead to running out of storage space.
\\n\\n\\n\\nBy consistently monitoring these metrics, you’ll be well-equipped to take timely action, whether that involves scaling up resources to meet demand, fine-tuning your application to improve efficiency, or investigating any unusual spikes in traffic or resource usage. This proactive approach ensures that your AWS environment remains reliable and responsive to changing conditions.
\\n\\n\\n\\nWhile AWS CloudWatch offers solid built-in monitoring features, several third-party tools can enhance your experience by providing advanced analytics, more intuitive dashboards, or additional integrations. Here are the top 5 tools to consider for extending your AWS CloudWatch capabilities, with a focus on their unique features:
\\n\\n\\n\\nMiddleware is a full-stack cloud observability platform that integrates with AWS CloudWatch, offering pre-built dashboards. It simplifies cloud management by making metrics more accessible and actionable for your team. Middleware is known for its straightforward setup and smooth integration with CloudWatch, allowing you to gain deeper insights into your AWS infrastructure’s performance.
\\n\\n\\n\\nKey features:
\\n\\n\\n\\nDatadog is a versatile monitoring tool that offers enhanced visibility into your AWS infrastructure by extending CloudWatch’s monitoring capabilities. It lets you combine CloudWatch metrics with logs, traces, and events from other services, providing a comprehensive view of your AWS ecosystem.
\\n\\n\\n\\nKey features:
\\n\\n\\n\\nDatadog excels at correlating data across different services, making it more than just a UI enhancement over CloudWatch’s native features.
\\n\\n\\n\\nNew Relic offers an observability platform that brings together AWS CloudWatch metrics with application performance data, providing a clearer picture of how your services interact. It’s particularly useful for monitoring complex, distributed environments.
\\n\\n\\n\\nKey features:
\\n\\n\\n\\nNew Relic provides a comprehensive view, helping you understand how AWS resources and your applications work together, making it easier to troubleshoot and optimize.
\\n\\n\\n\\nPrometheus, paired with Grafana, offers a powerful open-source solution for monitoring AWS CloudWatch metrics. It’s particularly effective in cloud environments that change frequently, such as Kubernetes clusters, and allows complete customization of how you visualize and alert on your data.
\\n\\n\\n\\nKey features:
\\n\\n\\n\\nThis combination is ideal for those who prefer open-source tools and want full control over their monitoring setup.
\\n\\n\\n\\nZabbix is another open-source monitoring tool that integrates well with AWS CloudWatch, providing advanced alerting and custom dashboards. It automatically discovers AWS services and resources, making it easier to monitor your environment without manual configuration.
\\n\\n\\n\\nKey features:
\\n\\n\\n\\nZabbix offers more advanced alerting capabilities compared to CloudWatch’s built-in alerts, making it suitable for organizations looking for a more tailored monitoring solution.
\\n\\n\\n\\nThese third-party tools add significant value to your AWS CloudWatch setup, providing more detailed insights, predictive analytics, and enhanced visualizations. Whether you’re looking for a user-friendly option or prefer an open-source solution with Prometheus and Grafana, these tools can help you gain a deeper understanding of your AWS environment.
\\n\\n\\n\\nTo truly leverage AWS CloudWatch and ensure effective monitoring, consider the following advanced practices:
\\n\\n\\n\\nBy incorporating these best practices, you’ll gain deeper insights into your AWS environment, automate routine responses to issues, and maintain tighter control over your infrastructure’s health. This approach ensures that your monitoring strategy is both effective and tailored to the needs of your applications and business.
\\n\\n\\n\\nAWS CloudWatch metrics are an essential tool for managing and optimizing both your AWS and non-AWS resources. They provide real-time visibility into the performance of your infrastructure and allow you to take action before problems arise.
\\n\\n\\n\\nLooking ahead, integrating CloudWatch with machine learning services like SageMaker can help predict anomalies and optimize resource usage automatically. Additionally, leveraging AWS Lambda for automated responses to CloudWatch alarms can help you streamline operations even further.
\\n\\nMember post by Jamie Lynch, Senior Software Engineer at Embrace
\\n\\n\\n\\nOpenTelemetry has historically been adopted mainly on backend systems, where it’s a great solution for gaining insight into what’s happening in production by gathering telemetry via an open standard. This avoids the dreaded costs of vendor lock-in because as long as a provider supports the OTel data format, you can easily switch and take control over your own data.
\\n\\n\\n\\nUp until now, OTel’s adoption on mobile has not been quite as widespread. However, there are early signs that this is rapidly changing and that engineers are adopting the standard for similar reasons as for backend observability. Mobile provides some unique challenges for gathering telemetry compared to backend development, and in this post we will highlight what those challenges are as well as a few solutions for fixing them.
\\n\\n\\n\\nBefore we cover some of the ways OTel is different on mobile, it’s worth comparing mobile development to backend development, as mobile devices have unique constraints that affect how telemetry is collected.
\\n\\n\\n\\nBackend servers have lots of CPU and memory, whereas most mobile devices have lower spec hardware that is less performant.
\\n\\n\\n\\nFurthermore, if your backend app runs into performance issues, you can usually just provision more servers with beefier hardware. When (not if) your mobile app runs into performance issues, shipping your users better devices is usually not an option! So on mobile, you’re stuck supporting a cohort of potentially thousands of underpowered Android/iOS device models.
\\n\\n\\n\\nYou’ve probably experienced the frustration of running out of charge on a mobile device. Power consumption is much more important on mobile than the backend where everything is plugged into the mains – that’s one reason why CPUs tend to be lower frequency on mobile as they require less power.
\\n\\n\\n\\nThe OS itself is also much more aggressive about prolonging battery life on mobile. Approaches that might be viable on the backend, such as polling for data every second, are almost always not an option on mobile. Mobile operating systems will eagerly kill processes that use excessive resources, and it’s usually impossible to have a process running continuously in the background. In contrast, this is fairly simple to do on the backend.
\\n\\n\\n\\nBackend servers have consistent network connections with great bandwidth and low latency. Mobile devices do not enjoy these luxuries! Their network connection is usually high latency, may have prolonged periods with no connection (e.g., long-haul flights with airplane mode enabled), and bandwidth can be low.
\\n\\n\\n\\nAn application server that responds to HTTP requests will run continuously. That’s not the case on mobile. A user may switch between dozens of apps in a short period, and to save limited battery and compute resources, the OS may terminate these processes at any time without warning. While this may happen on the backend in extreme scenarios such as memory pressure, on mobile this is a daily fact of life.
\\n\\n\\n\\nBackend applications typically have short transactions – a HTTP request comes in, some operation occurs, and a response is returned to the client. On mobile, a user might open an app for a few minutes, but they might also use it for hours, performing hundreds or even thousands of interactions with the application during a single session. The data, context, and duration of traces captured can therefore be vastly different between backend and mobile applications.
\\n\\n\\n\\nThe majority of mobile apps run in a single process which means OTel collectors and exporters run in the same process. This is quite different to the backend where these components typically run in separate processes. If the process terminates on mobile due to a crash or an OS kill, without additional work to persist telemetry, data will be lost.
\\n\\n\\n\\nNetwork connectivity is situational on mobile devices, making it necessary to plan for the worst case where telemetry cannot be delivered to your backend of choice. Even if you’re lucky enough to have a connection 95% of the time, that still means you could be missing 1 in 20 requests if you take the “fire and forget” approach. For mobile, where your app may be running on thousands of different devices, this can leave a substantial hole in your observability. It’s therefore essential to persist data before sending it.
\\n\\n\\n\\nPersisting data may sound straightforward – it’s just writing a bunch of data to disk, right? Unfortunately, on mobile, this simple act introduces a whole bunch of complexity. First, the average mobile device does not have much free disk space, so the amount of telemetry that can be persisted needs to be limited in some way. This requires picking a strategy to delete telemetry data. Common strategies involve prioritizing the newest and most important types rather than stale data.
\\n\\n\\n\\nSecond, it’s necessary to deal with I/O errors, potential schema changes depending on how the data is persisted, and data being sent from a different process (and maybe even a different day) from when it was captured. If engineers forget to deal with this complexity, then subtle bugs can creep into your data pipeline that impact your observability.
\\n\\n\\n\\nIf a process terminates on mobile, it’s necessary to immediately save any captured telemetry. This is partly due to poor network connections as discussed previously, but also because blocking the UI thread with HTTP requests on process termination can lead to blockages and ANRs.
\\n\\n\\n\\nFor a crash or OS kill, it’s generally not possible to predict when a process is going to terminate, and once it has happened, the amount that can be done is fairly limited. For example, once a C signal is raised on mobile, it’s possible to install a signal handler that reacts to the crash, but the implementation must be async-safe. These implementation constraints make it impossible to send HTTP requests, and very hard to do anything other than storing telemetry for later processing.
\\n\\n\\n\\nIn order to capture most telemetry, an option is to periodically persist telemetry so that an up-to-date “snapshot” of the captured data can be read the next time an application launches. We’ll elaborate on this a bit later.
\\n\\n\\n\\nThere’s no silver bullet for conserving resources such as battery and memory on mobile. The first step as an engineer working on an app is to be judicious in what telemetry is captured. For example, polling the OS every minute for memory data might be acceptable on the backend, but on mobile devices it would be preferable to rely on OS callbacks instead for significant events.
\\n\\n\\n\\nProfiling the impact of telemetry code in hot paths, such as application startup, also becomes more crucial on mobile. This is something that at Embrace we do as SDK vendors on our own code, but it’s also something that should be considered for your own application, as every mobile app may behave differently out in the wild.
\\n\\n\\n\\nLong-running spans are a challenge in OpenTelemetry for mobile because user sessions can run for much longer times than a typical backend HTTP request. This means they can accumulate many events, making the span payloads quite large. Spans also can’t be sent until they are completed, so there’s the potential for data loss if the process terminates halfway through a span.
\\n\\n\\n\\nEmbrace solves this problem on mobile with “span snapshots.” In this approach a JSON representation of all non-completed spans is stored on disk periodically, and if the process does terminate unexpectedly, then on next launch the application is able to send these to an OTel-compliant backend.
\\n\\n\\n\\nOTel’s semantic conventions are agreed-upon conventions on how telemetry should be captured by an application. These are great because using conventions means that OTel implementations can assume knowledge about what telemetry data contains.
\\n\\n\\n\\nFor example, rather than showing an OTel span for a network request, a backend solution could process a span containing HTTP call information and add opinionated logic on top of standard OTel that reveals superior insights into network performance. As OTel has historically had traction on the backend, more semantic conventions are agreed-upon for backend concepts such as HTTP requests and cloud events, than for mobile events such as user sessions.
\\n\\n\\n\\nThe other key difference between backend and mobile is that the backend is usually running 24/7, 365 days a year. In contrast, a mobile messaging app might have very short user sessions of a few seconds, or a movie-streaming app might have user sessions that potentially run for hours. Problems can develop over time and across endless combinations of user, device, and app conditions. The uncontrolled nature of the mobile environment is certainly more complicated than the standard OTel paradigm.
\\n\\n\\n\\nIn Embrace’s view there are three key points where the OTel community can improve support for mobile.
\\n\\n\\n\\nCurrently, the OTel implementation for mobile doesn’t fully account for the differences between the backend and mobile, and it’s too easy for data to get lost due to unexpected process termination or long-running spans. We expect this to change as more observability vendors adopt OTel and agree on common solutions.
\\n\\n\\n\\nSemantic conventions are agreed-upon conventions for how telemetry should be captured using basic OTel data types such as spans and events. For example, an Android phone entering battery saver mode and then exiting it when a user plugs in a charger could be modeled as a span as it has a start and end time.
\\n\\n\\n\\nIf an OTel-compliant backend implementation supports a semantic convention for battery saver mode, then it could perform extra processing on the telemetry data that might surface hidden trends that correlate with the presence of this span. This data is important from a mobile engineer perspective as low power indicates the OS will be more willing to constrain background jobs and reduce the amount of system resources available – therefore affecting the performance of an app.
\\n\\n\\n\\nThere is already a rich ecosystem of OTel instrumentation for backend libraries and technologies. The same doesn’t exist (yet) on mobile, but we believe that as more and more mobile engineers implement OTel and further SDK vendors become OTel-compliant, that will change.
\\n\\n\\n\\nHopefully, we can move towards a future where telemetry and instrumentation really only need to be written once for commonly shared libraries in the mobile ecosystem. If you’d like to explore OpenTelemetry for mobile today, check out Embrace’s open source, OTel-compliant SDKs and join our Slack community to learn more about how to modernize your mobile observability.
\\n\\nMember post originally published on Linbit’s blog by Matt Kereczman
\\n\\n\\n\\nEdge computing is a distributed computing paradigm that brings data processing and computation closer to the data source or “edge” of the network. This reduces latency and removes Internet connectivity as a point of failure for users of edge services.
\\n\\n\\n\\nSince more hardware is involved in an edge computing environment than there is in a traditional central data center topology, there is a need to keep that hardware relatively inexpensive and replaceable. Generally, that will mean the hardware running at the edge will have less system resources than what you might find in a central data center.
\\n\\n\\n\\nLINBIT SDS, which consists of LINSTOR® and DRBD® from LINBIT®, has a very small footprint on system resources. This leaves more resources available for other edge services and applications and makes LINBIT SDS an ideal candidate for solving persistent storage needs at the edge.
\\n\\n\\n\\nThe core function of LINBIT SDS is to provide resilient and feature-rich block storage to the many platforms it integrates with. The resilience comes from DRBD, the block storage replication driver managed by LINSTOR in LINBIT SDS, allows services to tolerate host-level failures. This is an important feature at the edge, since host-level failures may be more frequent when using less expensive hardware that might not be as fault tolerant as hardware that you would run in a proper data center.
\\n\\n\\n\\nTo prove and highlight some of the claims I’ve made about LINBIT SDS above, I used my trusty Libre Computer AML-S905X-CC (Le Potato) ARM-based single board computer (SBC) cluster to run LINBIT SDS and K3s. If you’re not familiar with “Le Potato” SBCs, they are simply 2GB Raspberry Pi 4 model B clones. I would characterize my “Potato cluster” as severely underpowered compared to the enterprise grade hardware used by some of LINBIT’s users, and would even go as far as saying this is the floor in terms of hardware capability that I would try something like this on. To read about LINBIT SDS on a much more capable ARM-based system read my blog, Benchmarking on Ampere Altra Max Platform with LINBIT SDS. That said, if LINBIT SDS can run on my budget Raspberry Pi clone cluster, it can run anywhere.
\\n\\n\\n\\nHere is a real photograph* of my Le Potato cluster and cooling system in my home lab:
\\n\\n\\n\\nThe cluster I’m using does not have a ton of resources available. Using the kubectl top node
command we can see what each of these nodes has available with LINBIT SDS already deployed.
root@potato-0:~# kubectl top node\\nNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%\\npotato-0 1111m 27% 1495Mi 77%\\npotato-1 1277m 31% 1666Mi 86%\\npotato-2 1096m 27% 1634Mi 84%\\n
\\n\\n\\n\\nA single CPU core in these quad-core Libre Computer SBCs is equal to 1000m, or 1000 millicpus. This output shows us that with LINBIT SDS and Kubernetes running on them, that they still have 69-73% of their CPU resources available. Memory pressure on these 2Gi SBCs is extremely limiting, but we still have a little room to play around.
\\n\\n\\n\\nA typical LINBIT SDS deployment in Kubernetes consists of the following containers:
\\n\\n\\n\\n\\n\\n\\n\\n\\n📝 NOTE: You can verify which containers the latest LINBIT SDS deployment in Kubernetes uses by viewing the image lists that LINBIT maintains at charts.linstor.io.
\\n
Using kubectl top pods -n linbit-sds --sum
we can see how much memory and CPU the LINBIT SDS containers are using.
root@potato-0:~# kubectl top -n linbit-sds pods --sum\\nNAME CPU(cores) MEMORY(bytes)\\nha-controller-bk2rh 4m 20Mi\\nha-controller-bxhs7 5m 19Mi\\nha-controller-knvg8 3m 21Mi\\nlinstor-controller-5b84bfc497-wrbdn 25m 168Mi\\nlinstor-csi-controller-8c9fdd6c-q7rj5 45m 124Mi\\nlinstor-csi-node-c46bb 3m 31Mi\\nlinstor-csi-node-knmv2 4m 29Mi\\nlinstor-csi-node-x9bhr 3m 33Mi\\nlinstor-operator-controller-manager-6dd5bfbfc8-cfp7t 10m 57Mi\\npotato-0 9m 87Mi\\npotato-1 10m 68Mi\\npotato-2 10m 57Mi\\n ________ ________\\n 127m 719Mi\\n
\\n\\n\\n\\nThat’s less than a quarter of a single CPU core and under 1Gi of the 6Gi available in my tiny cluster.
\\n\\n\\n\\nIf I create a LINBIT SDS provisioned persistent volume claim (PVC) replicated twice for a demo MinIO pod, we can check the utilization again while we’re actually running services. Using the following PVC and pod manifest LINBIT SDS will provision a LINSTOR volume, replicate it between two nodes (as defined in my storageClass) using DRBD, and schedule MinIO pod with data persisted on the LINBIT SDS managed storage.
\\n\\n\\n\\nroot@potato-0:~# kubectl apply -f - <<EOF\\napiVersion: v1\\nkind: Namespace\\nmetadata:\\n name: minio\\n labels:\\n name: minio\\nEOF\\nnamespace/minio created\\n\\nroot@potato-0:~# kubectl apply -f - <<EOF\\nkind: PersistentVolumeClaim\\napiVersion: v1\\nmetadata:\\n name: demo-pvc-0\\n namespace: minio\\nspec:\\n storageClassName: linstor-csi-lvm-thin-r2\\n accessModes:\\n - ReadWriteOnce\\n resources:\\n requests:\\n storage: 4G\\nEOF\\npersistentvolumeclaim/demo-pvc-0 created\\n\\nroot@potato-0:~# kubectl apply -f - <<EOF\\napiVersion: v1\\nkind: Pod\\nmetadata:\\n labels:\\n app: minio\\n name: minio\\n namespace: minio\\nspec:\\n containers:\\n - name: minio\\n image: quay.io/minio/minio:latest\\n command:\\n - /bin/bash\\n - -c\\n args:\\n - minio server /data --console-address :9090\\n volumeMounts:\\n - mountPath: /data\\n name: demo-pvc-0\\n volumes:\\n - name: demo-pvc-0\\n persistentVolumeClaim:\\n claimName: demo-pvc-0\\nEOF\\npod/minio created\\n
\\n\\n\\n\\nUsing the kubectl port-forward pod/minio 9000 9090 -n minio --address 0.0.0.0
command, I forwarded traffic on port 9000 from my potato-0
node to the MinIO pod. I then started an upload of a Debian image (583Mi) to a new MinIO bucket (bucket-0
) using the MinIO console accessible at https://potato-0:9000. During the upload I captured the output of kubectl top
again to compare against my previous results.
root@potato-2:~# kubectl top pods -n linbit-sds --sum\\nNAME CPU(cores) MEMORY(bytes)\\nha-controller-bk2rh 3m 22Mi\\nha-controller-bxhs7 5m 26Mi\\nha-controller-knvg8 6m 16Mi\\nlinstor-controller-5b84bfc497-wrbdn 72m 174Mi\\nlinstor-csi-controller-8c9fdd6c-q7rj5 30m 127Mi\\nlinstor-csi-node-c46bb 8m 33Mi\\nlinstor-csi-node-knmv2 3m 25Mi\\nlinstor-csi-node-x9bhr 2m 31Mi\\nlinstor-operator-controller-manager-6dd5bfbfc8-cfp7t 174m 58Mi\\npotato-0 4m 118Mi\\npotato-1 8m 131Mi\\npotato-2 7m 81Mi\\n ________ ________\\n 317m 846Mi\\nroot@potato-2:~# kubectl top node\\nNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%\\npotato-0 2581m 64% 1623Mi 84%\\npotato-1 1556m 38% 1713Mi 88%\\npotato-2 1147m 28% 1636Mi 84%\\n
\\n\\n\\n\\nThat’s a pretty minimal difference. The LINSTOR satellite pods (potato-0, potato-1, and potato-2) are using a little more memory than in the first sample. The extra memory is likely to hold DRBD’s bitmap because there are physical replicas on potato-0 and potato-1, and a diskless “tiebreaker” assignment on potato-2, which does not store a bitmap.
\\n\\n\\n\\nroot@potato-2:~# kubectl exec -n linbit-sds deployments/linstor-controller -- linstor resource list\\n+----------------------------------------------------------------------------------------------------------------+\\n| ResourceName | Node | Port | Usage | Conns | State | CreatedOn |\\n|================================================================================================================|\\n| pvc-e5c40aa3-e9b2-40dc-b096-e96a61a27d47 | potato-0 | 7000 | InUse | Ok | UpToDate | 2023-09-08 17:37:32 |\\n| pvc-e5c40aa3-e9b2-40dc-b096-e96a61a27d47 | potato-1 | 7000 | Unused | Ok | UpToDate | 2023-09-08 17:37:18 |\\n| pvc-e5c40aa3-e9b2-40dc-b096-e96a61a27d47 | potato-2 | 7000 | Unused | Ok | TieBreaker | 2023-09-08 17:37:40 |\\n+----------------------------------------------------------------------------------------------------------------+\\n
\\n\\n\\n\\nNow that the storage has data, I can simulate a failure in the cluster and see whether the data persists. I can tell that the MinIO pod is running on potato-0 from the linstor resource list
command which shows the PVC as InUse
on potato-0. To do this, I used the command echo c > /proc/sysrq-trigger
on potato-0. This immediately crashes the kernel, and unless you’ve configured your system otherwise, it will not reboot on its own.
While waiting for Kubernetes to catch and react to the failure, I checked DRBD’s state on the remaining nodes and could see that potato-1, the remaining “diskful” peer, reported UpToDate
data, so it would be able to take over services:
root@potato-2:~# kubectl exec -it -n linbit-sds potato-1 -- drbdadm status\\npvc-e5c40aa3-e9b2-40dc-b096-e96a61a27d47 role:Secondary\\n disk:UpToDate\\n potato-0 connection:Connecting\\n potato-2 role:Secondary\\n peer-disk:Diskless\\n\\nroot@potato-2:~# kubectl exec -it -n linbit-sds potato-2 -- drbdadm status\\npvc-e5c40aa3-e9b2-40dc-b096-e96a61a27d47 role:Secondary\\n disk:Diskless\\n potato-0 connection:Connecting\\n potato-1 role:Secondary\\n peer-disk:UpToDate\\n
\\n\\n\\n\\nAfter roughly five minutes, Kubernetes picked up on the failure and began terminating potato-0’s pods. I didn’t use a deployment, or any other workload resources for managing this pod, so it will not be rescheduled on its own. To delete a pod from a dead node I needed to use the force, that is: kubectl delete pod -n minio minio --force
. With the pod deleted, I could recreate it by using the same command used earlier:
root@potato-2:~# kubectl apply -f - <<EOF\\napiVersion: v1\\nkind: Pod\\nmetadata:\\n labels:\\n app: minio\\n name: minio\\n namespace: minio\\nspec:\\n containers:\\n - name: minio\\n image: quay.io/minio/minio:latest\\n command:\\n - /bin/bash\\n - -c\\n args:\\n - minio server /data --console-address :9090\\n volumeMounts:\\n - mountPath: /data\\n name: demo-pvc-0\\n volumes:\\n - name: demo-pvc-0\\n persistentVolumeClaim:\\n claimName: demo-pvc-0\\nEOF\\npod/minio created\\n
\\n\\n\\n\\nAfter the pod was rescheduled on potato-1, and the port-forward to the MinIO pod restarted from potato-1, I could once again access the console and see the contents of my bucket were intact. This is because the DRBD resource LINBIT SDS created for the MinIO pod’s persistent storage replicates writes synchronously between the cluster peers. This means that by using DRBD, you have a block-for-block copy of your block devices on more than one node in the cluster at all times.
\\n\\n\\n\\nIn this scenario, K3s happened to reschedule the MinIO pod on another node with a physical replica of the DRBD device, but this isn’t necessarily always the case. If K3s would have rescheduled the MinIO pod on a node without a physical replica of the DRBD device, the LINSTOR CSI driver would have created what we call a “diskless” resources on that node. A “diskless” resource uses DRBD’s replication network to attach the “diskless” peer to a node in the cluster that does contain a physical replica of the volume, allowing reads and writes to occur over the network. You can think of this like NVMe-oF or iSCSI targets and initiators, except that it uses DRBD’s internal protocols. Since this may be undesirable for workloads that are sensitive to latency, such as databases, you can configure LINBIT SDS to enforce volume locality in Kubernetes.
\\n\\n\\n\\nLINBIT SDS is open source, with LINBIT offering support and services on a subscription basis. This means that the total cost of ownership (TCO) in terms of acquisition can be as low as the price of your hardware. My Potato cluster can be had for less than $100 USD from Amazon at the time I’m writing this blog. Realistically, you’re not going to run anything meaningful on a couple of Raspberry Pi clones, but I think I’ve made my point that you don’t need to spend tens of thousands of dollars for hardware to run a LINBIT SDS cluster.
\\n\\n\\n\\nThe other side of TCO is the operating cost. This is where TCO involving open source software gets a bit more abstract. The price of hiring a Linux system administrator familiar with distributed storage can vary widely depending on the region you operate in, and you’ll want to have enough other work to keep an admin busy to make their salary a good investment. If that makes open source sound expensive, you’re not wrong, but LINBIT stands by its software and its users offering subscriptions at a fraction of the cost of hiring your own distributed storage expert.
\\n\\n\\n\\nUltimately the actual TCO will come down to the expertise your organization has on staff and how many spare cycles they can put towards maintaining an open source solution like LINBIT SDS. I feel like this is where I can insert one of my favorite quotes regarding open source software, “think free as in free speech, not free beer.”
\\n\\n\\n\\nI’ve proven, at least to myself but hopefully to you the reader, that you could run LINBIT SDS and Kubernetes on a cluster that fits in a shoe box, with a price tag that’s probably lower than the shoes that came in said shoe box. The efficiency of LINSTOR when coupled with the resilient block storage from DRBD makes running edge services possible using replaceable hardware. The self healing nature of Kubernetes and LINBIT SDS makes replacing a node as easy as running a single command to add it to the Kubernetes cluster, making the combination an excellent platform for running persistent containers at the edge.
\\n\\n\\n\\nAfter using this “Potato cluster” for a few days to write this blog, I am happy with it, but I’m also eager to tinker with other ARM-based systems that are a bit more powerful. In the past I’ve used DRBD and Pacemaker for HA clustering on small form factor Micro ATX boards with Intel processors to great success, but the low power and size requirements of newer ARM-based systems is attractive for edge environments. If you have experience with a specific hardware platform that could fit this bill, consider joining the LINBIT Slack community and dropping me a message.
\\n\\nMember post originally published on The New Stack by Kate Obiidykhata, Percona
\\n\\n\\n\\nOver the past few decades, database management has shifted from traditional relational databases on monolithic hardware to cloud native, distributed environments. With the rise of microservices and containerization, modern databases need to fit seamlessly into more complex, dynamic systems, requiring advanced solutions to balance scale, performance and flexibility.
\\n\\n\\n\\nFor large organizations navigating these complex environments, managing databases at scale presents myriad challenges. Companies with extensive data operations often face issues like ensuring high availability, disaster recovery and scaling resources efficiently. To tackle these, many adopt a hybrid approach, combining on-premises infrastructure with cloud resources to meet their diverse needs.
\\n\\n\\n\\nA natural result of this hybrid model is the push toward standardization. By consolidating various components, including databases, onto a unified infrastructure platform, organizations aim to reduce operational overhead and improve consistency across different environments, streamlining their overall operations.
\\n\\n\\n\\nAs Kubernetes has become the default infrastructure layer for many enterprises, running databases on Kubernetes is becoming more prevalent. Initially, there was skepticism about Kubernetes’ suitability for database workloads. However, this has changed as Kubernetes has matured and the community has developed tools and best practices for managing stateful applications.
\\n\\n\\n\\nFor platform engineers, Kubernetes offers a robust framework to build internal database management platforms. This approach allows for custom solutions tailored to specific organizational needs, such as automated provisioning and integration with existing CI/CD pipelines.
\\n\\n\\n\\nDespite the benefits, managing databases on Kubernetes introduces complexities. These include maintaining stateful applications, ensuring data consistency and integrating with existing infrastructure.
\\n\\n\\n\\nFortunately, the Kubernetes ecosystem has responded with tools like operators, which simplify the management of stateful applications by automating common tasks such as backups, scaling and updates.
\\n\\n\\n\\nKey approaches to database management on Kubernetes include:
\\n\\n\\n\\nThe shift towards Kubernetes and the evolution of open source tools have redefined how enterprises manage databases. Open source Percona Everest addresses many of these challenges by automating database provisioning and management across any Kubernetes infrastructure, whether deployed in the cloud or on premises.
\\n\\n\\n\\nFor enterprises seeking a flexible, scalable and cost-effective database solution, Percona Everest presents a compelling alternative to traditional database management strategies.
\\n\\n\\n\\n\\n\\nAmbassador post originally published on Medium by Dotan Horovits
\\n\\n\\n\\nWant to catch up on KubeCon’s highlights and takeaways? Take it from the experts who know the cloud-native space inside out — the CNCF Ambassadors!
\\n\\n\\n\\nI’d like to thank the Cloud Native Computing Foundation (CNCF), the organization behind KubeCon, for inviting me to host its official KubeCon+CloudNativeCon North America 2024 recap session. The CNCF hosts Kubernetes, Argo, Backstage, OpenTelemetry and over 200 other cloud-native projects we know and love. You can find the full recording of the session on the KubeCon + CloudNativeCon Salt Lake City 2024 playlist on the CNCF’s official YouTube channel.
\\n\\n\\n\\nIn this recap session I sat down with fellow CNCF Ambassadors and cloud-native experts Viktor Farcic and Max Körbächer to unpack the major project announcements and key themes from Salt Lake City: the standout talks, co-located events, and those memorable hallway conversations.
\\n\\n\\n\\nIn this post, I’ll share a recap of that highlight-packed hour-long discussion. Let’s go!
\\n\\n\\n\\nViktor Farcic is a lead rapscallion at Upbound and a published author. He is a host of the YouTube channel DevOps Toolkit and a co-host of DevOps Paradox. Or as he presented himself: “the Crossplane guy”.
\\n\\n\\n\\nMax Körbächer is Co-Founder at Liquid Reply. He is Co-Chair of the CNCF Environmental Sustainability Technical Advisory Group and served 3 years at the Kubernetes release team. He runs the Munich Kubernetes Meetup as well as the Munich and Ukraine Kubernetes Community Days.
\\n\\n\\n\\nWhat an amazing KubeCon we had at Salt Lake City! It was the largest KubeCon in North America after the Covid pandemic, hosting over 9,000 attendees.
\\n\\n\\n\\nIt was also amazing to see the hands raised when Chris Aniszczyk, the CTO of the CNCF, asked at the opening keynote who’s a first-timer, and half of the crowd raised their hands. Chris confirmed we have about 50% new folks. Stay tuned for the CNCF’s transparency report, as we always publish after these events.
\\n\\n\\n\\nThis year’s KubeCon was also an opportunity to celebrate a decade to Kubernetes, the project that started the CNCF and the cloud-native ecosystem. It’s astonishing to see how it has grown: at the time of this KubeCon we have 208 projects under the CNCF, run by 255k contributors from across 193 countries. The CNCF landscape is getting crowded (so much so, that I started guiding people on how to navigate the CNCF landscape :-))
\\n\\n\\n\\nLet’s look at some KubeCon news and highlights from these projects.
\\n\\n\\n\\nA new joiner to the CNCF is Flatcar. It’s actually the first time the CNCF has adopted an operating system distribution. As Chris Aniszczyk rightly put it: “A secure community-owned cloud native operating system was one of the missing layers of the CNCF technology stack”.
\\n\\n\\n\\nFlatcar provides a lightweight Linux OS, derived from CoreOS, that is specifically tailored for hosting container workloads. Max says it has always been handing around, who’s going to take care of the project. He points out its value for platform engineers who look for the recommended golden image for their organization.
\\n\\n\\n\\nViktor noted that while an important addition, it’s a low-level component to which end-users don’t get exposed and don’t really care, especially if running on managed Kubernetes like EKS and GKE.
\\n\\n\\n\\nA big kudos to the Kinvolk team that originally developed it, and to Microsoft for evolving it these past years since the acquisition and now contributing it to the CNCF Sandbox. For more details, check out the announcement.
\\n\\n\\n\\nAlongside new projects joining the CNCF Sandbox, we see projects maturing from Sandbox to Incubation, one of which wasmCloud, the popular WebAssembly platfrom.
\\n\\n\\n\\nI’ve noted in the past that WebAssembly is the next frontier in cloud-native evolution, and wasmCloud is a cornerstone of this movement within the CNCF stack. The project has over a 100 regular contributors representing 73 unique companies.
\\n\\n\\n\\nwasmCloud is deployed by major organizations such as Adobe, Orange, MachineMetrics, TM Forum member CSPs, and Akamai Technologies. It’s interesting to see the wide variety of use cases for wasmCloud out there, from industrial IoT and automotive to digital services and banking. Max notes in particular the talk by Siemens at wasmCon co-located event, in which their shared their use case for embedded development.
\\n\\n\\n\\nViktor thinks that WASM isn’t here to replace containers, but rather provides its value alongside containers, with use cases such as edge computing. Max notes that if you run WASM in Kubernetes, it can practically leverage all the ecosystem of Kubernetes. For more on the wasmCloud maturation, check out the announcement blog.
\\n\\n\\n\\nThe highest maturity level in the CNCF is Graduation. This KubeCon we saw two important projects reaching graduation: Dapr and cert-manager.
\\n\\n\\n\\nDapr, the Distributed Application Runtime project, has today 3,700 individual contributors from more than 400 organizations, which is a good testament to the project’s maturity and sustainability. It is used by tens of thousands of organizations, including Grafana Labs, FICO, HDFC Bank, SharperImage.com and ZEISS.
\\n\\n\\n\\nThere was a big graduation party with Dapr folks at Sale Lake City to celebrate this occasion. Congrats to Microsoft Azure team who founded it and are lead maintainers, to Diagrid who also leads it and all other maintainers and everyone involved. See the announcement post.
\\n\\n\\n\\nAnother critical component in our Kubernetes stack reaching graduation is cert-manager project. It is no wonder this project is graduated, as it’s pretty much the de-facto standard for issuance and renewal of TLS and mTLS certificates these days, and we tend to take it for granted. In fact, 86 percent of new production clusters are created with cert-manager deployed as standard practice!
\\n\\n\\n\\nAs we looked back at the times, a decade ago, when we could spend half a day setting certificates or failing clusters due to expired certificates, Viktor Max and I agreed it’s a bliss to have it baked in. See the announcement post.
\\n\\n\\n\\nThe biggest news on the observability front were the major releases of two graduated projects: Jaeger v2 and Prometheus v3. Interestingly, a clear silver lining goes through both these releases, and that’s OpenTelemetry.
\\n\\n\\n\\nJaeger has been rearchitected to take advantage of the OpenTelemetry Collector framework, while Prometheus aims at becoming the de-facto backend for OpenTelemetry metrics. It’s wonderful to see the collaboration across these CNCF projects, driving for unified observability
\\n\\n\\n\\nOn the OpenTelemetry side, we’ve shared during KubeCon the great work of the CI/CD Observability Special Interest Group (OTel CI/CD), which
\\n\\n\\n\\nFor more information see the CNCF blog.
\\n\\n\\n\\nIn addition, OpenMetrics has been archived and merged into Prometheus, and is now embarking on OpenMetrics 2.0. A new working group has been founded under Prometheus, and its first focus will be requirements gathering and scoping of the 2.0. This is your opportunity to influence. You can read more in this post.
\\n\\n\\n\\nAlongside new projects, at the CNCF we look at how to help users design and architect their systems for the various use cases based on the cloud-native stack. The End User TAB (technical advisory board) is a forum to collect valuable end-user feedback and learnings.
\\n\\n\\n\\nUnder the TAB, the Reference Architecture working group was formed to collect references and provide practical guidance and examples for building cloud native applications, and has launched its first Reference Architectures — Scaling through Platform Engineering at Allianz Direct and Scaling Adobe’s Service Delivery Foundation with a Cell-based Architecture.
\\n\\n\\n\\nAs a former systems architect, these real-life architecture patterns have often been a topic of my discussions at KubeCon and other community gatherings. I’m excited at this new initiative, as it facilitates this sort of discussion and enables the community to share design blueprints from large-scale production deployments in an organized fashion.
\\n\\n\\n\\nWe now have a new dedicated CNCF sub-domain for that: architecture.cncf.io. Bookmark it, as I expect to see more reference architectures added there over time. In fact, you can also submit your own reference architecture in that page, to share it with the world.
\\n\\n\\n\\nWhat? no AI? don’t worry. On our hour-long fireside chat we covered artificial intelligence and machine learning, which have been a major theme this KubeCon, as well as platform engineering, environmental sustainability initiatives and many more topics.
\\n\\n\\n\\nCheck out the full episode on YouTube or on OpenObservbability Talks on all major podcast apps.
\\n\\nCertifications & Training post originally published on Medium by Giorgi Keratishvili
\\n\\n\\n\\nMost probably, your LinkedIn feed is full of posts from people speaking about the Kubestronaut program or even becoming one of them, and you still wonder what “Kubestronaut” even means. Then, my dear friend, buckle up. We will dive deep into what this program does, how it may help you with your career, how to get involved in community activities, and what kind of opportunities Kubestronaut can give you. We will go a bit beyond…
\\n\\n\\n\\nBefore we start let me introduce myself, my name is Giorgi Keratishvili and I am from Georgia, I have been in the IT field for more than a decade, and during this period has been exposed to the majority of fields of Development and Operation starting from bare metal infrastructure to higher level of automation, agile transformation, container orchestration, Platform Engineering, SRE, Incident Management, and Cloud Native Maturity, Besides my profession, I am very actively involved in the community, I play the role of AWS Community leader in Tbilisi also Ambassador for CD foundation, CNCF Kubestronaut, CNCF chapter lead, IEEE Senior Member and Ambassador for Institute of DevOps, I conduct meetups to grow DevOps/SRE/Platform Engineering awareness inside Georgian community, with Kubernetes have been working for last 5 years and have been passionate about cloud-native since then…
\\n\\n\\n\\nBefore we start speaking about the program let’s debunk the myths and legends about certifications because during my career I was often asked “Would getting XXX certification make me 10x engineer?” The camp splits into two categories, people who believe in certification and people who think it is a waste of time, money and energy, the answer to that question it’s heavily depends on multiple factors, the first thing we need to understand what purpose does this cert is intended for and what we can expect from it.
\\n\\n\\n\\nWe need to look at certifications from different perspectives and where we are on the career ladder let’s split into three parts
\\n\\n\\n\\nWhen Junior/Interns need advice on how to learn a new skill or get some experience they often feel lost in a maze of information, first, we need to understand for which domain do then need to learn, cause in software engineering there are not may well-known certifications as the same level in system engineering or DevOps, maybe doing project would be much better then passing some certification cause during job interviews it is often when the interviewer is discussing candidates project if they have it on GitHub or publicly available but for DevOps related topics such as Cloud, Kubernetes, and Linux.
\\n\\n\\n\\nI try to recommend certifications, because it is really hard to do a project by our self without foundational knowledge, like you can not expect from person without any background in system administration to understand concepts like what are high availability (HA) and fault tolerance (FA) systems like this kind of knowledge takes time to process, heck even some experienced engineers can not explain during interviews…
\\n\\n\\n\\n\\n\\n\\n\\n\\nI’m a big believer in learning, training, and certifications because they provide a clear path, goal, and target. Certifications offer a structured roadmap, helping you gain deep knowledge in specific areas. While certifications may not make you a complete master in a subject, they provide a strong foundation and clear direction for continuous learning and growth.
\\n\\n\\n\\nRamesh Kumar — Kubestronaut Program Initiator
\\n
By taking certification paths we have an exact target/goal and direction rather than just shooting in the sky also with certification there are some good training materials that go with it, so candidates can follow the structural path to prepare for it, sometimes I refer to taking certification like looking at the compass, it can get you in the right direction if you know where you want to go and what to expect there or it can get you lost without knowing the exact destination or in our case without having a goal.
\\n\\n\\n\\nFor recruiters, it makes it a bit easier to differentiate but again a little bit also an important factor is which type of certification it is. There are many options of how exams are structured, performance-based certification or multiple choice one, as there are many exams, and not all are equally created on difficulty level and credibility.
\\n\\n\\n\\nFor the Middle or Senior people, I would say it may not have much impact maybe if you are switching careers it could the case, but the majority of times company experience is much more important than gazillion certifications but would say in some companies, they do promotion-based on certification and in this case, I can relate but again it could be for lower level seniority.
\\n\\n\\n\\nNow let’s look from the perspective of consultants as they are a bit different compared to full employees as the majority of times they are called on specific problems to solve when a company may not have experience, also when making requirements pretty often companies use certification as criteria of selection for tenders and it is more valuable for business representatives from hiring company also when this contractor company or person has achieved some collection of certification they can apply as official partner of that vendor or certificated specialist, similar to AWS Services Partner or Red Hat Certified Architect
\\n\\n\\n\\nNow let’s speak about the program, its requirements and what you could expect at the end of achieving Kubestronaut status, I will not cover exam preparations in this blog but you can check my exam guide as part of the Kubestronaut journey at the bottom.
\\n\\n\\n\\nAs you already may know, in order to get Kubestronaut status, the candidate needs to hold 5 active certificates Kubernetes certifications Kubernetes and Cloud Native Associate Exam (KCNA), Kubernetes and Cloud Native Security Associate (KCSA), Certified Kubernetes Application Developer Exam (CKAD), Certified Kubernetes Administrator Exam (CKA) and Certified Kubernetes Security Specialist Exam (CKS) yes all of them needs to be active/valid, here are more details of requirements and FAQ:
\\n\\n\\n\\nTo become a Kubestronaut, you need to pass all Kubernetes-related certifications: CKA, CKAD, CKS, KCNA, and KCSA. All certifications must be current/active (i.e., in good standing/valid, not expired).
\\n\\n\\n\\nAfter completing your 5th Kubernetes certification, expect an email from CNCF staff with next steps. You’ll be asked to fill out a Google form with required information, enabling CNCF to dispatch your Blue Jacket and feature your profile on the Kubestronaut map.
\\n\\n\\n\\nKubestronauts will receive the title of “Kubestronaut”, as well as these additional benefits:
\\n\\n\\n\\nAnswer. No. To become a Kubestronaut and get the jacket, you need to have all 5 Kubernetes-related certifications — and all must be active certifications. You will have to pass the missing exams to become a Kubestronaut.
\\n\\n\\n\\nYou will still be a Kubestronaut until the end of the year. Here is how CNCF is managing the expiration:
\\n\\n\\n\\nAs a former Kubestronaut who requalifies, you will once again enjoy the same benefits as all other Kubestronauts, except the jacket, since a Kubestronaut can only receive one single Kubestronaut jacket during their lifetime.
\\n\\n\\n\\nYes, you can use the voucher for any CNCF exam that is available even if it was not released when you received the voucher. But keep in mind that those vouchers are for an individual exam only and not applicable to bundles (training course + certification or multi-exam.)
\\n\\n\\n\\nAs a Kubestronaut, to get your vouchers to get 20% off a CNCF event (KubeCon or KubeDays) each year, simply contact training@cncf.io and we will be able to give you all the needed elements.
\\n\\n\\n\\nAt this point, we understand what the requirements are and what journey we need to take, in the end, I would say it is very satisfactory to achieve recognition for preparing for this exam and successfully passing, personally when I was passing all these exams at that time there was no such thing as Kubestronaut and I was doing purely from my interest and for couple of them I even was selected as beta-tester to have a better understanding of Kubernetes ecosystem, but let’s understand how can the Kubestronaut program help us?
\\n\\n\\n\\nThe first thing is that the program provides five 50% discount vouchers for certification, also small tip Linux Foundation has the tux reward program you earn 1 point for every $1 spent on every course, certification exam, bundle, or Bootcamp you buy from their platform (500 points earns you a 50% off coupon), if some of your certs are expiring it is a good way to get discount to keep the status, also as part of Kubestronaut program they provide discounts for CNCF & LF events such as KubeCon or Open Source Summit. Besides that you can get a cool jacket which is very nice, to be honest but bigger questions is what next?
\\n\\n\\n\\nI would say it purely depends on the person if he wants to contribute or grow more cause the thing is that Kubestronaut is status for Kubernetes “Subject Matter Expert” and as CNCF is always trying to expand its community influence, Kubestronauts can join many initiatives such as become SME for future CNCF certification creation. Personally, I have been authoring for a couple of exam questions in Certified GitOps Associate (CGOA), Kyverno Certified Associate (KCA), and one more future to be announced from the Kubernetes ecosystem.
\\n\\n\\n\\nI joined as a contributor to the Infrastructure Lifecycle Working Group and I had the opportunity to speak in at KubeCon China 2024 about the Kubestronaut program and been active with my local communities CNCF Chapter Tbilisi organizing meetups and collaborating with the AWS User Group Tbilisi.
\\n\\n\\n\\nI am personally a big believer we get what we give, I always love to connect with like-minded people and the whole hole of this program is exactly that to connect like-minded people to make a better influence on growing cloud native and open source communities. Also small hint there is discussion for more than the Kubestronaut program which we my hear about soon.
\\n\\n\\n\\nIn the end, I can say I am very happy to be part of the Kubestronaut community as there are many interesting people, and in some way, it has opened doors to me for networking with some very interesting fellows. Overall I am thankful to CNCF for giving such great opportunities.
\\n\\n\\n\\nOn this note, I want to highlight one very important thing which I mentioned above certifications are a good way to have some achievable goal or target but having certification does not make person expert or SME, I would say it the person who has certificate has basic knowledge and foundation on working with that technologies but the expertise comes with experience.
\\n\\n\\n\\nAs a person who is doing promotional assessments and technical interviews at my work, I can always see who has theoretical knowledge and who has practical experiance and who used exam dumps…
\\n\\n\\n\\nHere is the list for exam prep guide needed for Kubestronaut:
\\n\\n\\n\\nAmbassador post by Leo Pahlke, CNCF Ambassador and CNCF TAG Environmental Sustainability Chair
\\n\\n\\n\\nOpen source is a fascinating space, where you are surrounded by emerging technologies and where you can directly engage with and have an impact on their future. “Engineering on the shoulders of giants.” All of this in a collaborative, open environment where you learn and contribute at the same time.
\\n\\n\\n\\nI’ve been thinking more about the space lately, which led to writing this article. In times when we think about sustainability, not just in terms of the environment and climate, but also in terms of social and economic sustainability, how we like to work together, how we can drive further innovation and build new technologies, open source is an interesting contrast. I hope and believe that we can learn from it and adapt practices not only between open source communities, but also within companies.
\\n\\n\\n\\nThis article explores concepts of how open source software is done based on my experience. What are the driving forces? Why are innovative projects often open source projects? How are companies adopting open source in their business? What might be the next steps for the field? Are there any lessons we can learn from open source? There are plenty of questions that point to different matters. Hopefully, there are bits and pieces that help you understand the space better and give you fuel to build on your companies open source strategy. Okay, let’s talk about my latest open source snapshot.
\\n\\n\\n\\nOpen source began as a movement decades ago and has continued to evolve. It involves collaborating on software projects and other matters, such as data or standards. In this article, we will focus on open source software. Open source approaches vary depending on cultural contexts, perspectives, preferences, and your goals and interests. Open source software is driven by its contributors and maintainers and not companies. Engineers who spend time freely pushing a project forward and sharing their work with others. Engineers not only set the entire picture, adopters (companies, institutions) and open source foundations are also part of it, but it all stands on engineers building software and sharing their work.
\\n\\n\\n\\nOne of the core reasons open source is possible lies in the nature of software—it can be shared with little to no cost, as software is a form of information. This may sound strange and irrelevant, but this opens up a lot. Software teams can spend years developing a product, but once it’s finished, it can be copied and shared across networks almost instantly.
\\n\\n\\n\\nEngineers generally like to tinker, experiment, and learn new technologies, and, like everyone else, show and share with others our work. Unlike some other engineering fields, software engineering makes tinkering and experimenting easy. No costly manufacturing, supply chains, or expensive equipment are required, meaning you don’t need company or university backing. This leads to some interesting effects, one of which is the potential to democratize software innovation. With some basic computing power and curiosity about software, you are set up to journey through tech. In this regard, the barriers to shaping the digital world are low.
\\n\\n\\n\\nTogether, these two factors explain why open source emerged and became a widespread movement. Beyond those, there are other practical reasons why openness and collaboration can benefit your company goals, as explored a bit later.
\\n\\n\\n\\nWonder and curiosity sparks exploration and experimentation, which can lead to the creation of new projects. If we share our work, this leads eventually to an ocean of projects, with a few that stand out and gain further traction. This creates a self-reinforcing cycle, where successful projects attract more attention and contributions. Since these projects are open, engineers can learn and experiment and build on top of the project as a base. Over time, this creates a network of interconnected projects, where each one influences the others. Influential projects can build over time an ecosystem that is defined by APIs, culture, and mindsets.
\\n\\n\\n\\nIf you have experience collaborating in open source communities, and put it in contrast to a company you worked for, processes are vastly different. Both generate value. In the open source space, it is possible to experience that, if conditions are a bit different, the entire motion of how software is developed can become a different twist. Open source is chaotic, and maintainers put simple guidelines to make it work. Which ideas can we take from open source and bring to our engineering teams at our company? Think about it.
\\n\\n\\n\\nAnd of course, this is slightly more complex. In the context of open source projects, social structures emerge, encompassing communities that collaborate, organize meetings, establish contribution platforms, establish communication channels, and so forth. It is not only about innovation and building new technologies, but also about community, learning, and other things. »Linux Kernel people have their approach to building open source than the React or the Hyperledger community«.
\\n\\n\\n\\nWith time, I think there are multiple reasons that emerged why folks end up contributing to open source. Tinkering and experimentation being at the core of it. Purpose, change, fellowship in communities, reputation, experience, freedom to build, and passion are some more.
\\n\\n\\n\\nFrom my experience, it’s usually a mix of all of these topics that people get into open source because of. It’s not always a rational choice, but rather a path that presents itself, and you take a look.
\\n\\n\\n\\nWith new open source projects coming up, the IT landscape shifts over time. The IT landscape has been influenced by open source for decades. It’s nothing new. Companies like Google have been founded with open source in their DNA, embracing free and open services and releasing new open source projects. Others, like Red Hat or SUSE, are building their business model on top of it, expanding on open source software to offer enterprise versions and support services to companies. Others, such as Microsoft, have further advanced their company strategy by partially abandoning proprietary software in favour of a hybrid approach that encompasses open source. Everyone mentioned uses open source software, but they have their ideas, plans, and priorities.
\\n\\n\\n\\nThe companies’ engagement in open source is not triggered by wonder and the freedom to create, as in the previous chapter defined for engineers, but rather simply by creating business value. Can open source be used to generate more business value. Yes, it can!
\\n\\n\\n\\nIf your company is concerned with software, it becomes clear that software engineering is a complex endeavour. To manage complexity, we capsule logic and divide and conquer. That’s important to deal with complexity but also opens up the possibility to share your projects with others effectively. Components can come from open source or be developed internally; ultimately, it doesn’t really matter: you use interfaces. The result matters. Results generate business value. Open source helps you to achieve your goals faster, to focus on your business value. Therefore, open source is about efficiency. It’s also about reducing engineering complexity you need to deal with. Since the project is capsuled, all the business logic (should be) safe and secure with you, and you do not lose your competitive edge.
\\n\\n\\n\\nHowever, there is a crucial prerequisite for open source to be effective: you must separate your project’s core business logic from the more general components that can be shared and reused. Open source thrives when it focuses on extracting general ideas, approaches, and algorithms that are applicable across various contexts. From my experience, open source is about creating generic solutions that serve as foundational building blocks. In a way, it shares similarities with fundamental research: just as research seeks to uncover universal truths about the world, open source seeks to define and share fundamental technologies that can be built upon and adapted in many settings.
\\n\\n\\n\\nThere are more reasons for companies to engage with open source, as shown in the diagram and table after.
\\n\\n\\n\\nCategory | Reason | Description |
Operational & Engineering | Increase software engineering efficiency | reduce overhead — “Do not reinvent the wheel” |
Reduce software engineering complexity | Focus on your business value and expertise | |
Agility and innovation | Ability to change your systems, react to new requirements and innovate products and services | |
Strategic & Business Impact | Digital sovereignty | Independence from proprietary software and foreign vendors |
Collaborate and influence standards | Shaping the direction of industry standards | |
Customer and partner relationship | Stronger ties with external stakeholders | |
Sales and monetisation | Opportunities for business models | |
Public image and prestige | Enhancing brand value by being an active open source citizen | |
Attract engineering talent | Using emerging software to appeal to talented engineers | |
Developer ecosystem | Leveraging open source communities to engage with developers | |
Transparency | Build trust through transparency | Building trust by setting the source public |
Data privacy and security | Independent auditing | |
Ethical reasons | Values of openness, fairness, and inclusion | |
Legal and Compliance | Legislation to publicise internal algorithms, et al. |
At the same time, there are several risks to consider when adopting open source as a strategic goal. The nature and impact of these risks will vary on a case-by-case basis. One of the primary risks is the commitment to taking ownership of your software stack. Contributing to and maintaining open source projects requires investments in engineering excellence—not just in writing code, but in ensuring long-term sustainability, quality, and security. Additionally, software engineering is in some ways a creative process that can be difficult to manage, and this may require a shift in mindset. It’s not just about the tools; it’s about fostering a culture where engineers take ownership of the entire project, not just their individual contributions. This shift places added pressure on talent acquisition and retention, as companies must attract engineers with the skills and mindset to thrive in such an environment. Finally, the total cost of ownership with open-source solutions can be less transparent than with proprietary alternatives. While open source can reduce licensing fees, the costs associated with maintaining, integrating, and securing these tools can accumulate. However, if a company aims to be a leader in the digital space, these risks are likely ones that will need to be addressed anyway.
\\n\\n\\n\\nOpen source plays a role for every company that does software engineering; however, the engagement is vastly different. There is quite a discrepancy between using open source because software engineering is nowadays not possible otherwise and being an open source citizen. To invest in open source requires driving factors, which we explored in the previous section. These could be transparency-related, for example, or expanding the monetization model. This driver needs to be picked up and integrated into the business strategy. If that’s the case, more active engagement and leadership can be achieved.
\\n\\n\\n\\nInvesting in open source and becoming an active open source citizen may still be a goal for the future, but not something to realize soon. Organizations are still in the midst of their digital transformation, working to reach a level of maturity that would enable them to play a more active role in the open source community. Open source itself exists on a spectrum, ranging from exclusive reliance on proprietary software to fully embracing the open source spirit. A balanced middle ground may be a practical approach during the digital transformation. Engagement with open source can be gradual, allowing companies to build capabilities and expertise over time while transitioning towards more active involvement.
\\n\\n\\n\\nStrategies for companies to adapt open source varies. To figure out how to manage your open-source engagement within the company, you can take a look at the resources by the TODO Group.
\\n\\n\\n\\nAlexis de Tocqueville suggested that change arises when there is pressure from both the top and the bottom. This can be interpreted in several ways: there needs to be a clear vision, guidance, and direction from the top, while commitment, energy, and passion from the bottom drive progress. A vision alone won’t create change, and simply putting in the work doesn’t guarantee it either. This dynamic can be understood in both large and small contexts. For example, on one hand, maintainers at the “top” create and release software; on the other, companies, and developers at the “bottom” provide the energy and motivation to adopt, modify, and advance it and their products or service. Change happens. Guiding and empowering this process is key to facilitating meaningful change and encouraging ongoing contribution.
\\n\\n\\n\\nOpen source foundations play a vital role in building communities, setting guidelines, and fostering collaboration. They sit in the middle as a “neutral body” since, open source thrives on collaboration rather than confrontation. This is where open source foundations come in—they help negotiate between the interests of maintainers and adopters and empower contribution. While some open source projects thrive without the backing of a formal foundation, a neutral body is often essential for nurturing a strong community and ensuring the long-term success of both the project and its technology stack. Without such an entity, it can be much more difficult to build and sustain a community, which in turn hampers the future development of the project.
\\n\\n\\n\\nOne example of how this is done can be observed by looking at the structure of the CNCF, where you have different structures bringing expertise and representing the interests of maintainers and end users. Together, we decide on governance and collaboration guidelines.
\\n\\n\\n\\nMy journey through open source projects and communities so far showed me an interesting contrast to regular software engineering practices. Open source complements companies software engineering capabilities every day. Playing a more active role in open source comes with some risks due to the chaotic nature of the space. Still, I believe that to be a leader in technology, it’s important to take these risks intentionally and benefit from shaping technology not just for your customers but also for everyone else. As explored, there are plenty of motivations for it.
\\n\\n\\n\\nA question remains when more traditional companies that rely on software now and in the future to stay competitive will reach digital maturity to contribute to open source like digital natives. For some companies, they reached this stage. In this light, it’s to be seen how the space evolves going forward. It is difficult to create a collaborative and innovative environment in a company. Open source can inspire to imagine a more collaborative and democratic process in software engineering going forward.
\\n\\nMember post by John Matthews, and Savitha Raghunathan, Red Hat
\\n\\n\\n\\nMigrating legacy software to modern platforms has long been a challenging endeavor for businesses. Companies often need to move decades-old systems to newer technologies without causing disruptions. Konveyor is an open source project aimed at supporting enterprises with the modernization of their applications, especially in cloud-native environments. It provides a suite of tools and services that enable organizations to migrate and adapt their workloads to Kubernetes and cloud-native platforms. Now, with Konveyor AI (Kai), the focus expands to using AI-driven solutions to make the modernization journey even smoother and more efficient, and that is what we will explore in this blog post.
\\n\\n\\n\\nKonveyor AI, or “Kai,” is aimed at assisting with these migrations. By applying Generative AI methods, Kai helps organizations seamlessly transition legacy codebases to modern platforms through tailored code suggestions.
\\n\\n\\n\\nKai uses the Retrieval Augmented Generation (RAG) approach, combining two sources of intelligence: static code analysis and past migration examples within an organization. This enables Kai to offer highly relevant code suggestions based on how similar migration challenges have been tackled before, avoiding the need for extensive AI retraining.
\\n\\n\\n\\nBy combining these data sources, Kai can generate context-aware suggestions that help developers modernize their applications while staying consistent with how their organization typically solves problems.
\\n\\n\\n\\nKai’s workflow integrates seamlessly into the development process:
\\n\\n\\n\\nThis process allows developers to efficiently address migration issues without having to manually comb through old projects or reinvent solutions from scratch.
\\n\\n\\n\\nTo better understand how Kai works, we put together a Kai demo, where we showcase its capabilities in facilitating the modernization of application source code to a new target. We focus on how Kai can handle various levels of migration complexity, ranging from simple import swaps to more involved changes such as modifying scope from CDI bean requirements. Additionally, we look into migration scenarios that involve EJB Remote and Message Driven Bean(MBD) changes. We also concentrate on migrating a partially migrated JavaEE Coolstore application to Quarkus, a task that involves not only technical translation but also considerations for deployment to Kubernetes.
\\n\\n\\n\\nKai is designed to address several key challenges that are common in large-scale modernization projects leveraging generative AI.
\\n\\n\\n\\nLLMs can only process a limited amount of data at a time, referred to as their context size. Since most legacy codebases are too large to be processed in a single request, Kai uses the results from static code analysis to narrow down the problem to specific areas of the code that need attention. This helps to avoid overwhelming the LLM and ensures that it remains focused on solving manageable, well-defined issues.
\\n\\n\\n\\nOne of the challenges in application modernization is dealing with changes that affect multiple files across a repository. For example, changing a method’s signature may require updates in other files that call that method. While Kai currently focuses on file-specific changes, future iterations will tackle repository-wide changes, drawing inspiration from Microsoft’s CodePlan research which will allow Kai to propagate changes automatically across related files.
\\n\\n\\n\\nMany organizations use custom frameworks and technologies that are not widely known or supported by existing AI models. Kai addresses this by incorporating examples of how an organization has handled similar migration problems in the past. This few-shot prompting technique provides the LLM with additional context to help it generate relevant code suggestions, even when dealing with proprietary or unfamiliar frameworks.
\\n\\n\\n\\nKai is designed to be model agnostic. As LLMs continue to evolve, Kai remains adaptable, allowing organizations to switch between different models as needed, whether using public, private, or local AI services.
\\n\\n\\n\\nKai includes an agent that iteratively refines code suggestions by checking the validity of the initial output and providing feedback to the LLM. This iterative process helps improve the quality of the final code solution, ensuring that it addresses the problem effectively.
\\n\\n\\n\\nMaintainers of Kai are actively working on development, with a heavy focus on addressing cascading changes across large codebases and improving the IDE experience. We regularly produce early evaluation builds that adopters can test and use.
\\n\\n\\n\\nFor those interested in contributing to Kai or learning more about its development, there are several ways you can get involved:
\\n\\n\\n\\nKonveyor AI aims to provide practical, efficient support for application modernization. By leveraging static code analysis and prior solved examples, it offers code suggestions that can help organizations migrate their legacy codebases without having to rebuild solutions from the ground up. With a flexible and adaptable design, Kai is poised to assist companies in meeting their modernization goals, regardless of the technology stack or the size of the codebase.
\\n\\nBy Nate Waddington, Head of Mentorship and Documentation, CNCF
\\n\\n\\n\\nOpen source projects rely on strong communities. Mentorship programs like LFX Mentorship and Google Summer of Code offer maintainers a chance to bring new contributors into their projects, fostering long-term growth and sustainability.
\\n\\n\\n\\nBut mentorship isn’t just about onboarding new talent—it’s also a way to expand your network, demonstrate leadership, and build a stronger foundation for your project’s future.
\\n\\n\\n\\nThis post highlights the benefits of mentorship for maintainers, explains what mentees can bring to your project, and shares key details about the upcoming CNCF mentorship opportunities.
\\n\\n\\n\\nMentorship programs aren’t just for aspiring contributors—they’re a strategic tool for project maintainers. By welcoming mentees, you can:
\\n\\n\\n\\nMentoring is more than an act of generosity—it can be a career-defining opportunity for maintainers:
\\n\\n\\n\\nCNCF mentees are prepared to make meaningful contributions, such as:
\\n\\n\\n\\nMentorship creates a ripple effect, benefiting projects long after the term ends:
\\n\\n\\n\\nGet ready to make an impact; the call for 2025 CNCF LFX Mentorship Term 1 project proposals opens soon:
\\n\\n\\n\\nBe part of the team shaping the future of cloud native. The Mentorship Working Group is a collaborative space for maintainers to share insights, develop strategies, and support mentorship efforts. Meetings are held on the third Thursday of each month, and new members are always welcome.
\\n\\n\\n\\nCNCF’s mentorship programs are a win-win for maintainers and contributors; only 73 of our 205+ projects have participated, so there is room to grow–and in particular, I’d like to invite sandbox projects to participate. By mentoring, you’ll not only grow your project but also gain recognition, leadership experience, and a stronger community. Have any questions? Here are helpful links, and if you still want to chat, reach out to natew@cncf.io:
\\n\\n\\n\\nMember post originally published on Fastly’s blog by Hannah Aubry
\\n\\n\\n\\nAbout five years ago, Fastly had a problem with scale. No, not our network. Fastly’s network continues to scale effortlessly, including recently breezing past a 353 Tbps* (terabits per second) capacity threshold we’ve been tracking internally. No, our problem was scaling how our dev teams worked together and the shared resources they used. That’s a common problem for any company with a vital and growing engineering function like Fastly, but for us, it came with a unique twist — because Fastly is one of the few companies on which the entire internet relies, and because our whole thing is instant digital experiences, our solution to internal scale had to not only be reliable and resilient but also very, very fast.
\\n\\n\\n\\nEnter Fastly’s Cloud and Container Services team. In 2020, the Platform Engineering team—now Foundation Engineering—was exploring ways to make Fastly’s engineering teams more effective and efficient. Around that time, a new engineering paradigm was gaining steam. Platform engineering is the practice of “designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era.” One of the key tools used in a platform-engineering-focused organization is an Internal Development Platform (IDP). IDPs greatly benefit individual engineers and the organizations they work for because they centralize control for cloud resources, security policies, user management, and more. In other words, they keep engineers focused on productivity and make it easy for organizations to allocate resources, onboard new hires, and more.
\\n\\n\\n\\nToday, we call the IDP our Foundation Engineering team built Elevation. To understand how Fastly’s Elevation platform works, I chatted with Danny Kulchinksy, one of the original members of Fastly’s Cloud and Container Services team.
\\n\\n\\n\\nA platform like Elevation aims to provide a standardized interface and user experience for all of Fastly’s developers. Specifically, its current role in Fastly’s architecture is to provide common and centrally-owned infrastructure for the development teams building applications that control crucial aspects of our network like AutoPilot, which automagically load balances traffic between our Points of Presence (POPs) to improve performance, or Neptune, which runs Fastly’s TLS features. Previously, Fastly used custom Chef cookbooks per application to run these kinds of applications, which led to a lot of maintenance for each engineering team: not only writing cookbooks but also figuring out how to deploy the application, patching the servers, fixing downtime as it happens—which doesn’t always happen—the list goes on.
\\n\\n\\n\\nAt its core, Elevation is Kubernetes (and many other tools from the Kubernetes ecosystem). Rather than individually managing their infrastructure, teams produce container images with standardized deployment patterns, enabling them to simply define where and how they want to deploy their application. From there, our Foundation Engineering team utilized controllers to perform all the necessary initialization, secrets management, and auto-scaling processes. What’s more, Elevation uses custom controllers to ensure that our workloads are always in policy over the long term too.
\\n\\n\\n\\n“So what we’ve done is built a controller that sits on each of the Elevation clusters. Once it detects that a new namespace is created, it automatically talks to Vault—an open source secrets storage system—and creates the secret namespace, the relevant policies, roles, and all the necessary machinery for the users to get started. If we need to change the policy over time, that gets rolled out automatically by the controller, too,” said Danny.
\\n\\n\\n\\nElevation’s success is largely due to the Cloud & Container Services team’s thoughtful planning, execution, and internal advocacy. The success and positive reviews from migrated teams haven’t hurt either, as Elevation has grown to serve 200+ services and 40+ teams and projects across Fastly.
\\n\\n\\n\\n“First and foremost, we knew Elevation needed to be very reliable and resilient but also simple. Because it is an adjustment for the engineering teams, and if it’s too hard or the benefits aren’t clear, they won’t adopt it. And it took quite a while to get the confidence of the various engineering teams because, at the beginning, nobody wanted to use Kubernetes. It was very new, there were a lot of jokes around it, and it took quite a bit of effort on our part to prove and demonstrate that this is a reliable and worthwhile platform to use. But since we started, not a single team that has made the switch so far has regretted it. They’ve all felt that they were better off than they were before.”
\\n\\n\\n\\nWhen asked how the team wants to grow the platform next, Danny said their main focus is always ensuring that our development teams continue to have a good experience using Elevation, even as its user base and complexity grow. He has to say “no” or “next year” to more proposed features than he did in Elevation’s early days, but the team’s aim remains the same: ensuring our users have the freedom to operate independently while ensuring they don’t break someone else’s service by mistake. Prometheus and Thanos for monitoring, with FluentD for automated metrics and log collection. But perhaps the most versatile tool from the Kubernetes ecosystem for Fastly is Kyverno, the policy engine. Its ability to mutate, validate, and generate resources upon creation or when they’re updated makes it especially powerful for Fastly. For example, if a developer tries to do something with Fastly’s infrastructure that is an insecure practice or out of policy—running an application as root, for example—Kyverno processes the deployment manifest against our validation policies and blocks the app from running.
\\n\\n\\n\\nAbstracting infrastructure is a great expedient for the software development lifecycle, but what about during emergency scenarios, like an incident, when dev teams may need extended permissions to fix an issue? The Platform Engineering team thought of that. The Fastly development teams using Elevation have access to a unique automation called Break Glass—built using Kyverno—which extends their permissions on our production clusters.
\\n\\n\\n\\n“So essentially, service owners have a specific set of permissions for what they can do in a production environment. Generally, we don’t allow certain actions because they’re considered risky from a security perspective or we have some compliance requirements where we cannot just apply changes directly, we need to go through an approval process. But if there’s an emergency, if there’s an incident and the engineer has to go in and do something to fix it, they can Break Glass. Once they do so, two things happen. One is they get elevated permissions that are time-scoped, starting at two hours, but they can extend if needed. The second is that comprehensive audit tracing initiates at the same time. We know who did the Break Glass and what they did, so we can go back post-event and understand what happened. This feature has helped us reduce lag time in responding to incidents since it’s completely self-serve while ensuring that we are always compliant,” said Danny.
\\n\\n\\n\\nLearn how Fastly can help you scale your internal development platform to success. Sign up for a free account or join the conversation in our forum.
\\n\\n\\n\\n* 353 Tbps of connected global capacity as of June 30, 2024
\\n\\nMember post by Gabriele Bartolini, VP Chief Architect of Kubernetes at EDB
\\n\\n\\n\\nThis article delves into the concept of cloud neutrality— a term I prefer over agnosticism— in PostgreSQL deployments. It highlights the transformative impact of Kubernetes as a cloud-neutral platform, enabling organizations to deploy and manage PostgreSQL across diverse environments, including on-premises, hybrid, and multi-cloud setups. Key organizational considerations, such as vendor lock-in, cost predictability, and development velocity, are examined alongside various deployment models, from traditional on-premises Linux to modern Kubernetes-based solutions. The discussion emphasizes the importance of involving database administrators (DBAs) in transitioning to cloud-native architectures and presents CloudNativePG as a robust solution for building a scalable, flexible, and cloud-neutral PostgreSQL ecosystem.
\\n\\n\\n\\nThe demand for databases continues to grow at a rapid pace. Gartner reports that the global database market grew by 13.4% in 2023, with nearly 80% of databases being relational systems like PostgreSQL. This surge is further propelled by the expanding presence of hyperscalers—Amazon, Google, and Microsoft—offering robust Database-as-a-Service (DBaaS) solutions. PostgreSQL has solidified its position as the most popular database, with increasing adoption not only for its technical strengths but also as a way to reduce vendor lock-in.
\\n\\n\\n\\nWhile public cloud services continue to dominate the market, many organizations that rely on PostgreSQL are increasingly adopting Kubernetes as a central part of their infrastructure strategy. This shift is motivated by the pursuit of cloud neutrality, enabling businesses to avoid vendor lock-in at the provider level and gain greater flexibility while consciously taking on more operational responsibilities. By embracing Kubernetes, organizations also unlock the potential to deliver multi-cloud and hybrid-cloud solutions, providing more versatile services to customers worldwide. Let’s explore how and why this transition is redefining the future of PostgreSQL deployments through the CloudNativePG stack.
\\n\\n\\n\\nWhen I started using PostgreSQL in production in the early 2000s, I ran it on a bare metal Linux operating system with a hardware RAID controller and multiple locally attached disks. To optimize costs and resource utilization, I ran several PostgreSQL instances on the same machine, each assigned to different TCP ports. This was a common approach to consolidating database infrastructure at the time.
\\n\\n\\n\\nAs technology advanced, the rise of virtualization and configuration management tools significantly improved on-premises database management. These innovations have fueled PostgreSQL’s global adoption, making on-prem deployments a solid option for organizations that require complete control and ownership of their infrastructure.
\\n\\n\\n\\nToday, on-premises deployments are often favored by organizations with a strong understanding of their workloads, who must manage their systems’ security, performance, business continuity, and compliance. Industry-specific regulations, government mandates, and data sovereignty requirements make this approach indispensable in certain sectors.
\\n\\n\\n\\nThe late 2000s saw a significant shift with the rise of infrastructure-as-a-service (IaaS), driven by the rapid growth of cloud computing. By the early 2010s, Postgres databases were transitioning from traditional on-prem data centers to cloud-based solutions, allowing users to rent virtual servers and storage on demand while paying for actual resource usage.
With tools like Terraform and automation platforms like Ansible, the provisioning, deployment, and configuration of PostgreSQL databases has become far more efficient, particularly for Day 1 operations. However, despite the advantages in speed and flexibility, Day 2 operations—such as availability, maintenance, scaling, and optimization—still require substantial involvement from database administrators (DBAs) and often remain confined to the database domain, disconnected from the broader infrastructure and applications.
While IaaS solutions reduce lead times for provisioning compute and storage, they also bring challenges such as vendor lock-in and unpredictable costs, despite the promise of pay-as-you-go savings. Hybrid and multi-cloud strategies aim to reduce vendor dependency, but their implementation is often complex and resource-intensive.
Interestingly, IaaS also played a pivotal role in accelerating the rise of platform-as-a-service (PaaS) and software-as-a-service (SaaS). A prime example is Heroku, a pioneering PaaS that gained popularity among Ruby developers and significantly contributed to the growing adoption of PostgreSQL around 2010.
Building on the IaaS model, Database-as-a-Service (DBaaS) for PostgreSQL has become increasingly popular since 2013, particularly for organizations seeking to outsource most Day 2 operations. With DBaaS, service providers manage the underlying infrastructure, operating system, and PostgreSQL server, allowing customers to focus on application development and business needs.
\\n\\n\\n\\nToday, DBaaS is synonymous with cloud databases. Major cloud providers offer DBaaS solutions like Amazon RDS for PostgreSQL, Google Cloud SQL for PostgreSQL, and Azure Database for PostgreSQL, alongside specialized solutions such as EDB Postgres AI Cloud Service from my company. These services simplify database management by providing high availability, disaster recovery, and observability, with geographical redundancy across multiple cloud regions.
\\n\\n\\n\\nWhile DBaaS significantly reduces infrastructure complexity and operational overhead, it limits direct control over the PostgreSQL backend, restricting access to a PostgreSQL connection, CLI, or web-based dashboard. This trade-off can accelerate development velocity, but it also introduces concerns like vendor lock-in, unpredictable costs, and data portability. The recent European Union Data Act highlights the importance of data portability (i.e. switching between cloud providers, implementing multi/hybrid solutions, and migrating data without downtime), urging organizations to consider these factors carefully when adopting DBaaS. Unless the DBaaS is designed with a multi-cloud approach and specifically optimized for PostgreSQL (as with EDB’s solution), migrating data away from the cloud service provider can be challenging.
\\n\\n\\n\\nManaging PostgreSQL across on-premises, IaaS, and hybrid/multi-cloud environments exposes significant complexity and operational trade-offs. Each scenario brings its own challenges, making it difficult to maintain consistency and flexibility.
\\n\\n\\n\\nA cloud-neutral approach offers a solution by enabling highly portable infrastructure that facilitates seamless transitions between cloud and on-premise environments. This model supports Day 1 tasks (provisioning, setup) and Day 2 tasks (scaling, monitoring, backups), without sacrificing performance or manageability. By using open-source technologies and standardized APIs, organizations can avoid vendor lock-in, reduce operational overhead, and retain the flexibility to choose the best infrastructure.
\\n\\n\\n\\nThe cloud-neutral PostgreSQL solution we propose is based on the CloudNativePG open-source stack, composed of Kubernetes, PostgreSQL, and CloudNativePG. This stack already empowers organizations worldwide to build highly portable infrastructures, deployable on-premises (including bare metal) or hybrid configurations. By leveraging Kubernetes-as-a-service (KaaS) solutions from hyperscalers like Amazon EKS, Azure AKS, and Google GKE, or container platforms like Red Hat OpenShift, or even the standard Kubernetes distribution, businesses can embrace cloud neutrality.
\\n\\n\\n\\nKubernetes is an open-source platform that provides a standard abstraction layer for managing infrastructure and applications within Linux containers. Its modular design, extensibility, fault tolerance, self-healing, and compliance with Infrastructure as Code (IaC) principles make it ideal for organizations looking for highly portable, cloud-neutral infrastructures.
\\n\\n\\n\\nKubernetes is the most popular Cloud Native Computing Foundation (CNCF) project and can be deployed anywhere—from bare metal servers to virtual machines. Organizations can choose between self-managed Kubernetes or managed KaaS solutions depending on their expertise. Additionally, enterprise platforms based on Kubernetes, like Red Hat OpenShift, Suse Rancher, or VMware Tanzu, extend cloud neutrality to enterprise environments, facilitating seamless movement between on-premise, private, and public cloud infrastructures.
\\n\\n\\n\\nKubernetes unlocks a truly cloud-neutral infrastructure, allowing organizations to deploy workloads in private, public, or hybrid clouds with minimal changes to code or configuration. Thanks to GitOps and the integration of IaC, Kubernetes is a key enabler for cloud-neutral PostgreSQL, providing the flexibility needed for future-proof database deployments.
\\n\\n\\n\\nKubernetes facilitates cloud neutrality for infrastructure and containerized workloads, but managing PostgreSQL databases in this environment presents unique challenges. While Kubernetes treats PostgreSQL as just another application, relying on standard resources like `Deployments` and `StatefulSets` is insufficient due to the database’s inherent complexity. Though we often strive to move away from treating databases as “pets” and adopt the “cattle” model, it’s important to remember that PostgreSQL’s mascot is an elephant—symbolizing strength and requiring careful management (instead of “cattle,” perhaps “herd” is the more fitting analogy).
\\n\\n\\n\\nThis is where a PostgreSQL Operator comes into play. The operator pattern extends Kubernetes’ capabilities through custom resources, controllers, and declarative configurations, allowing the orchestration of complex applications like PostgreSQL in a cloud-native way.
\\n\\n\\n\\nA well-designed PostgreSQL operator provides the essential custom resources and controllers that enable Kubernetes to manage the database efficiently. This ensures PostgreSQL meets core cloud-native requirements such as high availability, self-healing, scalability, and security, while also leveraging its powerful business continuity features. These include native replication mechanisms like Hot Standby, synchronous replication, cascading replication, along with support for hot backups, continuous backup, and Point-in-Time Recovery (PITR). Together, these capabilities form a solid foundation for data integrity, resilience, and operational excellence in a cloud-native world.
\\n\\n\\n\\nThere are several PostgreSQL operators available, but CloudNativePG is the one on which I will be focusing, for a few key reasons:
\\n\\n\\n\\nBy defining a CloudNativePG `Cluster` resource, users can ensure their PostgreSQL database, with a primary node and any number of read-only replicas, operates in business continuity scenarios with minimal Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), with 99.99% uptime realistically attainable over a year. The same resource can be deployed, unchanged, across any cloud environment or multiple Kubernetes clusters for hybrid and multi-cloud scenarios, handling most Day 2 operations automatically. Users retain full control over their data with no vendor lock-in, ensuring complete data portability as required by the European Union’s Data Act (PostgreSQL’s native streaming replication—logical or physical—supports this flexibility).
\\n\\n\\n\\nThis is how the CloudNativePG stack, built on Kubernetes and PostgreSQL, delivers true cloud neutrality.
\\n\\n\\n\\nAdditionally, Kubernetes’ scheduling capabilities allow users to allocate specific machines for PostgreSQL workloads. By applying labels and taints to nodes and defining affinity rules and tolerations on CloudNativePG `Cluster` resources, PostgreSQL can run in complete isolation from other applications at the physical layer, while seamlessly integrating at the logical layer. This approach mirrors a microservice database architecture, where the application and its backend coexist within the same Kubernetes namespace.
\\n\\n\\n\\nMoreover, the CloudNativePG stack is an excellent choice for isolating applications and backend databases at the logical layer while providing a cloud-neutral database-as-a-service (DBaaS) solution. Relying on standard resources, such as load balancer services, safely exposes PostgreSQL outside Kubernetes. This setup effectively serves both internal customers within your organization and external clients, as demonstrated by IBM’s Cloud Pak, EDB’s Postgres AI Cloud Service, and Tembo.
\\n\\n\\n\\nOne of the most common misconceptions about Kubernetes is that it can only run on virtual machines. In reality, Kubernetes operates just as effectively—if not more efficiently—on bare metal infrastructure without the need for a hypervisor. The diagram below offers a simplified view of the key layers in bare metal and virtual machine Kubernetes nodes.
\\n\\n\\n\\nBy running directly on the host, containers can fully exploit the underlying hardware, eliminating the overhead of virtualization and even doubling the efficiency and performance compared to VMs. This is especially important for stateful workloads, such as databases, which can be deployed on dedicated bare metal Kubernetes nodes and benefit from locally attached storage. This setup creates a unique opportunity for database professionals to implement highly performant, robust, shared-nothing architectures for PostgreSQL clusters, all managed seamlessly through declarative configuration. (After all, the very existence of CloudNativePG stems from the first fail-fast experiment my team conducted in late 2019, when we launched our Postgres initiative. The goal was to measure the performance of Kubernetes on bare metal—and, as you might have guessed, the results spoke for themselves.)
\\n\\n\\n\\nThanks to the CloudNativePG stack, PostgreSQL DBAs can now run PostgreSQL on-premises in Kubernetes, on bare metal, and replicate it across private or public Kubernetes clusters using symmetric architectures. This offers a highly portable and standardized approach to hybrid cloud environments—a truly cloud-neutral PostgreSQL solution that can help reverse or slow the migration of databases to public clouds, enabling a shift back to hybrid or fully on-premises deployments.
\\n\\n\\n\\nAs I have previously written, PostgreSQL DBAs are at a critical career crossroads when approaching Kubernetes: should they adopt, avoid, or deny it?
\\n\\n\\n\\nHaving dedicated the past five years of my Cloud Native initiative to making it easier to run PostgreSQL on Kubernetes, I’m now shifting to the next phase—helping PostgreSQL experts transition smoothly into the Kubernetes ecosystem. While spinning up a PostgreSQL cluster in Kubernetes has now become remarkably straightforward, there is an expansive and largely uncharted realm of opportunities that only seasoned PostgreSQL experts can fully explore and master. This transition will undoubtedly require effort and adaptation from DBAs, who must develop sufficient Kubernetes knowledge (a T-shaped profile) to collaborate effectively with Kubernetes experts regarding the PostgreSQL database from day 0 (planning). However, the rewards are clear, as confirmed by PostgreSQL DBAs who have successfully navigated this journey. In many ways, this transition mirrors the shift DBAs experienced over the past two decades when moving from bare metal to virtual machines.
\\n\\n\\n\\nBased on my experience, PostgreSQL DBAs deeply appreciate open-source software and are well-versed in running Linux commands. Kubernetes, alongside Linux and PostgreSQL, is one of the most fascinating and transformative open-source projects and communities in history. While Kubernetes is vast and complex, it is also modular, meaning DBAs don’t need to master every aspect of it. Instead, they can focus on the essentials for managing CloudNativePG PostgreSQL clusters: understanding pods and containers, networking (service resources), storage (storage classes, persistent volume claims, and persistent volumes), and having enough familiarity to collaborate effectively with infrastructure administrators on areas like IaC, GitOps, TLS certificates, monitoring (using Prometheus for metrics and alerts), and logging (as logs aren’t stored on disk). Investing a month or two of study into this skill set can unlock a decade of new opportunities.
\\n\\n\\n\\nAchieving true cloud neutrality at the infrastructure level requires your organization to develop or acquire Kubernetes expertise through internal teams or external partnerships. This expertise varies depending on the platform choice—whether using upstream Kubernetes (which demands the most skills), Kubernetes-as-a-Service (KaaS), or enterprise-grade container platforms like Red Hat OpenShift and SUSE Rancher.
\\n\\n\\n\\nThe diagram below outlines the decision-making process for adopting the CloudNativePG stack based on my experience and analysis of customer journeys and conversations over the past five years. While simplified, it accurately reflects the broader trends in the industry, even if some details may differ in specific contexts.
\\n\\n\\n\\nFor organizations aiming to extend cloud neutrality to PostgreSQL and fully leverage the CloudNativePG stack, it’s essential for PostgreSQL DBAs to be engaged from the start. If your team lacks Kubernetes knowledge or your DBAs are reluctant to embrace this transition, it may be prudent to stick with more familiar options like running PostgreSQL on bare metal, virtual machines, or Infrastructure-as-a-Service (IaaS). Although not explicitly represented in the diagram for simplicity, in these cases, DBAs may also rely on Database-as-a-Service (DBaaS) solutions, especially when other stakeholders like developers drive the decision.
\\n\\n\\n\\nFor organizations without dedicated DBAs or Kubernetes expertise, DBaaS remains a popular and pragmatic option.
\\n\\n\\n\\nHowever, a growing trend is emerging among organizations that have strategically adopted Kubernetes but don’t have dedicated DBAs. These companies are using the CloudNativePG stack to provide Database-as-a-Service to their internal customers—primarily developers and engineers from other departments. This approach offers a balanced solution between traditional infrastructure management and a fully managed cloud DBaaS. While I can’t disclose specific company names, this use case is increasingly common among larger enterprises in sectors like manufacturing, banking, finance, payments, automotive, and IT services. The trend is especially pronounced in Europe, where on-premise infrastructure is often preferred. The CloudNativePG stack’s ability to provide declarative isolation of workloads, both logically and physically, enables PostgreSQL consolidation on bare metal Kubernetes nodes, creating a win-win scenario for both infrastructure administrators and DBAs.
\\n\\n\\n\\nAs a final note, it’s important to emphasize that EDB, the original creator and founder of CloudNativePG, is well-positioned to assist enterprises globally in migrating their Postgres databases to Kubernetes. EDB is at the forefront of PostgreSQL on Kubernetes technology, serving as a Silver Member of the CNCF and the only Kubernetes Certified Service Provider actively involved in PostgreSQL development. EDB offers a long-term supported version of CloudNativePG called EDB Postgres for Kubernetes (PG4K), which also provides access to EDB Postgres Advanced Server (EPAS), simplifying Oracle migrations, as well as EDB Postgres Distributed for Kubernetes (PGD4K) for active-active workloads.
\\n\\n\\n\\nTo conclude, the rise of Kubernetes as a cloud-neutral platform is revolutionizing how PostgreSQL is deployed and managed across diverse environments. While blueprints and best practices can offer valuable guidance, the right deployment choice ultimately hinges on your organization’s specific needs and the expertise of its teams.
\\n\\n\\n\\nThe table below summarizes key organizational considerations—such as vendor lock-in, cost predictability, and development velocity—across the major PostgreSQL deployment models available today, from traditional on-premises Linux setups to modern Kubernetes-based solutions.
\\n\\n\\n\\nOn-Premises PostgreSQL | PostgreSQL in the Cloud (IaaS) | PostgreSQL in the Cloud (DBaaS) | Cloud Neutral PostgreSQL (KaaS) | Cloud Neutral PostgreSQL (Self-Managed) | |
Deployment model | Purchase / Consumption-based | Consumption-based | Consumption-based | Consumption-based | Purchase / Consumption-based |
Cost predictability | High | Low/Medium | Low | Medium | High |
Time to Market for DB Applications | Slow | Medium | Fast | Fast | Fast |
Vendor Lock-In Risk | Low/None | High, typically | High | Low | Low/None |
DBaaS Use | No | Yes, internal & external | N/A | Yes, external only | Yes, internal & external |
After addressing these organizational factors, we can dive deeper into the infrastructural and operating system layers to explore the key technical differences among the deployment options discussed in this article:
\\n\\n\\n\\nOn-Premises PostgreSQL | PostgreSQL in the Cloud (IaaS) | PostgreSQL in the Cloud (DBaaS) | Cloud Neutral PostgreSQL (KaaS) | Cloud Neutral PostgreSQL (Self-Managed) | |
Hardware costs | High | None | None | None | High |
Installation method | Packages on OS | Packages on OS | N/A | Immutable containers | Immutable containers |
Bare Metal Support with Local Storage | Yes | No, typically | No | No, typically | Yes |
Control Over System Configuration | High | High | None | None | High |
Private Cloud Capability | Yes | No | No | No | Yes |
Public Cloud Availability | No | Yes | Yes | Yes | Yes, potentially via IaaS |
Hybrid Cloud Support | Yes | Yes | No | Yes | Yes |
Multi-Cloud Support | No | Yes, but hard | No | Yes | Yes, potentially |
At this point, all major organizational and architectural considerations have been addressed, and your decision on which cloud model to adopt is likely clear. However, it’s equally important to evaluate aspects related to the Postgres database itself and the data it manages. These factors, summarized in the table below, can have a significant impact on your final deployment choice:
\\n\\n\\n\\nOn-Premises PostgreSQL | PostgreSQL in the Cloud (IaaS) | PostgreSQL in the Cloud (DBaaS) | Cloud Neutral PostgreSQL (KaaS) | Cloud Neutral PostgreSQL (Self-Managed) | |
---|---|---|---|---|---|
Business Continuity & Compliance | Yes, full responsibility | Yes, OS and up | No, database content only | Yes, database | Yes, full responsibility |
Day 2 Operations (Maintenance) | Manual | Manual | Automated | Automated | Automated |
Data portability (EU Data Act) | Yes | Yes | No | Yes | Yes |
Control Over PostgreSQL Configuration | High | High | Limited | High | High |
Database performance | High | Medium | Medium | Medium | High |
Postgres Extensions Support | Full control | Full Control | Controlled by the provider | Full control | Full control |
Postgres PKI Support (mTLS) | Yes, complex | Yes, complex | Yes | Yes | Yes |
With CloudNativePG, organizations can embrace a robust, high-performance, and standardized approach to running PostgreSQL clusters across bare metal, hybrid, or multi-cloud environments, all driven by declarative configuration. This empowers DBAs to implement cloud-neutral, shared-nothing architectures that avoid vendor lock-in while retaining full control over data and performance. As businesses increasingly move away from cloud-dependent models, CloudNativePG offers a scalable, future-proof solution for managing PostgreSQL in diverse and complex environments, including a return to on-premises deployments.
\\n\\nMember post originally published on the ngrok blog by Joel Hans
\\n\\n\\n\\nDevelopers love a groove.
\\n\\n\\n\\nNo, I don’t mean a touch of jazz to class up your workday, but the specific patterns you rely on for building great applications. Think of the grooves in the vinyl you might play for said jazz. Grooves that work, like development environments, version control, declarative configurations, code review with your peers, CI/CD pipelines for your test suites…
\\n\\n\\n\\nBy settling into that groove, you get shifted-left security without breaking your brain, more transparency into what everyone is building together, and higher quality. That’s the undeniable advantage of working with these proven traditions. The developer experience for building applications stays relatively strong through production. In the cases where it gets operationally messy, you still have plenty of exciting software delivery and DevSecOps platforms to pick from that will help you get over the finish line.
\\n\\n\\n\\nTaking APIs to production in the same groove is another entirely different matter.
\\n\\n\\n\\n\\n\\n\\n\\n\\nTip: Too eager to peek at the real infrastructure behind an APIOps workflow with ngrok? Skip down to the Building with ngrok’s production-grade API gateway and unified ingress section.
\\n
Most of the entrenched API gateway providers—both those deployed on-premises and in the cloud—leave you hanging right before you deploy to production. When you try to enable must-have policies like JWT-based authentication, you have to trudge your way through the gateway’s awkward web console, where you get the joy of working with languages like CSharpScript. Adding rate limits isn’t much better, with a sequence of curl
requests to the admin API.
Even if they do offer a more developer-friendly configuration process, it’s usually an afterthought they’ve tacked on to keep up with the competition. Using it requires a painful rip-and-replace process to get yourself back to feature parity.
\\n\\n\\n\\nThese outcomes might be fine if you have a dedicated DevOps or platform engineering team who handles all operational worries on your behalf. Doesn’t that sound nice? The reality is that most developers can’t afford—in time, cognitive load, and risk to quality—to jump grooves at the last possible moment. Abandoning a Git-driven development lifecycle when you need good developer experience can’t be a viable long-term solution.
\\n\\n\\n\\nHow much would you benefit from defining API gateway configuration in code, version-controlled traffic policies, and repeatable deployments in any environment?
\\n\\n\\n\\nIf you haven’t heard of APIOps, here’s the TL;DR: APIOps aims to integrate the beneficial processes (more grooves) of DevOps and GitOps into API development.
\\n\\n\\n\\nAPIOps borrows the automation and integration fundamentals from DevOps to improve your collaboration across the entire API lifecycle, speeding up your release cycle while minimizing bugs. It also folds in what makes GitOps great for automating how you provision infrastructure, like relying on Infrastructure as Code (IaC) and formal change mechanisms, for more manageable and higher-quality production deployments.
\\n\\n\\n\\nBecause your Git repository is the canonical source of truth, dictating not just what the upstream service is but how it’s made available in production, you can start to answer that last question:
\\n\\n\\n\\ncurl
commands you used against the gateway provider’s API.GitOps and APIOps also gives you access to some other benefits:
\\n\\n\\n\\ngit revert
.Your line of questioning then becomes: Does your API gateway even let you do APIOps? If yes, the conversation quickly devolves into: How do you actually get the process right? Even with the right groove right in front of you, it’s easy to lead your team astray.
\\n\\n\\n\\nThe younger the company, the more likely you are to have picked up an API gateway provider that allows for some degree of APIOps, and the more likely you are to let everyone create, configure, and deploy APIs with abandon. You need to drive toward that MVP or pivot before you burn through your cash, so it doesn’t matter that every API is unique.
\\n\\n\\n\\nDifferent providers. Different authentication policies. Different repositories. Different branch protections. Different disaster plans. Different ideas of what “quality,” “security,” and “availability” mean.
\\n\\n\\n\\nYou can move fast, but it sure is chaotic.
\\n\\n\\n\\nLarger and older companies are more likely to have dedicated teams who “own” the APIOps process. They maintain the quality of the production environment by establishing guardrails that dictate how developers should design and deploy APIs.
\\n\\n\\n\\nYou’re grateful for their high standards and not having to worry so much about the operational end of the API lifecycle, but you also feel the pain of pushing all your hard work through them.
\\n\\n\\n\\nYour APIs might be perfectly stable, but the congestion slows you down.
\\n\\n\\n\\nYou might think APIOps, as a methodology, is somehow flawed. If it can’t help developers get back into their groove, what’s the point? The failure mode here isn’t APIOps, but rather how most API gateways handle ingress and the global delivery of your APIs. In other words: They don’t.
\\n\\n\\n\\nDeployed API gateways operate from a single point of failure and require sophisticated Ops handiwork, and cloud API gateways lock you into specific cloud providers and don’t offer nearly as much deep customization. In either case, the networking and operational task of connecting multiple API gateways to your infrastructure and a global delivery network is far from your core competency. You’re still on the hook for high availability and failover processes. You’re still waiting on the NetOps team to give you the green light on an API you wrapped up two weeks ago.
\\n\\n\\n\\nAPIOps makes the early phases of the lifecycle developer-friendly, but you still need developer-friendly ingress at the very end of the line.
\\n\\n\\n\\nWe built ngrok’s API gateway to be both flexible and developer-defined, which means it’s ready for APIOps workflows that work in the same groove you already know and love. Adopting APIOps gets quite straightforward once you have unified ingress from ngrok.
\\n\\n\\n\\nFor example, you start by developing a new API in a single Git repository for your backend service, a Dockerfile
to containerize said service, and a few IaC files in the form of Kubernetes manifests in YML.
Git-based foundation? ✅
\\n\\n\\n\\nNext, you provision a Kubernetes cluster—could be a minikube
cluster on your local workstation for a proof of concept, or its production-grade counterpart in GKE. Use Helm to install ngrok Kubernetes Operator to take care of secure and flexible ingress, followed by a GitOps tool like Argo CD to tackle all the work around continuous deployment.
Automated and repeatable GitOps deployments using Git as your source of truth? ✅
\\n\\n\\n\\nNext, you add API gateway Traffic Policies to your existing IaC by layering new CRDs and referencing them throughout your HTTPRoutes, which the Kubernetes Operator uses to direct incoming traffic to your containerized backend API service. Traffic Policies execute on both inbound requests and outbound traffic, allowing you to quickly layer in JWT authentication, rate limiting, and plenty of other custom rules.
\\n\\n\\n\\nAPI gateway policies defined by YAML and version-controlled with Git? ✅
\\n\\n\\n\\nFinally, you can enable Argo CD’s auto syncing feature to automatically deploy every merged commit, whether that’s a bugfix on your backend service or a new authentication method to protect your public API from abuse.
\\n\\n\\n\\nFull-on APIOps? ✅
\\n\\n\\n\\nngrok helps you avoid the problematic cases mentioned above through unified ingress. No more worrying about networking infrastructure or TLS certificate generation. No more pondering over a global delivery network, high availability, or geo-aware failover. Developers work efficiently on the convenience API gateway interface, and operators—when they are indeed in the picture—to green-light new deployments faster than before.
\\n\\n\\n\\nNo chaos. No congestion. Just one smooth groove.
\\n\\n\\n\\nCheck out our how-to guide for a full walkthrough of the technical setup to bring APIOps to your next API project with ngrok.
\\n\\n\\n\\nIn the end, your APIOps workflow on ngrok comes with some extra benefits, too:
\\n\\n\\n\\nYour new API development groove now includes ngrok—if you haven’t yet, sign up now to start defining your API gateway through Git and IaC, not obscure admin API calls.
\\n\\n\\n\\nOnce you have an account, check out not just the how-to guide, but all our other resources on how ngrok’s developer-defined API gateway works alongside Kubernetes APIOps-ready deployments.
\\n\\n\\n\\nOnce you’ve tried our how-to guide or bravely implemented the core APIOps concepts in your own infrastructure, we’d love to hear from you in our new ngrok Community Repo, the best place for all discussions around ngrok, including bug reports and product feedback.
\\n\\n\\n\\nJoel Hans is a Senior Developer Educator at ngrok. He has plenty of strong thoughts about documentation, developer education, developer marketing, and more.
\\n\\nMember post by Jatinder Singh Purba, Principal, Infosys; Krishnakumar V, Principal, Infosys; Prabhat Kumar, Senior Industry Principal, Infosys; and Shreshta Shyamsundar, Distinguished Technologist, Infosys
\\n\\n\\n\\nIn the last quarter of 2024, the cloud-native ecosystem continues to be strong as end users reap the benefits of over 10 years of modernization initiatives. This article covers a list of trends observed that will steer investment and end-user interest into cloud-native ecosystems in the year 2025. Most of these trends are driven by foundations such as the Cloud Native Computing Foundation (CNCF), The FinOps Foundation, Open Source Security Foundation (OpenSSF), and LF AI & Data Foundation, under The Linux Foundation.
\\n\\n\\n\\nCloud-native architecture today is the default choice for greenfield projects. The large adoption of cloud-native architecture by enterprise customers is accompanied by an increasing focus on costs over the past few years [1]. Cloud-native architecture is more complex and layered compared to a typical monolith. Automation with infrastructure-as-code (IaC) allows teams to create and destroy infrastructure at scale, creating a highly dynamic environment to meter and manage costs. As a larger number of organizations modernize to cloud and cloud-native architecture, measuring and controlling costs will become crucial.
\\n\\n\\n\\nWhile FinOps does have a separate foundation, CNCF projects like OpenCost are driving this area forward [2, 3]. OpenCost is a tool that provides visibility into Kubernetes spend and resource allocation. It helps accurately measure and proportion cost. Further, customers are also looking for the right tools and best practices that reduce overall spend without sacrificing performance. The cloud-native ecosystem is fast evolving to meet this need.
\\n\\n\\n\\nThis trend is closely associated with the ability to observe and measure resource consumption and other parameters of IT estates. As such, associated projects such as OpenTelemetry, Jaeger, Prometheus, and OpenSearch (part of the Open Search Foundation) also help with this goal.
\\n\\n\\n\\nOrganizations can apply this trend in practice by piloting FinOps projects such as OpenCost and trying to obtain a clear picture of where their IT dollars are being spent.
\\n\\n\\n\\nIt is no secret that additional tools introduced by cloud-native approaches increase developer friction. While concepts such as containers have revolutionized operations, they have also led to additional concepts that developers must master in addition to the application codebase and tooling. Consequently, the year 2023 saw the rapid rise of internal developer portals (IDPs) and platform engineering to address this.
\\n\\n\\n\\nBackstage is an open framework to develop IDPs [4]. It clocked the highest number of end-user contributions in the 2023 CNCF annual report and the 4th highest velocity across the landscape [5]. This is a testament to the interest that end users have in IDPs. Backstage, in particular, is fast becoming the de-facto standard for building IDPs and accelerating cloud native productivity.
\\n\\n\\n\\nAn IDP is necessary to ensure developer productivity and is a sizeable part of the larger topic of platform engineering. In addition to developer portals, clients are looking at platform engineering capabilities to build developer and operations-friendly abstractions on top of the Kubernetes core.
\\n\\n\\n\\nA recent Backstage implementation by Infosys for a leading US insurance company has shown promising results [6]. The solution reduced onboarding time for new developers by about 40% and increased code deployment frequency by 35%. It has greatly optimized the lead time from generating requirements to moving to production, leading to a commensurate increase in customer satisfaction.
\\n\\n\\n\\nEnd users can benefit by performing an analysis of internal developer portals in the industry and piloting these within their organizations. Implementing Backstage is a promising first step as it is a CNCF project with a great deal of momentum and an increasing number of adopters.
\\n\\n\\n\\nCloud Native Powers AI
\\n\\n\\n\\nSince 2016, OpenAI, a pioneer in the industry, has been running its training and inference workloads on Kubernetes [7]. It has pushed the limits of platform technology by running clusters with up to 2500 nodes. All the advantages of cloud-native technologies and platforms such as scalability and dynamic nature transfer directly to artificial intelligence (AI) workloads. This is especially true for large language models (LLMs), a fast-moving area of AI technology that is transforming every industry it touches. The trend within the cloud-native landscape to cater to AI training and services is spread across the LF AI & Data and the CNCF foundations [8]. CNCF has also developed and published a cloud-native AI landscape along with a white paper earlier this year [9, 10].
\\n\\n\\n\\nLF AI & Data and CNCF house open-source projects that are critical building blocks for the AI revolution. These include projects such as:
\\n\\n\\n\\nApart from projects, there are also foundational improvements and changes being made to Kubernetes such as the elastic indexed job to better handle the demands of AI workloads [15]. Considerable thought leadership in this space is being driven by the members of the cloud-native AI working group under the Technical Advisory Group for Runtime (TAG-Runtime) of CNCF [16].
\\n\\n\\n\\nCompanies experimenting with AI typically start with a proprietary SaaS or cloud offering. They can further expand their reach into cloud-native AI by setting up and running open-source projects such as KServe to experiment with open-source LLMs. The ability to curate and self-host LLMs is the first step to addressing privacy, security, and regulatory concerns with proprietary offerings and developing in-house capabilities in this area.
\\n\\n\\n\\nObservability is critical to the success of cloud-native programs as their architectures are complex and dynamic. In addition, with the rise of hybrid and multi-cloud environments, it becomes critical to have a comprehensive observability solution. Cloud-native observability must go beyond legacy metrics such as CPU, memory, storage, and network throughput. Further, the volume of metrics and the metadata associated with cloud-native observability are orders of magnitude higher as compared to legacy environments. All of this creates technical and operational challenges in enabling effective observability for cloud-native architectures.
\\n\\n\\n\\nOver the last decade, this area of cloud-native technology has been driven by large, closed-source commercial vendors. Though open-source projects such as Prometheus (a monitoring system with time series database [17]) and Fluentd (a unified logging layer [18]) are well-adopted, end users typically turn to commercial vendors such as Dynatrace, AppDynamics, and Splunk with more fully-featured suites. However, this has led to cost and portability concerns.
\\n\\n\\n\\nTwo key CNCF projects are driving change in this area through initiatives such as ‘observability query language specification’ [19]. These projects are OpenTelemetry – a set of tools, APIs and SDKs to create telemetry pipelines for metrics, logs, and traces [20], and the Technical Advisory Group for Observability (TAG-Observability) [21, 22]. This move towards open-source standards is driving healthy changes in the area of observability and offers immense opportunity for both end-users and service providers to get involved.
\\n\\n\\n\\nPiloting CNCF projects such as OpenTelemetry and Fluentd can improve observability pipelines, minimize vendor lock-in, and reduce cost.
\\n\\n\\n\\nModern architectures require new and innovative security methods. Concepts such as zero trust and secure supply chain are receiving attention at both end-user organizations and nation-state levels [23]. As the number of applications adopting microservices architecture increase, along with the growing sophistication of organized bad actors (e.g., Lockbit 2.0, Conti), security is on top of the priority list for CNCF [24].
\\n\\n\\n\\nThe recently graduated project, Falco (a runtime tool to detect security threats), is a great step in this direction [25]. In addition, cloud-native architectures rely on both CNCF as well as associated landscapes and foundations such as OpenSSF [26] that drive innovation in secure supply chains. TAG-Security is a key part of CNCF [27]. Beyond ensuring security for CNCF projects, it also publishes white papers that offer direction to the industry on the topic of security [28]. The area of policy as code, which is a critical part of security, is being driven by projects such as Open Policy Agent (OPA) and Kyverno [29, 30]. Both provide functionalities to define security policies as code.
\\n\\n\\n\\nWhile there are cutting-edge open-source projects covering certain parts of the security landscape, this trend is dominated by product suites offered by established security vendors. In addition, each cloud vendor offers a suite of security products native to their implementations such as AWS Inspector and Microsoft Defender for Cloud. This trend of partnerships between open-source and established vendors to offer comprehensive security solutions is likely to continue into the foreseeable future.
\\n\\n\\n\\nEnd users can work with both vendors and open-source projects to implement security guardrails across all layers of their infrastructure and applications. Organizations can uplift the capability of security teams by educating them on the new challenges created by cloud-native and microservices architecture.
\\n\\n\\n\\nGreenIT, GreenOps, and sustainability are popular topics today. There are some CNCF projects that service this need by measuring the carbon consumption of applications on the Kubernetes platform. Some examples of these projects include Kepler [31], an operator that collects and exports metrics using Extended Berkely Packet Filter (eBPF) to estimate energy consumption by pods, and OpenCost [32], a tool that provides visibility into Kubernetes spend and resource allocation. Observability into carbon consumption is a key first step to reduce carbon emissions and increase sustainability.
\\n\\n\\n\\nSustainable IT operations are being driven increasingly by legislation such as the EU sustainability reporting rules and regulations in 2024 [33]. Corporations are currently harvesting low-hanging fruit through programs such as green datacenters, reducing end-user devices, etc. However, in the coming years, they will need to drill down to identify the most sustainable choices for application development.
\\n\\n\\n\\nWhile tooling for this area is still in its infancy, multiple open-source projects and standards are being developed. End users leading the charge are working to integrate green principles into their tools and applications to reduce their carbon footprint.
\\n\\n\\n\\nAs a first step, organizations can pilot Kepler or OpenCost for their container workloads.
\\n\\n\\n\\nFor other layers of their installed IT landscape, companies may need to identify and shortlist tools from their cloud provider or third-party vendors to measure carbon cost and sustainability impact.
\\n\\n\\n\\nThis list would not be complete without an overarching trend that has been driving cloud-native over the last decade, the adoption of Kubernetes as the cloud-native orchestrator-of-choice for modern technology platforms.
\\n\\n\\n\\nGoogle first open-sourced Kubernetes as an open implementation of its internal orchestrator Borg in 2014 [34]. It was the first project of CNCF and the first project to graduate. It has become the de-facto platform for modernization projects. Kubernetes continues with a cadence of three releases per year but a majority of the features being worked on at this stage of its maturity are around reliability, scaling, and security. This is beneficial as it enhances the production readiness of Kubernetes, which is likely to remain the platform-of-choice for both modernization and greenfield developments for the foreseeable future.
\\n\\n\\n\\nEnd users are preparing for the future of cloud-native implementations by standardizing best practices and creating stable blueprints for the implementation of effective platforms around Kubernetes. This trend is also illustrated by the rise and acceptance of platform engineering, which now focuses on the tooling around Kubernetes required to create an effective platform [35].
\\n\\n\\n\\nPlatform engineering is the practice of designing, building, and operating reusable software platforms that provide a foundation for multiple applications to operate. It enables faster delivery, improved quality, and increased scalability. In addition to Kubernetes, the core engine for most platforms today, platform engineering focuses on building blocks such as observability, policy as code, internal developer portals, security, continuous integration/continuous development (CI/CD), and storage. The end result is to provide a set of business capabilities such as reliability, performance, scalability, and availability for any application hosted on the platform and a declarative, self-service approach for users to access these capabilities.
\\n\\n\\n\\nEnd-user organizations need to build strong platform engineering capabilities. This will have a major impact on how Kubernetes is used to create effective container platforms. A good place to start is to map critical use cases within their organizations that can be standardized and automated through a platform engineering approach.
\\n\\n\\n\\n[1] https://www.cloudkeeper.com/insights/blog/2024-state-finops-report-key-trends-cloud-finops
\\n\\n\\n\\n[2] https://www.finops.org/introduction/what-is-finops/
\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n[5] https://www.cncf.io/reports/cncf-annual-report-2023/
\\n\\n\\n\\n[6] https://www.cncf.io/case-studies/infosysinsurancecustomer/
\\n\\n\\n\\n[7] https://kubernetes.io/case-studies/openai/
\\n\\n\\n\\n[8] https://lfaidata.foundation/
\\n\\n\\n\\n[9] https://landscape.cncf.io/?group=cnai
\\n\\n\\n\\n[10] https://www.cncf.io/reports/cloud-native-artificial-intelligence-whitepaper/
\\n\\n\\n\\n[11] https://opea.dev/
\\n\\n\\n\\n[12] https://milvus.io/
\\n\\n\\n\\n[13] https://kserve.github.io/website/latest/
\\n\\n\\n\\n[14] https://www.kubeflow.org/
\\n\\n\\n\\n\\n\\n\\n\\n[16] https://tag-runtime.cncf.io/wgs/cnaiwg/
\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n[19] https://github.com/cncf/tag-observability/blob/main/working-groups/query-standardization.md
\\n\\n\\n\\n[20] https://opentelemetry.io/
\\n\\n\\n\\n[21] https://github.com/cncf/tag-observability
\\n\\n\\n\\n[22] https://github.com/cncf/tag-observability/blob/whitepaper-v1.0.0/whitepaper.md
\\n\\n\\n\\n[23] https://www.whitehouse.gov/wp-content/uploads/2022/01/M-22-09.pdf
\\n\\n\\n\\n\\n\\n\\n\\n[25] https://falco.org/
\\n\\n\\n\\n[26] https://openssf.org/community/sigstore/
\\n\\n\\n\\n[27] https://github.com/cncf/tag-security
\\n\\n\\n\\n[28] https://tag-security.cncf.io/publications/
\\n\\n\\n\\n[29] https://www.openpolicyagent.org/
\\n\\n\\n\\n[30] https://kyverno.io/
\\n\\n\\n\\n[31] https://sustainable-computing.io/
\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n[34] https://www.sdxcentral.com/articles/news/how-kubernetes-1-29-improves-open-source-cloud-native-production-readiness/2023/12/ [35] https://www.gartner.com/en/articles/what-is-platform-engineering
\\n\\nCommunity post by Adam Korczynski, ADA Logics
\\n\\n\\n\\nThe Keycloak has completed its fuzzing audit. The audit was carried out by Ada Logics, a UK-based security firm with deep expertise in fuzz testing, and the audit was funded by the CNCF. The audit is part of the CNCF’s investment in security through security audits, and Keycloak joins a significant list of CNCF projects that have undergone fuzzing audits. See this blog post to read more about the fuzzing work for CNCF projects.
\\n\\n\\n\\nThe audit resulted in Keycloak integrating into the OSS-Fuzz project – an open source fuzzing program developed and offered by Google for critical open source projects. Accepted projects can include their fuzz tests in their OSS-Fuzz build, and OSS-Fuzz will run them with high amounts of compute to scale the projects’ chances of finding potential bugs and vulnerabilities before malicious threat actors find them.
\\n\\n\\n\\nThe audit also saw the auditing team write an extensive fuzzing suite for the Keycloak project that targets both complex processing routines and code parts that interact with 3rd-party services enabled by mocking of these 3rd-party services in the fuzz tests. In total, the auditing team wrote 24 new harnesses and added them all to Keycloaks OSS-Fuzz integration such that they run continuously on the OSS-Fuzz infrastructure. The auditing team then assessed the feedback from OSS-Fuzz, adjusted the fuzz tests to improve these based on OSS-Fuzz’s feedback and added seed corpora to selected fuzz tests.
\\n\\n\\n\\nThe fuzz tests found a crash of low severity during the audit, which the auditing team fixed with an upstream patch. With Keycloaks integration into OSS-Fuzz, its fuzz tests continue to test the Keycloak code after its audit.
\\n\\n\\n\\nThe full report from the audit is available here.
\\n\\nThis week’s Kubestronaut in Orbit, Dmitri Telinov, a Senior DevOps Engineer in Chișinău, Moldova, is a curious and avid learner and considered himself a complete beginner in Kubernetes only 3 years ago. When he’s not working, he likes to follow tutorials and courses to increase his DevOps knowledge.
\\n\\n\\n\\nIf you’d like to be a Kubestronaut like Dmitri, get more details on the CNCF Kubestronaut page.
\\n\\n\\n\\nI first learned about Kubernetes in 2017 and started to use it extensively at work in 2019.
\\n\\n\\n\\nI learned a lot during preparation for certification exams. This preparation includes also practical experience that helped me to advance in my projects at work
\\n\\n\\n\\nI mostly used these learning platforms: Kodekloud, O’Reilly and Packt publishing where I found some good courses.
\\n\\n\\n\\nI enjoy freelancing, and learning and sharing knowledge on my daily blog, https://www.qwerty.md/
\\n\\n\\n\\nStart with simple and hands-on examples. My biggest mistake was that I started with Kubernetes the Hard Way and it took some time to understand things.
\\n\\n\\n\\nI plan to finish more Linux certifications – for example, I want to get the Linux Foundation Certified System Administrator (LFCS) and start to learn NodeJS and prepare certification in that as well.
\\n\\nMember post originally published in the Cerbos blog by James Walker
\\n\\n\\n\\nIf you want to make your authorization more scalable, easier to maintain, and simpler to integrate with your components – externalized authorization is the way to go. However, these benefits are difficult to realize if you don’t consciously plan for them within your authorization implementation.
\\n\\n\\n\\nIn reality, externalized authorization can add new technical challenges that aren’t always apparent at the start of a project. In this article, we explore some of the problems with externalized authorization. We’ll also go through several useful strategies to avoid these pitfalls, so you can implement authorization in the right way.
\\n\\n\\n\\nBefore getting into the concept of externalized authorization, let’s first start with the basics – bear with me.
\\n\\n\\n\\nThe backbone of any secure application is authorization. It determines what actions a user can perform within an application. Authorization ensures users only access what they are allowed to. As applications scale, authorization often becomes more complex, especially when dealing with microservices or distributed systems. Spoiler alert – stay tuned for the upcoming release of our ebook on transitioning from monolith to microservices.
\\n\\n\\n\\nThis is where externalized authorization comes in. But how exactly does it work, and what should you watch out for?
\\n\\n\\n\\nExternalized authorization refers to separating your service’s authorization routines from your main application code.
\\n\\n\\n\\nIn monolithic applications, authorization is handled by functions and classes that live inside your single codebase. Externalized authorization refers to repositioning these components as a standalone service that your main code interfaces with. The interface normally consists of network calls to an API that the authorization component provides.
\\n\\n\\n\\nExternalized authorization is often used together with external identity providers. You can delegate user account storage and role management to an authentication platform that’s purpose-built for the task. Then your application can pass this context onto the authorization layer to check whether the roles assigned to a user authorize an action or not with all the added context of the application.
\\n\\n\\n\\nThis model keeps authorization logic separate from your application, making it more testable and easier to iterate upon in isolation. It also centralizes the implementation of authorization policies, ensuring all your services apply the same restrictions. Any new services you develop can reuse the externalized authorization component without duplicating its logic. This isn’t possible when authorization is tightly coupled to specific codebases. If you want to learn why companies are turning to externalized (decoupled) authorization more and more, check out this blog.
\\n\\n\\n\\nAuthorization logic that’s written directly into application components is inflexible, and externalized authorization has clear advantages over it. Externalizing authorization into its own service can increase overall complexity, though. You have to develop and maintain two services while ensuring they remain compatible with each other. Here are five specific kinds of technical complexity you’ll face.
\\n\\n\\n\\nPlugging the externalized authorization layer back into your main application can be harder than you think. Whereas authorization has historically been a synchronous process without side effects, externalizing introduces the potential for system failures when the authorization service can’t be reached or an unknown response is received.
\\n\\n\\n\\nThe externalized authorization component must be carefully integrated to ensure the implementation is reliable. The application will need to retry calls to the authorization layer that fail due to a flaky network, for example. When no response can be obtained, perhaps because the authorization component has failed, the app should deny the user’s request to prevent authorization being inadvertently granted.
\\n\\n\\n\\nEach app you develop will need to integrate the authorization layer before it can be consumed. These integrations should be backed by tests that verify the app correctly handles different authorization outcomes, such as grant, deny, and failure.
\\n\\n\\n\\nExternalizing authorization from application code doesn’t mean you can forget about user accounts and permissions. These still need to be managed by your authentication layer, either directly within your account management component or with an external identity provider.
\\n\\n\\n\\nThe service you use should be flexible enough to store all the user data you require. All but the simplest systems will require a granular permissions model with support for roles and groups. You’ll need a mechanism for setting up and maintaining user attributes and role assignments, either via scriptable APIs for automated provisioning or an accessible web UI for human use.
\\n\\n\\n\\nExternalizing authorization without planning how you’ll manage user accounts can cause problems as you reintegrate your components into your application. Apps need a dependable mechanism for establishing the user’s identity, retrieving relevant attributes such as their team and project, and checking which permissions have been assigned. All this info has to be centralized across your services to preserve consistency.
\\n\\n\\n\\nServices that filter data and display results over multiple pages need to be adjusted so users are only shown the items they can interact with. For example, an API request for the first ten invoices in a system could expose a different set of items depending on the user: department leaders might only see invoices approved by their department, while accounting staff have unfiltered access.
\\n\\n\\n\\nTo achieve this, you’ll need to run your authorization policies against each item included in resource lists fetched by your application. The policies should verify that the items can be used in the current authorization context defined by the characteristics of the resource and the requesting user.
\\n\\n\\n\\nUnfortunately, performing authorization checks for lists of hundreds or thousands of records is often an inefficient process. It also has knock-on impacts on your pagination routines. When an item gets discarded because authorization is denied, a replacement must be loaded from the database to fill up the correct page size.
\\n\\n\\n\\nTo properly address this complexity, plan how you’ll integrate your data queries with your externalized authorization policies. Select an authorization provider that includes features such as query plans to preemptively retrieve the list of resources that a given user can interact with. Use the plan to filter database queries at the point they’re made, instead of individually running authorization policies against each item in your result sets.
\\n\\n\\n\\nSecurity weaknesses and privacy concerns can be caused by externalized authorization. Any new service increases your threat perimeter and creates an additional target for attackers. Splitting authorization out of your application converts it into a standalone component that might be easier for bad actors to manipulate.
\\n\\n\\n\\nTraditional authorization models are invisible from outside your application. Authorization checks occur within the code, providing no opportunities for attackers to investigate their logic. Externalized authorization can be more visible if your service isn’t properly protected. Network activity logs reveal the requests being made and the results obtained in response.
\\n\\n\\n\\nInsecure authorization APIs can leak data too. It’s vital to ensure your authorization service only responds to requests from known application services via a trusted service-to-service call. Otherwise, a rogue user or attacker could exfiltrate sensitive details by making direct calls to the authorization API.
\\n\\n\\n\\nAuthorization is a critical application component. It’s involved in almost every user interaction, demanding exceptional scalability and reliability. Poorly optimized authorization is a bottleneck that compromises your whole system’s performance.
\\n\\n\\n\\nSplitting authorization into its own service can increase latency as your apps have to wait for authorization checks to complete. Too many pending calls will increase congestion and lead to resource contention. If your authorization layer can’t scale with user activity, people will be left waiting at times of heavy usage.
\\n\\n\\n\\nFailure resilience is equally important. If authorization goes down, users won’t be able to log in or access functions that require permission checks. Authorization services should be deployed as multiple replicated instances to produce a fault tolerant architecture that can withstand individual instance crashes.
\\n\\n\\n\\nExternalized authorization doesn’t have to be burdensome. You can mitigate complexity by sticking to proven strategies that promote an effective implementation. Building upon standard microservices patterns is a good starting point, but the following techniques offer specific best practices for splitting authorization from application code.
\\n\\n\\n\\nWhilst authentication is a known problem and standards such as OAuth2 and OpenID Connect have made these solution plugins and play, authorization is still in its early phases or standardization.
\\n\\n\\n\\nThere are common best practices and approaches for RBAC, ABAC, and PBAC making use of the PDP/PIP/PEP model, and now there is an effort underway to define a standard for how all the components involved in the authorization ceremony interact.
\\n\\n\\n\\nThe OpenID AuthZEN Working Group – of which Cerbos is a key member – is defining the specification to ensure adding fine-grained authorization is just as simple and interoperable.
\\n\\n\\n\\nMuch like building your own IdP, starting an authorization platform from scratch is a daunting task. You’re responsible for checking your authorization logic and maintaining security standards. Selecting a dedicated platform such as Cerbos gives you all the benefits of externalized authorization without the complexity.
\\n\\n\\n\\nThese systems sit outside your stack and are integrated using their public APIs. You can register user accounts, handle logins, and set up authorization policies using RBAC, ABAC and PBAC. They remove the complexity of inventing your own mechanisms for storing, evaluating, and querying authorization logic.
\\n\\n\\n\\nSome systems demand their own authorization layers either because of their sensitivity or due to legacy compatibility requirements. Developing your own authorization solution can be the only option in these circumstances, but you don’t have to do it on your own.
\\n\\n\\n\\nMinimizing features and keeping code paths lean is a good way to lessen security dangers and remove complexity. After distilling your solution to its essential requirements, you can more readily compare it to reference architectures or invite an external review. Seeking an audit and penetration test from a specialized authorization security team can provide confidence that your system’s protected, allowing you to get back to building your business functionality.
\\n\\n\\n\\nStart developing your own solution by clearly listing the vital features it requires. Next, plan out how you’ll deploy your authorization service, protect it from unauthorized access, and scale it to achieve sustained performance. You can then start working on the technical implementation. Try referring to open source authorization platforms if you need more guidance as many of the challenges you’ll face will have already been encountered by others.
\\n\\n\\n\\nExternalizing authorization from the code of individual applications is a best practice that enforces consistent authorization logic across services, simplifies testing, and is more scalable when implemented correctly.
\\n\\n\\n\\nNonetheless, too many software teams struggle to effectively utilize externalized authorization because of the extra technical complexity it creates. Poorly planned implementations can be unreliable and difficult to maintain.
\\n\\n\\n\\nProactively developing strategies to identify and address this complexity will let you build and scale a externalized authz system for your next project. Involve project managers, developers, and service operators to canvass opinion on potential drawbacks of the approach. Once you’ve identified any problems, you can add relevant mitigations to your development plan.
\\n\\n\\n\\nThe measures you choose can be specific technical changes, such as implementing rate limiting to improve security, or more general steps that support your solution’s success. Extensive test suites, adoption of standardized protocols, and the use of expert guidance when needed all strip away the complexity of externalized authz.
\\n\\n\\n\\nWant more details on getting started with authorization? Set up a call with a Cerbos Engineer and ask us anything.
\\n\\n2025 is right around the corner, and we’re thrilled to announce the CNCF 2025 lineup of events! Next year, we are expanding our reach and will host our first-ever KubeCon + CloudNativeCon in Japan. Mark your calendars, gather your teams, and get ready for an unforgettable year of making cloud native ubiquitous!
\\n\\n\\n\\nCNCF remains dedicated to supporting and expanding emerging communities worldwide. In 2022 and 2023, CNCF hosted KubeDay events to bring cloud native to emerging communities; however, in 2025, alongside the now 5 global KubeCon + CloudNativeCon events, we’ll be shifting focus from KubeDays to Kubernetes Community Days (KCD) by investing additional resources and funding into this program and supporting its organizers. This commitment allows us to invest in community-led events on a larger scale, with an impressive 30 KCDs planned for 2025! Learn more about the KCD program and how you can be part of it in your local community.
\\n\\n\\n\\nInterested in sponsoring a CNCF event? The 2025 sponsorship prospectus is available. Email sponsor@cncf.io to secure your sponsorship today.
\\n\\n\\n\\nAre you a planner?! Save the dates for KubeCon + CloudNativeCon Europe and North America 2026.
\\n\\n\\n\\nAre you interested in knowing how we select dates & locations? Check out our latest blog for a behind-the-scenes look at our selection process to bring our community together!
\\n\\nOpenTelemetry (also known as OTel) is an open-source observability framework with tools, libraries, APIs, and SDKs for collecting, processing, and exporting rich telemetry data such as traces, metrics, and logs to backend systems. It’s designed to help developers monitor and evaluate the performance and health of cloud-native applications, which can be distributed across complex infrastructures.
\\n\\n\\n\\nOTel was accepted to CNCF on May 7, 2019 and moved to the Incubating maturity level on August 26, 2021. Since moving to Incubation, OpenTelemetry has become one of CNCF’s most active projects, with high adoption rates across industries. Otel is the 2nd most active CNCF project after Kubernetes.
\\n\\n\\n\\n“Modern cloud native systems can be complex to manage if an organization lacks the necessary telemetry data and visibility into their varied layers,” says Chris Aniszczyk, CTO, CNCF. “OpenTelemetry has come a long way to mature open source telemetry technology and specifications to benefit all. Our new OpenTelemetry certification supports our goal of educating and promoting best practices for cloud native observability.”
\\n\\n\\n\\nBenefits of OpenTelemetry
\\n\\n\\n\\nOTel brings a unified, community-supported framework for capturing observability data with these key features and benefits:
\\n\\n\\n\\nWhy Opentelemetry Matters to Platform Engineers
\\n\\n\\n\\nOpenTelemetry simplifies observability and enables better visibility into the complex, distributed systems of cloud-native and microservices-based applications. Here’s why Otel is so valuable in their role:
\\n\\n\\n\\nBy providing a standardized, comprehensive observability solution, OpenTelemetry enables platform engineers to improve reliability, efficiency, and agility, making it an essential tool in building and maintaining modern, cloud-native platforms.
\\n\\n\\n\\n“OpenTelemetry is quickly becoming a ‘must-have’ component of cloud native, and this certification is a great way for developers to demonstrate their mastery. We’re excited to launch this program with CNCF, and anticipate it to be just the first step in learning programs that we build together,” says Austin Parker, OpenTelemetry Governance Committee and Director of Open Source at honeycomb.io
\\n\\n\\n\\nValue of Certification in OTel
\\n\\n\\n\\nOpenTelemetry certification provides valuable benefits, especially for people in cloud-native and observability roles. A certification will validate essential skills in setting up and using OpenTelemetry to monitor distributed systems, covering trace, metric, and log collection, which aids in troubleshooting and optimizing performance. Certification also gives professionals a competitive edge in acquiring or progressing in DevOps, SRE, and Cloud Engineering roles by demonstrating expertise in a critical, widely adopted observability tool. Overall, it’s a solid career investment for advancing in cloud-native practices and enhancing organizational impact.
\\n\\n\\n\\nAnnouncing the OpenTelemetry Certified Associate (OTCA)
\\n\\n\\n\\nCloud Native Computing Foundation and Linux Foundation Education are excited to announce the launch of the OpenTelemetry Certified Associate (OTCA) certification. The certification will help Application Engineers, DevOps Engineers, System Reliability Engineers, Platform Engineers or any IT professional interested in building their skills in OpenTelemetry – the industry standard for tracing, metrics and logs increase their ability to leverage telemetry data across distributed systems to solve problems or increase team collaboration.
\\n\\n\\n\\nThe primary domains and competencies covered by this certification are:
\\n\\n\\n\\nThe OTCA certification was built in collaboration with Honeycomb, with the participation of people from Chronosphere, Elastic, Lynxmind, F1rst Digital Services, Sicredi, DBS Bank, Accenture, Datadog and Lightstep
\\n\\nPlanning a large conference like KubeCon + CloudNativeCon Europe or North America is a complex endeavor that begins years in advance. The venue and date selection process is an exercise in compromise and there are a lot of key factors that are taken into consideration. Here’s an insider’s look at the behind-the-scenes process of how we carefully select each location and date.
\\n\\n\\n\\nDetermining when to host a KubeCon + CloudNativeCon in a new region is primarily dependent on the number of contributors and members in that region, as well as the interest of local members and community. Often, we start with a smaller event, like a KubeDay, to test out interest in both attendance and sponsorship before investing in hosting a KubeCon + CloudNativeCon in that location. India is a great example of this; we hosted a Kubernetes Forum in 2019 & 2020 and a KubeDay in 2023 before rolling out KubeCon + CloudNativeCon, once we saw the strong interest from the local community, members and sponsors.
\\n\\n\\n\\n
As for locations within a region like North America, we have to take availability, accessibility, and space requirements into consideration, which you can read about below. In addition, we strive to move the event around within a region to make the event more accessible to a wider group of community members. Moving events to new locations plays a key role in advancing our diversity and inclusion efforts by allowing broader participation from attendees who may not otherwise be able to travel. This approach not only strengthens our global community but also celebrates and supports local tech ecosystems, fostering innovation and growth in emerging communities.
KubeCon + CloudNativeCon requires a lot of space to accommodate keynotes, breakout sessions, the Solutions Showcase, networking areas, the project pavilion, food & beverage services, meal function seating, specialized activities and more. As an example, for KubeCon North America in Salt Lake City we utilized 737,000 square feet of space for 9100+ attendees which is the equivalent of 55 Olympic size swimming pools and 3.5 miles of attendees standing in line! The space requirements narrow down the number of venues that would be able to host an event the size of KubeCon + CloudNativeCon. In North America as an example, less than 25 cities have the facilities to host a KubeCon + CloudNativeCon at its current size/scale.
\\n\\n\\n\\nOnce we’ve narrowed down to the venues that can fit KubeCon + CloudNativeCon, we then have to look at availability. Events are happening at an all-time high around the globe and venues are booking out years in advance, especially at this size. As an example, in 2023 when we were evaluating future KubeCon locations, the San Diego convention center was one of the first venues we talked to, and they were already booked out through 2028 for any date that worked for us.
\\n\\n\\n\\nWe start the RFP (request for proposal) process 2-4 years in advance (based on size/scope) to ensure there are options available. When selecting event dates, we do our best to balance venue availability & capacity, timing between CNCF events, observances, holidays & other industry events.
\\n\\n\\n\\nAccessibility is a key factor in selecting an event location as it directly impacts attendee convenience and turnout. When a venue is easily reachable by train, flights, and public transportation, it opens the event to a broader audience.
\\n\\n\\n\\nAttendees coming from out of town need a place to rest their head after a full day of networking and content. It’s a high priority to select a city with hotels that are within walking distance, or are easy to navigate via city transportation, and have rates that fit within company and individual budgets. In Salt Lake City, we had hotel room blocks at 23 separate hotels, to ensure we met the community’s housing needs; a big bonus in Salt Lake City was just how close most of these hotels were to the convention center and to each other.
\\n\\n\\n\\nWe hope this gives a glimpse into the extensive process of planning at KubeCon + CloudNativeCon event and how we bring our global communities together. We look forward to seeing you at a future CNCF event wherever in the world that could be!
\\n\\nBackstage is an open-source framework for building developer portals, created by Spotify, designed to streamline the process of building software and digital products. Backstage restores order to microservices and unifies infrastructure tooling, services, and documentation to create a better and more efficient developer experience and environment so they can ship high-quality code faster and more autonomously
\\n\\n\\n\\nBackstage was accepted to CNCF on September 8, 2020 and moved to the Incubating maturity level on March 15, 2022. Backstage has more than 3000 adopters and 2000 contributors worldwide, including companies like CVS Health, Siemens, LinkedIn, REI, Vodafone, and Lego, and as of October 2024, Backstage has over 270 public adopters.
\\n\\n\\n\\n“Backstage is the leading open source internal developer portal (IDP) that simplifies development, increases developer efficiency, fosters deeper collaboration, and enhances service visibility, said Chris Aniszczyk, CTO, CNCF. “By integrating seamlessly with various tools with a vibrant plug-in ecosystem, Backstage empowers teams to maintain high standards and drive innovation,” said Chris Aniszczyk, CTO, CNCF”
\\n\\n\\n\\nBackstage as a Cornerstone of Platform Development
\\n\\n\\n\\nBackstage.io provides a unified and customizable interface that centralizes the management of development tools, services, and documentation within an organization. As a developer portal, it brings together everything developers need in one place, simplifying complex workflows, reducing cognitive load, and accelerating the delivery of high-quality software. Backstage enables platform teams to build a cohesive internal ecosystem, improving productivity, standardization, and governance across projects, making it a cornerstone for organizations seeking to improve productivity, standardization, and governance across projects.
\\n\\n\\n\\nKey features that drive platform development include:
\\n\\n\\n\\nWhy Developers Prefer Working with Backstage
\\n\\n\\n\\nDevelopers like Backstage because it streamlines software development by providing a central platform to manage projects, documentation, and tools for building software components in a standardized way, making it an essential tool in DevOps and platform engineering. Designed with developers in mind, Backstage aligns with the growing trend toward customizable, developer-centric environments that integrate seamlessly with various tools and services.
\\n\\n\\n\\nBackstage’s Developer Self-Service functionality improves access to tools, automates routine tasks, and reduces developer reliance on platform teams, accelerating development cycles. Its plugin-based modularity makes it a hub for all the tools developers need to efficiently build, test, and deploy applications, streamlining workflows and improving overall productivity.
\\n\\n\\n\\n“We continue to see more and more companies adopt the Backstage open source framework as their internal developer portal of choice. The framework’s customization, extensibility, and scalability makes Backstage the preferred IDP for organizations of all shapes and sizes, from finance, government, health care, and manufacturing, to scale-ups, startups, and other digital natives. As the ecosystem of Backstage plugins, features, and service providers continues to grow, the demand for Backstage experts also grows. With the CNCF’s Certified Backstage Associate program, companies now have a way to identify and tap into that expertise, while helping to contribute back to the growing Backstage community,” says Pia Nilsson, Head of Platform Developer Experience, Spotify
\\n\\n\\n\\nBackstage Offers Benefits for Multiple Roles in Your Org
\\n\\n\\n\\nBackstage offers value beyond developers, benefiting various roles within an organization. For engineering managers, it helps maintain standards and best practices across teams, making it easier to manage the entire tech ecosystem, from migrations to test certifications. Platform engineers benefit from Backstage’s extensibility and scalability, with the ability to seamlessly integrate new tools and services through plugins, or enhance existing ones. Ultimately, for everyone, Backstage delivers a consistent, unified experience that brings together all infrastructure tooling, resources, standards, ownership, contributors, and administrators in a single, cohesive platform.
\\n\\n\\n\\nAnnouncing the Certified Backstage Associate (CBA) certification
\\n\\n\\n\\nWith all these benefits, knowledge of Backstage is becoming essential to platform development. A certification will validate your expertise in platform engineering, service catalog management, and DevOps integration, making you more competitive for roles like platform engineer, DevOps specialist, or site reliability engineer (SRE).
\\n\\n\\n\\nCloud Native Computing Foundation and Linux Foundation Education are excited to announce plans to soon launch the Certified Backstage Associate (CBA) certification. Developers, engineers and others interested in the certification can sign up to be notified of the beta and general availability launch here.
\\n\\n\\n\\nThe primary domains and competencies covered by this certification are:
\\n\\n\\n\\nThe CBA certification was built in collaboration with Frontside with the participation of people from Spotify, Rootly, Roadie, Red Hat, Adaptavist, Anuclei, Agile Lab, Adyen, IBM, Cognizant, SAS, Roboautal Pte Ltd, John Deere and Airbus.
It’s fitting that on the last day of KubeCon it was time to celebrate the community and the 10th anniversary of Kubernetes. A packed ballroom at the Salt Palace Convention Center was treated to a lot of exciting news and future plans, not to mention very fun rounds of Family Feud.
\\n\\n\\n\\nHere’s a brief recap of the day.
\\n\\n\\n\\nChris Aniszczyk kicked off the event, and ushered in a lively discussion among co-chair Joseph Sandoval of Adobe, Lachlan Evenson, Microsoft, and Kelsey Hightower, about the past decade of cloud native and Kubernetes and whether we’re actually “there” yet.
\\n\\n\\n\\nAnd of course there was lots of exciting community news.
\\n\\n\\n\\nAfrica is home to the fastest growing population of developers, which makes it even more exciting that CNCF is partnering with Andela to train 20k to 30k developers over the next two to three years. Starting in 2025, the partnership will be offering Kubernetes and Cloud Native Associate and Certified Kubernetes Application Developer certifications.
\\n\\n\\n\\nAlso, there are three new cloud native certifications: Backstage, OpenTelemetry, and Kyverno.
\\n\\n\\n\\nMark your calendars!
\\n\\n\\n\\nThe next KubeCon + CloudNativeCon schedule is live and includes some firsts, namely a KubeCon Japan!
\\n\\n\\n\\nKubeCon + CloudNativeCon India 2024 |December 11-12 | New Delhi
\\n\\n\\n\\n2025
\\n\\n\\n\\nKubeCon + CloudNativeCon Europe 2025 | April 1-4, 2025 | London, United Kingdom
\\n\\n\\n\\nKubeCon + CloudNativeCon China 2025 | June 10 – 11 | Hong Kong
\\n\\n\\n\\nKubeCon + CloudNativeCon Japan 2025 | June 16 -17, 2025 | Tokyo, Japan
\\n\\n\\n\\nKubeCon + CloudNativeCon India 2025 | August 6 – 7, 2025 | Hyderabad, India
\\n\\n\\n\\nKubeCon + CloudNativeCon North America 2025 | November 10 -13, 2025 | Atlanta, GA
\\n\\n\\n\\n2026
\\n\\n\\n\\nKubeCon + CloudNativeCon Europe | March 23-26 | Amsterdam
\\n\\n\\n\\nKubeCon + CloudNativeCon North America | October 26-29 | Los Angeles
\\n\\n\\n\\nKubernetes 10th anniversary survey
\\n\\n\\n\\nThe community knows what an impact Kubernetes has had on cloud native and on the broader technology ecosystem, but our survey also uncovers how it has changes lives and careers.
\\n\\n\\n\\nHonoring the past to forge ahead
\\n\\n\\n\\nGail Fredrik, Heroku CTO at Salesforce spoke about the ideal practices for application development in the cloud, emphasizing minimizing fragility, maintaining growth and reducing friction. These ideas were distilled from TwFA, aka Twelve-Factor, which is now an open source project. Twelve-Factor plans to work on telemetry, creating standards around secrets and identity and deploying multiple groups of applications.
\\n\\n\\n\\nCloud native technologies to watch
\\n\\n\\n\\nLin Sun, solo.io and Karena Angell, Red Hat, both from the CNCF Technical Oversight Committee, called out four key cloud native technologies attendees should be looking out for in the coming months. The first – cloud native and AI – was probably not a surprise, but the second – cost and going green – brought about a lively discussion about the price and footprint of AI, though of course AI can also be used to help optimize cloud spend and carbon footprints. Other tech trends to watch out for includes mulit-cluster and simplification.
\\n\\n\\n\\nAnnouncements
\\n\\n\\n\\nKubeCon + CloudNativeCon Japan
\\n\\n\\n\\nCNCF expands certifications to platform engineering
\\n\\n\\n\\nAutomate Kubernetes security and operations with Kyverno certified associate
\\n\\n\\n\\nInternal developer platforms at scale with certified backstage associate
\\n\\n\\n\\nGain insights into cloud native apps with the OpenTelemetry Certified
\\n\\n\\n\\nCNCF and Andela announce partnership to train 20,000+ African tech
\\n\\nKyverno is an open-source policy engine designed for Kubernetes that allows teams to validate, mutate, and generate configurations, enabling the automation of security policies as code, beyond just audit and enforcement.
\\n\\n\\n\\nKyverno was created by Nirmata and contributed to the CNCF in November 2020, and graduated to the CNCF Incubator in July 2022. Since then, it has experienced nearly 10X growth in downloads and gained over 2,000 GitHub stars, becoming a popular tool for platform engineering teams using Kubernetes.
\\n\\n\\n\\n“Kyverno simplifies Kubernetes policy management and enhances security in cloud-native environments, making it a valuable tool for platform engineering teams,” said Chris Aniszczyk, CTO, CNCF. “Kyverno being Kubernetes native and ease of use on top of integration into CI/CD pipelines have contributed to its widespread adoption in cloud-native projects.”
\\n\\n\\n\\nKyverno is designed to be used by Kubernetes administrators, operators, and DevOps teams who are responsible for managing and maintaining Kubernetes clusters. It can be especially valuable in situations where policy management, resource validation, and dynamic policy enforcement is required.
\\n\\n\\n\\nKyverno policies can:
\\n\\n\\n\\nWhy Kyverno matters to security
\\n\\n\\n\\nKyverno secures software supply chains by automating security, compliance, and best practices validation. It can verify container images and metadata, allowing teams to create an allowed list of approved base images for constructing containers. Additionally, Kyverno tailors security configurations with fine-grained pod security controls, offering flexibility to exempt specific controls within a pod security profile.
\\n\\n\\n\\nKyverno streamlines the DevSecOps workflow and security management in cloud-native environments by validating resources as part of the CI/CD pipeline, producing policy reports that show the results of policy decisions, and enforcing policies as a Kubernetes admission controller, CLI-based scanner, or at runtime.
\\n\\n\\n\\nValue of a Kyverno Certification
\\n\\n\\n\\nEarning a Kyverno certification can enhance your knowledge of Kubernetes policy management and demonstrate your ability to handle security, compliance, and operational aspects of cloud-native projects in your current role or help progress your career. The education required for the certification will help you learn how to create, apply, and manage Kyverno policies, while also building professional credibility and standing out from the competition. Additionally, certification prepares you for roles such as Kubernetes security specialist, DevSecOps engineer, or Kubernetes administrator.
\\n\\n\\n\\n“We are excited to launch the Kyverno Certified Associate (KCA) exam in partnership with the CNCF and Linux Foundation Education. Kubernetes runs mission-critical workloads across all major verticals, and Kyverno has become an indispensable tool with its ability to automate security and operations with policy as code,” says Jim Bugwadia, Nirmata Co-founder and CEO. “ With this certification Kubernetes administrators will be able to assess their expertise in Kyverno and prove their ability to address key use cases for their organizations.
\\n\\n\\n\\nAnnouncing the Kyverno Certified Associate (KCA) certification
\\n\\n\\n\\nCNCF with Linux Foundation Education currently offers one Kyverno specific course, Mastering Kubernetes Security with Kyverno (LFS255) and we’re excited to announce the launch of the Kyverno Certified Associate (KCA). The KCA is designed to help you establish yourself as an expert in managing and securing Kubernetes environments. Passing the KCA demonstrates your deep understanding of Kyverno and will highlight your proficiency in cloud-native management, policy automation, and security. By gaining Kyverno expertise, you’ll be better positioned to meet the growing demand for cloud security professionals and take your career to the next level.
\\n\\n\\n\\nThe primary domains and competencies covered in this certification are:
\\n\\n\\n\\nThe KCA certification was built in collaboration with Nirmata, the creator of Kyverno, with the participation of people from KubeCost, PE Digital GmbH, Ohio Supercomputer Center, Snapp!, Quantela and Vmware.
\\n\\nOn the second day of KubeCon, nearly 9,200 attendees had the opportunity to focus on the theme of the day – security – while attending sessions, visiting the Sponsor Showcase, and networking. The mood was upbeat, and the enthusiasm for learning and sharing was real.
\\n\\n\\n\\nHere’s a snapshot of the day.
\\n\\n\\n\\nAn update from the End User TAB
\\n\\n\\n\\nTaylor Dolezal, head of ecosystem, kicked off the morning by asking the crowd, “what was your first mountain in this ecosystem?” His first mountain was Kubernetes! He announced the return of CNCF’s Tech Radar service (here’s a look at Tech Radar in 2021), and then introduced the End User Technical Advisory Board. The End User Tab, “the voice of the end user,” shared their 2024 achievements including the publication of multiple reference architectures, integration with LFX Insights, a successful feedback pilot program, and increased end user participation.
\\n\\n\\n\\nMeet the Envoy AI Gateway
\\n\\n\\n\\nAlexa Griffith, senior software engineer with Bloomberg, debuted a new GenAI platform known as Envoy that is a collaborative open source effort between engineers at Bloomberg and Tetrate. The Envoy AI Gateway aims to solve 3 major pain points common to LLMs: different LLM providers require different access patterns, use different ways to manage credentials, and the service-specific models have different needs.
\\n\\n\\n\\nAwards!
\\n\\n\\n\\nEvery year the community votes for the top end users based on their contributions and what they’ve achieved. Taylor Dolezal presented this year’s top three winners.
\\n\\n\\n\\n3rd place: Reddit
Supporting millions of daily active users and processing billions of page views monthly, Reddit has demonstrated exceptional implementation of CNCF technologies across their hybrid cloud infrastructure, while actively contributing to core projects and fostering diversity through mentorship programs and scholarships in the cloud native community.
2nd place: Capital One
\\n\\n\\n\\nAs the first major U.S. bank to fully transition to the cloud, Capital One has leveraged CNCF projects to revolutionize their financial services infrastructure, contributing the widely-adopted Cloud Custodian to the ecosystem, while achieving remarkable metrics including a two-orders-of-magnitude increase in deployment frequency and a 4x cost reduction in AWS expenses compared to non-Kubernetes alternatives.
\\n\\n\\n\\n1st place: Adobe
\\n\\n\\n\\nAdobe has transformed their massive cloud infrastructure supporting Creative Cloud, Document Cloud, and Experience Cloud through extensive CNCF project adoption, making over 5,160 contributions across 46 different projects, while demonstrating particular technical leadership in Kubernetes implementations and developer experience tooling that powers creative tools used by millions globally.
\\n\\n\\n\\nChris Aniszczyk presented the Community Awards.
\\n\\n\\n\\nThe Top committer/maintainer is Joe Stringer.
\\n\\n\\n\\nThe Top Documentarians are Qiming Teng and Haifeng (Michael) Yao.
\\n\\n\\n\\nThe Taggie is Nancy Chauhan.
\\n\\n\\n\\nThe Chop Wood Carry Water awards – created to represent all work happens behind the scenes in a project – went to Stefan Schimanski, Ali Ok, James Spurin, Priyanka Saggu, Sandeep Kanabar, and William Rizzo.
\\n\\n\\n\\nThis year there were also two new awards including the “Lift and Shift” awards, relating to work done for Kubernetes. The winners are: Tim Hockin, Aaron Crickenberger, Ben Elder, Amaud Meukam, Davanum Srinivas, Mahamed Ali, Ricky Sadowski, Hichelle Shepardson, Koray Oksay, Patryk Przekwas, Marko Mudrinic, Justin Santa Barbara, Cole Wagner, Caleb Woodbine, Hippie Hacker, and Linus Arver.
\\n\\n\\n\\nAnd the first-ever Lifetime Achievement Award goes to Tim Hockin!
\\n\\n\\n\\nStop being a software ostrich!
\\n\\n\\n\\nKubernetes is stable and boring, according to Nikhita Raghunath, principal software engineer at Broadcom. And while that is great, because it means Kubernetes is ubiquitous, it also means attackers are not going to leave things alone. In fact they are only going to get a lot sneakier. So “if you think cloud native is done disrupting things, buckle up, because things are about to get wild,” Raghunath said.
\\n\\n\\n\\nFrom actually *using* SBOMs to AI bills of materials and quantum computing, security has to be built into every layer so we can truly disrupt cloud native, Raghunath explained.
\\n\\n\\n\\nOpen source security is not a spectator sport
\\n\\n\\n\\nDespite what conventional wisdom might tell you, anyone can contribute to security, even if you aren’t an expert or don’t have a PhD. That is the conclusion of Justin Cappos of NYU and Santiago Torres-Arias of Purdue University, who’ve studied cybersecurity extensively and believe the more people who get involved, the better. So for those wanting to learn more about security, they have a number of suggestions including take classes, get hands-on experience, join a security project, or find a group which specializes in security. Their group recommendations include CNCF’s TAG Security Group and Linux Foundation’s OpenSSF.
\\n\\n\\n\\nRead about KubeCon Day 0 and Day 1.
\\n\\nGitOps provides a pathway to stable, dependable, and predictable cloud native infrastructure and workflows. Over the past few years GitOps and Argo have grown hand in hand as ArgoCD has become a reliable solution for consolidating and extending GitOps inside Kubernetes environments.
\\n\\n\\n\\nHowever, Argo is much more than just a GitOps project – so we are excited to share the new documentary, “Inside Argo: Automating the Future.”
\\n\\n\\n\\nThe film, produced by CNCF and Speakeasy Productions, gives viewers an inside look at the journey of a groundbreaking open source project that revolutionized Kubernetes workflows and unveils how Argo grew from a single workflow engine to a powerful collection of tools to simplify and automate Kubernetes deployments – Argo Workflows, CD, Rollouts, and Events.
\\n\\n\\n\\n“Creating Argo was a journey that paralleled the grit and inspiration of the cloud native community,” said Pratik Wadher, senior vice president of product development at Intuit and Argo Project creator. “This premiere is a celebration of the people and stories that created the vibrant, diverse Argo community that we have today.”
\\n\\n\\n\\n“After 8 years of intensive product development, we’re excited to share the full story of the Argo Project – a documentary celebrating the innovation, collaboration, and impact of open-source in transforming Kubernetes application deployment,” said Hong Wang, Argo project creator, CEO at Akuity
\\n\\n\\n\\nArgo’s suite of tools can each be used independently, but there is great benefit in using them together to create and operate complex applications at scale. Argo Workflows enables the creation of complex parallel workflows as Kubernetes resources for use cases like CI/CD pipelines and machine learning workflows. Argo Events provides declarative management of event-based dependencies for Kubernetes resources. Argo CD and Argo Rollouts help engineers adopt and use Kubernetes and make GitOps best practices more approachable.
\\n\\n\\n\\n“The amazing thing about open source is you can be part of something so impactful and so important,” said Dan Garfield, Argo project maintainer, co-founder & chief open source officer at Codefresh. “Argo has exceeded all my wildest expectations to power the world, and it’s only accelerating. There’s still time to be part of this story, come contribute, share your successes, ask questions, point out bugs, and help us build the future.”
\\n\\n\\n\\n“We have seen first-hand the momentum of the Argo project over the years and its growth to become the forefront of GitOps and empowering teams to manage infrastructure and application deployment with agility and confidence,“ said Siamak Sadeghianfar, senior manager, product management at Red Hat. “We’re excited to share the project’s story, and we look forward to continuing to support and contribute to the project as a critical component of Red Hat OpenShift.”
\\n\\n\\n\\nSince joining the CNCF Incubator in 2020, Argo has become one of CNCF’s most active projects. As of its graduation in 2022, Argo had seen a 250 percent jump in its use in production workloads, counting more than 350 end user organizations such as Adobe, Capital One, CERN, Intuit, and Ticketmaster. In our most recent look at project velocity this summer, Argo ranked in the top 3 CNCF projects and top 5 Linux Foundation projects.
\\n\\n\\n\\n“Argo’s graduation from the CNCF marked a major milestone in cloud native technology and solidified the project as a critical tool in driving the GitOps movement forward,” said Chris Aniszczyk, CTO of CNCF. “Argo has proven itself to be a powerful tool for managing complex Kubernetes workflows and we are proud to share their success story with the cloud native community in this compelling documentary. Its robust community, high standards for security, and extensive industry adoption reflect its maturity and value to organizations worldwide.”
\\n\\n\\n\\nThe film features contributors, maintainers, and community members from Akuity, CNCF, CodeFresh by Octopus Deploy, Intuit, and Red Hat to share the story behind Argo’s origins and its rise to one of the most transformative projects in modern cloud native development.
\\n\\n\\n\\nThe film will debut on Thursday, November 14, at KubeCon + CloudNativeCon North America in Salt Lake City.
\\n\\n\\n\\nMore than 9,000 people convened at the Salt Palace Convention Center in Salt Lake City for the first day of KubeCon + CloudNativeCon North America. The mood was energetic and lively and the audience was primed to dive into the themes of the day: artificial intelligence and platform engineering.
\\n\\n\\n\\nHere’s a look at some of the highlights.
\\n\\n\\n\\nWelcome, and banish trolls!
\\n\\n\\n\\nThe conference began with a video welcome from Priyanka Sharma, executive director of CNCF, who reminded the audience that they are the people who built cloud native.
\\n\\n\\n\\nChris Aniszczyk, CTO of CNCF, continued with this theme, pointing out that nine-year-old CNCF now has over 200 projects with over 255,000 contributions across 193 countries. That amazing level of success has also drawn attention to the open source community, and not all of it the right sort of attention.
\\n\\n\\n\\nIn some cases, so-called patent trolls, whose sole interest is buying patents and threatening adopters of technology with lawsuits, have started coming after the open source community with these frivolous claims. As Sharma and Aniszczyk stressed, the time has to come to stop this.
\\n\\n\\n\\nJim Zemlin, executive director of the Linux Foundation, took up that thread, vowing to crush patent trolls, promising “no negotiation and no settlements.” (There were definitely cheers of support from the audience at this point.) Zemlin pointed to the Linux Foundation/CNCF partnership with Unified Patents as already showing a 90% success rate in defeating patent trolls.
\\n\\n\\n\\nThe relationship with Unified Patents has already invalidated dozens and dozens of patents, explained Joanna Lee, VP of strategic programs and legal at CNCF. But, echoing what Zemlin said, she said a broader push is needed because patent trolls are responsible for more than 80% of high tech litigation today. Lee said that was one of the driving forces behind the decision to launch the Cloud Native Heroes Challenge, where cloud native developers and technologists can earn swag and win prizes by helping protect our ecosystem from patent trolls. (And again, there was definitely audience applause at this announcement.)
\\n\\n\\n\\nMulti-cluster batch jobs dispensing with Kueue at CERN
\\n\\n\\n\\nMarcin Wielgus, staff software engineer at Google, and Ricardo Rocha, lead platform infrastructure at CERN, took a deep dive into how Kueue tackles one of the trickier issues around admission to Kubernetes workloads. Working around the principles of “quota borrowing and fair sharing,” Kueue helped CERN and its data physicists improve particle flow by having a central place to submit jobs.
\\n\\n\\n\\nTake a peek under the hood of cloud native AI at scale
\\n\\n\\n\\nOne of the biggest challenges running AI workloads is that they tend to live in black boxes, so it can be difficult to know what’s really going wrong, and that’s what CoreWeave’s Peter Salanki, chief technology officer, and Chen Goldberg, senior vice president of engineering, set out to demonstrate. Through a series of steps and processes, they walked through how CoreWeave was able to cut “interruptions” time in half and save money by making sure it was possible to look under the hood at what was actually going on.
\\n\\n\\n\\nTheir motto: Failures are inevitable but management is key.
\\n\\n\\n\\nPaving the way for AI through platform engineering
\\n\\n\\n\\nIf there’s a secret sauce to AI success, Kasper Borg Nissen, staff engineer at Lunar, would argue that it’s applying platform engineering principles to AII platforms. Classic platform engineering concepts – self service-ing, explicit APIs, paved paths, platform as a product, etc. – can be tweaked to work for an AI journey. That worked at Lunar where right now over 60% of text communication with customers is now handled by AI without human intervention. That translates into a 93% reduction in support time resolution. Or to look at it another way, AI has taken on the equivalent of 13% of full time staff work. And Lunar did it using existing cloud native investments.
\\n\\n\\n\\nNvidia case study – many facets of building and delivering AI in the cloud native ecosystem
\\n\\n\\n\\nChris Lamb, vice president, computing software platforms at NVIDIA, walked attendees through NVIDIA’s long history in the cloud native/open source ecosystem. Lamb invited the audience to explore digital human architecture using Blueprint to create a digital human assistant. Try a demo with James.
\\n\\n\\n\\nThe engineering future of generative AI platforms on Kubernetes
\\n\\n\\n\\nAparna Sinha, senior vice president and head of AI product at Capital One, thinks generative AI has the potential to be as mainstream as the internet or mobile technologies. One huge advantage it has? It comes with a more natural human interface immediately giving it a broader range of users thanks to audio, images and video. GenAI has hundreds of use cases today just in banking, Sinha said, from automating coding to processes across the back office. “AI is the next layer,” she said. “Let’s build the puzzle together.”
\\n\\n\\n\\nFor all of the technology and networking and learning going on at KubeCon, there is also plenty of fun to be had! Attendees can try the local speciality dirty soda (it’s delicious and available in a wide variety of flavors both regular and diet), try their hands at pickleball and curling as well as arcade and console games, pet some cute dogs, visit the relaxation station, build with popular building blocks, and the try the coffee bars…and these are all available within the convention center.
\\n\\n\\n\\nBetter pod availability – a survey of the many ways to manage workload disruptions
\\n\\n\\n\\nZach Loafman, staff SRE at Google, made the case for a new taxonomy for disruptions – basically breaking it into bad and good disruptions, then walked attendees through a case study in pod disruption. His takeaway: managing pod disruptions on Kubernetes requires a series of tradeoffs and it’s important to think about the cost of disruption.
\\n\\n\\n\\nCan your Kubernetes network handle the heat? Building resilience with AI chaos
\\n\\n\\n\\nSurya Seetharaman, principal software engineer at Red Hat, and Lior Lieberman, site reliability engineer at Google, outlined the challenges involved in Kubernetes networking and suggested that even being proactive is probably not sufficient to get ahead of the issues. Their advice is to embrace chaos testing and use CNCF projects including Litmus, Krkn, Chaos Mesh and kube-burner. But, because human-created chaos isn’t enough, use AI to enhance the chaos experiments.
\\n\\n\\n\\nScalable authentication across organizations with Keycloak 26
\\n\\n\\n\\nAnnouncing the release of KubeVirt v1.4
\\n\\n\\n\\nAnnouncing the Cloud Native Heroes Challenge
\\n\\n\\n\\nAnnouncing the inaugural contest for the Cloud Native Heroes Challenge
\\n\\n\\n\\nLitmusChaos gains adoption in lower environments
\\n\\n\\n\\nCloud Native Computing Foundation Announces cert-manager Graduation
\\n\\n\\n\\nCloud Native Computing Foundation Announces Dapr Graduation
\\n\\n\\n\\nJaeger v2 released: OpenTelemetry in the core!
\\n\\n\\n\\nA look at the Cilium CNCF project journey report
\\n\\nProject post by Alexander Schwartz, Keycloak Maintainer
\\n\\n\\n\\nKeycloak brings scalable and customizable authentication to your environment! The team is thrilled to announce the release of Keycloak 26 which again improves its authentication features for its growing community. It also simplifies an admin’s activities to customize, run, and upgrade Keycloak deployments. Keycloak continues to be a cornerstone of self-hosted security stacks and integrates well with other components by supporting open standards.
\\n\\n\\n\\nCelebrate this release with us at KubeCon North America! Join Keycloak team members to discuss all things Keycloak at the Project pavilion on November 13-15, 2024 (open during the afternoons). For those interested in highly available architectures, come see our talk on Running a Highly Available Identity and Access Management with Keycloak by Ryan Emerson and Kamesh Akella on Friday 4:55pm MST.
\\n\\n\\n\\nKeycloak has four feature releases a year. Read on what’s new in the latest release 26, published in October this year.
\\n\\n\\n\\nWith Keycloak being a security product, it is essential to keep it up-to-date and simple to operate. This release minimizes downtimes, reduces memory footprint, and improves on tooling for troubleshooting.
\\n\\n\\n\\nIn this release we added the following features:
\\n\\n\\n\\nThis release is also the first of a series of minor releases which will ship all potentially breaking changes as opt-in. This allows administrators a seamless fast upgrade to stay on par with security fixes, with the option to enable new features or migrate configurations at a later time.
\\n\\n\\n\\nWhen you host an application, you often want to scale it beyond the users in your organization. This becomes even more important when you want to offer your application as a software-as-a-service to the employees of other companies or organizations.
\\n\\n\\n\\nThe good news is that all those other employees already have credentials they use every day, issued by their organizations. So how do you use these credentials to authenticate those users when they want to use your application? For many years, Keycloak offered Identity Brokerage, so you could leverage a SAML or OpenID Connect service to authenticate those users.
\\n\\n\\n\\nWith Keycloak 26, we simplified this setup and ensured that it scales even better than before. By introducing organizations as their own entity, you can now associate email domains with identity providers and the users of that organization. Admins can create, disable, and remove organizations and invite users. Users can log in with their email address, authenticate with the Identity Provider of their company, and are then forwarded to the application they want to access. Applications know which organizations a user belongs to, and can adjust which data they provide access to.
\\n\\n\\n\\nThis simplifies business-to-business (B2B) and business-to-business-to-customer (B2B2C) setups with Keycloak, and enables Customer Identity and Access Management (CIAM) and multi-tenancy. Onboard new organizations in minutes, independent of the number of users in that organization.
\\n\\n\\n\\nKeycloak offers password-based and password-less authentications. In addition to the classic username and password, it offers second factor authentication using time-based tokens. For password-less authentication, it offers Kerberos, X.509 certificates (for example smart cards), and WebAuthn.
\\n\\n\\n\\nIn Keycloak 26, we again updated the Passkeys features of Keycloak. Passkeys is the modern standard for passwordless authentication which is based on WebAuthn technologies. Keycloak continues to implement features of this evolving standard, and includes it as a preview feature. A recent addition is the conditional flow which improves the user experience as a browser prompts the user for the right credential to authenticate with.
\\n\\n\\n\\nWhile OpenID Connect is a standard, it continues to evolve to cover more requirements of different industries, ranging from ecommerce to banking.
\\n\\n\\n\\nThis release of Keycloak includes improvements in the following areas:
\\n\\n\\n\\nA big thank-you to everyone who contributed to this, especially in the Keycloak OAuth Special Interest Group! Join their channel #keycloak-oauth-sig on the CNCF slack to hear the latest news and contribute to their efforts.
\\n\\n\\n\\nWork for the next feature release, which is scheduled for January 2025, is already underway. Some enhancements like simplified node discovery for cloud and non-cloud environments are already available in our nightly release. Try them out in a development environment and provide us with feedback on our mailing list! The work on our 2025 road map is under way, and we hope to publish it soon.
\\n\\nProject post from the Kubevirt Community
\\n\\n\\n\\nThe KubeVirt Community is proud to announce the release of v1.4. This release aligns with Kubernetes v1.31 and is the sixth KubeVirt release to follow the Kubernetes release cadence.
\\n\\n\\n\\nWhat’s 1/3 of one thousand? Because that’s how many people have contributed in some way to this release, with 90 of those 333 people contributing commits to our repos.
\\n\\n\\n\\nYou can read the full release notes in our user-guide, but we have included some highlights in this blog.
\\n\\n\\n\\nFor those of you at KubeCon this week, be sure to check out our maintainer talk where our project maintainers will be going into these and other recent enhancements in KubeVirt.
\\n\\n\\n\\nThis release marks the graduation of a number of features to GA; deprecating the feature gate and now enabled by default:
\\n\\n\\n\\nThis version of KubeVirt includes upgraded virtualization technology based on libvirt 10.5.0 and QEMU 9.0.0. Other KubeVirt-specific features of this release include the following:
\\n\\n\\n\\nIn the interest of security, we have restricted the ability of virt-handler to patch nodes, and removed privileges for the cluster. You can also now live-update tolerations to a running VM.
\\n\\n\\n\\nOur KubeVirt command line tool, virtctl, also received some love and improved functionality for VM creation, image upload, and source inference.
\\n\\n\\n\\nThe networking binding plugins have matured to Beta, and we have a new domain attachment type,managedTap
, and the ability to reserve memory overhead for binding plugins. Network binding plugins enable vendors to provide their own VM-to-network plumbing alongside KubeVirt.
We also added support for the igb
network interface model.
If you’ve ever wanted to migrate your virtual machine volume from one storage type to another then you’ll be interested in our volume migration feature.
\\n\\n\\n\\nOur SIG scale and performance team have added performance benchmarks for resource utilization of virt-controller and virt-api components. Furthermore, the test-suite was enhanced by integrating KWOK with SIG-scale tests to simulate nodes and VMIs for testing KubeVirt performance while using minimum resources in test infrastructure. A comprehensive list of performance and scale benchmarks for the release is available here.
\\n\\n\\n\\nA lot of work from a huge amount of people go into these releases. Some contributions are small, such as raising a bug or attending our community meeting, and others are massive, like working on a feature or reviewing PRs. Whatever your part: we thank you.
\\n\\n\\n\\nAnd if you’re interested in contributing to the project and being a part of the next release, please check out our contributing guide and our community membership guidelines.
\\n\\nWe’re thrilled to share the details of the inaugural contest in our Cloud Native Heroes Challenge program, a series of crowdsourced “prior art” contests in which cloud native developers can earn swag and prizes by helping us defeat patent trolls. See our earlier announcement about the program.
\\n\\n\\n\\nSwag and Prizes
\\n\\n\\n\\nAll entrants who submit an entry that conforms to the contest rules will receive a free “Cloud Native Hero” t-shirt that can be picked up at any future KubeCon+CloudNativeCon. The entrant who submits the winning entry will also receive a $3,000 cash prize.
\\n\\n\\n\\nThe Inaugural Contest Challenged Patent
\\n\\n\\n\\nIn our inaugural contest, we are seeking information that can be used to invalidate Claim 1 from US Patent US-11695823-B1, which has been asserted by Edge Networking Systems LLC, a patent troll, against adopters of Kubernetes.
\\n\\n\\n\\nSee the relevant language of the patent below:
\\n\\n\\n\\n1. A system comprising:a programmable network device adapted to host a plurality of network device applications;a programmable cloud device adapted to host a plurality of cloud applications, wherein the plurality of network device applications and the plurality of cloud applications are in secure communication with each other to form distributed applications; andwherein the plurality of network device applications and plurality of cloud applications device form unified capabilities enabling a plurality of upper layer application programming interfaces (APIs) to program the plurality of network device applications and plurality of cloud applications independent of network device hardware and cloud device hardware. |
If you are aware of any publicly available materials (other than materials already listed in the “known references” tab of the contest information page) demonstrating that know-how regarding the invention described above already existed prior to June 13, 2013 (the priority date of the patent), please submit that evidence as “prior art” in this contest.
\\n\\n\\n\\n“Prior art” is a legal term that refers to technical know-how that predated the patent application. Prior art can be used to invalidate or weaken a troll’s patent by demonstrating that the patented invention already existed and wasn’t “new” when the application for a patent was filed.
\\n\\n\\n\\nExamples of materials that can be provided as prior art include:
\\n\\n\\n\\nHowever, please note that materials already listed in the “known references” tab of the contest information page do not qualify and cannot be submitted in this contest.
\\n\\n\\n\\nInstructions for Participating
\\n\\n\\n\\nPlease see our Participation Instructions and the contest program page for more information and step-by-step instructions for entering the contest.
\\n\\n\\n\\nAdditional information about this inaugural contest can be found at the Contest Listing for this contest on Unified Patents’s contest portal.
\\n\\n\\n\\nQuestions? Need Help?
\\n\\n\\n\\nIf you have questions or would like to request a 1:1 help session where a member of our contest team walks you through the process of preparing and submitting your contribution of prior art, please message us at the CNCF slack channel #heroes-challenge.
\\n\\n\\n\\nAbout Our Co-Host for the Cloud Native Heroes Challenge
\\n\\n\\n\\nCNCF is co-hosting this program with Unified Patents, the Linux Foundation’s partner in patent troll deterrence since 2019. Unified Patents is the only organization that uses offensive community-driven strategies to deter patent trolls.
\\n\\n\\n\\nOther relevant posts
\\n\\n\\n\\nAnnouncing the Cloud Native Heroes Challenge
\\n\\n\\n\\n\\n\\nProject post from the LitmusChaos Community
As enterprises continue to scale their systems, resilience and stability remain crucial. Testing these under real-world failure scenarios without impacting production environments is essential.
Over recent months, LitmusChaos has gained remarkable traction as an open-source chaos engineering tool, particularly in lower environments like staging, development, and pre-production. By allowing teams to validate resilience at an earlier stage, LitmusChaos helps organizations catch issues before they reach production, building confidence and reducing the risk of costly downtimes. With recent adoption by prominent companies such as Infor, Wingie Enuygun Company, and Emirates NBD, LitmusChaos is increasingly recognized as a critical tool for resilience in dynamic environments.
The recent wave of LitmusChaos adopters highlights how enterprises are building resilience early in the lifecycle, running controlled chaos experiments across lower environments:
\\n\\n\\n\\nFor Emirates NBD, LitmusChaos began as a proof of concept in a playground cluster environment and has since become an integral part of their resilience testing. Initially deploying Litmus in a non-production cluster, their Site Reliability Engineering (SRE) team focused on lower environments to simulate failure scenarios. By running automated chaos tests, Emirates NBD ensures infrastructure robustness and reliability well before production. The transition from manual to automated testing in lower environments allows the SRE team to proactively manage risks, making LitmusChaos a cornerstone of their resilience strategy.
\\n\\n\\n\\nAt Infor, LitmusChaos serves as the foundation for resilience practices within their product resilience team. Primarily focused on development and pre-production environments, Infor adopted LitmusChaos to simulate failures on Kubernetes-based workloads. Through chaos engineering workshops, Infor’s teams validate resilience hypotheses well before production. By simulating real-world failures early on, Infor has created a resilience-focused culture that improves product quality and paves the way for future integrations of chaos testing within CI/CD pipelines.
\\n\\n\\n\\nWingie Enuygun Company, a leader in travel technology, utilizes LitmusChaos to enhance resilience during QA cycles in pre-production. By integrating chaos experiments into their QA process, they can detect bottlenecks and potential issues before they reach production, allowing them to proactively address vulnerabilities. This approach enables Wingie Enuygun to maintain system resilience across multiple platforms, giving them confidence in production stability and ensuring an optimized travel experience for their customers.
Through these recent adoptions, it’s clear that LitmusChaos is becoming essential for enterprises aiming to foster resilience across their infrastructure. By empowering teams to test resilience in controlled, lower environments, LitmusChaos not only helps organizations address vulnerabilities early but also fosters a proactive culture that safeguards production stability.
With the latest LitmusChaos 3.12.x release now available, it’s easier than ever to start chaos engineering on your own workloads. Follow the getting started guide to explore LitmusChaos and begin implementing game-changing resilience practices.
\\n\\n\\n\\nReady to adopt LitmusChaos? Please comment on this issue with details about how you are using it. Engage with our developers on the #litmus channel on CNCF Slack or join the discussion on GitHub. Explore the power of chaos engineering and make your systems resilient today.
\\n\\nToday at KubeCon+CloudNativeCon North America 2024, CNCF announced the Cloud Native Heroes Challenge, a patent troll bounty program in which cloud native developers and technologists can earn swag and win prizes by helping protect our ecosystem from patent trolls.
\\n\\n\\n\\nPatent trolls are increasingly targeting cloud native open source due to its success and broad ubiquity. Kubernetes is the most often targeted project, but other cloud native open source projects have also caught the attention of trolls. Patent trolls are companies that don’t sell any products and services; their only business activity is buying patents and threatening adopters of technology with patent lawsuits (learn more about trolls).
\\n\\n\\n\\nMembers of our community can help us disarm patent trolls by providing evidence that the invention described in the troll’s patent wasn’t actually “new” at the time the patent application was filed. If the invention wasn’t “new” on the application date, then the resulting patent is not valid. Evidence of invalidity can neutralize or weaken a troll’s ability to weaponize its patents against our community.
\\n\\n\\n\\nEvidence of such pre-existing technology – referred to by patent lawyers as “prior art” – could be in the form of open source documentation (including release notes), published standards or specifications, product manuals, articles, blogs, books, or any publicly available information.
\\n\\n\\n\\nMembers of our technical community – as subject matters experts in cloud native tech – are uniquely qualified to help us find prior art to defeat trolls.
\\n\\n\\n\\nThe Cloud Native Heroes Challenge will be a series of crowdsourced prior art contests, starting with a single Inaugural Contest that launched today. To learn more, visit www.cncf.io/heroes and read about the inaugural contest here.
\\n\\n\\n\\nCNCF is co-hosting this program with Unified Patents, the Linux Foundation’s partner in patent troll deterrence since 2019. Unified Patents is the only organization that uses offensive community-driven strategies to deter patent trolls.
\\n\\n\\n\\nThousands of KubeCon + CloudNativecon North America attendees braved cold rain – and even snow – to attend 16 co-located events in the Salt Palace Convention Center in Salt Lake City. With talks aimed at every level from beginner to expert covering a broad swath of cloud native technologies including AI, observability, eBPF, WebAssembly and Istio, the always popular co-located events were filled with enthusiastic participants.
\\n\\n\\n\\nHere’s a slice of how the day unfolded. Want to follow KubeCon + CloudNativeCon North America 2024 in real time? We’re on Instagram, X (for KubeCon), and X (for CNCF).
\\n\\n\\n\\nPlatform Engineering Day
\\n\\n\\n\\nThe advent of internal developer platforms promises to revolutionize every step of the development process and the excitement around IDPs was the focus of the Platform Engineering Day co-located presentations this year. In Portals and platforms, two Ps in a pod? How great interfaces make for good operabilty, Abby Bangser, principal engineer, Syntasso, and Jorge Lainfiesta, reliability advocate, Rootly, talked about the universal challenges around creating development templates that can grow, scale and adapt to future needs. They also discussed the equally tough hurdle of getting developers to actually use those templates once they were created.
\\n\\n\\n\\nTo ensure long term success, build templates that have “operability” in mind but also keep other “ilities” on the list including maintainability, usability and extensibility. Maintainability is achieved by looking at the past, while usability is how to allow for present adoption and extensibility will ensure developers can build something in the future they can’t even imagine today.
\\n\\n\\n\\nWondering what your team should be demanding of your platform and tooling? Compare efforts with the CNCF Platform Engineering Model.
\\n\\n\\n\\nCloud Native + Kubernetes AI Day
\\n\\n\\n\\nArtificial intelligence can challenge everything we’ve come to believe about cloud native, and Autumn Moulder, director of infrastructure & security at Cohere, tackled that head on in her presentation, From Supercomputing to serving: a case study delivering cloud native foundation models.
\\n\\n\\n\\nMoulder offered her company’s story – taking an ML infrastructuring from one cloud to multi-cloud and many clusters in less than six months – as object lessons in how to take cloud native learnings and apply them in the AI world.
\\n\\n\\n\\nFive challenges made the company stronger, she explained. Cohere had to overcome GPU capacity constraints, GPU high failure rates, the reality of tightly coupled software and hardware, complex multi-team deployments, and managing allocation and utilization. Her team had to get creative and was particularly successful using Kueue.
\\n\\n\\n\\nCloud Native Startup Fest
\\n\\n\\n\\nTo kick off an afternoon targeted at entrepreneurs, Kelsey Hightower, retired distinguished engineer and Megan Reynolds, VC at Vertex Ventures, hosted a fireside chat, Startup resilience in a post-ZIRP (zero interest rate policy world). To get things rolling, Hightower suggested founders ask some hard questions beginning with something super basic:
\\n\\n\\n\\nDoes the product work? To answer this, have a customer try it out without any input at all; if they can’t get it up and running, it’s clear the product doesn’t work. Also, it’s important to be careful with the term customer, Hightower said. The word needs to be reserved for those who’ve paid for the product, he emphasized. Also, he said he’s a big believer in early stage companies taking field trips to potential customers and really immersing themselves in the business. Ask those potential customers what you should build, and how much they’d pay for it. And, of course, look to open source because it’s got a proven track record as a great place to start new businesses.
\\n\\n\\n\\nIstio Day
\\n\\n\\n\\nFor those wondering what it would be like to finally embrace service mesh, Confluent shared their experience during Confluent’s service mesh journey – building security and reliability one sidecar at a time, with Adam Sayah, Solo.io and Cody Ray, Confluent.
\\n\\n\\n\\nThe driving force behind Confluence’s decision to move to service mesh was a combination of factors, starting with the compliance fed ramp, as well as an internal push for zero trust, Ray explained. The company had a lot of moving parts to deal with including tens of thousands of Kubernetes clusters, hundreds of services and almost 5,000 connections that needed continuous security monitoring.
\\n\\n\\n\\nWith such vast scale – and a need for increased productivity and security – Ray felt the time was right to make the case for service mesh, and his ROI calculations showed an expected savings of $3.5 million + per year, not to mention centralized observability, improved compliance and faster delivery.
\\n\\n\\n\\nWatch the highlight reel from Day 0 of KubeCon + CloudNativeCon North America 2024!
\\n\\n\\n\\nGet the first look at some of the fun, networking, and learning that went on today!
\\n\\n\\n\\n\\n\\nFalco has become a vital tool for security practitioners seeking to safeguard containerized and cloud-native environments. Leveraging the power of eBPF (Extended Berkeley Packet Filter), Falco monitors system calls and audit events, allowing it to detect malicious behavior in hosts and containers, no matter how large the infrastructure. Beyond its core detection capabilities, Falco offers critical integrations with industry-standard frameworks, such as MITRE ATT&CK and PCI DSS, to enforce regulatory compliance and security standards.
\\n\\n\\n\\nAs the CNCF Graduate Project evolves to meet industry needs, one of the most pressing challenges practitioners face is ensuring that their instance of Falco stays ahead of evolving threats and regulatory requirements. With the complexity of managing and updating Falco rules, handling false positives, and keeping up with emerging threats, how can security teams efficiently manage their threat intelligence with Falco? In this blog post, we will explore various tools and approaches that help manage Falco’s rules and extend its capabilities to meet modern cybersecurity challenges.
\\n\\n\\n\\nFalcosidekick is a lightweight solution that extends Falco’s capabilities by forwarding Falco event data to various third-party services. As organizations adopt more complex environments, they often need to centralize threat detection and responses across multiple systems. This is where Falcosidekick excels: it acts as a bridge between Falco and the extensive third-party ecosystem, including logging services, chat platforms, alerting tools, and observability systems.
\\n\\n\\n\\nBy connecting Falco to these services, Falcosidekick enables:
\\n\\n\\n\\nFor example, if Falco detects an unexpected process running in a container, the enriched alert can be sent to a chat platform like Slack, giving security teams the necessary context to assess the situation. This fan-out model ensures that you don’t miss critical alerts, and the ability to forward events to various platforms helps optimize your threat response strategy.
\\n\\n\\n\\nAn alternative approach would be to use Falco Talon within a zero-trust model by automating real-time responses to unexpected system behavior based on predefined Falco rules. Instead of relying solely on known attack signatures, this approach focuses on detecting deviations from expected behavior within the system.
\\n\\n\\n\\nIn this approach, you can define Falco rules for the normal behavior of operating systems, containers, and processes. If anything outside of the expected behavior occurs, a Falco Talon action can be triggered to mitigate the threat immediately, such as:
\\n\\n\\n\\nThis approach is ideal for enforcing regulatory compliance in environments requiring strict control, such as PCI DSS or SOC 2. The benefit of this approach is that it requires less frequent updates to Falco rules, as it doesn’t rely heavily on threat feeds or lists of known behaviors. However, implementing this method requires a deep understanding of your system’s normal behavior to avoid unintended disruptions caused by over-aggressive rules.
\\n\\n\\n\\nFalcoctl is the command-line interface that simplifies the lifecycle management of Falco’s rules and plugins. Managing and updating Falco’s rules effectively is critical to staying ahead of evolving threats, and falcoctl allows practitioners to:
\\n\\n\\n\\nFalcoctl supports OCI-compliant registries for managing rules and plugins, allowing users to pull artifacts from custom or official sources. This is particularly useful for maintaining custom rules or curated threat feeds in larger organizations. For example, you can maintain a list of known malicious binaries within your Falco instance. If Falco Talon incorrectly blocks a legitimate process due to a false positive, falcoctl enables you to quickly revert to a previous version of the rule without disrupting your operations.
\\n\\n\\n\\nThis flexibility makes falcoctl an essential tool for managing both threat intelligence and compliance requirements across dynamic, cloud-native environments.
\\n\\n\\n\\nManaging Falco’s rule lifecycle is essential for adapting to constantly changing security threats, compliance standards, and infrastructure needs. Yet, many practitioners face challenges in efficiently handling custom rules, updating threat feeds, and reducing noise in their production environments. To address these pain points, managed Falco Feeds can provide a solution.
\\n\\n\\n\\nFalco feeds allow users to maintain up-to-date rules for threat detection, compliance enforcement, and more. Instead of manually updating lists of IP addresses, complex macros, or regulatory compliance rules, managed Falco Feeds simplify this process by delivering continuous, curated rule updates. Managed feeds ensure that your Falco instance stays aligned with:
\\n\\n\\n\\nAs Falco matures post-CNCF graduation, we are eager to understand how the community manages its Falco instances, especially as threat landscapes and compliance requirements evolve. We’ve launched the 2024 Falco Survey to gain insights into how Falco is being used in production environments, what tools (like Falcosidekick, Falco Talon, or falcoctl) are being adopted, and how custom rules are managed.
\\n\\n\\n\\nNow is the time to get involved—contribute to the Falco community, share your insights in the 2024 Falco survey, and help shape the future of this critical CNCF project.
\\n\\n\\n\\nSome of the questions we aim to answer:
\\n\\n\\n\\nFalco is a powerful tool for securing cloud-native environments, but managing its rules and threat intelligence is an ongoing challenge for many organizations. Whether you’re integrating with third-party services via Falcosidekick, automating real-time response with Falco Talon, or streamlining lifecycle management with falcoctl, there are solutions available to meet the evolving needs of security practitioners.
\\n\\n\\n\\nAs we continue to build on the capabilities of Falco, fully managed Falco Feeds by Sysdig offers a way to stay ahead of emerging threats and complex compliance requirements without the friction of manual updates or configuration changes. Falco Feeds equips users to harness the power and flexibility of open source Falco and exper-written detection rules fueled by the Sysdig Threat Research Team for real-time threat detection at enterprise scale.
\\n\\nThe CNCF Technical Oversight Committee (TOC) has voted to accept wasmCloud as a CNCF incubating project.
\\n\\n\\n\\nwasmCloud, an open source project from the Cloud Native Computing Foundation (CNCF), enables teams to build and run polyglot applications made up of reusable WebAssembly (Wasm) components. This allows applications to operate resiliently and efficiently across diverse environments—in the cloud, on Kubernetes, in data centers, or at the edge.
\\n\\n\\n\\nBy using Wasm as the application artifact, wasmCloud decouples applications from underlying infrastructure, freeing developers to focus on feature development. It provides the tools to run Wasm components securely and efficiently and—instead of forcing thousands of developers to maintain the same libraries and capabilities in their applications—create a single set of reusable core applications.
\\n\\n\\n\\n“wasmCloud is a platform for platform engineers. It orchestrates componentized applications in a way that complements Kubernetes,” said Liam Randall, wasmCloud co-founder and Cosmonic CEO. “While Kubernetes abstracts and manages infrastructure, wasmCloud functions as a distributed application control plane, managing applications at scale. It integrates with Kubernetes, allowing organizations to extend their Kubernetes deployments to remote edges which maximizes the value of existing investments. We are humbled by the end user adoption and excited to see wasmCloud make this big move to the Incubator.”
\\n\\n\\n\\nThe project was created by Liam Randall and Kevin Hoffman during their time at a top 10 US bank. The project is currently led by Cosmonic CTO and Bytecode Alliance Technical Steering Committee member Bailey Hayes. wasmCloud was designed to solve the friction that application teams face in every enterprise when writing software and it has grown in popularity since being accepted into the CNCF Sandbox; it is now being deployed and maintained by engineers working in a host of organizations including Adobe, Orange, MachineMetrics, TM Forum member CSPs, and Akamai.
\\n\\n\\n\\nSince joining the CNCF Sandbox, wasmCloud has matured and grown in popularity:
\\n\\n\\n\\nAdoption is growing amongst engineering teams working in a variety of sectors, attracted by the possibility of simplifying the way they build, run, and maintain applications at scale.
\\n\\n\\n\\n“wasmCloud is the most ambitious project around. It is attempting to revolutionize how software is developed, architected, and run, all while staying at the forefront of wider WebAssembly and WASI standards. I’m so proud of the team that has done so much over the past 4 years to get it to this point. The smartest and kindest group of people you’d ever want to work with.” – Colin Murphy, wasmCloud maintainer and Senior Software Engineer, Adobe
\\n\\n\\n\\n“CNCF incubation is a confirmation of the strength of the wasmCloud project and community. It has taken a lot of work to gain the right reputation within the wider cloud native landscape, and it’s paying off in project maturity, integrations and success stories. I’m incredibly grateful to our industry partners who pioneered real-world Wasm use cases. Incubation is a real indicator that wasmCloud is ready to integrate into any cloud native stack. I truly believe that a successful community makes a successful project, and I’m so proud of where we are today.” – Brooks Townsend, wasmCloud maintainer and Senior Engineer at Cosmonic
\\n\\n\\n\\n“As a longtime maintainer and contributor to CNCF projects, I am thrilled that wasmCloud has made it to incubating status. This is the culmination of 5 years of work from an ever-growing community that represents so many different parts of the technology landscape. The contributions that wasmCloud has made to the wider ecosystem and the adoption we see across software platforms, banking, IoT, and more is something that makes me extremely proud.”– Taylor Thomas, wasmCloud maintainer and engineering director, Cosmonic
\\n\\n\\n\\n“wasmCloud helps us build complex systems with a new perspective; it gives us a way to distribute workloads, compute, and feature requests in a way that just wasn’t possible before. A team of any size can start to see benefits very early in the development cycle.” – Luke Jones, Lattica co-founder and developer
\\n\\n\\n\\n“I am so excited to see wasmCloud enter incubating and sit alongside other major incubating and graduated projects like Kubernetes and Knative in the scheduling and orchestration section of the CNCF landscape. Components offer a fundamentally finer-grained abstraction than containers, like Kubernetes for WebAssembly, so wasmCloud provides a Wasm-native orchestrator to best take advantage of the unique properties that WebAssembly components can provide. Wasm-native works with cloud-native and runs seamlessly on Kubernetes or any other container execution engine like AWS Fargate, Microsoft AKS, or Google Cloud Run.” – Bailey Hayes, TSC director for Bytecode Alliance foundation, W3C WASI SG chair, CTO at Cosmonic
\\n\\n\\n\\n“wasmCloud’s acceptance into the CNCF incubator is a major milestone, marking the beginning of a new phase—one of collaboration, innovation, and the spread of WebAssembly across industries. It’s also a testament to the dedication of the entire community in pioneering the cloud-native Wasm space, and I’m proud to contribute towards expanding its ecosystem.”– Aditya Sal, wasmCloud contributor
\\n\\n\\n\\nMain Components:
\\n\\n\\n\\nNotable Milestones:
\\n\\n\\n\\nwasmCloud has several features and functionalities on the roadmap for Q4 and beyond including mult-tenancy, standards alignment, and deeper language support. Previous roadmaps can be viewed in the wasmCloud roadmap section in wasmCloud documentation.
\\n\\n\\n\\nAs a CNCF-hosted project, wasmCloud is part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. wasmCloud joins incubating technologies Artifact Hub, Backstage, Buildpacks, cert-manager, Chaos Mesh, CloudEvents, Container Network Interface (CNI), Contour, Cortex, CubeFS, Dragonfly, Emissary-Ingress, Falco, Flatcar, gRPC, in-toto, Keptn, Keycloak, Knative, Kubeflow, KubeVela, KubeVirt, Kyverno, Litmus, Longhorn, NATS, Notary, OpenCost, OpenFeature, OpenKruise, OpenMetrics, OpenTelemetry, Operator Framework, Thanos, and Volcano. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.
\\n\\nProject post by the Jaeger maintainers
\\n\\n\\n\\nJaeger, the popular open-source distributed tracing platform, has had a successful 9 year history as being one of the first graduated projects in the Cloud Native Computing Foundation (CNCF). After over 60 releases, Jaeger is celebrating a major milestone with the upcoming release of Jaeger v2. This new version is a new architecture for Jaeger components that utilizes OpenTelemetry Collector framework as the base and extends it to implement Jaeger’s unique features. It brings significant improvements and changes, making Jaeger more flexible, extensible, and better aligned with the OpenTelemetry project.
\\n\\n\\n\\nIn this blog post, we’ll dive into the details of Jaeger v2, exploring its design, features, and benefits. Sharing what users can expect from this exciting new release and what is next for the project.
\\n\\n\\n\\nOpenTelemetry is the de-facto standard for application instrumentation providing the foundation for observability. Jaeger is now based on the cornerstone of this project, the OpenTelemetry Collector. Jaeger is a complete tracing platform that includes storage and the UI. OpenTelemetry Collector is usually an intermediate component in the collection pipelines that are used to receive, process, transform, and export different telemetry types. The two systems have some overlaps, for example, `jaeger-agent` and `jaeger-collector` play a role similar to what can be done with OpenTelemetry Collector, but only for traces.
\\n\\n\\n\\nHistorically, both Jaeger and OpenTelemetry Collector reused each other’s code. Collector supports receivers for legacy Jaeger formats implemented by importing Jaeger packages. And Jaeger reuses Collector’s receivers and data model converters. Because of this synergy, it’s been our goal for a while to bring the two projects closer.
\\n\\n\\n\\nOpenTelemetry Collector has a very flexible and extensible design, which makes it easy to extend with additional components needed for Jaeger use cases.
\\n\\n\\n\\nBy aligning the Jaeger v2 architecture with OpenTelemetry Collector, we deliver several exciting features and benefits for users, including:
\\n\\n\\n\\nThe result is a leaner code base and more importantly the ability to future proof Jaeger as OpenTelemetry evolves to ensure Jaeger is always the first tracing system for open source users. Compatibility between OpenTelemetry and Jaeger will be supported on day 1 due to the tight integration between the projects. This will continue collaboration between both projects, getting more users to adopt open source technologies more quickly.
\\n\\n\\n\\nOverall, Jaeger v2 architecture is very similar to a standard OpenTelemetry Collector that has pipelines for receiving and processing telemetry (a pipeline encapsulates receivers, processors, and exporters), and extensions that perform functions not directly related to processing of telemetry. Jaeger v2 makes a few design decisions in how to use the Collector framework.
\\n\\n\\n\\nSince Collector is primarily designed for data ingestion, querying for traces and presenting them in the UI is an example of functionality that is not in its remit. Jaeger v2 implements it as a Collector extension.
\\n\\n\\n\\nJaeger v1 provided multiple binaries for different purposes (agent, collector, ingester, query). Those binaries were hardwired to perform different functions and exposed different configuration options passed via command line. We realized that all that complexity was unnecessary in the v2 architecture as we achieve the same simply by enabling different components in the configuration file. We also did some benchmarking of executable size and noticed that if we bundle all possible Jaeger v2 components in a single binary, including ~3Mb (compressed) of UI assets, we end up with a container image of ~40Mb in size, versus ~30Mb in v1. As a result, Jaeger v2 ships just a single binary `jaeger`, and it will be configurable for different deployment roles via YAML configuration file, the same as the OpenTelemetry Collector.
\\n\\n\\n\\nBoth Jaeger and OpenTelemetry Collector process telemetry data, but they differ in how they handle it.
\\n\\n\\n\\nJaeger v2 implements a storage extension to abstract the storage layer. This allows both the query component (read path) and a generic exporter (write path) to interact with various storage backends without dedicated implementations for each. This approach provides flexibility and maintains compatibility with Jaeger v1’s multi-storage capabilities.
\\n\\n\\n\\n▶️ Try Jaeger v2 today and check out the Getting Started documentation for more options:
\\n\\n\\n\\ndocker run –rm –name jaeger \\\\
\\n\\n\\n\\n-p 16686:16686 \\\\
\\n\\n\\n\\n-p 4317:4317 \\\\
\\n\\n\\n\\n-p 4318:4318 \\\\
\\n\\n\\n\\njaegertracing/jaeger:2.0.0
\\n\\n\\n\\nVersion 2.0.0 already supports all the core features of Jaeger, but the development will continue in 2025 to add remaining feature parity, improving performance, and enhancing the overall user experience.
\\n\\n\\n\\nThe roadmap for Jaeger v2 includes the following milestones:
\\n\\n\\n\\nThe Jaeger v2 roadmap was designed to minimize the amount of changes we needed to make to the project by avoiding a big-bang approach in favor of incremental improvements to the existing code base. Yet it was still a significant amount of new development, which is often difficult to sustain for a volunteer-driven project. We were able to attract new contributors and drive the Jaeger v2 roadmap by participating in the mentorship programs run by Linux Foundation, CNCF, and Google, such as LFX Mentorship and Google Summer of Code. This has been a rewarding and mutually beneficial engagement for both the project and the participating interns. Stay tuned for our next mentorships to build out the roadmap items.
\\n\\n\\n\\nJaeger v2 represents a significant step forward for the Jaeger project, bringing improved flexibility, extensibility, and alignment with the OpenTelemetry project. With its native OTLP ingestion, simplified deployment model, and access to OpenTelemetry Collector features, Jaeger v2 promises to provide a more efficient and scalable distributed tracing solution.
\\n\\n\\n\\nAs the development of Jaeger v2 continues, we can expect to see a more robust and feature-rich system emerge. Stay tuned for updates and get ready to experience the next generation of distributed tracing with Jaeger v2!
\\n\\nWe’re excited to share the Cilium project journey report!
\\n\\n\\n\\nCilium is an open source platform designed for cloud-native networking, security, and observability, leveraging eBPF technology. It provides secure, high-performance network connectivity and deep visibility for Kubernetes and other cloud environments. Cilium allows for complex and detailed network policies, load balancing, service mesh integration, and comprehensive network observability. It is widely used in production environments by companies like Adobe, Capital OneCapitolOne, and Google, offering robust support for large-scale deployments.
\\n\\n\\n\\nSince joining as an Incubating project in October 2021, Cilium has grown tremendously, graduating in under a year in 2023.
\\n\\n\\n\\nSome of the highlights of the report include:
\\n\\n\\n\\nHave a look at the full Cilium Project Journey Report to learn more about these accomplishments in more detail.
\\n\\nMember post originally published on the Redpill Linpro blog by Amelie Löwe
\\n\\n\\n\\nIn this blog post, we’ll explore how to get involved in CNCF (Cloud Native Computing Foundation) open source projects, what knowledge you need to contribute, and how to become part of this community.
\\n\\n\\n\\nHere are a few perks of contributing to open source, aside from making a positive global impact and ensuring software remains freely accessible for anyone:
\\n\\n\\n\\nOf course, there are several important aspects of open source that Non-Code Contributors can help with:
\\n\\n\\n\\nNot quite, there are still some important prerequisites to keep in mind before contributing to open source projects:
\\n\\n\\n\\nHere are the 5 steps of getting started with contributing to CNCF technologies:
Start Small
\\n\\n\\n\\nFor beginners, starting small is crucial. Look for easier issues or small features to gain familiarity with the project’s codebase. As you gain experience, then you can tackle the more challenging tasks. Remember, every contribution counts!
\\n\\n\\n\\nGlossary
\\n\\n\\n\\nIf you stumble upon words or acronyms that seem outlandish to you then here is a great glossary that covers frequently used CNCF-specific terms.
\\n\\n\\n\\nExplore and join a Mentoring WG (Mentoring Working Group)
\\n\\n\\n\\nThrough the Mentoring WG, individuals can explore various mentorship opportunities. Mentors, who are typically experienced contributors in the CNCF community, offer guidance and support to mentees looking to enhance their skills.
This is a great opportunity if you are new to contributing to Open source and cloud-native technologies.
Mentoring WG also manages the following mentorship initiatives:
\\n\\n\\n\\nEmbarking on the CNCF open source journey opens doors to collaborative innovation. Whether a developer or non-code contributor, the benefits extend beyond the technical realm, fostering skill growth and global learning. Collaboration is key, so seek assistance and contribute at your own pace. Each contribution contributes to the collective progress of open source. Embrace the journey, as well as the learning experiences, and become an integral part of this dynamic ecosystem.
\\n\\n\\n\\nHappy contributing!
\\n\\n\\n\\nWritten by Amelie Löwe
\\n\\nMember post originally published on the Devtron blog by Siddhant Khisty
\\n\\n\\n\\nWhile working with Kubernetes, the cluster has many tiny internal components that all work together to deploy and manage your business applications. Kubernetes itself is a distributed system, consisting of tons of different components that work together to provide the experience that we are all familiar with. To learn about all those different Kubernetes components, please check out our blog on Kubernetes Architecture and the different Kubernetes Workloads.
\\n\\n\\n\\nIn a traditional monolithic application, your main concerns would be ensuring that your application code is written securely and that the environment where it is deployed is secure and has the necessary firewalls and security permissions set up. When talking about security in Kubernetes, the story gets a lot more distributed. Since many different components work together to create and operate a cluster, you need to ensure that each individual component is secure. You have to set up proper mTLS between all the different Kubernetes components.
\\n\\n\\n\\nWhen you deploy your business applications, will have to once again ensure that it meets your security standards. You also need to ensure that only authorized users have access to the Kubernetes clusters and that they have the correct level of access to the cluster. Luckily, Kubernetes has a way to authenticate users to the cluster and restrict their level of access to the cluster as required.
\\n\\n\\n\\nWithin the blog, you will be learning how you can authenticate users to your Kubernetes clusters, and how to restrict their access to the cluster using Role Based Access Control (RBAC).
\\n\\n\\n\\nWhenever a user has access to any systems, in this case, Kubernetes, you do not want every single user to have super-admin levels of access to the cluster. This can pose a security risk. Imagine handing over super-admin access to the production cluster to someone who does not know all the ins and outs of a Kubernetes cluster. There is a possibility that they may unknowingly perform some action that takes the entire cluster offline.
\\n\\n\\n\\nTo avoid the above scenario, you want only one or two super-admin users that have unrestricted access to the Kubernetes cluster. Every other user like the developers, should have limited access to the cluster for only the resources they require. For example, if the development team is working on developing a new microservice called backend-beta, they should have access to the specific namespace and resources that are related to the backend-beta application to avoid accidentally messing up any other workloads running in the cluster.
\\n\\n\\n\\nAn additional benefit of using RBAC is to limit the impact of accidental credential leaks. In the unlikely event that a developer’s credentials are accidentally leaked, the attacker would have very limited access to the information within the cluster. Another step that can be taken to minimize the impact is to implement time-based access. After a certain amount of time, the access would expire and the token which was given to the developer would be rendered useless.
\\n\\n\\n\\nImplementing Role Based Access Control (RBAC) in your Kubernetes systems has multiple benefits such as:
\\n\\n\\n\\nWhenever you want to allow a user to get access to the Kubernetes cluster, you do not want to give them super admin permissions. Doing so may lead to unintended deletions of resources or a change of configurations. You want to ensure that the user has access to only a few specific resources in the cluster. If you have a developer user, they should only be allowed to access the applications that they are developing in their particular namespace.
\\n\\n\\n\\nOn the other hand, if you had a node admin or a storage admin, they would require access to every single node or every single storage resource across all namespaces in the cluster. What if you have deployed an application to the Kubernetes cluster that requires some permissions to view a particular cluster resource? The application too would need to be assigned the correct level of permissions.
\\n\\n\\n\\nKubernetes by default has several objects that help satisfy all of the above use cases. These objects include
\\n\\n\\n\\nWe will be looking at all the above Kubernetes objects in detail later in this article. First, let us look at how you can add a user to the cluster.
\\n\\n\\n\\nBefore you can go ahead and assign permissions to the cluster, you will first need to authenticate(authN) the user to the Kubernetes cluster. To add the user to the cluster, you must create a Certificate Signing Request (CSR). The cluster signing key will use the user’s public key to authenticate with the cluster.
\\n\\n\\n\\nLet’s say that we want to onboard a new developer named John to the cluster. He has provided us with his public key. Let us create a Certificate Signing Request (CSR) to add John to the cluster.
\\n\\n\\n\\nTo create a Certificate Signing Request, you will need to paste John’s public key into the manifest file below under spec.request
.
apiVersion: certificates.k8s.io/v1\\nkind: CertificateSigningRequest\\nmetadata:\\n name: john\\nspec:\\n request: <request>\\n signerName: kubernetes.io/kube-apiserver-client\\n expirationSeconds: 86400 # one day\\n usages:\\n - client auth\\n
\\n\\n\\n\\nOnce you’ve created the above manifest file, go ahead and apply it using
\\n\\n\\n\\nkubectl apply -f csr.yaml
\\n\\n\\n\\nUpon applying the manifest, a CSR object will be created. You can check all the CSR’s made to the cluster by running
\\n\\n\\n\\nkubectl get csr
\\n\\n\\n\\nYou should be able to see that John’s CSR has been created, and is pending approval. Since we already know that John is our developer user, we want to accept this request. To do so, you can run the below command
\\n\\n\\n\\nkubectl csr approve john
\\n\\n\\n\\nNow, the user John is authenticated to the cluster and has access to the cluster. The authorization permissions (authZ) still need to be set up.
\\n\\n\\n\\nIf you want to add users to the cluster with ease, without creating a CSR, you can use an SSO to easily authenticate your users via their GitHub, Google, or Microsoft accounts. Please check out this blog to learn more about SSO.
\\n\\n\\n\\nNow that you have added a user to the cluster, it’s time to set up the appropriate level of permissions for the user. As we discussed before, you can set permissions at two levels
\\n\\n\\n\\nWe will take a look at how you can create both types of permissions.
\\n\\n\\n\\nAfter a user has been added to the cluster, we can now go ahead and assign certain permissions to them. To do this, Kubernetes provides two objects called a Role
and RoleBinding
.
As discussed earlier, the role defines the set of permissions, and the role-binding
assigns the role to a particular user.
Let’s say that you have a new developer called John who has joined your organization to work on a sensitive project called Project Gamma. In your existing setup, you have a separate namespace called project-gamma
for all the resources of the project. You want to allow John to create deployments and pods only in the project-gamma
namespace.
To enforce these rules, you have to create the proper role and role-binding resources. Let’s take a look at how you can create the two objects and assign the correct permissions to John.
\\n\\n\\n\\nStep 1: Create a role called gamma-developer-role
and assign it the proper set of permissions.
This can be done imperatively using kubectl
or declaratively by using a YAML file. We will be looking at creating the file declaratively.
The YAML file below can be used to create the role with the required permissions.
\\n\\n\\n\\napiVersion: rbac.authorization.k8s.io/v1\\nkind: Role\\nmetadata:\\n namespace: project-gamma\\n name: gamma-developer-role\\nrules:\\n- apiGroups: [\\"\\"]\\n resources: [\\"pods\\"]\\n verbs: [\\"create\\", \\"get\\", \\"list\\", \\"delete\\"]\\n- apiGroups: [\\"\\"]\\n resources: [\\"deployment\\"]\\n verbs: [\\"create\\", \\"get\\", \\"list\\"]\\n
\\n\\n\\n\\nLet’s understand some of the important configuration fields that are described in the above manifest
\\n\\n\\n\\nStep 2: Create a RoleBinding called gamma-developer-binding
and configure it to use the correct role and assign the correct user
Once the role has been created, it also needs to be assigned to a user. Similar to the role, the RoleBinding can also be created imperatively with the kubectl create rolebinding
command. We will look at how to make it declaratively using a YAML file.
The below YAML file will create the RoleBinding and assign the permissions to John.
\\n\\n\\n\\napiVersion: rbac.authorization.k8s.io/v1\\nkind: RoleBinding\\nmetadata:\\n name: gamma-developer-binding\\n namespace: project-gamma\\nsubjects:\\n- kind: User\\n name: John\\nroleRef:\\n kind: Role \\n name: gamma-developer-role\\n
\\n\\n\\n\\nIn the above manifest, some of the important configuration fields are as follows:
\\n\\n\\n\\nNote:- The Role and RoleBinding are Namespace-scoped resources. They will only work in the namespace where they have been created.
\\n\\n\\n\\nStep 3: Validate the permissions
\\n\\n\\n\\nAfter creating the role and assigning it to the correct user using RoleBinding, it’s time to make sure that the permission works correctly.
\\n\\n\\n\\nYou can use the kubectl auth can-i
command to ensure that the permissions work correctly.
The command will show you if the action that you are trying to run will be successful or not. For example, John should be able to get the pods in the project-gamma
namespace. But he should not be able to get the pods in the default namespace. We can verify this by running the below two commands.
kubectl auth can-i get pods -nproject-gamma --as=John\\n\\nkubectl auth can-i get pods --as=John\\n
\\n\\n\\n\\nAs you can see from the above image, John can get the pods in the project-gamma namespace, but cannot get the pods in the default namespace. This is the exact permissions that we wanted to create.
\\n\\n\\n\\nSimilar to the Roles and RoleBindings, ClusterRoles, and ClusterRoleBindings can allow certain permissions for a user. The key difference is that the ClusterRole and ClusterRoleBinding will be applied across the entire cluster i.e it will work in every namespace in the cluster. The Roles and RoleBindings on the other hand are limited to the particular namespace they exist in.
\\n\\n\\n\\nLet’s look at how to create and assign a ClusterRole and ClusterRoleBinding with an example. Imagine that you have an application called DeploymentManager which is responsible for managing all the Deployments in the cluster. For this application to perform its functions, it needs to have admin permissions for all the Deployments in the cluster.
\\n\\n\\n\\nLet’s look at the step-by-step process for how we can assign the right level of permissions for this application
\\n\\n\\n\\nStep 1: Create a Service Account
\\n\\n\\n\\nWhen we want to assign permissions to a user, we can directly assign it to them. However, when permission has to be applied to an application, it cannot be directly applied to the pod or deployment of the application. Kubernetes has a resource called a Service Account which is a non-human account with a distinct identity. When an application needs certain permissions within the cluster, it is assigned to them via a service account.
\\n\\n\\n\\nLet us create a ServiceAccount called deploy-manager-sa
. You can create the service account imperatively using the below command.
kubectl create serviceaccount deploy-manager-sa
\\n\\n\\n\\nYou also want to ensure that the application is using this new ServiceAccount. Edit the application’s manifest and update the serviceAccount
field so that it uses the new service account that has been created.
Step 2: Creating the ClusterRole
\\n\\n\\n\\nAs the application is responsible for managing every single deployment in the cluster, it will require permission to perform all actions on the deployments.
\\n\\n\\n\\nSimilar to what we discussed with Roles, ClusterRoles can be created imperatively as well as declaratively. We saw how to create Roles in a declarative manner. Let us create the ClusterRole named deploy-manager-cr
with the imperative method. You can use the below command to create the ClusterRole with the correct permissions.
kubectl create clusterrole deploy-manager-cr --verb=’*’ --resource=deployments\\n
\\n\\n\\n\\nThe above command will allow the Service Account to perform any actions on only the deployment objects. It will not be able to perform any action on any other resource type.
\\n\\n\\n\\nIf you wish to run the same command using the declarative commands, you can use the following YAML manifest:
\\n\\n\\n\\napiVersion: rbac.authorization.k8s.io/v1\\nkind: ClusterRole\\nmetadata:\\n name: deploy-manager-cr\\nrules:\\n- apiGroups: [\\"\\"]\\n resources: [\\"deployment\\"]\\n verbs: [\\"*\\"]\\n
\\n\\n\\n\\nStep 3: Create the ClusterRoleBinding
\\n\\n\\n\\nFinally, we want to create a ClusterRoleBinding to assign the ClusterRole to the Service account that we had created. Let us call this ClusterRoleBinding as deploy-manager-crb
.
We will create the ClusterRoleBinding imperatively with kubectl
commands. You can use the below command to create the ClusterRoleBinding such that it will use the permissions defined in the deploy-manager-cr
and assign it to the deploy-manager-sa
service account
kubectl create clusterrolebinding deploy-manager-crb --clusterrole=deploy-manager-cr --serviceaccount=default:deploy-manager-sa
\\n\\n\\n\\nThe above command ClusterRoeBinding can also be created declaratively. You can use the following YAML manifest to create the same role using the declarative method.
\\n\\n\\n\\napiVersion: rbac.authorization.k8s.io/v1\\nkind: ClusterRoleBinding\\nmetadata:\\n name: deploy-manager-crb\\nsubjects:\\n- kind: ServiceAccount\\n name: default:deploy-manager-sa\\nroleRef:\\n kind: ClusterRole \\n name: deploy-manager-cr\\n
\\n\\n\\n\\nStep 4: Validate the permissions
\\n\\n\\n\\nSimilar to how we validated the permissions for the Roles and ClusterRoles, we also want to validate the permissions for the ClusterRole and ClusterRoleBinding using the kubectl auth can-i
commands.
Let’s see if the ServiceAccount can get the deployments, as well as if it can get individual pods or not. You can use the below commands to try it out
\\n\\n\\n\\nkubectl auth can-i get deployments --as=system:serviceaccount:default:deploy-manager-sa\\n\\nkubectl auth can-i get pods --as=system:serviceaccount:default:deploy-manager-sa\\n
\\n\\n\\n\\nAs you can see, the service account can perform actions on deployments, but cannot perform actions on any other resource such as pods.
\\n\\n\\n\\nWhile we were creating the roles and role bindings in the above examples, you may have noticed that when we were assigning the permissions to a user, we directly assigned the permission to the user. However, when we wanted to give some permissions to an application, we had to do it through a service account.
\\n\\n\\n\\nIn this blog, you’ve seen the entire process of adding a user to the Kubernetes cluster and adding fine-grained permissions for them using Roles, RoleBindings, ClusterRoles, and ClusterRoleBindings. Kubernetes provides a robust and flexible way to define the permissions for different users and applications. However, certain limitations and challenges do arise when creating the permissions with just the Kubernetes resources.
\\n\\n\\n\\nSome of the challenges with implementing RBAC in Kubernetes are as follows:
\\n\\n\\n\\nDevtron is a robust Kubernetes dashboard that makes it much easier to add users to your Kubernetes cluster by using an SSO service. It also helps you fine-grain the level of access you wish to assign to all your users. Devtron’s permissions groups can help you assign the same permissions to different users, such as all Developer users or Operations users. It also provides the option to set a time-based access to the cluster which ensures that the user has access only for a limited period.
\\n\\n\\n\\nAdditional reading on how Devtron helps with Kubernetes RBAC:
\\n\\n\\n\\nKubernetes Security has multiple aspects to it. One of the most important aspects of Kubernetes Security is properly setting up role based access control. Kubernetes RBAC is useful to limit the actions that a user can perform in the Kubernetes cluster. A user is first added to the cluster by creating a CSR request which authenticates(authN) the user with the Kubernetes cluster.
\\n\\n\\n\\nOnce the authentication is done, the user needs to be assigned the proper authorization(authZ) permissions to ensure that they have access only to the resources they require. To set up the authorization permissions in the cluster, Kubernetes provides users with Roles, RoleBindings, ClusterRoles, and ClusterRoleBindings which are useful for setting authorization permissions either at the namespace level or the cluster level.
\\n\\n\\n\\nDevtron is a Kubernetes Dashboard that takes Kubernetes Security one step further by making it easier to create and assign permissions to different users. It also provides permission groups to assign the same set of permissions to a group of users such as the developers. It also helps provide time-based access to particular resources in the cluster and revokes access once the set period has expired.
\\n\\n\\n\\nIf you have any queries, don’t hesitate to connect with us. Join the lively discussions and shared knowledge in our actively growing Discord Community.
\\n\\nCommunity post originally published on Medium by Giorgi Keratishvili
\\n\\n\\n\\nIf you have worked on Kubernetes production systems at any time during the last 10 years and needed to check your pods or application uptime, resource consumption, HTTP error rates, and needed to observe them for a certain period of time, most probably you have been using the Prometheus and Grafana stack. If you want to extend your knowledge of observability and monitoring, then this exam is exactly for you because it does not focus only on Prometheus but also on general concepts such as SLA, SLO, SLI, how to structure alerting, and best practices for observability.
\\n\\n\\n\\nBesides that, Prometheus is one of the first open source projects that joined the CNCF after Kubernetes and has since been one of the most preferred tools for monitoring and observability in containerized environments. It also incorporates other open source projects such as Grafana for visualization and OpenTelemetry for observability, which greatly impact the whole industry.
\\n\\n\\n\\nAs mentioned above, the majority of the time we see Prometheus in Kubernetes or containerized environments, but nonetheless, it is not limited only to containerized scenarios. Overall, if we have any production system, we must have some kind of monitoring tool. If we don’t have observability, we are blind and when things go wrong — and they will — it’s just a matter of time before it happens, you will appreciate that you could see more and on time everything.
\\n\\n\\n\\nRegarding persons who would benefit SysAdmins/Dev/Ops/SRE/Managers/Patform engineers or anyone who is doing anything on production should consider it.
\\n\\n\\n\\nSo, are we fired up like a torch, eager to spot any degradation in your systems and wanting to pass the exam? Then, we have a long path ahead until we reach this point. First, we need to understand what kind of exam it is compared to CKAD, CKA, and CKS. This is the first exam where the CNCF has adopted multiple-choice questions, and compared to other multiple-choice exams, this one, I would say, is not easy-peasy. However, it is still qualified as pre-professional, on par with the KCNA and KCSA.
\\n\\n\\n\\nThis exam is conducted online, proctored similarly to other Kubernetes certifications, and is facilitated by PSI. As someone who has taken more than 15 exams with PSI, I can say that every time it’s a new journey. I HIGHLY ADVISE joining the exam 30 minutes before taking the test because there are pre-checks of ID, and the room in which you are taking it needs to be checked for exam criteria. Please check these two links for the exam rules and PSI portal guide
\\n\\n\\n\\nYou’ll have 90 minutes to answer 60 questions, which is generally considered sufficient time. Be prepared for some questions that can be quite tricky. I marked a couple of them for review and would advise doing the same because sometimes you could find a hint or partial answers in the next question. By this way, you could refer back to those questions. Regarding pricing, the exam costs $250, but you can often find it at a discount during Black Friday promotions or near dates for CNCF events like KubeCon, Open Source Summit, etc.
\\n\\n\\n\\nAt this point, we understand what we have signed up for and are ready to dedicate time to training, but where should we start? Before taking this exam, I had a good experience with Kubernetes and its ecosystem and had experience with Prometheus but only for things that I needed. I did not delve deeper, yet I still learned a lot from this exam.
\\n\\n\\n\\nLet break down Domains & Competencies
\\n\\n\\n\\n**Observability Concepts 18%**\\n\\n\\n\\n
Metrics
Understand logs and events
Tracing and Spans
Push vs Pull
Service Discovery
Basics of SLOs, SLAs, and SLIs
**Prometheus Fundamentals 20%**
System Architecture
Configuration and Scraping
Understanding Prometheus Limitations
Data Model and Labels
Exposition Format
**PromQL 28%**
Selecting Data
Rates and Derivatives
Aggregating over time
Aggregating over dimensions
Binary operators
Histograms
Timestamp Metrics
**Instrumentation and Exporters 16%**
Client Libraries
Instrumentation
Exporters
Structuring and naming metrics
**Alerting & Dashboarding 18%**
Dashboarding basics
Configuring Alerting rules
Understand and Use Alertmanager
Alerting basics (when, what, and why)
At first glance, this list might seem too simple and easy; however, we need to learn the fundamentals of observability first in order to understand higher-level concepts.
\\n\\n\\n\\nObservability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs. In the context of system engineering and IT operations, observability is crucial for diagnosing issues and ensuring that all parts of the system are functioning as expected.
\\n\\n\\n\\nYou can explore and learn about PCA Certification and related topics freely through the following GitHub repositories which I have used
\\n\\n\\n\\nFor structured and comprehensive PCA exam preparation, consider investing in these paid course from KodeKloud where they providy play ground I used it in preparation and it helped a lot
\\n\\n\\n\\nThe exam is not easy among other certs. I would rank it in this order KCNA/CGOA/CKAD/PCA/KCSA/CKA/CKS. After conducting the exam, you will receive grading within 24 hours and after passing the exam it feels pretty satisfying. Overall, I hope it was informative and useful 🚀
\\n\\n\\n\\nProject post originally published on the Kyverno blog
\\n\\n\\n\\nKyverno 1.13 released with Sigstore bundle verification, exceptions for validatingAdmissionPolicies, new assertion trees, generate enhancments, enhanced ValidatingAdmissionPolicy and PolicyException support, and tons more!
\\n\\n\\n\\nWednesday, October 30, 2024
\\n\\n\\n\\nKyverno 1.13 contains over 700 changes from 39 contributors! In this blog, we will highlight some of the major changes and enhancements for the release.
\\n\\n\\n\\nKyverno 1.13 introduces support for verifying container images signatures that use the sigstore bundle format. This enables seamless support for GitHub Artifact Attestations to be verified using verification type SigstoreBundle
.
The following example verifies images containing SLSA Provenance created and signed using GitHub Artifact Attestation.
\\n\\n\\n\\nHere is an example policy:
\\n\\n\\n\\napiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: sigstore-image-verification
spec:
validationFailureAction: Enforce
webhookTimeoutSeconds: 30
rules:
- match:
any:
- resources:
kinds:
- Pod
name: sigstore-image-verification
verifyImages:
- imageReferences:
- \\"*\\"
type: SigstoreBundle
attestations:
- type: https://slsa.dev/provenance/v1
attestors:
- entries:
- keyless:
issuer: https://token.actions.githubusercontent.com
subject: https://github.com/nirmata/github-signing-demo/.github/workflows/build-attested-image.yaml@refs/heads/main
rekor:
url: https://rekor.sigstore.dev
additionalExtensions:
githubWorkflowTrigger: push
githubWorkflowName: build-attested-image
githubWorkflowRepository: nirmata/github-signing-demo
conditions:
- all:
- key: \\"{{ buildDefinition.buildType }}\\"
operator: Equals
value: \\"https://actions.github.io/buildtypes/workflow/v1\\"
- key: \\"{{ buildDefinition.externalParameters.workflow.repository }}\\"
operator: Equals
value: \\"https://github.com/nirmata/github-signing-demo\\"
\\n\\n\\n\\nThe demo repository is available at: https://github.com/nirmata/github-signing-demo.
\\n\\n\\n\\nKyverno 1.13 introduces the ability to leverage PolicyException declarations while auto-generating Kubernetes ValidatingAdmissionPolicies directly from Kyverno policies that use the validate.cel
subrule.
The resources specified within the PolicyException are then used to populate the matchConstraints.excludeResourceRules
field of the generated ValidatingAdmissionPolicy, effectively creating exclusions for those resources. This functionality is illustrated below with an example of a Kyverno ClusterPolicy and a PolicyException, along with the resulting ValidatingAdmissionPolicy.
Kyverno policy:
\\n\\n\\n\\napiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-host-path
spec:
background: false
rules:
- name: host-path
match:
any:
- resources:
kinds:
- Deployment
- StatefulSet
operations:
- CREATE
- UPDATE
namespaceSelector:
matchExpressions:
- key: type
operator: In
values:
- connector
validate:
failureAction: Audit
cel:
expressions:
- expression: \\"!has(object.spec.template.spec.volumes) || object.spec.template.spec.volumes.all(volume, !has(volume.hostPath))\\"
message: \\"HostPath volumes are forbidden. The field spec.template.spec.volumes[*].hostPath must be unset.\\"
\\n\\n\\n\\nPolicyException:
\\n\\n\\n\\napiVersion: kyverno.io/v2\\nkind: PolicyException\\nmetadata:\\n name: policy-exception\\nspec:\\n exceptions:\\n - policyName: disallow-host-path\\n ruleNames:\\n - host-path\\n match:\\n any:\\n - resources:\\n kinds:\\n - Deployment\\n names:\\n - important-tool\\n operations:\\n - CREATE\\n - UPDATE\\n
\\n\\n\\n\\nThe generated ValidatingAdmissionPolicy and its binding are as follows:
\\n\\n\\n\\napiVersion: admissionregistration.k8s.io/v1\\nkind: ValidatingAdmissionPolicy\\nmetadata:\\n labels:\\n app.kubernetes.io/managed-by: kyverno\\n name: disallow-host-path\\n ownerReferences:\\n - apiVersion: kyverno.io/v1\\n kind: ClusterPolicy\\n name: disallow-host-path\\nspec:\\n failurePolicy: Fail\\n matchConstraints:\\n resourceRules:\\n - apiGroups:\\n - apps\\n apiVersions:\\n - v1\\n operations:\\n - CREATE\\n - UPDATE\\n resources:\\n - deployments\\n - statefulsets\\n namespaceSelector:\\n matchExpressions:\\n - key: type\\n operator: In\\n values:\\n - connector\\n excludeResourceRules:\\n - apiGroups:\\n - apps\\n apiVersions:\\n - v1\\n operations:\\n - CREATE\\n - UPDATE\\n resourceNames:\\n - important-tool\\n resources:\\n - deployments\\n validations:\\n - expression: \'!has(object.spec.template.spec.volumes) || object.spec.template.spec.volumes.all(volume,\\n !has(volume.hostPath))\'\\n message: HostPath volumes are forbidden. The field spec.template.spec.volumes[*].hostPath\\n must be unset.\\n---\\napiVersion: admissionregistration.k8s.io/v1\\nkind: ValidatingAdmissionPolicyBinding\\nmetadata:\\n labels:\\n app.kubernetes.io/managed-by: kyverno\\n name: disallow-host-path-binding\\n ownerReferences:\\n - apiVersion: kyverno.io/v1\\n kind: ClusterPolicy\\n name: disallow-host-path\\nspec:\\n policyName: disallow-host-path\\n validationActions: [Audit, Warn]\\n
\\n\\n\\n\\nIn addition, Kyverno policies targeting resources within a specific namespace will now generate a ValidatingAdmissionPolicy that utilizes the matchConstraints.namespaceSelector
field to scope its enforcement to that namespace.
Policy snippet:
\\n\\n\\n\\nmatch:\\n any:\\n - resources:\\n kinds:\\n - Deployment\\n operations:\\n - CREATE\\n - UPDATE\\n namespaces:\\n - production\\n - staging\\n
\\n\\n\\n\\nThe generated ValidatingAdmissionPolicy:
\\n\\n\\n\\nmatchConstraints:
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: In
values:
- production
- staging
resourceRules:
- apiGroups:
- apps
apiVersions:
- v1
operations:
- CREATE
- UPDATE
resources:
- deployments
\\n\\n\\n\\nKyverno-JSON allows Kyverno policies to be used anywhere, even for non-Kubernetes workloads. It introduces the powerful concept of assertion trees.
\\n\\n\\n\\nPreviously the Kyverno CLI added support for assertion trees, and now in Release 1.13 assertion trees can also be used in validation rules as a sub-type.
\\n\\n\\n\\nHere is an example of a policy that uses an assertion tree to deny pods from using the default service account:
\\n\\n\\n\\napiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-default-sa
spec:
validationFailureAction: Enforce
rules:
- match:
any:
- resources:
kinds:
- Pod
name: disallow-default-sa
validate:
message: default ServiceAccount should not be used
assert:
object:
spec:
(serviceAccountName == ‘default’): false
\\n\\n\\n\\nThe foreach
declaration allows the generation of multiple target resources of sub-elements in resource declarations. Each foreach
entry must contain a list attribute, written as a JMESPath expression without braces, that defines sub-elements it processes.
Here is an example of creating networkpolicies for a list of Namespaces, the namespaces are stored in a ConfigMap which can be easily configured dynamically.
\\n\\n\\n\\napiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: foreach-generate-data
spec:
rules:
- match:
any:
- resources:
kinds:
- ConfigMap
name: k-kafka-address
generate:
generateExisting: false
synchronize: true
orphanDownstreamOnPolicyDelete: false
foreach:
- list: request.object.data.namespaces | split(@, ‘,’)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
name: my-networkpolicy-{{element}}-{{ elementIndex }}
namespace: ‘{{ element }}’
data:
metadata:
labels:
request.namespace: ‘{{ request.object.metadata.name }}’
element: ‘{{ element }}’
elementIndex: ‘{{ elementIndex }}’
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
\\n\\n\\n\\nThe triggering ConfigMap is defined as follows, the data contains a namespaces field that defines multiple namespaces.
\\n\\n\\n\\nkind: ConfigMap
apiVersion: v1
metadata:
name: default-deny
namespace: default
data:
namespaces: foreach-ns-1,foreach-ns-2
\\n\\n\\n\\nSimilarly, below is an example of a clone source type of foreach
declaration that clones the source Secret into a list of matching existing namespaces which is stored in the same ConfigMap as above.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: foreach-clone
spec:
rules:
- match:
any:
- resources:
kinds:
- ConfigMap
namespaces:
- default
name: k-kafka-address
generate:
generateExisting: false
synchronize: true
foreach:
- list: request.object.data.namespaces | split(@, \',\')
apiVersion: v1
kind: Secret
name: cloned-secret-{{ elementIndex }}-{{ element }}
namespace: \'{{ element }}\'
clone:
namespace: default
name: source-secret
\\n\\n\\n\\nIn addition, each foreach
declaration supports the following declarations: Contex and Preconditions. For more information please see Kyverno documentation.
This release also allows updates to the generate rule pattern. In addition to deletion, if the triggering resource is altered in a way such that it no longer matches the definition in the rule, that too will cause the removal of the downstream resource.
\\n\\n\\n\\nIn the case where the API server returns an error, apiCall.default
can be used to provide a fallback value for the API call context entry.
The following example shows how to add default value to context entries:
\\n\\n\\n\\n context:\\n - name: currentnamespace\\n apiCall:\\n urlPath: “/api/v1/namespaces/{{ request.namespace }}”\\n jmesPath: metadata.name\\n default: default\\n
\\n\\n\\n\\nKyverno Service API calls now also support custom headers. This can be useful for authentication or adding other HTTP request headers. Here is an example of adding a token in the HTTP Authorization header:
\\n\\n\\n\\n context:\\n - name: result\\n apiCall:\\n method: POST\\n data:\\n - key: foo\\n value: bar\\n - key: namespace\\n value: \\"{{ `{{ request.namespace }}` }}\\"\\n service:\\n url: http://my-service.svc.cluster.local/validation\\n headers:\\n - key: \\"UserAgent\\"\\n value: \\"Kyverno Policy XYZ\\"\\n - key: \\"Authorization\\"\\n value: \\"Bearer {{ MY_SECRET }}\\"\\n
\\n\\n\\n\\nIn addition to validate and verifyImages rules, Kyverno 1.13 supports reporting for generate and mutate, including mutate existing policies, to record policy results. The container flag --enableReporting
can be used to enable or disable reports for specific rule types. It allows the comma-separated values, validate, mutate, mutateExisting, generate, and imageVerify. See details here.
A result entry will be audited in the policy report for rule decision:
\\n\\n\\n\\napiVersion: wgpolicyk8s.io/v1alpha2
kind: PolicyReport
metadata:
labels:
app.kubernetes.io/managed-by: kyverno
namespace: default
results:
- message: mutated Pod/good-pod in namespace default
policy: add-labels
result: pass
rule: add-labels
scored: true
source: kyverno
scope:
apiVersion: v1
kind: Pod
name: good-pod
namespace: default
...
\\n\\n\\n\\nNote that the proper permissions need to be granted to the reports controller, a warning message will be returned upon policy admission if no RBAC permission is configured.
\\n\\n\\n\\nA new field reportProperties
is introduced to custom data in policy reports. For example, a validate rule below adds two additional entries operation and objName to the policy reports:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-owner
spec:
background: false
rules:
- match:
any:
- resources:
kinds:
- Namespace
name: check-owner
context:
- name: objName
variable:
jmesPath: request.object.metadata.name
reportProperties:
operation: ‘{{ request.operation }}’
objName: ‘{{ objName }}’
validate:
validationFailureAction: Audit
message: The `owner` label is required for all Namespaces.
pattern:
metadata:
labels:
owner: ?*
\\n\\n\\n\\nYou can find the two custom entries added to results.properties
:
apiVersion: wgpolicyk8s.io/v1alpha2\\nkind: ClusterPolicyReport\\nmetadata:\\n ownerReferences:\\n - apiVersion: v1\\n kind: Namespace\\n name: bar\\nresults:\\n- message: validation rule ‘check-owner’ passed.\\n policy: require-owner\\n result: pass\\n rule: check-owner\\n scored: true\\n source: kyverno\\n properties:\\n objName: bar\\n operation: CREATE\\nscope:\\n apiVersion: v1\\n kind: Namespace\\n name: bar\\n
\\n\\n\\n\\nKyverno’s GlobalContextEntry provides a powerful mechanism to fetch external data and use it within policies. When leveraging the apiCall feature to retrieve data from an API, transient network issues can sometimes hinder successful retrieval.
\\n\\n\\n\\nTo address this, Kyverno now offers built-in retry logic for API calls within GlobalContextEntry. You can now optionally specify a retryLimit for your API calls:
\\n\\n\\n\\napiVersion: kyverno.io/v2alpha1\\nkind: GlobalContextEntry\\nmetadata:\\n name: gctxentry-apicall-correct\\nspec:\\n apiCall:\\n urlPath: \\"/apis/apps/v1/namespaces/test-globalcontext-apicall-correct/deployments\\"\\n refreshInterval: 1h\\n retryLimit: 3\\n
\\n\\n\\n\\nThe retryLimit
field determines the number of times Kyverno will attempt to make the API call if it initially fails. This field is optional and defaults to 3, ensuring a reasonable level of resilience against temporary network hiccups.
By incorporating this retry mechanism, Kyverno further strengthens its ability to reliably fetch external data, ensuring your policies can function smoothly even in the face of occasional connectivity issues. This enhancement improves the overall robustness and dependability of your Kubernetes policy enforcement framework.
\\n\\n\\n\\nKyverno CLI now allows you to dynamically inject global context entries using a Values file. This feature facilitates flexible policy testing and execution by simulating different scenarios without modifying GlobalContextEntry resources in your cluster.
\\n\\n\\n\\nYou can now define global values and rule-specific values within the Values file, providing greater control over policy evaluation during testing.
\\n\\n\\n\\napiVersion: cli.kyverno.io/v1alpha1\\nkind: Value\\nmetadata:\\n name: values\\nglobalValues:\\n request.operation: CREATE\\npolicies:\\n - name: gctx\\n rules:\\n - name: main-deployment-exists\\n values:\\n deploymentCount: 1\\n
\\n\\n\\n\\nIn this example, request.operation
is set as a global value, and deploymentCount is set for a specific rule in the gctx policy. When using the Kyverno CLI, you can reference this Values file to inject these global context entries into your policy evaluation.
The Kyverno project strives to be secure and production-ready, while providing ease of use. This release contains important changes to further enhance the security of the project.
\\n\\n\\n\\nPrior versions of Kyverno included wildcard view permissions. These have been removed in 1.13 and replaced with a role binding to the system view role.
\\n\\n\\n\\nThis change does not impact policy behaviors during admission controls, but may impact users with mutate and generate policies for custom resources, and may impact reporting of policy results for validation rules on custom resources A Helm option was added to upgrade Kyverno without breaking existing policies, see the upgrade guidance here.
\\n\\n\\n\\nIn prior versions, policy exceptions were allowed in all namespaces. This creates a potential security issue, as any user with permission to create a policy exception can bypass policies, even in other namespaces. See CVE-2024-48921 for more details.
\\n\\n\\n\\nThis release changes the defaults to disable the policy exceptions and only allows exceptions to be created in a specified namespace. To maintain backward compatibility follow the upgrade guidance.
\\n\\n\\n\\nA warning message can now be returned along with admission responses by the policy setting spec.emitWarning
, this can be used to report policy violations as well as mutations upon admission events.
Kyverno performs nested variable substitution by default, this may not be desirable in certain situations. Take the following ConfigMap as an example, it defines a .hcl
string content using the same {{ }}
notation which is used in Kyverno for variable syntax. In this case, Kyverno needs to be instructed to not attempt to resolve variables in the HCL, this can be achieved by {{- ... }}
notation for shallow (one time only) substitution of variables.
apiVersion: v1\\ndata:\\n config: |-\\n from_string\\n {{ some hcl tempalte }} \\nkind: ConfigMap\\nmetadata:\\n annotations:\\n labels:\\n argocd.development.cpl.<removed>.co.at/app: corp-tech-ap-team-ping-ep\\n name: vault-injector-config-http-echo\\n namespace: corp-tech-ap-team-ping-ep\\n
\\n\\n\\n\\nTo only substitute the rule data with the HCL, and not perform nested substitutions, the following policy uses the declaration {{- hcl }}
for shallow substitution.
apiVersion: cli.kyverno.io/v1alphaapiVersion: kyverno.io/v1\\nkind: ClusterPolicy\\nmetadata:\\n name: vault-auth-backend\\nspec:\\n validationFailureAction: Audit\\n background: true\\n mutateExistingOnPolicyUpdate: true\\n rules:\\n - name: vault-injector-config-blue-to-green-auth-backend\\n context:\\n - name: hcl\\n variable:\\n jmesPath: replace_all( ‘{{ request.object.data.config }}’, ‘from_string’,‘to_string’)\\n match:\\n any:\\n - resources:\\n kinds:\\n - ConfigMap\\n names:\\n - test-*\\n namespaces:\\n - corp-tech-ap-team-ping-ep\\n mutate:\\n patchStrategicMerge:\\n data:\\n config: ‘{{- hcl }}’\\n targets:\\n - apiVersion: v1\\n kind: ConfigMap\\n name: ‘{{ request.object.metadata.name }}’\\n namespace: ‘{{ request.object.metadata.namespace }}’\\n name: vault-injector-config-blue-to-green-auth-backend\\n
\\n\\n\\n\\nKyverno-managed webhook configurations are auto-cleaned up upon uninstallation. This behavior could be broken if Kyverno loses RBAC permissions to do so given the random resources deletion order. This release introduces a finalizer-based cleanup solution to ensure webhooks are removed successfully.
\\n\\n\\n\\nThis feature is in beta stage and will be used as the default cleanup strategy in the future.
\\n\\n\\n\\nKyverno 1.13 introduces new changes in the policy CRDs:
\\n\\n\\n\\nNote that the deprecated fields will be removed in a future release, so migration to the new settings is recommended.
\\n\\n\\n\\nKyverno 1.13 promises to be a great release, with many new features, enhancements, and fixes. To get started with Kyverno try the quick start guides or head to the installation section of the docs.
\\n\\n\\n\\nTo get the most value out of Kyverno, and check out the available enterprise solutions.
\\n\\n\\n\\n\\n\\nProject post by Lin Sun, Solo.io, for the Istio Steering and Technical Oversight Committees
\\n\\n\\n\\nWe are proud to announce that Istio’s ambient data plane mode has reached General Availability, with the ztunnel, waypoints and APIs being marked as Stable by the Istio TOC. This marks the final stage in Istio’s feature phase progression, signaling that ambient mode is fully ready for broad production usage.
\\n\\n\\n\\nAmbient mesh — and its reference implementation with Istio’s ambient mode — was announced in September 2022. Since then, our community has put in 26 months of hard work and collaboration, with contributions from Solo.io, Google, Microsoft, Intel, Aviatrix, Huawei, IBM, Red Hat, and many others. Stable status in 1.24 indicates the features of ambient mode are now fully ready for broad production workloads. This is a huge milestone for Istio, bringing Istio to production readiness without sidecars, and offering users a choice.
\\n\\n\\n\\nFrom the launch of Istio in 2017, we have observed a clear and growing demand for mesh capabilities for applications — but heard that many users found the resource overhead and operational complexity of sidecars hard to overcome. Challenges that Istio users shared with us include how sidecars can break applications after they are added, the large CPU and memory requirement for a proxy with every workload, and the inconvenience of needing to restart application pods with every new Istio release.
\\n\\n\\n\\nAs a community, we designed ambient mesh from the ground up to tackle these problems, alleviating the previous barriers of complexity faced by users looking to implement service mesh. The new concept was named ‘ambient mesh’ as it was designed to be transparent to your application, with no proxy infrastructure collocated with user workloads, no subtle changes to configuration required to onboard, and no application restarts required. In ambient mode it is trivial to add or remove applications from the mesh. All you need to do is label a namespace, and all applications in that namespace are instantly added to the mesh. This immediately secures all traffic within that namespace with industry-standard mutual TLS encryption — no other configuration or restarts required!. Refer to the Introducing Ambient Mesh blog for more information on why we built Istio’s ambient mode.
\\n\\n\\n\\nThe core innovation behind ambient mesh is that it slices Layer 4 (L4) and Layer 7 (L7) processing into two distinct layers. Istio’s ambient mode is powered by lightweight, shared L4 node proxies and optional L7 proxies, removing the need for traditional sidecar proxies from the data plane. This layered approach allows you to adopt Istio incrementally, enabling a smooth transition from no mesh, to a secure overlay (L4), to optional full L7 processing — on a per-namespace basis, as needed, across your fleet.
\\n\\n\\n\\nBy utilizing ambient mesh, users bypass some of the previously restrictive elements of the sidecar model. Server-send-first protocols now work, most reserved ports are now available, and the ability for containers to bypass the sidecar — either maliciously or not — is eliminated.
\\n\\n\\n\\nThe lightweight shared L4 node proxy is called the ztunnel (zero-trust tunnel). ztunnel drastically reduces the overhead of running a mesh by removing the need to potentially over-provision memory and CPU within a cluster to handle expected loads. In some use cases, the savings can exceed 90% or more, while still providing zero-trust security using mutual TLS with cryptographic identity, simple L4 authorization policies, and telemetry.
\\n\\n\\n\\nThe L7 proxies are called waypoints. Waypoints process L7 functions such as traffic routing, rich authorization policy enforcement, and enterprise-grade resilience. Waypoints run outside of your application deployments and can scale independently based on your needs, which could be for the entire namespace or for multiple services within a namespace. Compared with sidecars, you don’t need one waypoint per application pod, and you can scale your waypoint effectively based on its scope, thus saving significant amounts of CPU and memory in most cases.
\\n\\n\\n\\nThe separation between the L4 secure overlay layer and L7 processing layer allows incremental adoption of the ambient mode data plane, in contrast to the earlier binary “all-in” injection of sidecars. Users can start with the secure L4 overlay, which offers a majority of features that people deploy Istio for (mTLS, authorization policy, and telemetry). Complex L7 handling such as retries, traffic splitting, load balancing, and observability collection can then be enabled on a case-by-case basis.
\\n\\n\\n\\nThe ztunnel image on Docker Hub has reached over 1 million downloads, with ~63,000 pulls in the last week alone.
\\n\\n\\n\\nWe asked a few of our users for their thoughts on ambient mode’s GA:
\\n\\n\\n\\n\\n\\n\\n\\n\\nIstio’s implementation of a service mesh with their ambient mesh design has been a great addition to our Kubernetes clusters to simplify the team responsibilities and overall network architecture of the mesh. In conjunction with the Gateway API project it has given me a great way to enable developers to get their networking needs met at the same time as only delegating as much control as needed. While it’s a rapidly evolving project it has been solid and dependable in production and will be our default option for implementing networking controls in a Kubernetes deployment going forth. — Daniel Loader, Lead Platform Engineer at Quotech
\\n
\\n\\n\\n\\n\\nIt is incredibly easy to install ambient mesh with the Helm chart wrapper. Migrating is as simple as setting up a waypoint gateway, updating labels on a namespace, and restarting. I’m looking forward to ditching sidecars and recuperating resources. Moreover, easier upgrades. No more restarting deployments! — Raymond Wong, Senior Architect at Forbes
\\n
\\n\\n\\n\\n\\nIstio’s ambient mode has served our production system since it became Beta. We are pleased by its stability and simplicity and are looking forward to additional benefits and features coming together with the GA status. Thanks to the Istio team for the great efforts! — Saarko Eilers, Infrastructure Operations Manager at EISST International Ltd
\\n
\\n\\n\\n\\n\\nBy Switching from AWS App Mesh to Istio in ambient mode, we were able to slash about 45% of the running containers just by removing sidecars and SPIRE agent DaemonSets. We gained many benefits, such as reducing compute costs or observability costs related to sidecars, eliminating many of the race conditions related to sidecars startup and shutdown, plus all the out-of-the-box benefits just by migrating, like mTLS, zonal awareness and workload load balancing. — Ahmad Al-Masry, DevSecOps Engineering Manager at Harri
\\n
\\n\\n\\n\\n\\nWe chose Istio because we’re excited about ambient mesh. Different from other options, with Istio, the transition from sidecar to sidecar-less is not a leap of faith. We can build up our service mesh infrastructure with Istio knowing the path to sidecar-less is a two way door. — Troy Dai, Senior Staff Software Engineer at Coinbase
\\n
\\n\\n\\n\\n\\nExtremely proud to see the fast and steady growth of ambient mode to GA, and all the amazing collaboration that took place over the past months to make this happen! We are looking forward to finding out how the new architecture is going to revolutionize the telcos world. — Faseela K, Cloud Native Developer at Ericsson
\\n
\\n\\n\\n\\n\\nWe are excited to see the Istio dataplane evolve with the GA release of ambient mode and are actively evaluating it for our next-generation infrastructure platform. Istio’s community is dynamic and welcoming, and ambient mesh is a testament to the community embracing new ideas and pragmatically working to improve developer experience operating Istio at scale. — Tyler Schade, Distinguished Engineer at GEICO Tech
\\n
\\n\\n\\n\\n\\nWith Istio’s ambient mode reaching GA, we finally have a service mesh solution that isn’t tied to the pod lifecycle, addressing a major limitation of sidecar-based models. Ambient mesh provides a more lightweight, scalable architecture that simplifies operations and reduces our infrastructure costs by eliminating the resource overhead of sidecars. — Bartosz Sobieraj, Platform Engineer at Spond
\\n
\\n\\n\\n\\n\\nOur team chose Istio for its service mesh features and strong alignment with the Gateway API to create a robust Kubernetes-based hosting solution. As we integrated applications into the mesh, we faced resource challenges with sidecar proxies, prompting us to transition to ambient mode in Beta for improved scalability and security. We started with L4 security and observability through ztunnel, gaining automatic encryption of in-cluster traffic and transparent traffic flow monitoring. By selectively enabling L7 features and decoupling the proxy from applications, we achieved seamless scaling and reduced resource utilization and latency. This approach allowed developers to focus on application development, resulting in a more resilient, secure, and scalable platform powered by ambient mode. — Jose Marques, Senior DevOps at Blip.pt
\\n
\\n\\n\\n\\n\\nWe are using Istio to ensure strict mTLS L4 traffic in our mesh and we are excited for ambient mode. Compared to sidecar mode it’s a massive save on resources and at the same time it makes configuring things even more simple and transparent. — Andrea Dolfi, DevOps Engineer
\\n
The general availability of ambient mode means the following things are now considered stable:
\\n\\n\\n\\nistioctl
.istioctl
to operate waypoints, and troubleshoot ztunnel & waypoints.Refer to the feature status page for more information.
\\n\\n\\n\\nWe are not standing still! There are a number of features that we continue to work on for future releases, including some that are currently in Alpha/Beta.
\\n\\n\\n\\nIn our upcoming releases, we expect to move quickly on the following extensions to ambient mode:
\\n\\n\\n\\nSidecars are not going away, and remain first-class citizens in Istio. You can continue to use sidecars, and they will remain fully supported. While we believe most use cases will be best served with a mesh in ambient mode, the Istio project remains committed to ongoing sidecar mode support.
\\n\\n\\n\\nWith the 1.24 release of Istio and the GA release of ambient mode, it is now easier than ever to try out Istio on your own workloads.
\\n\\n\\n\\nYou can engage with the developers in the #ambient channel on the Istio Slack, or use the discussion forum on GitHub for any questions you may have.
\\n\\nMember post by Rajdeep Saha, Principal Solutions Architect, AWS and Praseeda Sathaye, Principal SA, Containers & OSS, AWS
\\n\\n\\n\\n
Karpenter is an open-source project that provides node lifecycle management to optimize the efficiency and cost of running workloads on Kubernetes clusters. AWS created and open sourced Karpenter in 2021 to help automate how customers select, provision, and scale infrastructure in their clusters and provide more flexibility for Kubernetes users to take full advantage of unique infrastructure offerings across different cloud providers. In 2023, the project graduated to beta, and AWS contributed the vendor-neutral core of the project to the Cloud Native Computing Foundation (CNCF) through the Kubernetes Autoscaling Special Interest Group (SIG). In 2024, AWS released Karpenter version 1.0.0 that marks the final milestone in the project’s maturity. With this release, all Karpenter APIs will remain available in future 1.0 minor versions and will not be modified in ways that results in breaking changes. Karpenter is available as open-source software (OSS) under an Apache 2.0 license. It separates the generic logic for Kubernetes application-awareness and workload binpacking from the creation and running of API requests to launch or terminate compute resources for a given cloud provider. By developing cloud provider-specific integrations that interact with their respective compute APIs, Karpenter enables individual cloud providers such as AWS, Azure, GCP, and others to leverage its capabilities within their respective environments. In 2023 Microsoft released Karpenter Provider for running Karpenter on Azure Kubernetes Service (AKS).
Today, Karpenter has gained widespread popularity within the Kubernetes community, with a diverse range of organizations and enterprises using its capabilities to help improve application availability, lower operational overhead, and increase cost-efficiency.
\\n\\n\\n\\n
Karpenter’s job in a Kubernetes cluster is to make application and Kubernetes-aware compute capacity decisions. It is built as a Kubernetes Operator, runs in the Kubernetes cluster, and manages cluster compute infrastructure. There are two kinds of decisions Karpenter makes: to provision new compute and to deprovision that compute when it’s no longer needed. Karpenter works by watching for pods that the Kubernetes scheduler has marked as unschedulable, evaluating scheduling constraints (resource requests, node-selectors, affinities, tolerations, and topology spread constraints) requested by the pods, provisioning nodes that meet the requirements of the pods, and deprovisioning the nodes when the nodes are no longer needed. Karpenter’s workload consolidation feature proactively identifies and reschedules underused workloads onto a more cost-efficient set of instances, either by reusing existing instances within the cluster or by launching new, optimized instances, thereby maximizing resource usage and minimizing operational costs.
Karpenter scaling is controlled using Kubernetes native YAMLs, specifically through the use of NodePool and NodeClass custom Kubernetes resources.
\\n\\n\\n\\nNodePools set constraints on the nodes that Karpenter provisions in the Kubernetes cluster. Each NodePool defines requirements such as instance types, availability zones, architectures (for example AMD64 or ARM64), capacity types (spot or on-demand), and other node settings that apply to all the nodes launched in the NodePool. It also allows setting limits on total resources such as CPU, memory, and GPUs that the NodePool can consume. The following is an example of NodePool configuration.
\\n\\n\\n\\napiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: [\\"c\\",\\"m\\",\\"r\\",\\"t\\"]
- key: \\"karpenter.k8s.aws/instance-family\\"
operator: In
values: [\\"m5\\",\\"m5d\\",\\"c5\\",\\"c5d\\",\\"c4\\",\\"r4\\"]
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values: [\\"nano\\",\\"micro\\",\\"small\\",\\"medium\\"]
- key: topology.kubernetes.io/zone
operator: In
values: [\\"us-west-2a\\",\\"us-west-2b\\"]
- key: kubernetes.io/arch
operator: In
values: [\\"amd64\\",\\"arm64\\"]
limits:
cpu: 100
\\n\\n\\n\\nRefer to the Karpenter documentation for the complete list of fields for NodePool requirement.
\\n\\n\\n\\nThe following is the example of NodePool with taints, user-defined labels and annotations that are added to all the nodes provisioned by Karpenter.
\\n\\n\\n\\napiVersion: karpenter.sh/v1beta1
kind: NodePool
spec:
template:
metadata:
annotations:
application/name: \\"app-a\\"
labels:
team: team-a
spec:
taints:
- key: example.com/special-taint
value: \\"true\\"
effect: NoSchedule
\\n\\n\\n\\napiVersion: apps/v1
kind: Deployment
metadata:
name:
spec: myapp
nodeSelector:
team: team-a
\\n\\n\\n\\nKarpenter supports all standard Kubernetes scheduling constraints, such as node selectors, node affinity, taints/tolerations, and topology spread constraints. This allows applications to use these constraints when scheduling pods on the nodes provisioned by Karpenter.
\\n\\n\\n\\nNodeClasses in Karpenter allow you to configure cloud provider-specific settings for your nodes in Kubernetes cluster. Each NodePool references a NodeClass that determines the specific configuration of nodes that Karpenter provisions. For example, you can specify settings such as the Amazon Machine Image (AMI) family, subnet and security group selectors, AWS Identity and Access Management (IAM) role/instance profile, node labels, and various Kubelet configurations in AWS EC2NodeClass and similarly for Azure AKSNodeClass.
\\n\\n\\n\\n
Karpenter is more than an efficient cluster autoscaler for Kubernetes. Karpenter optimizes compute costs, helps upgrade and patch data plane worker nodes, and delivers powerful and explorative use cases by pairing with other CNCF tools.
In the previous section, we saw how Karpenter provisions appropriate worker virtual machines (VMs) based on pod resource requests. As the workloads in a Kubernetes cluster change and scale, it can be necessary to launch new instances to make sure they have the compute resources they need. Over time, those instances can become under-used as some workloads scale down or are removed from the cluster. Workload consolidation for Karpenter automatically looks for opportunities to reschedule these workloads onto a set of more cost-efficient instances, whether they are already in the cluster or need to be launched.
In the preceding diagram, the first node is highly used but the others aren’t used as well as they could be. With Karpenter, you can enable the consolidation feature in the NodePool YAML:
\\n\\n\\n\\napiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
\\n\\n\\n\\nKarpenter is always evaluating and working to reduce the cluster cost. Karpenter consolidates workloads onto the fewest, lowest-cost instances, while still respecting the pod’s resource and scheduling constraints. With the preceding scenario, Karpenter moves the pods from the last two nodes into the second node, and terminates the resultant empty nodes:
\\n\\n\\n\\nKarpenter prioritizes nodes to consolidate based on the least number of pods scheduled. For users with workloads that experience sudden spikes in demand or interruptible jobs, frequent pod creation and deletion (pod churn) might be a concern. Karpenter offers a consolidateAfter setting to control how quickly it attempts to consolidate nodes to maintain optimal capacity and minimize node churn. By specifying a value in hours, minutes, or seconds, users can determine the delay before Karpenter initiates consolidation actions in response to pod additions or removals.
\\n\\n\\n\\n apiVersion: karpenter.sh/v1\\nkind: NodePool\\nmetadata:\\n name: default\\nspec:\\n disruption:\\n consolidationPolicy: WhenEmptyOrUnderutilized\\n consolidateAfter: 1h
\\n\\n\\n\\nWith consolidation, Karpenter also rightsizes the worker nodes. For example, in the following case, if Karpenter consolidates the pod from the third node to the second (m5.xlarge), then there is still underusage. Karpenter instead provisions a smaller node (m5.large) and consolidates the pods, resulting in lower cost.
\\n\\n\\n\\nTo know more about consolidation, refer to this official documentation
\\n\\n\\n\\napiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiSelectorTerms:
- alias: bottlerocket@latest
\\n\\n\\n\\namiSelectorTerms
is a necessary field of the NodeClass and a new term, alias
, has been introduced with version 1.0, which consists of an AMI family and a version (family@version). If an alias exists in the NodeClass, then Karpenter selects the AMI supplied by the cloud provider for that family. With this new feature, users can also pin to a specific version of an AMI. For AWS, the following Amazon Elastic Kubernetes Service (Amazon EKS) optimized AMI families can be configured: al2, al2023, bottlerocket, windows2019, and windows2022
All the nodes provisioned by this NodeClass, will have the latest bottlerocket AMI. Because this alias uses @latest
version, when the cloud provider releases a new optimized AMI for the Kubernetes version cluster is running with, then Karpenter updates the worker node AMIs automatically, respecting the Kubernetes scheduling constraints. Worker nodes are upgraded in a rolling deployment fashion. If the cluster is upgraded to a newer version, then Karpenter automatically upgrades the worker nodes with the latest AMI for this new version, automatically and without manual intervention. This takes away management overhead and lets you always run with the latest and most secure AMI. This works well in pre-production environments where it’s nice to be auto-upgraded to the latest version for testing, but more control over AMI versions is recommended in production environments.
Alternatively, you can also pin your worker nodes to a specific version of the AMI as follows:
\\n\\n\\n\\napiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiSelectorTerms:
- alias: bottlerocket@v1.20.3
\\n\\n\\n\\nIn this case, if the cloud provider releases new bottlerocket AMI, Karpenter doesn’t drift the worker nodes.
\\n\\n\\n\\nKarpenter also supports custom AMIs. You can use the existing tags, name, or ID field in amiSelectorTerms to select an AMI. In the following case, the AMI with ID ami-123 is selected to provision the nodes. The amiFamily Bottlerocket injects pre-generated user data into the provisioned node.
\\n\\n\\n\\napiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
...
amiFamily: Bottlerocket
amiSelectorTerms:
- id: ami-123
\\n\\n\\n\\nTo upgrade the worker nodes, change the amiSelectorTerms
to select a different AMI, and nodes drift and upgrade to the assigned AMI.
Karpenter Working with Other CNCF projects
\\n\\n\\n\\nKarpenter can be used with other CNCF projects to deliver powerful solutions for common use cases. One prominent example of this is using Kubernetes Event Driven Autoscaling (KEDA) with Karpenter to implement event driven workloads. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be processed. One popular implementation is to scale up worker nodes to accommodate pods that process messages coming into a queue:
\\n\\n\\n\\nOften users want to scale down the number of worker nodes during off hours. KEDA and Karpenter can support this use case:
\\n\\n\\n\\nKarpenter can be combined with other CNCF projects such as Prometheus, Argo Workflows, Grafana to achieve diverse use cases. Check out this previous talk from Kubecon EU 2024 how Argo Workflows can be combined with Karpenter to migrate from Cluster Autoscaler.
\\n\\n\\n\\n
To get started with Karpenter, you can follow the official Getting Started with Karpenter guide, which provides a step-by-step procedure for creating an EKS cluster using eksctl and adding Karpenter to it. Alternatively, if you prefer using Terraform, then you can use the Amazon EKS Blueprints for Terraform, which includes a Karpenter module, thus streamlining the process of setting up Karpenter alongside your EKS cluster. Furthermore, there are guides available for setting up Karpenter with other Kubernetes distributions such as kOps on AWS. And if you want to migrate from Kubernetes Cluster Autoscaler to Karpenter for automatic node provisioning on an existing EKS cluster, then you can refer this guide for the detailed steps.
Karpenter has evolved far beyond being a tool for autoscaling, showcasing its versatility and deeper integration within the cloud-native ecosystem. Karpenter not only scales worker nodes but also drives cost efficiencies, seamlessly managing diverse workloads such as generative AI, facilitating data plane upgrades with precision, and more.
Looking ahead, the possibilities for Karpenter are endless, especially as organizations explore groundbreaking use cases. We are only beginning to scratch the surface of what Karpenter can achieve when combined with other CNCF projects. The potential for Karpenter to contribute to next-generation infrastructure is immense, and we can’t wait to observe the inventive and powerful use cases users come up with, making their cloud operations more efficient, scalable, and intelligent.
To shape the future of Karpenter, let us know what features we should work on by upvoting and commenting here.
\\n\\n\\n\\nIf you are attending KubeCon NA 2024, then you can meet with us at the AWS Booth F1, or attend our Karpenter workshop Tutorial: Kubernetes Smart Scaling: Getting Started with Karpenter to learn more.
\\n\\n\\n\\nRajdeep Saha, Principal Solutions Architect, AWS
\\n\\n\\n\\nRaj is the Principal Specialist SA for Containers, and Serverless at AWS. Rajdeep has architected high profile AWS applications serving millions of users. He is a published instructor on Kubernetes, Serverless, DevOps, and System Design, has published blogs, and presented well-received talks at major events, such as AWS Re:Invent, Kubecon, AWS Summits.
\\n\\n\\n\\nPraseeda Sathaye, Principal SA, Containers & OSS
\\n\\n\\n\\nPraseeda Sathaye is a Principal Specialist for App Modernization and Containers at Amazon Web Services, based in the Bay Area in California. She has been focused on helping customers accelerate their cloud-native adoption journey by modernizing their platform infrastructure and internal architecture using microservices strategies, containerization, platform engineering, GitOps, Kubernetes and service mesh. Praseeda is an ardent advocate for leveraging Generative AI (GenAI) on Amazon Elastic Kubernetes Service (EKS) to unlock the full potential of cutting-edge technologies, enabling the development of AI-powered applications.
\\n\\n\\n\\nProject post originally on the Litmus blog by Sayan Mondal, Community Manager and Maintainer
\\n\\n\\n\\nOver the past few years, LitmusChaos has evolved tremendously, becoming a leading open-source tool for Chaos Engineering within the Cloud Native ecosystem. It’s been inspiring to watch our community grow, from engineers just getting started with chaos testing to large teams running complex fault scenarios across cloud and on-prem environments.
\\n\\n\\n\\nOn behalf of the maintainers, we’re deeply grateful for the energy and passion each of you brings to the project. Your contributions, ideas, and feedback have fueled LitmusChaos’ growth and helped tackle real-world resilience challenges in a collaborative, open-source environment. Whether it’s through insightful discussions in our Slack channels, pull requests on GitHub, or your experiences shared at events, it’s clear we’re building something special—together!
\\n\\n\\n\\nOur maintainers and contributors will be on-site, eager to meet with attendees, discuss how
\\n\\n\\n\\nto get started with LitmusChaos, answer questions, and share opportunities to contribute to the project. We invite you to stop by, say hello, and learn more about how chaos engineering is transforming modern cloud-native applications.
\\n\\n\\n\\nProject Pavilion Kiosk #16A
\\n\\n\\n\\nLocation: Level 1 | Halls A-C + 1-5 | Project Pavilion (Hall 1), Kiosk #16A Shift: Half-shift AM schedule
\\n\\n\\n\\nProject Pavilion Hours:
\\n\\n\\n\\nOur team will be at the Project Pavilion ready to talk chaos engineering, share resources, and hand out some awesome LitmusChaos stickers, candies and more. No matter your experience level—whether you’re an expert or just beginning to explore chaos engineering—this is the perfect opportunity to get personalized guidance and learn best practices for using LitmusChaos.
\\n\\n\\n\\nThere are plenty of ways to connect with us beyond our kiosk! Here’s where you can find us around the venue:
\\n\\n\\n\\nIf you’d like to set up a specific time to meet with us, please feel free to email matthew.schillerstrom@harness.io or sayan.mondal@harness.io—we’d love to connect at KubeCon NA 2024!
\\n\\n\\n\\nOur booth offers a hands-on opportunity to dive into chaos engineering, ask questions, and explore the latest developments in LitmusChaos. Here are some highlights:
\\n\\n\\n\\nWhether you’re facing blockers, looking for experiment guidance, or curious about the project roadmap, our team is here to help you make the most of chaos engineering with LitmusChaos.
\\n\\n\\n\\nCan’t make it to KubeCon NA? You can still stay connected with the LitmusChaos community:
\\n\\n\\n\\nProject post by the Falco Team and Nigel Douglas
\\n\\n\\n\\nFalco achieved CNCF Graduation status on February 29, 2024. Following the celebration of this significant milestone at KubeCon EU in Paris earlier this year, the project has seen several major highlight-worthy updates.
\\n\\n\\n\\nThe Falco community’s recent development efforts have focused on enhancing performance and stability to deliver the best possible user experience for Falco adopters. Notable improvements include the integration of a new endpoint for exposing metrics in Prometheus format, automatic selection of the optimal driver for your system, a new collector that enriches captures with Kubernetes metadata, and many other exciting features.
\\n\\n\\n\\nHere are the key highlights:
\\n\\n\\n\\nFalco Talon is officially part of the falcosecurity project in Github. Searching for Falco Talon in Helm, you can see it now officially starting off on version 0.1.1. So what is Falco Talon? It’s a dedicated response engine for Falco. Response actions are linked to Falco rules, so when a detection rule is triggered, any of the listed “actionners” can be triggered in response to that unwanted behavior.
\\n\\n\\n\\nhelm search repo falcosecurity/falco-talon\\nNAME CHART APP VERSIONDESCRIPTION \\nfalcosecurity/falco-talon0.1.1 0.1.1 React to the events from Falco
\\n\\n\\n\\nRethinking how organizations should respond to threats in cloud-native architectures, Talon provides an industry-first, API-driven approach to threat mitigation directly via existing API primitives like networkpolicy, label enforcement. and graceful termination. This is a no-code implementation for threat isolation. Assign Falco rules to Falco actions in YAML, and let the automated API responses work their magic. This is the fastest, most efficient, and most reliable approach in today’s rapidly evolving Kubernetes and Cloud environments.
\\n\\n\\n\\nThe new Regular Expression (RegEx) operator allows you to match patterns in string fields using the Google RE2 library. While powerful, it’s important to note that the regex operator is significantly slower — potentially up to ten times slower — than other string comparison operators, which are recommended for simpler cases. For example, to detect certain patterns in file descriptors, you could use the following: fd.name regex [a-z]*/proc/[0-9]+/cmdline
.
It is important that users are mindful of performance impacts when opting for regex over simpler string operations.
\\n\\n\\n\\nOur community has once again played a vital role in expanding Falco’s capabilities by integrating new alerting targets, including Dynatrace, Sumo Logic, OpenTelemetry Traces, and Quickwit. These integrations provide more flexibility in how users receive and analyze security alerts, allowing Falco to seamlessly connect with existing ecosystem tools.
\\n\\n\\n\\nAdditionally, Falco can now collect Kubernetes audit logs from Google Kubernetes Engine (GKE) and logs from journald, providing deeper visibility into cluster activities and system-level events. New data sources, such as Kafka and Keycloak, have also been added, broadening the scope of environments which Falco can collect events from. These enhancements further empower teams to detect threats and monitor compliance across a wider range of architectures.
\\n\\n\\n\\nThis applies to the following transform operators: toupper, tolower, b64
, and basename
. Previously, it was not possible to write conditions that compared a field to another field, such as detecting when “a process deletes its own executable.” This limitation existed because field values couldn’t be used on the right-hand side of conditions. However, in this version, we’ve introduced the val() operator, which solves this issue.
evt.type = unlink and proc.exepath = val(fs.path.name)
\\n\\n\\n\\nThis rule will only trigger if the process’s executable path (proc.exepath
) matches the unlink target (fs.path.name
), effectively detecting when a process attempts to delete its own executable. Additionally, you can apply simple transformation operators to both sides of the comparison: toupper()
and tolower()
convert the case of strings, while b64() decodes base64-encoded strings.
Stay tuned as more transformers will be introduced to support additional use cases! For more details, check the documentation on transform operators.
\\n\\n\\n\\nDeploying across diverse Kubernetes environments just got easier! When using the official Falco Helm chart and setting driver.kind=auto
, the driver loader now intelligently handles the heavy lifting for you.
Here’s how it works: the driver loader will automatically generate a new Falco configuration file and select the correct engine driver based on the specific node Falco is deployed on. This means whether you’re using eBPF, kmod, or a modern eBPF driver, Falco will configure itself dynamically depending on the environment.
\\n\\n\\n\\nIn many Kubernetes clusters, nodes can differ in terms of kernel versions, capabilities, and driver compatibility. With this new auto-selection feature, you can seamlessly deploy different Falco drivers across various nodes within the same cluster.
\\n\\n\\n\\nThis setup gives you flexibility and ensures that each node in your Kubernetes cluster is running Falco in the most optimized way possible, without manual configuration. Simply set driver.kind=auto in the Helm chart and let Falco do the rest.
\\n\\n\\n\\nIf you have been following Falco development, you probably know we are constantly improving support for metrics that tell you how the Falco engine is doing. We now have introduced Prometheus support so you can better integrate Falco with your existing performance monitoring infrastructure and paves the way for the community to create an official Grafana dashboard that can be integrated into users’ charts. We integrated the new endpoint to expose a lot of useful metrics in the Prometheus format, which allows for the out-of-the-box Grafana dashboard creation.
\\n\\n\\n\\nThe Falco Plugin Rust SDK is a newly developed toolkit designed to allow developers to build Falco plugins using the Rust programming language. Rust, known for its focus on safety, performance, and efficiency, is a natural fit for building high-quality plugins that interact with the Falco ecosystem.
\\n\\n\\n\\nThis SDK provides a robust framework for creating Falco plugins, offering flexibility for different plugin types. Developers can create both dynamically and statically linked plugins:
\\n\\n\\n\\ncrate_type = [\\"dylib\\"]
, these plugins leverage macros like plugin! to enable various plugin capabilities.crate_type = [\\"staticlib\\"],
these plugins use the static_plugin
! macro without needing to manage individual capabilities.The SDK simplifies plugin development and extends Falco’s functionality, while opening doors for Rust developers to contribute and innovate within the Falco community.
\\n\\n\\n\\nTo close where we began, Falco has seen a boom in growth and usage since achieving CNCF graduation earlier this year. In total, Falco has now been downloaded 130M times, but that’s not all.
\\n\\n\\n\\nWhen users deploy Falco with its default configuration, Falco checks for community rule updates four times per day. We can derive how many Falco instances run daily using rules download stats from GitHub — in the last month, we have experienced over 120M downloads for the latest ruleset version. In short, Falco currently monitors around 1M active nodes daily using the latest ruleset and subscribed to the community rules feed. This is in addition to many millions more running previous versions.
\\n\\n\\n\\nThank you for your continued support of the Falco project. We hope to see you at KubeCon + CloudNativeCon North America in Salt Lake City!
\\n\\nMember post by Stanislava Racheva, DevOps & Cloud engineer at ITGix
\\n\\n\\n\\nIn modern Kubernetes environments, managing container images and ensuring that applications are always running the latest, most secure versions can be daunting. Argo CD Image Updater simplifies this process by automatically checking for new container image versions and updating your applications accordingly. Integrating seamlessly with Argo CD enables fully automated updates to Kubernetes workloads.
\\n\\n\\n\\nThe beauty of Argo CD Image Updater lies in its simplicity and flexibility. By annotating your Argo CD application resources with a list of images and defining version constraints, the Image Updater takes over the heavy lifting. It regularly polls for new image versions from your container registry, checks if they meet the specified constraints, and updates your applications automatically.
\\n\\n\\n\\nArgo CD Image Updater also offers a range of advanced features, such as support for Helm and Kustomize-based applications, various update strategies (like semver, latest, name, and digest), and seamless integration with private container registries. Additionally, it allows parallel updates and supports filtering tags with custom matchers, making it highly customizable and suitable for both small and large-scale Kubernetes environments.
\\n\\n\\n\\nWhether you’re running a simple workload or managing complex deployments across multiple environments, Argo CD Image Updater provides a streamlined way to automate image updates, reduce operational overhead, and ensure that your applications are always running with the latest and most secure versions.
\\n\\n\\n\\nIn this example implementation, we are using the official argocd-image-updater helm chart, available at: https://github.com/argoproj/argo-helm/tree/main/charts/argocd-image-updater
\\n\\n\\n\\nIt is deployed as an argocd application in the same cluster and namespace as Argo CD:
\\n\\n\\n\\napiVersion: argoproj.io/v1alpha1\\nkind: Application\\nmetadata:\\n name: argocd-image-updater\\n namespace: argocd\\nspec:\\n destination:\\n namespace: argocd\\n server: https://kubernetes.default.svc\\n project: \'applications\'\\n source:\\n helm:\\n valueFiles:\\n - ../argocd-image-updater/values.yaml\\n path: helm/argocd-image-updater\\n repoURL: https://gitlab.org.com/demo.git\\n targetRevision: HEAD\\n syncPolicy:\\n automated:\\n prune: true\\n selfHeal: true\\n allowEmpty: false\\n syncOptions:\\n revisionHistoryLimit: 3
\\n\\n\\n\\nLet’s review the values file, where we’ll explore some of the essential configuration options required. These options are critical to ensuring proper functionality and deployment of the service.
\\n\\n\\n\\nRegistries
\\n\\n\\n\\nHere we will configure the container registries that we are using. Argo CD Image Updater supports the majority of container registries (public and private), that implement Docker registry v2 API and has been tested against registries such as Docker Hub, Docker Registry v2 reference implementation (on-premise), Red Had Quay, Jfrog Artifactory, Github Container Registry, GitHub Packages Registry, GitLab Container Registry and Google Container Registry.
\\n\\n\\n\\nIn the following examples, we will configure two of the most widely used container registries – Amazon Elastic Container Registry (ECR) and GitHub Container Registry (GHCR). In our case, we are working with private registries to ensure secure storage and access control for container images.
\\n\\n\\n\\nAmazon Elastic Container Registry (ECR) configuration:
\\n\\n\\n\\nregistries:\\n - name: ECR\\n api_url: https://000000000000.dkr.ecr.eu-west-1.amazonaws.com\\n prefix: 000000000000.dkr.ecr.eu-west-1.amazonaws.com\\n ping: yes\\n insecure: false\\n credentials: ext:/scripts/login.sh\\n credsexpire: 10h
\\n\\n\\n\\nFor Amazon Elastic Container Registry, authentication is possible through a script that executes an API call to retrieve the necessary credentials. In the values file, we can include this script in the authScripts section:
\\n\\n\\n\\nauthScripts:\\n # -- Whether to mount the defined scripts that can be used to authenticate with a registry, the scripts will be mounted at `/scripts`\\n enabled: true\\n # -- Map of key-value pairs where the key consists of the name of the script and the value the contents\\n scripts:\\n login.sh: |\\n #!/bin/sh\\n aws ecr --region \\"eu-west-1\\" get-authorization-token --output text --query \'authorizationData[].authorizationToken\' | base64 -d
\\n\\n\\n\\nThe script is executed by the pod and is responsible for obtaining the ECR authorization token. We use a role attached to our EKS node group, which includes the AWS-managed policy AmazonEC2ContainerRegistryReadOnly. This policy permits the GetAuthorizationToken API call:
\\n\\n\\n\\n{\\n \\"Version\\": \\"2012-10-17\\",\\n \\"Statement\\": [\\n {\\n \\"Effect\\": \\"Allow\\",\\n \\"Action\\": [\\n \\"ecr:GetAuthorizationToken\\",\\n \\"ecr:BatchCheckLayerAvailability\\",\\n \\"ecr:GetDownloadUrlForLayer\\",\\n \\"ecr:GetRepositoryPolicy\\",\\n \\"ecr:DescribeRepositories\\",\\n \\"ecr:ListImages\\",\\n \\"ecr:DescribeImages\\",\\n \\"ecr:BatchGetImage\\",\\n \\"ecr:GetLifecyclePolicy\\",\\n \\"ecr:GetLifecyclePolicyPreview\\",\\n \\"ecr:ListTagsForResource\\",\\n \\"ecr:DescribeImageScanFindings\\"\\n ],\\n \\"Resource\\": \\"*\\"\\n }
\\n\\n\\n\\nGitHub Container Registry configuration:
\\n\\n\\n\\nregistries:\\n - name: GitHub Container Registry\\n api_url: https://ghcr.io\\n prefix: ghcr.io\\n ping: yes\\n credentials: secret:argocd/ghcr-secret#token
\\n\\n\\n\\nFor registry authentication, in the credentials section, we are using Kubernetes secret. The #token part refers to the specific key (usually containing a personal access token or authentication token) inside the secret. The token must have at least read:packages permissions. Here is a manifest of the Kubernetes secret which has to be applied in the argocd namespace:
\\n\\n\\n\\napiVersion: v1\\nkind: Secret\\nmetadata:\\n name: ghcr-secret3\\n namespace: argocd\\nstringData:\\n token: user_name:access_token
\\n\\n\\n\\nEnabling the service account and RBAC creation:
\\n\\n\\n\\nrbac:\\n # -- Enable RBAC creation\\n enabled: true\\n\\nserviceAccount:\\n # -- Specifies whether a service account should be created\\n create: true\\n # -- Annotations to add to the service account\\n annotations: {}\\n # -- Labels to add to the service account\\n labels: {}\\n # -- The name of the service account to use.\\n # If not set and create is true, a name is generated using the fullname template\\n name: \\"\\"
\\n\\n\\n\\nServiceAccount provides the necessary identity for ArgoCD Image Updater to authenticate and interact with the Kubernetes API in order to perform updates on deployment manifests or Helm charts (e.g., changing container image tags).
\\n\\n\\n\\nrbac ensures that ArgoCD Image Updater is granted only the permissions it needs, helping to secure your cluster by restricting its access and reducing the attack surface.
\\n\\n\\n\\nWithout enabling both, the ArgoCD Image Updater would either lack the permissions to modify Kubernetes resources (failing to update your applications) or could have overly broad permissions, which could be a security risk.
\\n\\n\\n\\nIn the default installation scenario, i.e. Argo CD Image Updater installed to the argocd namespace, no further configuration has to be done for Argo CD Image Updater to access the Kubernetes API. If your Argo CD installation is in a different namespace than argocd, you would have to adapt the RoleBinding to bind to the ServiceAccount in the correct namespace.
\\n\\n\\n\\nLog Level:
\\n\\n\\n\\n# -- Argo CD Image Update log level\\n logLevel: \\"debug\\"
\\n\\n\\n\\nChanging the log level from “info” to “debug” in the Argo CD Image Updater values file can be beneficial in certain scenarios where you need deeper insights into the system’s behavior.
\\n\\n\\n\\nArgo CD Image Updater binary:
\\n\\n\\n\\nThe argocd-image-updater binary and specifically the test subcommand provides a variety of test options including testing registry access, multi-arch images, semver constrains, update strategies, and credentials before configuring annotations on your Argo CD applications. It is available in the argocd-image-updater pod or you can install it locally. Here are the argocd-image-updater test command options:
\\n\\n\\n\\nFlags:\\n --allow-tags string only consider tags in registry that satisfy the match function\\n --credentials string the credentials definition for the test (overrides registry config)\\n --disable-kubernetes whether to disable the Kubernetes client\\n --disable-kubernetes-events Disable kubernetes events\\n -h, --help help for test\\n --ignore-tags stringArray ignore tags in registry that match given glob pattern\\n --kubeconfig string path to your Kubernetes client configuration\\n --loglevel string log level to use (one of trace, debug, info, warn, error) (default \\"debug\\")\\n --platforms strings limit images to given platforms (default [linux/amd64])\\n --rate-limit int specificy registry rate limit (overrides registry.conf) (default 20)\\n --registries-conf-path string path to registries configuration\\n --semver-constraint string only consider tags matching semantic version constraint\\n --update-strategy string update strategy to use, one of: semver, latest) (default \\"semver\\")
\\n\\n\\n\\nArgo CD Image Updater supports two write-back methods for propagating new image versions to Argo CD:
\\n\\n\\n\\nThe write-back method and its configuration are set per application, with further configuration options available depending on the method used.
\\n\\n\\n\\nIn this article, the examples are applied using the argocd update method, which is the default update method and does not need further configuration. For production environments, it is recommended to use the git update method to persist the changes made by Argo CD Image Updater in your git repository.
\\n\\n\\n\\nAn update strategy specifies how Argo CD Image Updater identifies new image versions for updates. It supports various strategies for tracking and updating configured images. Each image can have its update strategy, with the default being the semver strategy.
\\n\\n\\n\\nThe currently supported update strategies are:
\\n\\n\\n\\n• semver: Updates based on semantic versioning.
\\n\\n\\n\\n• latest: Updates to the most recently built image in the registry.
\\n\\n\\n\\n• digest: Updates to the latest version of a tag using its SHA digest.
\\n\\n\\n\\n• name: Sorts tags alphabetically and updates to the highest version
\\n\\n\\n\\nIn the examples below we show how to annotate our argocd applications in order to enable Argo CD Image Updater, setting up all update strategies. We are using an umbrella helm chart to deploy our sample application. For Helm applications with multiple images in the manifest or when parameters other than image.name and image.tag are used to define images, you need to configure an <image_alias> in the image specification. This alias helps identify the image and enables the Ago CD Image Updater:
\\n\\n\\n\\nargocd-image-updater.argoproj.io/image-list: \\"<image_alias>=<some/image>\\"
\\n\\n\\n\\nsemver update strategy :
\\n\\n\\n\\nThis is the default update strategy. Via semver strategy Argo CD Image Updater operates with images tagged in semantic versioning format. Tags should include semver-compatible identifiers in the structure X.Y.Z, where X, Y, and Z are whole numbers. An optional prefix of “v” (for example, vX.Y.Z) can be used, and both formats are considered equivalent. In this first example each annotation is specifically explained because we are using some of the annotations for semver update strategy in all examples.
\\n\\n\\n\\nExample annotations:
\\n\\n\\n\\napiVersion: argoproj.io/v1alpha1\\nkind: Application\\nmetadata:\\n name: sampleapp\\n namespace: argocd\\n annotations:\\n argocd-image-updater.argoproj.io/image-list: \\"sampleapp=0.dkr.ecr.eu-west-1.amazonaws.com/sampleapp:v1.2.x\\"\\n argocd-image-updater.argoproj.io/sampleapp.helm.image-name: \\"sampleapp.deployment.image.repository\\"\\n argocd-image-updater.argoproj.io/sampleapp.helm.image-tag: \\"sampleapp.deployment.image.tag\\"\\n argocd-image-updater.argoproj.io/sampleapp.update-strategy: \\"semver\\"\\n argocd-image-updater.argoproj.io/pull-policy: Always\\n argocd-image-updater.argoproj.io/write-back-method: argocd
\\n\\n\\n\\nimage-list – as we explained earlier, the image-list annotation enables Argo CD Image Updater to operate with the application – for the value we are using sampleapp as alias and we are specifying the image and its tag.
\\n\\n\\n\\nimage-name – we are specifying the image name via its helm values path, where we are defining the image repository
\\n\\n\\n\\nimage-tag – defines the image tag via its helm values path
\\n\\n\\n\\nupdate-strategy – here we are declaring the desired update strategy
\\n\\n\\n\\npull-policy – specifying the pull-policy, in this case we are always getting the latest version.
\\n\\n\\n\\nwrite-back-method – specifying the Argo CD Image Updater write-back-method
\\n\\n\\n\\nIn this scenario, we are using a semantic versioning constraint with the tag v1.2.x. This means that Argo CD Image Updater will look for any image tag that matches the v1.2.x pattern. The x in semantic versioning acts as a wildcard, so the updater will accept any patch-level version within the v1.2 series (e.g., v1.2.1, v1.2.5, v1.2.9, etc.).
\\n\\n\\n\\nHere is part of the helm values file that we are using for the sampleapp which is connected to the annotations:
\\n\\n\\n\\nsampleapp:\\n appId: sampleapp\\n deployment:\\n enabled: true\\n image:\\n repository: \\"000000000000.dkr.ecr.eu-west-1.amazonaws.com/sampleapp\\"\\n tag: \\"v1.2\\"\\n digest: true\\n pullPolicy: \\"Always\\"
\\n\\n\\n\\nlatest update strategy:
\\n\\n\\n\\nArgo CD Image Updater can update the image with the most recent build date, even if the tag is arbitrary (like a Git commit SHA or random string). It focuses on the build date, not when the image was tagged or pushed to the registry. If multiple tags share the same build date, the updater sorts the tags in descending lexical order and selects the last one.
\\n\\n\\n\\nExample annotations:
\\n\\n\\n\\napiVersion: argoproj.io/v1alpha1\\nkind: Application\\nmetadata:\\n name: sampleapp\\n namespace: argocd\\n annotations:\\n argocd-image-updater.argoproj.io/image-list: \\"sampleapp=0.dkr.ecr.eu-west-1.amazonaws.com/sampleapp\\"\\n argocd-image-updater.argoproj.io/sampleapp.helm.image-name: \\"sampleapp.deployment.image.repository\\"\\n argocd-image-updater.argoproj.io/sampleapp.update-strategy: \\"latest\\"\\n argocd-image-updater.argoproj.io/pull-policy: Always\\n argocd-image-updater.argoproj.io/write-back-method: argocd
\\n\\n\\n\\nIn this scenario, we don’t have to specify image-tag. But if we want to allow only particular tags for update we can use the argocd-image-updater.argoproj.io/myimage.allow-tags: annotation, for example with latest and master tags:
\\n\\n\\n\\nargocd-image-updater.argoproj.io/myimage.allow-tags: latest, master
\\n\\n\\n\\nor we can ignore them with the ignore-tags annotation:
\\n\\n\\n\\nargocd-image-updater.argoproj.io/myimage.ignore-tags: latest, master
\\n\\n\\n\\nHere is part of the helm values file that we are using for the sampleapp which is connected to the annotations:
\\n\\n\\n\\nsampleapp:\\n appId: sampleapp\\n deployment:\\n enabled: true\\n image:\\n repository: \\"000000000000.dkr.ecr.eu-west-1.amazonaws.com/sampleapp\\"\\n tag: \\"latest\\" #in this case tag will be ignored\\n digest: true\\n pullPolicy: \\"Always\\"
\\n\\n\\n\\ndigest update strategy:
\\n\\n\\n\\nThis update strategy monitors a specified tag in the registry for any changes and updates the image when a difference from the previous state is detected using the image SHA digest. The tag must be defined as a version constraint in the image list. It’s ideal for tracking mutable tags like the latest or environment-specific tags (e.g., dev, stage, prod) generated by a CI system.
\\n\\n\\n\\nExample annotations:
\\n\\n\\n\\napiVersion: argoproj.io/v1alpha1\\nkind: Application\\nmetadata:\\n name: sampleapp\\n namespace: argocd\\n annotations:\\n argocd-image-updater.argoproj.io/image-list: \\"sampleapp=0.dkr.ecr.eu-west-1.amazonaws.com/sampleapp:latest\\"\\n argocd-image-updater.argoproj.io/sampleapp.helm.image-name: \\"sampleapp.deployment.image.repository\\"\\n argocd-image-updater.argoproj.io/sampleapp.helm.image-tag: \\"sampleapp.deployment.image.tag\\"\\n argocd-image-updater.argoproj.io/sampleapp.update-strategy: \\"digest\\"\\n argocd-image-updater.argoproj.io/pull-policy: Always\\n argocd-image-updater.argoproj.io/write-back-method: argocd
\\n\\n\\n\\nHere is part of the helm values file that we are using for the sampleapp which is connected to the annotations – the important thing here is to specify the image tag in the format –
\\n\\n\\n\\ntag: “tag_name@sha256” :
\\n\\n\\n\\nsampleapp:\\n appId: sampleapp\\n deployment:\\n enabled: true\\n image:\\n repository: \\"000000000000.dkr.ecr.eu-west-1.amazonaws.com/sampleapp\\"\\n tag: \\"latest@sha256:ef8049179764ee395542a9895dbc3e326b6526116672aea568cfb0a33c0912af\\"\\n digest: true\\n pullPolicy: \\"Always\\"
\\n\\n\\n\\nname update strategy:
\\n\\n\\n\\nThis updated strategy sorts image tags lexically in descending order and selects the last tag for updating. It’s useful for tracking images using calver versioning (e.g., YYYY-MM-DD) or similar tags. By default, all tags in the repository are considered, but you can configure it to limit which tags are eligible for updates.
\\n\\n\\n\\nExample annotations:
\\n\\n\\n\\napiVersion: argoproj.io/v1alpha1\\nkind: Application\\nmetadata:\\n name: sampleapp\\n namespace: argocd\\n annotations:\\n argocd-image-updater.argoproj.io/image-list: \\"sampleapp=0.dkr.ecr.eu-west-1.amazonaws.com/sampleapp:latest\\"\\n argocd-image-updater.argoproj.io/sampleapp.helm.image-name: \\"sampleapp.deployment.image.repository\\"\\n argocd-image-updater.argoproj.io/sampleapp.update-strategy: \\"name\\"\\n argocd-image-updater.argoproj.io/myapp.allow-tags: regexp:^[0-9]{4}-[0-9]{2}-[0-9]{2}-stable$\\n argocd-image-updater.argoproj.io/pull-policy: \\"Always\\"\\n argocd-image-updater.argoproj.io/write-back-method: \\"argocd\\"
\\n\\n\\n\\nIn this case, if we have tags such as: 2024-09-30-stable, 2024-09-30-beta, 2024-10-01-beta, 2024-10-01-stable, master, latest – Argo CD Image Updater will consider only the “-stable” ending tags, sort them lexically and choose the 2024-10-01-stable tag for the update.
\\n\\n\\n\\nHere is part of the helm values file that we are using for the sampleapp which is connected to the annotations:
\\n\\n\\n\\nsampleapp:\\n appId: sampleapp\\n deployment:\\n enabled: true\\n image:\\n repository: \\"000000000000.dkr.ecr.eu-west-1.amazonaws.com/sampleapp\\"\\n tag: \\"2024-09-30-stable\\" #will be ignored in this case\\n digest: true\\n pullPolicy: \\"Always\\"
\\n\\n\\n\\nAfter we’ve made the needed configurations and selected the most suitable update strategy we can check the Argo CD application’s parameters through the UI:
\\n\\n\\n\\nAs we can see, after the new image version was pushed in ECR, the original value of the image tag was changed by the Argo CD image updater, and the new image was deployed!
\\n\\n\\n\\nIn conclusion, the Argo CD Image Updater is a powerful tool that enhances the continuous delivery process in Kubernetes environments. Automating the process of updating container images, not only streamlines deployments but also reduces the risk of human error associated with manual updates.
\\n\\n\\n\\nMoreover, its flexibility allows developers to tailor the update policies to suit their specific workflows, ensuring that only the necessary updates are applied. This ultimately leads to improved application reliability and performance.
\\n\\n\\n\\nReference: https://argocd-image-updater.readthedocs.io/en/stable/
\\n\\n\\n\\n\\n\\nGet to know Rishabh
\\n\\n\\n\\nThis week’s Kubestronaut in Orbit, Rishabh Sharma, our first Kubestonaut from Finland, is a senior software development engineer where he manages cloud native tech solutions for Capgemini Finland Oy. He is currently responsible for implementing and managing Kubernetes solutions for 4G and 5G technologies. His other key areas of interest are Java, Spring-Boot, Containerd, Linux, Istio, Linkerd, Falco, Sysdig, CoreDNS, Helm, OPA (Open Policy Agent), Cilium, Envoy etc.
\\n\\n\\n\\nIf you’d like to be a Kubestronaut like Rishabh, get more details on the CNCF Kubestronaut page.
\\n\\n\\n\\nWhen did you get started with Kubernetes and/or cloud-native? What was your first project?
\\n\\n\\n\\nIn early 2017, I was a part of a DevOps bootcamp in Chennai, India. In the 7 days of bootcamp I explored GCE VMs, Docker, and Kubernetes on a high level. It was the trigger point for me to explore Kubernetes.
\\n\\n\\n\\nMy first major project in Kubernetes was a Telco Network planning application which was deployed over a Kubernetes cluster as StatefulSet/Deployment pods.
\\n\\n\\n\\nI saw the real Kubernetes magic when we never had a single second downtime for our network planning application deployed as Kubernetes workloads. We used a multi-zone setup – two k8s clusters in two different zones.
\\n\\n\\n\\nWhat are the primary CNCF projects you work on or use today? What projects have you enjoyed the most in your career?
\\n\\n\\n\\nI use the following CNCF projects today:
\\n\\n\\n\\n1. Kubernetes
\\n\\n\\n\\n2. Cilium
\\n\\n\\n\\n3. Containerd
\\n\\n\\n\\n4. CoreDNS
\\n\\n\\n\\n5. Fluentd
\\n\\n\\n\\n6. Prometheus
\\n\\n\\n\\n7. Etcd
\\n\\n\\n\\n8. Helm
\\n\\n\\n\\n9. Kyverno
\\n\\n\\n\\n10. Istio
\\n\\n\\n\\n11. Falco
\\n\\n\\n\\n\\n\\n\\n\\nI love the Kubernetes project so much because I am a telco guy and I have seen how Kubernetes actually supports telco new generation technologies – 4G and 5G and more.
\\n\\n\\n\\nKubernetes’ flexible architecture and robust cloud-native management capabilities allow telcos to rapidly develop new features while maintaining performance and reliability.
\\n\\n\\n\\nThese projects are also favorites for these reasons:
\\n\\n\\n\\nHow have the certs or CNCF helped you in your career?
\\n\\n\\n\\nI have completed 6 CNCF certifications as follows:
\\n\\n\\n\\nWhat are some other books/sites/courses you recommend for people who want to work with k8s?
\\n\\n\\n\\nTo be honest the only “book” to learn Kubernetes is the from the source: https://kubernetes.io/docs/concepts
\\n\\n\\n\\nYou should read the docs thoroughly and practice with free Kubernetes clusters like on killercoda.
\\n\\n\\n\\nApart from official docs, if you would like to learn via courses, I would suggest the following courses:
\\n\\n\\n\\nWhat do you do in your free time?
\\n\\n\\n\\nI enjoy these things in my free time:
\\n\\n\\n\\nWhat would you tell someone who is just starting their K8s certification journey? Any tips or tricks?
\\n\\n\\n\\nJust go through the Kubernetes docs and then practice, practice, practice over free platforms like killer shell.
\\n\\n\\n\\nGet your hands dirty with Kubernetes concepts.
\\n\\n\\n\\nToday the Cloud native ecosystem is way more than Kubernetes. Do you plan to get other cloud native certifications from the CNCF?
\\n\\n\\n\\nI am planning to complete all cloud native certifications from the CNCF.
\\n\\n\\n\\nI would like to complete all 10 CNCF certifications and then there will be new certifications launched soon like Kyverno, Backstage, OpenTelemetry.
\\n\\n\\n\\nSo the learning journey will continue.
\\n\\nWe are delighted to announce our new DEI Community Hub at KubeCon + CloudNativeCon North America, sponsored by Google Cloud, a physical space to connect, learn, and celebrate diversity, equity, and inclusion and accessibility! The DEI Hub is a great place to join community groups, participate in allyship and advocacy workshops, or simply relax in a safe space during open lounge hours.
\\n\\n\\n\\nIt is our hope that the Community Hub will not just be a place for education, networking and downtime, but also a place where people can find and make new friends.
\\n\\n\\n\\nWe believe strong communities foster a feeling of belonging by providing opportunities for interaction, collaboration, and shared experiences. Gatherings are planned for students, BIPOC (Black, Indigenous and People of Color), LGBTQIA+, and the broader community of attendees.
\\n\\n\\n\\nSearch the schedule to find a gathering you’d like to attend.
\\n\\n\\n\\nFrom the way we conduct meetings to how we name our projects, there are countless opportunities in our day-to-day interactions to create better, more psychologically safe environments. Join Mike Bufano, Program Manager, Open Source DEI and Outreachat Google, in his session, “Be Part of the Solution: Cultivating Inclusion in Open Source” where he will share data from the field, discuss inclusion best practices, and leave attendees with actionable steps to cultivate inclusion in open source spaces.
\\n\\n\\n\\nAttendees who identify as women, non-binary individuals and allies are invited to start the day on a powerful note over a continental breakfast at a gathering, known as EmpowerUs. Network with peers, make new connections and discuss workplace issues and more before heading into the keynotes.
\\n\\n\\n\\nTo honor and celebrate Trans Awareness Week, come join this fireside chat with Jacey Thornton from local Salt Lake City nonprofit Project Rainbow. Topics will include the current climate for the queer community in Salt Lake City and ways in which we can better show up for our LGBTQIA+ peers more broadly. This session will be immediately followed by the LGBTQIA+ Community Gathering.
\\n\\n\\n\\nHelp make the cloud native community more accessible by joining the Deaf + Hard of Hearing Working Group for a lively discussion about advocacy and allyship. Then stay around for a Sign Language Crash Course where you can learn how sign language works and practice some basic cloud native signs.
\\n\\n\\n\\nUse this space to meet friends, old and new, before heading to the keynotes, attending a session, going to lunch or enjoying Salt Lake City. Or, our favorite suggestion: Find (or make) a friend then head to KubeCrawl + CloudNativeFest, the official launch party of KubeCon + CloudNativeCon in the Solutions Showcase.
\\n\\n\\n\\nThis is a great opportunity for attendees to meet with experienced open source veterans from different CNCF projects for wide-ranging conversations across technical, community, career, and certification topics. Mentors will pair with 2-8 mentees in a pod-like environment to facilitate conversation and connection.
\\n\\nCo-chairs: Megan Reynolds, Kelsey Hightower
\\n\\n\\n\\nNovember 12, 2024
\\n\\n\\n\\nSalt Lake City, Utah
\\n\\n\\n\\nAt the Cloud Native StartupFest expect to get inspired by hearing from successful cloud native entrepreneurs, learn about some of the most exciting cloud native and OSS startups in the space, get a glimpse into the current state of fundraising, and receive guidance on how to take your idea from community adoption to success. The Cloud Native StartupFest first happened during KubeCon + CloudNativeCon North America 2023.
\\n\\n\\n\\nWho will get the most out of attending this event?
\\n\\n\\n\\nIt’s for startup builders or anyone considering starting a company in the OSS or cloud native space
\\n\\n\\n\\nWhat is new and different this year?
Market conditions have become more challenging. Today it’s harder for any company to close customers and raise money so this year we are focusing more on how to navigate this new world as a startup builder. We have an all-star line-up of successful entrepreneurs and executives that will share their guidance.
\\n\\n\\n\\nWhat will the day look like?
\\n\\n\\n\\nWe have an afternoon session of talks and panels guided by our exceptional MC, Kelsey Hightower. We’ll make time for Q&A from the audience after every session to keep the day as interactive as possible.
\\n\\n\\n\\nShould I do any homework first?
\\n\\n\\n\\nCome prepared with any go-to-market challenges you’re facing right now to get feedback in the Q&A segments. It’s likely the speakers have experienced something similar and will be able to share advice.
\\n\\n\\n\\nFind your community!
\\n\\n\\n\\nWe’re over the moon about the caliber of speakers that have given their time to join us, there will be a lot to learn from all of them and we expect some spicy discussions!
\\n\\n\\n\\nSubmitted by Megan Reynolds, who is looking forward to learning how new startups are adapting to the Generative AI wave and the impact that’s having on core enterprise infrastructure.
\\n\\n\\n\\nDon’t forget to register for KubeCon + CloudNativeCon North America 2024.
\\n\\nSIG post by Dotan Horovits and Adriel Perkins, Project Leads, SIG CI/CD Observability, OpenTelemetry
\\n\\n\\n\\nWe’ve been talking about the need for a common “language” for reporting and observing CI/CD pipelines for years, and finally, we see the first “words” of this language entering the “dictionary” of observability – the OpenTelemetry open specification. With the recent release of OpenTelemetry’s Semantic Conventions, v1.27.0, you can find designated attributes for reporting CI/CD pipelines.
\\n\\n\\n\\nThis is the result of the hard work of the CI/CD Observability Special Interest Group (SIG) within OpenTelemetry. As we accomplish the core milestone for the first phase, we thought it’d be a good time to share it with the world.
\\n\\n\\n\\n\\n\\n\\n\\nCI/CD observability is essential for ensuring that software is released to production efficiently and reliably. Well-functioning CI/CD pipelines directly impact business outcomes by shortening Lead Time for Changes DORA metric, and enabling fast identification and resolution of broken or flaky processes. By integrating observability into CI/CD workflows, teams can monitor the health and performance of their pipelines in real-time, gaining insights into bottlenecks and areas that require improvement.
\\n\\n\\n\\nLeveraging the same well-established tools used for monitoring production environments, organizations can extend their observability capabilities to include the release cycle, fostering a holistic approach to software delivery. Whether open source or proprietary tools, there’s no need to reinvent the wheel when choosing the observability toolchain for CI/CD pipelines.
\\n\\n\\n\\nHowever, the diverse landscape of CI/CD tools creates challenges in achieving consistent end-to-end observability. With each tool having its own means, format and semantic conventions for reporting the pipeline execution status, fragmentation within the toolchain can hinder seamless monitoring. Migrating between tools becomes painful, as it requires reimplementing existing dashboards, reports and alerts.
\\n\\n\\n\\nThings become even more challenging, when needing to monitor multiple tools involved in the release pipeline in a uniform manner. This is where open standards and specifications become critical. They create a common uniform language, one which is tool- and vendor-agnostic, enabling cohesive observability across different tools and allowing teams to maintain a clear and comprehensive view of their CI/CD pipeline performance.
\\n\\n\\n\\nThe need for standardization is relevant for creating the semantic conventions mentioned above, the language for reporting what goes on in the pipeline. Standardization is also needed for the means in which this reporting is propagated through the system, such as upon spawning processes during the pipeline execution. This led us to promote standardization for using environment variables for context and baggage propagation between processes, another important milestone that was recently approved and merged.
\\n\\n\\n\\n\\n\\nThis realization drove us to look for the right way to approach creating a specification. OpenTelemetry emerges as the standard for telemetry generation and collection. The OpenTelemetry specification is tasked with exactly this problem: creating a common uniform and vendor-agnostic specification for telemetry. And housed under the Cloud Native Computing Foundation (CNCF) can ensure it remains open and vendor-neutral. As long standing advocates of OpenTelemetry, it only made sense to extend OpenTelemetry to cover this important DevOps use case.
\\n\\n\\n\\nWe started with an OpenTelemetry extension proposal (OTEP #223) a couple of years ago, proposing our idea to extend OpenTelemetry to cover the CI/CD observability use case. In parallel, we’ve started a slack channel on the CNCF slack to gather fellow enthusiasts behind the idea and start brainstorming what that should look like. The slack channel grew and we quickly discovered that the problem is common across many organizations.
\\n\\n\\n\\nWith the feedback from the Technical Oversight Committee and others within the CNCF, we’ve taken the path of asking the mandate to start a dedicated Working Group for the topic under OpenTelemetry’s Semantic Conventions SIG (SIG SemConv in short). With their blessing, we launched the formal CI/CD Observability SIG to formalize our previous slack group discussions and goals.
\\n\\n\\n\\n\\n\\nSince November of 2023, the SIG has been actively working to develop the standard for semantics around CI/CD observability in collaboration with experts from multiple companies and Open-Source projects. At its inception, we decided to focus on a few key areas for 2024:
\\n\\n\\n\\nAt first, our SIG met during the larger Semantic Conventions Working Group meetings every Monday. This provided a good opportunity for us to get our bearings as we researched and discussed how we would accomplish the goals on our roadmap. This also enabled us to get to know many members of the larger OpenTelemetry community, solicit feedback on our designs, and get direction on how to proceed. The OpenTelemetry Semantic Convention Working Group has been extraordinarily supportive of the CI/CD initiative.
\\n\\n\\n\\nUpon completion and release of its initial milestone (see below), our SIG was granted its own dedicated meeting slot on the OpenTelemetry calendar, every Thursday at 0600 PT. The group gets together here to discuss current and future work prior to bringing to the larger Semantic Conventions meetings on Monday. We greatly look forward to the continued support and participation of the community as we continue to drive forward this critical area of standardization.
\\n\\n\\n\\n\\n\\nOver the course of months of iteration and feedback, the first set of Semantic Conventions was merged in for the v1.27.0 release. This change brought forth the first set of foundational semantics for CI/CD under the CICD
, artifacts
, VCS
, test
, and deployment
namespaces. This was a significant milestone for the CI/CD Observability SIG and industry as a whole. This creates the foundation for which all of our group’s other goals can begin to take form, and reach implementation.
But what does that actually mean? What value does it provide? Let’s consider real world examples for two of the namespaces
\\n\\n\\n\\nVersion Control System (VCS) attributes cover multiple areas common in a VCS like refs and changes (pull/merge requests). The vcs.repository.ref.revision
attribute is a key piece of metadata. As Version Control Systems like GitHub and GitLab emit events, they can now have this semantically compliant attribute. That means when integrating code, releasing it, and deploying it to environments, systems can include this attribute and trace the code revision across bounds more easily. In the event a deployment fails, you can quickly look at the revision of code and track it back to the buggy release. This attribute is actually a key piece of metadata for DORA metrics too as you calculate Change lead time and Failed deployment recovery time.
The artifact attribute namespace had multiple attributes for its first implementation. One key set of attributes within this namespace cover attestations that closely align with the SLSA model. This is really the first time a direct connection is being made between Observability and Software Supply Chain Security. Consider the following supply chain threat model defined by SLSA:
\\n\\n\\n\\nThese new attributes for artifacts and attestations help observe the sequence of events modeled in the above diagram in real time. Really, the conventions that exist today and those that will be added in the future enable interoperability between core software delivery capabilities like security and platform engineering via observability semantics.
\\n\\n\\n\\nThe first major milestone we shared above, was the merge of the OTEP for extending the semantic conventions with the new attributes, which is now part of the OpenTelemetry Semantic Conventions latest release.
\\n\\n\\n\\nThe other important milestone was OTEP #258 for Environment Variable Context Propagation that was just approved and merged. This OTEP sets the ground for writing the specification.
\\n\\n\\n\\n\\n\\nSince we’ve made progress on our initial milestones, we’ve updated the CI/CD Observability SIG milestones for the remainder of 2024. Our goal is to finish out as many of the defined milestones as possible by the end of the year. Notably, we’re focused on:
\\n\\n\\n\\nAll that has been mentioned thus far is just the beginning! We have lots of work defined on our CICD Project Board, and we have work in progress! We’ll continue to iterate on the above milestones that we’ve set out for the remainder of 2024. Here’s a couple things to look out for.
\\n\\n\\n\\nAnd much more!
\\n\\n\\n\\nWoah, that’s a lot to do! Most certainly this SIG will continue beyond 2024 and through 2025. Standards are hard, but essential. And, we have some amazing folks that are part of the SIG and contributing to these standards! Who you may ask?
\\n\\n\\n\\nFirstly we’d like to acknowledge key members of OpenTelemetry leadership committees who have heavily enabled the work we’ve done thus far, and will continue to do.
\\n\\n\\n\\nFrom the OpenTelemetry Technical Committee we have two core sponsors, Carlos Alberto from Lightstep and Josh Suereth from Google. Both Carlos and Josh have been so supportive of the CICD work, really guiding us through the process and details we need to be successful.
\\n\\n\\n\\nFrom the OpenTelemetry Governance Committee we’ve had Trask Stalnaker from Microsoft act as an exceptional ally, and Daniel Blanco from Skyscanner who now acts as our current Liaison. Both Trask and Daniel have been instrumental in supporting the SIG and enabling us to have our own meeting in the OpenTelemetry community.
\\n\\n\\n\\nIn addition to those folks, we’ve had significant feedback, support, and contributions from the following key folks:
\\n\\n\\n\\nThat was a lot of names to name! We greatly appreciate everyone who has supported this initiative and helped bring it to fruition! It takes significant thinking ability and time to build industry wide standards. Hard problems are hard, but these folks have risen to the challenge to make the world of observability and CICD systems a better, more interoperable place!
\\n\\n\\n\\nWant to learn more? Want to get involved in shaping CI/CD Observability?
\\n\\n\\n\\nWe invite developers and practitioners to participate in the discussions, contribute ideas, and help shape the future of CI/CD observability and the OpenTelemetry semantic conventions. Discussion takes place in the CNCF slack workspace under the #cicd-o11y channel, and you can chime in on GitHub and join the CICD SIG weekly calls every Thursday at 0600 PT.
\\n\\nMember post originally published on Middleware’s blog
\\n\\n\\n\\nIn the world of cloud-native applications, Kubernetes stands as the go-to platform for container orchestration (the automated process of managing, scaling, and maintaining containerized applications across multiple hosts). As applications grow in scale and complexity, effective logging becomes crucial for monitoring, troubleshooting, and maintaining smooth operations.
\\n\\n\\n\\nThis guide explores the intricacies of Kubernetes logging, its significance, and the common commands one may encounter in their monitoring activities. We’ll also dive into the various logging sources within a Kubernetes environment, accompanied by code examples to illustrate key concepts.
\\n\\n\\n\\nKubernetes log monitoring involves collecting and analyzing log data from various sources within a Kubernetes cluster. Logs provide valuable insights into the state of your applications, nodes, and the cluster itself. They help identify issues, understand application behavior, and maintain overall system health.
\\n\\n\\n\\ni. How logging works in Kubernetes?
\\n\\n\\n\\nIn Kubernetes, logs are typically generated by applications running in containers, the nodes on which these containers run, and the cluster components. Kubernetes does not handle log storage and analysis directly but allows integration with various logging solutions.
\\n\\n\\n\\nWhen a container writes logs, they are captured by the container runtime and can be accessed using commands like `kubectl logs`. These logs can then be shipped to external logging systems for further analysis.
\\n\\n\\n\\nii. Logging drivers and options
\\n\\n\\n\\nA logging driver is a component in containerized environments that manages how and where logs are stored and processed. It defines the mechanism by which logs from containers are collected, formatted, and sent to a specific logging backend or storage system. By configuring logging drivers, you can control the flow of log data, making it easier to monitor, troubleshoot, and analyze the performance and behavior of your applications.
\\n\\n\\n\\nKubernetes supports various logging drivers and options to suit different needs. Docker, which is often used in Kubernetes clusters, offers multiple logging drivers such as json-file, syslog, journald, fluentd, and gelf. Each driver has its own features and configuration options.
\\n\\n\\n\\nYou can also integrate your logs with the Middleware. Simply click here to know more. You can also use this guide to set up monitoring with Node.js.
\\n\\n\\n\\ni. Application logs
\\n\\n\\n\\nApplication logs are generated by the applications running inside containers. These logs are essential for understanding application-specific events and behaviors.
\\n\\n\\n\\nTo access application logs, you can use the `kubectl logs` command:
\\n\\n\\n\\n```sh\\n\\nkubectl logs <pod-name>\\n\\n```
\\n\\n\\n\\nii. Node Logs
\\n\\n\\n\\nNode logs are generated by the Kubernetes nodes and include logs from the kubelet, container runtime, and other node-level components. These logs help monitor the health and performance of individual nodes.
\\n\\n\\n\\nTo access node logs, you can use SSH to connect to the node and view logs stored in the `/var/log` directory:
\\n\\n\\n\\n```sh\\n\\nssh user@node-ip\\n\\nsudo tail -f /var/log/kubelet.log\\n\\n```
\\n\\n\\n\\niii. Cluster Logs
\\n\\n\\n\\nCluster logs encompass logs from the entire Kubernetes cluster, including logs from the control plane components like the API server, scheduler, and controller manager. These logs are crucial for understanding the cluster’s overall health and performance.
\\n\\n\\n\\nTo access cluster logs, you can use the `kubectl` command with the appropriate component:
\\n\\n\\n\\n```sh\\n\\nkubectl logs <pod-name> -n kube-system\\n\\n```
\\n\\n\\n\\nHere, kube-system is a namespace in Kubernetes used to host the core infrastructure components and system services that are essential for running and managing the cluster. It is a predefined namespace that comes with every Kubernetes installation.
\\n\\n\\n\\nTo understand the intricacies of logging in Kubernetes, we’ll use a demo microservice. This simple project will help us explore how to set up logging, monitor logs, and debug issues within a production-like Kubernetes environment.
\\n\\n\\n\\nThe demo microservice is a Node.js application with basic functionalities like user authentication, transactions, and payments. It includes several endpoints to simulate different scenarios and log relevant events.
\\n\\n\\n\\nHere’s the basic structure of the project:
\\n\\n\\n\\n1. app.js: The main application file.
\\n\\n\\n\\n2. Dockerfile: Used to containerize the application.
\\n\\n\\n\\n3. deployment.yaml: Kubernetes deployment configuration.
\\n\\n\\n\\n4. service.yaml: Kubernetes service configuration.
\\n\\n\\n\\n5. simulate_requests.sh: Script to simulate various user interactions.
\\n\\n\\n\\n1. Create the Application File (`app.js`)
\\n\\n\\n\\nFirst create the project directory and initialize an empty node project. Then install ExpressJS.
\\n\\n\\n\\n```\\nmkdir kubernetes-logging-demo\\ncd kubernetes-logging-demo\\nnpm init -y\\nnpm install express\\n```
\\n\\n\\n\\nThe application handles user authentication, transactions, and payments, and includes logging for different events. Here’s the complete code for `app.js` file.
\\n\\n\\n\\n```javascript\\nconst express = require(\'express\');\\nconst app = express();\\nconst port = 3000;\\n\\napp.use(express.json());\\n\\n// Simulated database\\nconst users = {\\n \\"john_doe\\": { password: \\"12345\\", balance: 100 },\\n \\"jane_doe\\": { password: \\"67890\\", balance: 200 }\\n};\\n\\n// Helper function for logging\\nconst log = (level, message) => {\\n const timestamp = new Date().toLocaleString();\\n console.log(`${timestamp} - ${level} - ${message}`);\\n}; \\n\\n// Middleware to log requests\\napp.use((req, res, next) => {\\n log(\'info\', `Request received: ${req.method} ${req.originalUrl}`);\\n next();\\n});\\n\\n// Endpoint for user authentication\\napp.post(\'/login\', (req, res) => {\\n const { username, password } = req.body;\\n if (users[username] && users[username].password === password) {\\n log(\'info\', `User login successful: ${username}`);\\n res.status(200).send(\'Login successful\');\\n } else {\\n log(\'warn\', `User login failed: ${username}`);\\n res.status(401).send(\'Login failed\');\\n }\\n});\\n\\n// Endpoint for logging user transactions\\napp.post(\'/transaction\', (req, res) => {\\n const { username, amount } = req.body;\\n if (users[username]) {\\n users[username].balance += amount;\\n log(\'info\', `Transaction successful: ${username} new balance: ${users[username].balance}`);\\n res.status(200).send(\'Transaction successful\');\\n } else {\\n log(\'warn\', `Transaction failed: User not found - ${username}`);\\n res.status(404).send(\'User not found\');\\n }\\n});\\n\\n// Endpoint for simulating payment processing\\napp.post(\'/payment\', (req, res) => {\\n const { username, amount } = req.body;\\n if (users[username]) {\\n if (users[username].balance >= amount) {\\n users[username].balance -= amount;\\n log(\'info\', `Payment successful: ${username} amount: ${amount}`);\\n res.status(200).send(\'Payment successful\');\\n } else {\\n log(\'warn\', `Payment failed: Insufficient funds - ${username}`);\\n res.status(400).send(\'Insufficient funds\');\\n }\\n } else {\\n log(\'warn\', `Payment failed: User not found - ${username}`);\\n res.status(404).send(\'User not found\');\\n }\\n});\\n\\n// Endpoint for simulating security incident\\napp.post(\'/admin\', (req, res) => {\\n const { username } = req.body;\\n if (username === \'admin\') {\\n log(\'error\', `Unauthorized access attempt by user: ${username}`);\\n res.status(403).send(\'Unauthorized access\');\\n } else {\\n res.status(200).send(\'Welcome\');\\n }\\n});\\n\\napp.listen(port, () => {\\n log(\'info\', `Dummy microservice listening at http://localhost:${port}`);\\n});\\n```
\\n\\n\\n\\n2. Create the Dockerfile
\\n\\n\\n\\nThis file is used to build a Docker image of the application.
\\n\\n\\n\\n```Dockerfile\\nFROM node:18\\n\\nWORKDIR /usr/src/app\\n\\nCOPY package*.json ./\\nRUN npm install\\n\\nCOPY . .\\n\\nEXPOSE 3000\\nCMD [\\"node\\", \\"app.js\\"]\\n```
\\n\\n\\n\\n3. Create the Kubernetes Deployment Configuration (`deployment.yaml`)
\\n\\n\\n\\nThis file defines how the application will be deployed on Kubernetes.
\\n\\n\\n\\n```yaml\\napiVersion: apps/v1\\nkind: Deployment\\nmetadata:\\n name: logging-demo\\nspec:\\n replicas: 1\\n selector:\\n matchLabels:\\n app: logging-demo\\n template:\\n metadata:\\n labels:\\n app: logging-demo\\n spec:\\n containers:\\n - name: logging-demo\\n image: kubernetes-logging-demo:latest\\n imagePullPolicy: Never\\n ports:\\n - containerPort: 3000\\n```
\\n\\n\\n\\n4. Create the Kubernetes Service Configuration (`service.yaml`)
\\n\\n\\n\\nThis file defines the service that will expose the application within the Kubernetes cluster.
\\n\\n\\n\\n```yaml\\napiVersion: v1\\nkind: Service\\nmetadata:\\n name: logging-demo-service\\nspec:\\n selector:\\n app: logging-demo\\n ports:\\n - protocol: TCP\\n port: 80\\n targetPort: 3000\\n nodePort: 30001\\n type: NodePort\\n```
\\n\\n\\n\\n5. Simulate User Interactions (`simulate_requests.sh`)
\\n\\n\\n\\nThis script sends various requests to the application to simulate different scenarios and trigger logs. Note that this is an infinite loop until terminated.
\\n\\n\\n\\n\\n```bash\\n#!/bin/bash
\\n\\n\\n\\nDefine the base URL for the application
\\n\\n\\n\\nBASE_URL=\\"http://localhost:30001\\"
\\n\\n\\n\\nDefine an array of curl commands to simulate different scenarios
\\n\\n\\n\\ncurl_commands=(\\n\\n # Successful login\\n\\n \\"curl -i -X POST $BASE_URL/login -H \'Content-Type: application/json\' -d \'{\\\\\\"username\\\\\\":\\\\\\"john_doe\\\\\\",\\\\\\"password\\\\\\":\\\\\\"12345\\\\\\"}\'\\"
\\n\\n\\n\\nFailed login due to wrong password
\\n\\n\\n\\n\\"curl -i -X POST $BASE_URL/login -H \'Content-Type: application/json\' -d \'{\\\\\\"username\\\\\\":\\\\\\"john_doe\\\\\\",\\\\\\"password\\\\\\":\\\\\\"wrong_password\\\\\\"}\'\\"
\\n\\n\\n\\nSuccessful transaction
\\n\\n\\n\\n\\"curl -i -X POST $BASE_URL/transaction -H \'Content-Type: application/json\' -d \'{\\\\\\"username\\\\\\":\\\\\\"john_doe\\\\\\",\\\\\\"amount\\\\\\":50}\'\\"
\\n\\n\\n\\nFailed transaction because the user is not found
\\n\\n\\n\\n\\"curl -i -X POST $BASE_URL/transaction -H \'Content-Type: application/json\' -d \'{\\\\\\"username\\\\\\":\\\\\\"unknown_user\\\\\\",\\\\\\"amount\\\\\\":50}\'\\"
\\n\\n\\n\\nSuccessful payment
\\n\\n\\n\\n\\"curl -i -X POST $BASE_URL/payment -H \'Content-Type: application/json\' -d \'{\\\\\\"username\\\\\\":\\\\\\"jane_doe\\\\\\",\\\\\\"amount\\\\\\":50}\'\\"
\\n\\n\\n\\nFailed payment due to insufficient funds
\\n\\n\\n\\n\\"curl -i -X POST $BASE_URL/payment -H \'Content-Type: application/json\' -d \'{\\\\\\"username\\\\\\":\\\\\\"john_doe\\\\\\",\\\\\\"amount\\\\\\":200}\'\\"
\\n\\n\\n\\nUnauthorized access attempt
\\n\\n\\n\\n\\"curl -i -X POST $BASE_URL/admin -H \'Content-Type: application/json\' -d \'{\\\\\\"username\\\\\\":\\\\\\"admin\\\\\\"}\'\\"\\n\\n)
\\n\\n\\n\\nInfinite loop to continuously execute curl requests
\\n\\n\\n\\nwhile true; do
\\n\\n\\n\\ncmd=${curl_commands[$RANDOM % ${#curl_commands[@]}]}
\\n\\n\\n\\necho \\" Executing: $cmd\\"
\\n\\n\\n\\neval $cmd
\\n\\n\\n\\nOptional: Add a short sleep to mimic real user behavior
\\n\\n\\n\\nThis adds a delay of 1 second between requests
\\n\\n\\n\\nsleep 1\\n\\ndone\\n\\n```
\\n\\n\\n\\n1. Build the docker image
\\n\\n\\n\\n```sh\\n\\n docker build -t kubernetes-logging-demo:latest .\\n\\n ```
\\n\\n\\n\\n2. Deploy to Kubernetes
\\n\\n\\n\\n```sh\\n\\n kubectl apply -f deployment.yaml\\n\\n kubectl apply -f service.yaml\\n\\n ```
\\n\\n\\n\\n3. Simulate User Interactions (this will make our app generate the logs)
\\n\\n\\n\\nUse a dedicated bash terminal to keep this simulation running as required.
\\n\\n\\n\\n```sh\\n\\n ./simulate_requests.sh\\n\\n ```
\\n\\n\\n\\nWith this, your local setup is ready and you can test out different `kubectl` commands in real time. This setup helps us understand how to collect, monitor, and analyze logs, which is crucial for maintaining the health and performance of our applications.
\\n\\n\\n\\n1. Tracking user activity
\\n\\n\\n\\nMonitoring user activity is essential for understanding user behavior, identifying trends, and detecting unusual patterns that might indicate security issues.
\\n\\n\\n\\n2. Debugging payment issues
\\n\\n\\n\\nPayment-related logs are critical for resolving transaction failures, identifying bugs in the payment processing logic, and ensuring that financial operations are secure and reliable.
\\n\\n\\n\\nThe `kubectl logs` command is a powerful tool for accessing the logs of containers running in your Kubernetes cluster.
\\n\\n\\n\\n```sh\\n\\nkubectl logs [OPTIONS] POD_NAME [-c CONTAINER_NAME]\\n\\n```
\\n\\n\\n\\n– `POD_NAME`: The name of the pod whose logs you want to view.
\\n\\n\\n\\n– `-c CONTAINER_NAME`: (Optional) Specifies the container within the pod. Useful if the pod has multiple containers.
\\n\\n\\n\\nHere’s a simple example to get logs from a pod:
\\n\\n\\n\\n```sh\\n\\nkubectl logs my-pod\\n\\n```
\\n\\n\\n\\nIf the pod contains multiple containers, specify the container name:
\\n\\n\\n\\n```sh\\n\\nkubectl logs my-pod -c my-container\\n\\n```
\\n\\n\\n\\nIf you have properly setup the demo project, you should be able to access your pod name by using this command:
\\n\\n\\n\\n```sh\\n\\nkubectl get pods\\n\\n```
\\n\\n\\n\\nYour terminal might output something like this:
\\n\\n\\n\\n```sh\\n\\nNAME READY STATUS RESTARTS AGE\\n\\nlogging-demo-6cf76dcb4c-mz7bv 1/1 Running 0 69m\\n\\n```
\\n\\n\\n\\nNow, to access the logs of this pod, you just have to run this command:
\\n\\n\\n\\n```sh\\n\\nkubectl logs logging-demo-6cf76dcb4c-mz7bv\\n\\n```
\\n\\n\\n\\nThe corresponding output will look something like this:
\\n\\n\\n\\n```sh\\n\\n6/29/2024, 11:24:52 AM - info - Dummy microservice listening at http://localhost:3000\\n\\n```
\\n\\n\\n\\nLet’s use our `simulate_requests` script to send requests to our pod and try to monitor the logs generated. To do this, simply keep running the script in a dedicated terminal.
\\n\\n\\n\\nYou can filter and view logs for specific pods and containers to narrow down your troubleshooting efforts.
\\n\\n\\n\\nViewing Logs for a Specific Pod
\\n\\n\\n\\nTo view logs for a specific pod:
\\n\\n\\n\\n```sh\\n\\nkubectl logs my-pod\\n\\n```
\\n\\n\\n\\nViewing Logs for a Specific Container in a Pod
\\n\\n\\n\\nAs explained above, if the pod has multiple containers, specify the container name:
\\n\\n\\n\\n```sh\\n\\nkubectl logs my-pod -c my-container\\n\\n```
\\n\\n\\n\\nViewing Logs with a Label Selector
\\n\\n\\n\\nYou can also use label selectors to filter logs from pods that match specific labels:
\\n\\n\\n\\n```sh\\n\\nkubectl logs -l app=my-app\\n\\n```\\n\\nFor our example project, this would look something like this:\\n\\n```sh\\n\\nkubectl logs -l app=logging-demo\\n\\n```
\\n\\n\\n\\nNote that the app name label is configured in the `deployment.yaml` file.
\\n\\n\\n\\nStartup logs are crucial for identifying issues that occur during the initialization phase of your containers, while runtime logs help monitor the ongoing operations.
\\n\\n\\n\\nViewing Startup Logs
\\n\\n\\n\\nTo analyze logs from the startup phase of a pod, you can specify the time range from the pod’s creation using the –since flag with the kubectl logs command. This flag allows you to retrieve logs starting from a specified duration in the past, which is particularly useful for investigating recent startups.
\\n\\n\\n\\n```sh\\n\\nkubectl logs --since=5m my-pod\\n\\n```
\\n\\n\\n\\nViewing Runtime Logs
\\n\\n\\n\\nFor continuous monitoring of runtime logs, you can use this command:
\\n\\n\\n\\n```sh\\n\\nkubectl logs my-pod --follow\\n\\n```
\\n\\n\\n\\nThe `–follow` option streams the logs in real-time, allowing you to monitor the container’s activities as they happen.
\\n\\n\\n\\nThe `kubectl logs –tail` command is particularly useful for real-time log monitoring and debugging. It helps you view the most recent log entries without having to go through the entire log history.
\\n\\n\\n\\nThe `–tail` option with `kubectl logs` fetches the last few lines of logs from a pod or container.
\\n\\n\\n\\nBasic syntax:
\\n\\n\\n\\n```sh\\n\\nkubectl logs my-pod --tail=50\\n\\n```
\\n\\n\\n\\nThis command retrieves the last 50 lines of logs from `my-pod`.
\\n\\n\\n\\nDifferences from Other Log Options
\\n\\n\\n\\n–since: Retrieves logs from a specific time period.
\\n\\n\\n\\n–follow: Stream logs in real-time.
\\n\\n\\n\\n–tail: Fetches a specified number of recent log lines.
\\n\\n\\n\\nHow to Tail Logs for Real-Time Monitoring?
\\n\\n\\n\\nTo tail logs for real-time monitoring of a pod, use this command:
\\n\\n\\n\\n```sh\\n\\nkubectl logs my-pod --follow --tail=50\\n\\n```
\\n\\n\\n\\nThis streams the last 50 lines of logs and continues to stream new log entries in real-time.
\\n\\n\\n\\nHow to Tail Logs for a Specific Component in the Project
\\n\\n\\n\\nFor tailing logs of a specific component, specify the container name:
\\n\\n\\n\\n```sh\\n\\nkubectl logs my-pod -c my-container --tail=50\\n\\n```
\\n\\n\\n\\nThis command fetches the last 50 lines of logs from the `my-container` container within `my-pod`.
\\n\\n\\n\\nFor our demo project, to get the container details within a pod, you can use the below command.
\\n\\n\\n\\n```sh\\n\\nkubectl describe pod logging-demo-6cf76dcb4c-mz7bv -n default\\n\\n```
\\n\\n\\n\\nCombining the `–tail` option with other flags in `kubectl logs` can enhance your logging capabilities and provide more detailed insights.
\\n\\n\\n\\n#### Examples: -f (follow), –since, –timestamps
\\n\\n\\n\\nFollow (`-f`): Combines real-time streaming with tailing the logs.
\\n\\n\\n\\n```sh\\n\\n kubectl logs my-pod --tail=50 -f\\n\\n ```
\\n\\n\\n\\nThis command fetches the last 50 lines and streams new logs in real-time.
\\n\\n\\n\\nSince (`–since`): Retrieves logs from a specific time period.
\\n\\n\\n\\n```sh\\n\\n kubectl logs my-pod --tail=50 --since=1h\\n\\n ```
\\n\\n\\n\\nThis command fetches the last 50 lines of logs from the past hour.
\\n\\n\\n\\nTimestamps (`–timestamps`): Adds timestamps to each log entry.
\\n\\n\\n\\n```sh\\n\\n kubectl logs my-pod --tail=50 --timestamps\\n\\n ```
\\n\\n\\n\\nThis command includes timestamps for the last 50 log lines, useful for chronological analysis if the component logs do not have timestamps of their own.
\\n\\n\\n\\n1. Debugging a Deployment Issue:
\\n\\n\\n\\n```sh\\n\\n kubectl logs my-pod --tail=100 -f --since=10m\\n\\n ```
\\n\\n\\n\\nThis command is useful to debug recent issues by streaming the last 100 lines of logs from the past 10 minutes.
\\n\\n\\n\\n2. Performance Monitoring:
\\n\\n\\n\\n```sh\\n\\n kubectl logs my-pod --tail=200 --timestamps -f\\n\\n ```
\\n\\n\\n\\nUse this command to monitor performance metrics in real-time with precise timestamps.
\\n\\n\\n\\nIf you use it with our demo project, here’s the output that you may see. Note that the logs generated are due to the simulated requests we are sending.
\\n\\n\\n\\n```sh\\n\\n kubectl logs logging-demo-6cf76dcb4c-mz7bv --tail=200 --timestamps -f\\n\\n 2024-06-29T12:45:42.618459492Z 6/29/2024, 12:45:42 PM - info - Request received: POST /payment\\n\\n 2024-06-29T12:45:42.618497775Z 6/29/2024, 12:45:42 PM - warn - Payment failed: Insufficient funds - jane_doe\\n\\n 2024-06-29T12:45:43.695529512Z 6/29/2024, 12:45:43 PM - info - Request received: POST /transaction\\n\\n 2024-06-29T12:45:43.695560010Z 6/29/2024, 12:45:43 PM - info - Transaction successful: john_doe new balance: 200\\n\\n 2024-06-29T12:45:44.766836119Z 6/29/2024, 12:45:44 PM - info - Request received: POST /login\\n\\n 2024-06-29T12:45:44.766874021Z 6/29/2024, 12:45:44 PM - warn - User login failed: john_doe\\n\\n .\\n\\n .\\n\\n .\\n\\n ```
\\n\\n\\n\\nFiltering logs using labels and selectors helps you focus on specific parts of your application, especially in large clusters.
\\n\\n\\n\\nFiltering Logs Using Labels and Selectors
\\n\\n\\n\\nTo filter logs by labels:
\\n\\n\\n\\n```sh\\n\\nkubectl logs -l app=my-app --tail=50\\n\\n```
\\n\\n\\n\\nThis command fetches the last 50 lines of logs from all pods labeled `app=my-app`.
\\n\\n\\n\\n1. Filtering Logs for the `frontend` Component:
\\n\\n\\n\\n```sh\\n\\n kubectl logs -l component=frontend --tail=100 -f\\n\\n ```
\\n\\n\\n\\nThis command fetches and streams the last 100 lines of logs from all frontend components.
\\n\\n\\n\\n2. Filtering Logs for Pods Running on a Specific Node:
\\n\\n\\n\\n```sh\\n\\n kubectl logs -l node=worker-node1 --tail=50\\n\\n ```
\\n\\n\\n\\nUse this command to fetch logs from pods running on `worker-node1`.
\\n\\n\\n\\nYou can fetch the list of nodes in your environment using the `kubectl get nodes` command.
\\n\\n\\n\\nComponent names are generally configured in your deployment or pod configs, similar to app labels, for better identification.
\\n\\n\\n\\nAutomating log tailing with scripts and integrating with CI/CD pipelines can enhance continuous monitoring and streamline troubleshooting.
\\n\\n\\n\\nYou can create scripts to automate the process of tailing logs. Here’s an example script:
\\n\\n\\n\\n```sh\\n\\n#!/bin/bash\\n\\n# Script to tail logs from all pods with label app=my-app\\n\\nkubectl logs -l app=logging-demo --tail=100 -f\\n\\n```
\\n\\n\\n\\nSave this script as `tail_logs.sh` and run it to automate log tailing.
\\n\\n\\n\\nHere’s how you can integrate log tailing in your CI/CD pipelines to monitor deployments and application health continuously.
\\n\\n\\n\\n```yaml\\n\\n# Example Jenkinsfile\\n\\npipeline {\\n\\n agent any\\n\\n stages {\\n\\n stage(\'Deploy\') {\\n\\n steps {\\n\\n sh \'kubectl apply -f deployment.yaml\'\\n\\n }\\n\\n }\\n\\n stage(\'Monitor Logs\') {\\n\\n steps {\\n\\n sh \'./tail_logs.sh\'\\n\\n }\\n\\n }\\n\\n }\\n\\n}\\n\\n```
\\n\\n\\n\\nThis Jenkins pipeline deploys your application and then tails the logs.
\\n\\n\\n\\nOnce we have run the `simulate_requests.sh` file for sometime, we would have generated a lot of logs in our pod. We have simulated many different positive and negative requests. How do we actually analyze these logs in a production environment? Let’s explore some operations with our project.
\\n\\n\\n\\nUsing tail logs to monitor key performance metrics helps understand your application’s behavior and performance under load. Here’s how you can effectively monitor and compute these metrics in real-time. You can run these commands in a bash terminal.
\\n\\n\\n\\nUsing Tail Logs to Monitor Key Performance Metrics
\\n\\n\\n\\nYou can create custom scripts to store your logic to measure specific things based on your logs.
\\n\\n\\n\\n```sh\\n\\nkubectl logs -l app=logging-demo --tail=100 | awk \'\\n\\nfunction parseTime(ts, date, time, ampm, h, m, s) {\\n\\n split(ts, datetime, \\", \\")\\n\\n date = datetime[1]\\n\\n time = datetime[2]\\n\\n ampm = datetime[3]\\n\\n split(date, date_parts, \\"/\\")\\n\\n split(time, time_parts, \\":\\")\\n\\n h = time_parts[1]\\n\\n m = time_parts[2]\\n\\n s = time_parts[3]\\n\\n if (ampm == \\"PM\\" && h < 12) h += 12\\n\\n if (ampm == \\"AM\\" && h == 12) h = 0\\n\\n return (h * 3600) + (m * 60) + s\\n\\n}\\n\\n/-/ {\\n\\n curr_time = parseTime($1 \\" \\" $2 \\" \\" $3)\\n\\n if (prev_time > 0) {\\n\\n response_time = curr_time - prev_time\\n\\n sum += response_time\\n\\n count += 1\\n\\n if (response_time > max || count == 1) max = response_time\\n\\n if (response_time < min || count == 1) min = response_time\\n\\n }\\n\\n prev_time = curr_time\\n\\n}\\n\\nEND {\\n\\n if (count > 0) {\\n\\n printf \\"Average Response Time: %.2f seconds\\\\nMax Response Time: %d seconds\\\\nMin Response Time: %d seconds\\\\n\\", sum/count, max, min\\n\\n } else {\\n\\n print \\"No response times calculated.\\"\\n\\n }\\n\\n}\'\\n\\n```\\n\\nOutput:\\n\\n```sh\\n\\nAverage Response Time: 0.54 seconds\\n\\nMax Response Time: 2 seconds\\n\\nMin Response Time: 0 seconds\\n\\n```
\\n\\n\\n\\nThis command fetches the last 100 log lines from all pods labeled `app=logging-demo` in real-time, and uses `awk` to compute key metrics:
\\n\\n\\n\\nTail logs are essential for identifying and analyzing suspicious activities, enabling a quick response to security incidents. Here’s how you can filter logs specific to such activity.
\\n\\n\\n\\nTail Logs for Identifying and Analyzing Suspicious Activities
\\n\\n\\n\\n```sh\\n\\nkubectl logs -l app=logging-demo --tail=200 | grep -i \'error\\\\|failed\\\\|unauthorized\' | awk \'{count += 1} END {print \\"Total Security Incidents Detected:\\", count}\'\\n\\n```\\n\\nOutput:\\n\\n```sh\\n\\nTotal Security Incidents Detected: 63\\n\\n```
\\n\\n\\n\\nThis command retrieves the last 200 log lines and filters for keywords such as “error,” “failed,” and “unauthorized,” which can help in identifying potential security breach attempts. It uses `awk` to count the number of incidents detected. The `awk` command is a versatile tool for text processing in Unix-like systems. It excels in pattern matching, field manipulation, and generating reports using custom logic as per your application needs.
\\n\\n\\n\\nBy running such commands and scripts, you can directly compute and track essential metrics, enabling real-time monitoring and swift response to any issues.
\\n\\n\\n\\nBy mastering these logging techniques, you can significantly enhance your ability to monitor, troubleshoot, and secure your Kubernetes applications. Start implementing these strategies today to maintain robust and reliable cloud-native applications.
\\n\\n\\n\\nWant to use Middleware for log monitoring? Check out our detailed documentation here.
\\n\\nMember post originally published on Tetrate’s blog by Cristofer TenEyck and Jimmy Song
\\n\\n\\n\\nIn the evolving landscape of cloud-native applications, securing service meshes across multiple clusters is crucial for ensuring both security and compliance. Istio, a leading open-source service mesh, provides tools for securing communication between microservices. However, implementing a robust and scalable Public Key Infrastructure (PKI) to manage certificates within this environment remains a significant challenge.
\\n\\n\\n\\nIn this blog, we will delve into the implementation of a PKI solution using the EJBCA open-source PKI for an Istio service mesh spanning multiple clusters. We will focus on the process of setting up EJBCA, configuring the cert-manager EJBCA external issuer, and ensuring automatic certificate renewal for your Istio workloads. This guide will help you build a trusted and scalable PKI, enabling secure, compliant, and resilient service meshes.
\\n\\n\\n\\nWhy multi-clusters? Multi-cluster deployments are becoming increasingly popular as organizations expand their Kubernetes infrastructure. Multi-cluster Istio setups provide enhanced availability, fault tolerance, and isolation of workloads across clusters.
\\n\\n\\n\\nPKI is a cornerstone of modern digital security. It involves managing keys and certificates to ensure secure communication between entities, be they users, applications, or services. In the context of a service mesh like Istio, an effective PKI is essential for securing communications between microservices, especially in multi-cluster environments.
\\n\\n\\n\\nThe EJBCA offers an open-source solution for managing PKI at scale. Compared to other options like OpenSSL or Istio’s built-in PKI, EJBCA provides a full-featured, enterprise-grade PKI that is well-suited for simple to more complex and multi-purpose deployments. EJBCA’s capabilities go beyond just issuing mTLS certificates, offering compliance features, secure scalability, crypto agility, and integration with a wide range of applications.
\\n\\n\\n\\nSetting up a PKI for a multi-cluster Istio environment using EJBCA. Here is what is included:
\\n\\n\\n\\nThis section outlines the steps to set up Istio on Kubernetes clusters using EJBCA as an external Certificate Authority (CA). The setup involves configuring two MicroK8s clusters with MetalLB for load balancing, integrating EJBCA for certificate management, and installing Istio components using Helm. the complete guide can be found here.
\\n\\n\\n\\nThe key steps include:
\\n\\n\\n\\nAbove is the flow diagram representing the mTLS certificate issuance and renewal process in Istio. It illustrates the flow from the Istiod control plane pushing the Envoy config to the final certificate issuance by EJBCA.
\\n\\n\\n\\nBuilding a secure PKI for your Istio service mesh involves more than just setting up any PKI and starting issuing certificates. It requires adherence to best practices and compliance with regulations to stay secure and future-proof. Here are some key points to consider:
\\n\\n\\n\\nConclusion
\\n\\n\\n\\nImplementing a PKI for an Istio service mesh in a multi-cluster environment can seem daunting, but with the right tools and practices, it can be achieved efficiently and effectively. EJBCA, combined with cert-manager, offers a solution for managing certificates at scale, ensuring that your Istio service mesh PKI is both secure and compliant.
\\n\\n\\n\\nBy following the steps outlined in this guide, you will be able to set up a trusted PKI, achieve seamless and robust certificate management, and collaborate effectively with your InfoSec team to maintain the security of your service mesh.
\\n\\n\\n\\nFor further resources and more detailed information on the topics covered in this blog, be sure to check out the links and references provided below.
\\n\\n\\n\\nCommunity post by Or Weis
\\n\\n\\n\\nDiscover how leveraging a policy-as-code platform helps foster an engineering culture focused on efficient authorization and access control.
\\n\\n\\n\\nPlatform engineering is rooted in a fundamental principle: cultivating a culture within development teams. This culture is not merely about saving time or streamlining development processes. Rather, it’s about creating superior products through the application of standards and well-defined solutions, providing developers with clear guidelines where they might otherwise struggle to find direction.
\\n\\n\\n\\nIn crucial areas like security and observability, such a culture is the difference between scalable success and inefficiency.
\\n\\n\\n\\nIn this article, we will explore how we successfully implemented such a culture, specifically in the context of permissions and authorization for our product users. We believe that our approach can serve as a valuable model for creating more effective platform engineering teams.
\\n\\n\\n\\nOne of the challenges developers often experience with the amount of data our applications process is the complexity of managing a vast web of policies that govern the permissions over this data. Managing the authorization of users and systems has become an increasingly challenging task.
\\n\\n\\n\\nIn recent years, Policy-as-Code has emerged as a compelling solution. Tools like Rego and Cedar, which use domain-specific languages, bring the rigor of software engineering to the world of policy management, allowing teams to express rules clearly and enforce them consistently.
\\n\\n\\n\\nWe’ve witnessed firsthand the transformative impact of integrating a Policy as Code culture within organizations, especially when it comes to navigating the complexities inherent in modern software development and data management.
\\n\\n\\n\\nPolicy as code essentially involves translating organizational rules and procedures into code. This approach draws on a fundamental insight: complex logic is most effectively managed and understood when expressed as declarative code. The beauty of this approach is that it leverages best practices from software development—such as version control, CI/CD, testing, and code reviews— empowering us with the tools to handle policies with the same level of scrutiny and flexibility as our software.
\\n\\n\\n\\nBy encoding policies in a declarative language, we ensure that all aspects of compliance and security are handled in a uniform, traceable way. This approach not only makes policy management more efficient but also makes auditing and troubleshooting more straightforward.
\\n\\n\\n\\nPolicy as code is also one of the biggest enablers of the fine-grained authorization motion, where we can give our users the exact permissions they need without compromising on our applications’ code limitations.
\\n\\n\\n\\nWhile Policy-as-Code greatly simplifies policy enforcement, it’s not a silver bullet. While it significantly streamlines policy management, several challenges remain, particularly concerning its implementation. How do we ensure that Policy-as-Code is applied consistently across an organization? How can we make it accessible to those without a development background? And perhaps most importantly, how do we choose the right Software Development Life Cycle (SDLC) models that align with Policy-as-Code?
\\n\\n\\n\\nConsider the scenario of a shared database or message queue that houses sensitive data. This data, once it leaves its original source, can be transformed by various services or their consumers, often as part of Extract, Transform, Load (ETL) flows. Ensuring that our policies protect this data across different teams and services requires a unified approach that is challenging to scale across diverse teams or developers.
\\n\\n\\n\\nThis difficulty stems not from technical limitations but from cultural ones. As different developers contribute to the software, a phenomenon known as ‘code drift’ occurs, where deviations from the initial policy intentions compromise the policy’s effectiveness.
\\n\\n\\n\\nAs your team grows, maintaining alignment on policy intent becomes increasingly difficult, especially as data schemas evolve and services change.
\\n\\n\\n\\nAddressing these issues requires a cultural shift. Instead of trying to limit our teams’ operational freedom, we should focus on empowering them. This starts with understanding that implementing Policy-as-Code is fundamentally a cultural challenge. By embracing a culture that values shared responsibility and collective adherence to policy guidelines, we can create an environment where policies are respected and enforced more naturally.
\\n\\n\\n\\nAn effective strategy is to establish a baseline policy that allows individual services to enforce their own access controls while adhering to overarching organizational guidelines. This approach not only meets the unique needs of different teams but also ensures that fundamental security and compliance standards are uniformly maintained. The integrity of these baseline guidelines can be ensured through various mechanisms, such as integrating policy checks into CI tests and conducting thorough code reviews focused on policy compliance.
\\n\\n\\n\\nAdditionally, providing teams with modular enforcement tools can facilitate more effective policy implementation. These modules can help gather statistics and identify behavioral patterns from audit logs, offering insights into how policies are being applied in practice. Linking these enforcement points to Data Loss Prevention (DLP) systems can further enhance policy effectiveness by shifting data tracking to a pattern-based approach, which is more adaptable to the dynamic nature of data flows and schema changes.
\\n\\n\\n\\nEmbracing Policy-as-Code within an organization requires a comprehensive approach that goes beyond mere technical implementation. It necessitates fostering a culture that values collaboration, empowerment and shared responsibility for policy compliance.
\\n\\n\\n\\nBy addressing the cultural aspects of Policy-as-Code, leveraging modular tools for enforcement, and ensuring flexibility to adapt to changing needs, we can pave the way for more secure, compliant, and efficient operations across our teams and services.
\\n\\n\\n\\nAs developers and leaders, it’s our role to champion this cultural shift, ensuring that our teams can confidently navigate the challenges of modern software development and data management.
\\n\\nCo-chairs: Christian Hernandez, Dan Garfield, Tim Collins
\\n\\n\\n\\nNovember 12, 2024
\\n\\n\\n\\nSalt Lake City, Utah
\\n\\n\\n\\nThe Argo Project consists of 4 related, but separate, toolsets. So it’s not just about GitOps, but a wide variety of use cases from platform engineering, AI, CI/CD, and general cloud-native management. We strive to load the event with as many end user stories as possible in order to facilitate community engagement and the sharing of ideas. So from learning what others are doing or just general networking, ArgoCon is uniquely positioned to offer things for a wide variety of talks for folks from various industries/backgrounds.
\\n\\n\\n\\nArgoCon has been around since 2021! Initially, ArgoCon started out as a virtual event and the inaugural event took place in 2021. Following the success of ArgoCon 2021, we held a stand alone event that was hosted by the CNCF. ArgoCon 2022 happened, in person, at the Computer Science Museum in Mountain View California. Starting in 2023, ArgoCon has been a co-located event with KubeCon for both North America and Europe. Happening twice a year, each time with multiple tracks, speaks to the popularity of the Argo Project.
\\n\\n\\n\\nThose who are interested in infrastructure management, AI workflows, DevOps engineers, and release managers. Really, there’s a variety of talks since the Argo Project toolsets span multiple use cases. We have to say that if you’re into operationalizing Kubernetes, you’ll definitely want to check out ArgoCon
\\n\\n\\n\\nThis year we have a very special surprise for everyone, we will be playing the trailer for the Argo Project Documentary that will be premiering (in its entirety) at KubeCon. Our attendees this year in Salt Lake will get a special sneak peek teaser before it’s premiered later during KubeCon.
\\n\\n\\n\\nThe event will take place in Salt Lake City Co-located with KubeCon. Best way to attend is to add the “all access pass” when registering for KubeCon. It’s one day, multiple tracks, so you’ll have plenty to choose from during the event. After the event there will be a networking event where you can meet with others, connect, and talk about the day.
\\n\\n\\n\\nDefinitely go to the project and familiarize yourself with each tool and what their focus is on. Go through any tutorials that you may have time for.
\\n\\n\\n\\nWe’re excited to keep the momentum going for ArgoCon! Putting on this event has been a culmination of a lot of hard work from the community. We wouldn’t be able to do an event like this without the help of countless people. It’s really an event for the community and we’re excited for all who attend.
\\n\\n\\n\\nSubmitted by Christian Hernandez who, along with the other co-chairs, is really looking forward to the premier of the Argo Documentary.
\\n\\n\\n\\nDon’t forget to register for KubeCon + CloudNativeCon North America 2024.
\\n\\nOriginally published on the Redpill Linpro blog by Daniel Buøy-Vehn
\\n\\n\\n\\nThe command ansible-runner
is part of the Ansible automation platform. If you have got installed Ansible, then you probably have already installed ansible-runner
as well.
But what do you use it for? Well, if you run AWX or the Ansible Automation platform package somewhere in your environment, ansible-runner
is part of the magic in the background and running your code. It is also a python library that can connect your code directly to Ansible and provides an abstraction interface with it.
For those who do not want to go into Python programming just to play with Ansible, ansible-runner
also has some other useful purposes.
You can use to encapsulate a single Ansible run including all required variables and settings into a single environment.
\\n\\n\\n\\nInstead of a playbook, ansible-runner
requires a project folder which then contains the required data for the Ansible run.
We create a quick setup in the /tmp/ansible-runner
directory just for giving an example. Something like this is already enough:
$ tree\\n.\\n├── inventory\\n│ └── hosts\\n└── project\\n └── playbook.yml\\n
\\n\\n\\n\\n# playbook.yml\\n---\\n- name: Example playbook\\n hosts: all\\n tasks:\\n - name: Debug output\\n ansible.builtin.debug:\\n msg: \\"The code runs.\\"\\n
\\n\\n\\n\\n# inventor/hosts\\nlocalhost ansible_connection=local\\n
\\n\\n\\n\\nWith these files in place, you can do this:
\\n\\n\\n\\n$ ansible-runner run /tmp/ansible-runner --playbook playbook.yml\\nPLAY [Example playbook] ********************************************************\\n\\nTASK [Gathering Facts] *********************************************************\\ntirsdag 27 februar 2024 14:01:03 +0100 (0:00:00.010) 0:00:00.010 *******\\ntirsdag 27 februar 2024 14:01:03 +0100 (0:00:00.010) 0:00:00.010 *******\\nok: [localhost]\\n\\nTASK [Debug output] ************************************************************\\ntirsdag 27 februar 2024 14:01:04 +0100 (0:00:01.068) 0:00:01.079 *******\\ntirsdag 27 februar 2024 14:01:04 +0100 (0:00:01.068) 0:00:01.078 *******\\nok: [localhost] => {\\n \\"msg\\": \\"The code runs.\\"\\n}\\n\\nPLAY RECAP *********************************************************************\\nlocalhost : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0\\nPlaybook run took 0 days, 0 hours, 0 minutes, 1 seconds\\ntirsdag 27 februar 2024 14:01:04 +0100 (0:00:00.042) 0:00:01.122 *******\\n===============================================================================\\nGathering Facts --------------------------------------------------------- 1.07s\\nDebug output ------------------------------------------------------------ 0.04s\\ntirsdag 27 februar 2024 14:01:04 +0100 (0:00:00.043) 0:00:01.122 *******\\n===============================================================================\\ngather_facts ------------------------------------------------------------ 1.07s\\nansible.builtin.debug --------------------------------------------------- 0.04s\\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\ntotal ------------------------------------------------------------------- 1.11s\\n
\\n\\n\\n\\nansible-runner
per default assumes playbooks to be placed in the subdirectory ./project
(this can be changed of course). The hosts in inventory/hosts
are loaded automatically and the playbook executed against them.
All other settings required for an Ansible run, like secrets, environment variables or SSH keys for accessing hosts, can also be provided within the project directory structure.
\\n\\n\\n\\nRoles just like playbooks go into the ./project
directory.
The outcome of the Ansible run creates a new directory in the project directory (not ./project
, but the one above it) called ./artifacts
. This directory contains all results, data and events that occured during the Ansible run in a parse-able and human-readable form.
With the proper configuration and settings, you can create encapsulated code environments, that can be deployed to a container or a remote system and the result can be parsed on for further use in e.g. a CICD pipeline.
\\n\\n\\n\\nUnfortunately there are just too many parameters for a single blog entry to go into the depth of what ansible-runner
can do.
Take a look into the Ansible Runner demo repository1 to get an easy start and some more guidance. Have fun playing with it!
\\n\\n\\n\\nAuthor:
\\n\\n\\n\\nSenior Systems Consultant at Redpill Linpro
\\n\\n\\n\\nDaniel works with automation in the realm of Ansible, AWX, Tower, Terraform and Puppet. He rolls out mainly to our customer in Norway to assist them with the integration and automation projects.
\\n\\n\\n\\n\\n\\nCross-posted on the OpenCost blog by Ajay Tripathy
\\n\\n\\n\\nThe OpenCost project proudly announces that we’ve reached CNCF Incubating status! This milestone in our journey underscores the significant dedication the project has received from the community that contributes to OpenCost. We’d like to thank the developers, Kubernetes practitioners, and FinOps teams from organizations across the globe that continue to make this project meaningful.
\\n\\n\\n\\nCloud-native adoption continues accelerating, and the need for clear, manageable insights into Kubernetes costs keeps pace. OpenCost addresses this need as an open-source tool designed to make Kubernetes cost management more accessible and standardized. Initially launched through a collaborative effort led by Kubecost and supported by experts from organizations like Amazon, Adobe, Google, Microsoft, and SUSE, OpenCost was welcomed into the CNCF Sandbox to improve cost management for Kubernetes. Our promotion to CNCF Incubation reflects strong community support and the important challenges it seeks to address.
\\n\\n\\n\\nOpenCost fills a critical gap by providing real-time visibility into Kubernetes costs across multi-cloud environments. With its vendor-neutral framework, OpenCost enables teams to allocate costs by Kubernetes service, deployment, container, and more. By standardizing cost allocation, OpenCost helps reduce cost overruns and gives teams a trusted model for budget planning, regardless of whether they use AWS, Google Cloud, Microsoft Azure, or on-premises infrastructure.
\\n\\n\\n\\nProgressing from Sandbox to Incubation within the CNCF represents a vote of confidence from the cloud-native and open-source communities. Projects that reach this status are celebrated for their innovation and reliability, and they signal the potential for widespread adoption. For OpenCost, this transition validates our project’s solutions to Kubernetes cost challenges and recognizes the tool as becoming foundational for Kubernetes cost management at scale.
\\n\\n\\n\\nOpenCost’s growth has been remarkable. Here are some notable highlights:
\\n\\n\\n\\nAs an Incubating project, OpenCost’s future is bright and filled with opportunities to expand its capabilities. We look forward to developing new integrations, refining real-time cost monitoring, and offering deeper support for multi-cloud and hybrid-cloud environments. Our progress depends on the community of users, and we invite anyone interested in Kubernetes cost management to join us in building a sustainable, transparent future for Kubernetes operations.
\\n\\n\\n\\nOpenCost is powered by the passion and expertise of our community. We encourage you to explore OpenCost on GitHub and chat with us on CNCF Slack. Together, we’re creating an essential tool for Kubernetes teams worldwide, helping them manage cloud costs more effectively and transparently. Are you going to Kubecon NA 2024? Stop to say ‘hi’ at the OpenCost kiosk in the CNCF Project Pavillion.
\\n\\nThis year, the CNCF refreshed the KCD (Kubernetes Community Days) program for 2025, offering more support to our organizers and their communities, including, but not limited to, financial assistance, structural improvements, and organizational resources. You can read more about the motivations and specifics behind these updates here.
\\n\\n\\n\\nFor those new to Kubernetes Community Day, KCDs are community-led events where open source enthusiasts and cloud native technologists come together for education, collaboration, and networking. Originally launched during the pandemic, the program has seen remarkable growth, expanding from 12 events in 2021 to 35 this year! These one- to two-day gatherings are locally-driven and supported by the CNCF to foster the growth of cloud native communities around the world.
\\n\\n\\n\\nAlong with the above changes, there will now be a formal review process each year. With 61 outstanding submissions from around the world and only 30 spots available, selecting the hosts for KCDs in 2025 was no easy task.
\\n\\n\\n\\nA committee of six CNCF staff members reviewed all submissions. We narrowed down applications based on completeness, CNCF ambassador or maintainer involvement, community engagement, regional representation, event timing, past event quality, and organizer team strength—including diversity, collaboration, and Code of Conduct adherence. From vibrant tech hubs to emerging cloud native communities around the globe, the 2025 KCDs will take place in cities across North America, Latin America, Europe, Asia Pacific, and Africa. Here’s a look at where and when each event will be happening. Please keep in mind some dates are subject to change:
\\n\\n\\n\\nIn March – Beijing, China March 16 – Guadalajara, Mexico March 22 – Rio de Janeiro, Brazil April 26 – Chennai, India April 28 – Auckland, New Zealand May 6 – Helsinki, Finland May 8 – San Francisco, USA May 15 – Austin, USA May 22 – Seoul, South Korea May 23 – Istanbul, Turkey May 31 – Heredia, Costa Rica In May – New York, USA April 23 – Budapest, Hungary July 3 – Utrecht, The Netherlands June 5 – Bratislava, Slovakia | June 6 – Bangalore, India June 14 – Antigua Guatemala, Guatemala June 19 – Nigeria, Africa July 5 – Taipei, Taiwan July 19 – Lima, Perú August 29 – Bogota, Colombia September 9 – Washington DC, USA September 18 – Sofia, Bulgaria September 20 – San Salvador, El Salvador September 26 – Porto, Portugal October 3 – Warsaw, Poland October 8 – Colombo, Sri Lanka October 21 – Edinburgh, UK 3rd fiscal quarter – Hangzhou, China December 5 – Geneva, Switzerland |
Thank you to all who submitted and contributed to this thriving program—we’re looking forward to an exciting 2025 with you!
\\n\\nCommunity post originally published on Medium by Giorgi Keratishvili
\\n\\n\\n\\n\\n\\n\\n\\nSo want to pass CGOA exam but not sure where to start? Don’t worry I will help you with what to pay attention to and will share my experience. I had been part of beta tester for CGOA and contributed to creating the exam and can say from both sides what to look for and what is overall expected from candidates who will pass it, but first, let’s start with the format of the exam and difficulty level.
\\n\\n\\n\\nCertified GitOps Associate is an entry-level certification compared to CKAD/CKA/CKS it is a multiple choice 90-minute exam also it is an online proctored exam keep in mind also it is much easier compared to them would say it is on pare with KCNA/KCSA, the main theme of this exam is to emphasize open GitOps standard and give candidate understanding of concepts, structuring repositories and general patterns what to look for and keeping everything in vendor neutral format which was hard part cause some times mentioning during discussion which tool is best ArgoCD or FluxCD could cause hot debates…
\\n\\n\\n\\nLets discuss the Certification Domains & Competencies
\\n\\n\\n\\n**GitOps Terminology 20%**\\n\\n\\n\\n
Continuous
Declarative Description
Desired State
State Drift
State Reconciliation
GitOps Managed Software System
State Store
Feedback Loop
Rollback
**GitOps Principles 30%**
Declarative
Versioned and Immutable
Pulled Automatically
Continuously Reconciled
**Related Practices 16%**
Configuration as Code (CaC)
Infrastructure as Code (IaC)
DevOps and DevSecOps
CI and CD
**GitOps Patterns 20%**
Deployment and Release Patterns
Progressive Delivery Patterns
Pull vs. Event-driven
Architecture Patterns (in-cluster and external reconciler, state store management, etc.)
**Tooling 14%**
Manifest Format and Packaging
State Store Systems (Git and alternatives)
Reconciliation Engines (ArgoCD, Flux, and alternatives)
Interoperability with Notifications, Observability, and Continuous Integration Tools
As we see big percentage is given to general Principles and Patterns/Terminologies as it is crucial to understand in which problem GitOps helps us, where it sits in our Continues Deployment and how it differs from traditionally deploying applications and CI/CD.
\\n\\n\\n\\nFrom an experienced person who has been using GitOps for a while it could be intuitive and the majority of questions should be related to day-to-day tasks which they had performed but what if we are new to all this jazz? In such cases we need to fill our gaps and deep dive into the world of cloud native ecosystem and GitOps, the best place to start exploring would be documentation of Open GitOps
\\n\\n\\n\\nThese four Pillars are fundamentals similar to the “10 commandments.” Based on these principles the whole philosophy is built around which is shared across all GitOps tools sound like a cult isn’t it? Joke aside, this approach and consistency helped to shape current approaches to how we deliver our deployment and management components in most simple and easy way on a large scale which fits well in the Kubernetes ecosystem
\\n\\n\\n\\nOkay at this point we are excited and want to rush into the exam but from where should start preparing and especially when wo do not have experience?
\\n\\n\\n\\nThe documentation is our best friend and let’s start with the most first tools that start this whole GitOps and Progressive delivery movement. I am speaking about FluxCD. I would say it has a great introduction to general concepts and patterns of how to structure GitOps delivery and can say the person who has been working with Flux V1 and V2 this documentation has seen very big progression over the time and was one of my most reference points, second great documentation, of course, is ArgoCD and third documentation which I would recommend is from Jenkins X beside documentation Linux Foundation provides also Free training materials which are good addition if we want something paid nice place it to check kodekloud catalog with there ArgoCD Course also to practice before taking the exam would recommend checking the Codefresh bage test and materials for ArgoCD, GitHub is also good place to check for repos such as this one.
\\n\\n\\n\\nIf you prefer videos, I would recommend one of my favorite CNCF Youtube channels. They always upload conference videos from KubeCon or GitOpsCon so highly recommend checking on them but do not get stuck in a Youtube rabbit hole. Remember, practice is everything and not watching someone else but where could we practice in this case I would highly recommend to check Killerkoda playground and scenarios regarding materials would say it should be enough, one recommendation which I can give is before starting whole this journey take some target day for exam and schedule it for yourself have timeline and urge to study seriously from my experience it helped me.
\\n\\n\\n\\nSo it is exam day: how will it look? All Linux Foundation and CNCF certifications are online and proctored. Before the exam, check all prerequisites and ensure the PSI browser plugin is installed. 30 minutes before, you will see the join button. I would recommend to starting your exam early because there are always some issues (saying this as a person who has passed more then 15 PSI proctored exams). In the room, we should not have any posters or whiteboard, the desk should be clean and nothing extra should be on if you are doing from laptop, then only the laptop, if using a PC, then a keyboard and mouse but no pen, notebook or anything. Proctors are very serious on this. You could have glass of water as the exam is long (90 min) but the most important is to have a good chair. After 24 hours, you will know your score the passing one is >75% and you will receive a Creadly certificate.
\\n\\n\\n\\nAt this point, congrats you have passed and obtained certification. I hope this post was informative and would encourage you to learn more and take bigger challenges as there are many certifications from the Linux Foundation, CNCF, and the CD Foundation, remember to stay curious and share your knowledge with others 🚀
\\n\\n\\n\\n\\n\\nMember post originally published on Elastisys’s blog by Cristian Klein
\\nI hear too many stories of platform teams being under-resourced. This usually manifests itself as an overworked platform team with unrealistic on-call rotations. Critical activities, such as disaster recovery drills, security patches and keeping the tech stack up-to-date get postponed over and over again.
\\n\\n\\n\\nIn fact, the CNCF Platform White Paper identified the following challenge with IT platforms:
\\n\\n\\n\\n\\n\\n\\n\\n\\nPlatform teams must seek support of enterprise leadership and show impact on value streams[…]
\\n\\n\\n\\nMany enterprise leaders perceive IT infrastructure as an expense quite disconnected from their primary value streams and may try to constrain costs and resources allocated to IT platforms, leading to a poor implementation, unrealized promises and frustration.
\\n
In this post, we will help CTOs, VP of Engineering, and Software Architects explain the business value of a Kubernetes-based platform. I will go “full geek mode” and structure the whole blog post as an architectural decision record.
\\n\\n\\n\\nWe already wrote two blogs (blog 1 and blog 2) on the NIS2 directive. In short, if you run software and you are located in Europe, you’ll have to comply with the NIS2 directive.
\\n\\n\\n\\nNIS2 Article 21(2) lists 10 so-called minimum requirements, which are minimum security measures that essential and important organizations must implement. One of these requirements is “security in network and information systems acquisition, development and maintenance […]”.
\\n\\n\\n\\nThis line needs to be translated into more specific measures, such as Sweden’s MSBFS. Furthermore, these need to be translated into a specific policy of your organization. Let’s pretend that both of these were done. You’re likely left with implementing something along the lines of:
\\n\\n\\n\\n\\n\\n\\n\\n\\nAll communication across the Internet MUST be encrypted with HTTPS and at least TLSv1.2. Certificates MUST be rotated at least yearly. HSTS MUST be employed.
\\n
In case you wonder why I capitalized “MUST”, it’s not because I was shouting at you, but because: <quote>The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.</quote> (Did I remember to warn you about this post going “full geek mode”? 🤔)
\\n\\n\\n\\nFor the non-geeks in the audience, the requirement above essentially means encrypt data in transit over untrusted public networks, such as the Internet. This ensures that if someone checks their patient journal using the free WiFi of the coffee shop, only they can see this very confidential information and nobody else, not even the owner of the coffee shop who set up the free WiFi.
\\n\\n\\n\\nSo how should you implement this network security measure?
\\n\\n\\n\\nYour decision drivers are going to be unique to you, but the TL;DR of it is in the title of this section. You likely want to minimize the cost of implementing, rolling out and maintaining this security measure.
\\n\\n\\n\\nYou might also be worried that this is not the only security measure you will need to implement. The CISO is still drafting the “Use of cryptography” policy, which is also a NIS2 minimum requirement. Hence, you would like to find a solution which can help you gain “security agility” and cost-effectively roll out other security measures in the future.
\\n\\n\\n\\nNow you have a few choices:
\\n\\n\\n\\nIf you have access to a knowledgeable platform team, either in-house or via a supplier, then option 4 “Platform” is in my opinion the best. It allows developers to develop. It also puts clear responsibility on implementing this network security measure on the platform team. They can implement this quite quickly with a combination of Kubernetes, NGINX Ingress Controller and cert-manager.
\\n\\n\\n\\nHaving a platform allows you to factor out commonalities between your applications. When the CISO is done drafting the “use of cryptography” policy, you have a single place to check if the encryption algorithms employed by your tech stack conform to your policy. In case there is a gap, you have a single team which is involved in bringing your tech stack in compliance with your new policies.
\\n\\n\\n\\nSadly, a platform won’t just magically fulfill all your policies. However, with proper safeguards in place, you can save application developer’s precious time and reduce human error when rolling out and enforcing new security measures.
\\n\\n\\n\\nIn the context of the upcoming NIS2 Directive, a platform allows you to have more security at a lower cost.
\\n\\n\\n\\nIn summary, with the right platform in place, you’ll reduce the burden to comply with regulations, such as the upcoming NIS2 or its Swedish implementation (Cybersecurity law – Cybersäkerhetslagen), ultimately reducing costs and minimizing risks.
\\n\\n\\n\\nIf you’re considering taking this step, Elastisys offers a Kubernetes platform designed to handle these complexities for you. It ensures compliance with regulations like NIS2 while letting your team focus on what they do best – innovating. Feel free to get in touch to see how we can help your organization optimize its platform strategy.
\\n\\nCommunity post by Pavan Navarathna Devaraj and Shwetha Subramanian
\\n\\n\\n\\nAI is an exciting, rapidly evolving world that has the potential to enhance every major enterprise application. It can enhance cloud-native applications through dynamic scaling, predictive maintenance, resource optimization, and personalized user experiences. However, many challenges still prevent mass adoption, particularly regarding infrastructure, operations, and data management. Fortunately, cloud-native infrastructure combined with open-source software, models, tools, and databases, enables experimental and production-ready AI models to be efficiently trained, tested, and deployed.
\\n\\n\\n\\nTraining machine learning models involves iterative runs on vast datasets. These models often generate high-dimensional data that’s stored in vector databases. The process of training, testing, and deploying AI models is resource-intensive, and requires significant compute power and GPU cycles. As these iterations accumulate, vector databases grow to hold the results of these expensive operations, making them invaluable to advancing AI workloads.
\\n\\n\\n\\nVector databases store high-dimensional vectors that represent unstructured data like text, images, and audio. These vectors enable similarity searches that are used in Retrieval Augmented Generation (RAG) to pull in relevant context from the massive datasets that are vectorized and stored in vector databases. The additional context helps improve the quality of responses generated by the Large Language Models (LLMs). Backing up these databases is essential for maintaining data integrity and preventing costly data loss that could disrupt AI applications.
\\n\\n\\n\\nBy keeping AI applications and vector databases together on cloud-native infrastructure, organizations can optimize operations and manage their infrastructure more easily. However, because of the sheer volume of high-dimensional embeddings stored in vector databases, data protection is critical to preserve iterative training results before terminating ephemeral compute resources. Losing this data could set back critical AI workloads and make robust backup and disaster recovery (DR) strategies indispensable.
\\n\\n\\n\\nAt KubeCon + CloudNativeCon North America 2024 in Salt Lake City, UT, our talk “Building Resilience: Effective Backup and Disaster Recovery for Vector Databases on Kubernetes”, will demonstrate an efficient and secure backup and restore strategy for a popular vector database using Kanister, an open-source CNCF Sandbox project. Kanister is a workflow management tool that simplifies data management on Kubernetes by offering the ability to perform atomic data operations on applications via custom resources called Blueprints and ActionSets.
\\n\\n\\n\\nHere’s how it works: ActionSets instruct the Kanister controller to execute an action, like a backup, while Blueprints define the steps needed to perform these actions on specific databases. During our talk, we will:
\\n\\n\\n\\nThis practical demonstration will provide you with a clear roadmap for using Kanister to protect your AI/ML data and ensure your resilience and efficiency.
\\n\\n\\n\\nIn a world where AI is transforming every industry, protecting the infrastructure that powers these models is essential. If you’re working with AI, handling vector databases, or managing applications on Kubernetes, this session is for you! By attending, you’ll learn:
\\n\\n\\n\\nJoin us for our talk, “Building Resilience: Effective Backup and Disaster Recovery for Vector Databases on Kubernetes”, at Salt Palace in Grand Ballroom GI, Salt Lake City, UT.
\\n\\n\\n\\nWe’ll show you how to future-proof your AI applications and secure your cloud-native infrastructure. Don’t miss this opportunity to learn how to safeguard your AI workloads and ensure business continuity in a cloud-native world!
\\n\\nProject post by the Vitess Maintainers
\\n\\n\\n\\nWe’re delighted to announce the release of Vitess 21 along with version 2.14.0 of the Vitess Kubernetes Operator.
\\n\\n\\n\\nVersion 21 focuses on enhancing query compatibility, improving cluster management, and expanding VReplication capabilities, with experimental support for atomic distributed transactions and recursive CTEs. Key features include reference table materialization, multi-metric throttler support, and enhanced online DDL functionality. Backup and restore processes benefit from a new mysqlshell engine, while vexplain now offers detailed execution traces and schema analysis. The Vitess Kubernetes Operator introduces horizontal auto-scaling for VTGate pods and Kubernetes 1.31 support, improving overall scalability and deployment flexibility.
\\n\\n\\n\\nLet’s take a deeper look at some key highlights of this release.
\\n\\n\\n\\n\\n\\n\\n\\nWe’re reintroducing atomic distributed transactions with a revamped, more resilient design. This feature now offers deeper integration with core Vitess components and workflows, such as OnlineDDL and VReplication (including operations like MoveTables and Reshard). We have also greatly simplified the configuration required to use atomic distributed transactions. This feature is currently in an experimental state, and we encourage you to explore it and share your feedback to help us improve it further.
\\n\\n\\n\\nVitess 21 introduces experimental support for recursive CTEs, allowing more complex hierarchical queries and graph traversals. This feature enhances query flexibility, particularly for managing parent-child relationships like organizational structures or tree-like data. As this functionality is still experimental, we encourage you to explore it and provide feedback to help us improve it further.
\\n\\n\\n\\nWe have added a new metric in VTOrc that shows the count of errant GTIDs in all the tablets for better visibility and alerting. This will help operators to track and manage errant GTIDs across the cluster.
\\n\\n\\n\\nVitess provides Reference Tables as a mechanism to replicate commonly used lookup tables from an unsharded keyspace into all shards in a sharded keyspace. Such tables might be used to hold lists of countries, states, zip codes, etc, which are commonly used in joins with other tables in the sharded keyspace. Using reference tables allows Vitess to execute joins in parallel on each shard thus avoiding cross-shard joins. Previously, we recommended creating Materialize workflows for reference tables, but did not provide an easy way to do so. In v21 we have added explicit support to the Materialize command to replicate a set of reference tables into a sharded keyspace.
\\n\\n\\n\\nPreviously, many configuration options for VReplication workflows were controlled by VTTablet flags. This meant that any change required restarting all VTTablets. We now allow these to be overridden while creating a workflow or updated dynamically once the workflow is in progress.
\\n\\n\\n\\nThe tablet throttler has been redesigned with new multi-metric support. With this, the throttler now handles more than just replication lag or custom queries, but instead can work with multiple metrics at the same time, and check for different metrics for different clients or for different workflows. This gives users better control over the throttler allowing them to fine-tune its behavior based on their specific production requirements.
\\n\\n\\n\\nSeveral new metrics have been introduced in v21, with plans to expand the list of available metrics in later versions.
\\n\\n\\n\\nThe multi-metric throttler in v21 is backward compatible with the v20 throttler. It is possible to have a v20 primary tablet collecting throttler data from a v21 replica tablet, and vice versa. This backward compatibility will be removed in v22, where all tablet throttlers will be expected to communicate multi-metric data.
\\n\\n\\n\\nOther key throttler changes:
\\n\\n\\n\\nSeveral bug fixes and improvements, including:
\\n\\n\\n\\nIntroducing an experimental mysqlshell engine. With this engine it is possible to run logical backups and restores. The mysqlshell engine can be used to create full backups, incremental backups and point in time recoveries. It is also available to use with the Vitess Kubernetes Operator.
\\n\\n\\n\\nThe mysqlshell engine work was contributed by the Slack engineering team.
\\n\\n\\n\\nThe new vexplain trace command provides deeper insights into query execution paths by capturing detailed execution traces. This helps developers and DBAs analyze performance bottlenecks, review query plans, and gain visibility into how Vitess processes queries across distributed nodes. The trace output is delivered as a JSON object, making it easy to integrate with external analysis tools.
\\n\\n\\n\\nThe new vexplain keys feature helps you analyze how your queries interact with your schema, showing which columns are used in filters, groupings, and joins across tables. This tool is especially useful for identifying candidate columns for indexing, sharding, or optimization, whether you’re using Vitess or a standalone MySQL setup. By providing a clear view of column usage, vexplain keys makes it easier to fine-tune your database for better performance, regardless of your backend infrastructure.
\\n\\n\\n\\nVitess v21.0.0 comes with a companion release of the vitess-operator v2.14.0. In v2.14 we have added the ability to horizontally scale the VTGate deployment using an HPA. We have upgraded the supported version of Kubernetes to the latest version (v1.31). We have added a feature that allows users to select Docker images on a per-keyspace basis instead of a single setting for the entire cluster.
\\n\\n\\n\\nNew VTAdmin pages have been added for creating, monitoring and managing VReplication Workflows. We have also added a dashboard to view and conclude distributed transactions.
\\n\\n\\n\\nAs an open-source project, Vitess thrives on the contributions, insights, and feedback from the community. Your experiences and input are invaluable in shaping the future of Vitess. We encourage you to share your stories and ask questions, on GitHub or in our Slack community.
\\n\\n\\n\\nFor a seamless transition to Vitess 21, we highly recommend reviewing the detailed release notes. Additionally, you can explore our documentation for guides, best practices, and tips to make the most of Vitess 21. Whether you’re upgrading from a previous version or running Vitess for the first time, our resources are designed to support you every step of the way.
\\n\\n\\n\\nThank you for your support and contributions to the Vitess project!
\\n\\n\\n\\nThe Vitess Maintainer Team
\\n\\nThe CNCF Technical Oversight Committee (TOC) has voted to accept Flatcar as a CNCF incubating project.
\\n\\n\\n\\nFlatcar is a zero-touch, minimal operating system (OS) for containerized workloads, addressing the challenges of managing and securing a production fleet at scale. It is meant to be deployed the same way cloud native applications are deployed: by applying a declarative configuration, creating an immutable instance from a well-defined image.
\\n\\n\\n\\n“A secure community-owned cloud native operating system was one of the missing layers of the CNCF technology stack,” said Chris Aniszczyk, CTO of CNCF. “As validated by a thorough due diligence process, Flatcar has more than proven itself in this role, and we are thrilled to adopt it as an Incubating project and will support growing its community.”
\\n\\n\\n\\nFlatcar was originally created by the team at Kinvolk, a Berlin-based cloud native technology company that is now a part of Microsoft, as a derivative of CoreOS Container Linux. Flatcar is a popular base operating system for Kubernetes, and is closely integrated with Cluster API for streamlined deployments.
\\n\\n\\n\\nMain Features:
\\n\\n\\n\\nFlatcar has experienced significant success with end user adoption including by Adobe (SaaS provider, with more than 20,000 nodes running Flatcar), Stackit (managed Kubernetes service), and Wipro (managed PostgreSQL service).
“Flatcar Container Linux offers the security, robustness, and efficiency required by various critical workloads, including those utilized within the defence industry of Ukraine. Its acceptance into the CNCF as an incubation-level project, under vendor-neutral and community-driven governance, ensures that many users, including the Ukrainian defence sector, can continue to benefit from its reliability and performance in modern cloud-native environments.” – Ihor Dvoretskyi, Directorate of the Digital Transformation in the Defenсe Area at the Ministry of Defence of Ukraine, and Senior Developer Advocate at Cloud Native Computing Foundation
\\n\\n\\n\\n“Equinix is excited to see Flatcar’s acceptance by CNCF, and proud to be major supporters ourselves through the contribution of build, test, and distribution of cloud infrastructure as part of the Equinix Open Source Partner program,” said Eduardo Cocozza, Vice President of Developer and Product Led Growth Marketing at Equinix.
\\n\\n\\n\\n“Adobe leverages Flatcar as the host operating system for self-managed Kubernetes deployments across our multi-cloud environment, including Microsoft Azure,” said Joseph Sandoval, Principal Product Manager at Adobe and End User Advisory Board Member at CNCF. “We have proven it out at very large scale, and been really impressed both with how Flatcar simplifies our operations and how the project has matured and evolved to stay at the forefront of Linux OS development with capabilities such as Cluster API and system extensions. Adoption by the CNCF is the next logical step, and we are happy to endorse and support that move as a CNCF End User member.”
\\n\\n\\n\\nNotable Milestones:
\\n\\n\\n\\nFlatcar has hit several milestones in the last several months, which have contributed to the project’s move to the incubator.
\\n\\n\\n\\nThe Flatcar roadmap is focused on expanding the range of system extensions to encompass a wider variety of use cases; evolving the Flatcar CAPI implementation, leveraging system extensions to enable independent updates of control plane and operating system; and support for greater security controls including secure boot, disk encryption, and integrity measurement architecture (IMA). The latest roadmap is available on GitHub: https://github.com/orgs/flatcar/projects/7, and discussed in the project’s public release planning meetings.
As a CNCF-hosted project, Flatcar is part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. Flatcar joins incubating technologies Backstage, Buildpacks, cert-manager, Chaos Mesh, Cloud Custodian, Container Network Interface (CNI), Contour, Cortex, Crossplane, CubeFS, Dapr, Dragonfly, Emissary-Ingress, gRPC, in-toto, Karmada, Keptn, Keycloak, Knative, KubeEdge, Kubeflow, KubeVela, KubeVirt, Kyverno, Litmus, Longhorn, NATS, Notary, OpenFeature, OpenKruise, OpenMetrics, OpenTelemetry, Operator Framework, Strimzi, Thanos, and Volcano. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.
Co-chairs: Tina Tsou and Mars Toktonaliev
\\n\\n\\n\\nNovember 12, 2024
\\n\\n\\n\\nSalt Lake City, Utah
\\n\\n\\n\\nKubernetes on Edge Day demonstrates edge computing is here, and it’s powered by Kubernetes. We’re showcasing real-world use cases, best practices, and cutting-edge technologies that are driving the adoption of Kubernetes at the edge. This is a community-driven event. We’ve curated a program featuring industry leaders, developers, and end-users who are passionate about shaping the future of edge computing. We really hope you’ll have fun, connect and learn with fellow practitioners. This event first occurred during KubeCon EU in 2021.
\\n\\n\\n\\nAnyone involved in deploying, managing, or developing applications for edge environments will get a lot out of attending our event. This includes DevOps engineers, SREs, developers, architects, and technical decision-makers.
\\n\\n\\n\\nWe hope to provide a progression in the conversation around Kubernetes at the edge, moving from introductory topics to more advanced applications and real-world implementations covering diverse areas such as federated machine learning, operating kiosks or revolutionizing specific industries like cargo shipment.
\\n\\n\\n\\nKubernetes on Edge will be a half-day event focusing specifically on the use of Kubernetes for managing and orchestrating applications and workloads at the edge. We’ll start at 1:25pm and will be done before the end of the day. Here’s the full schedule.
\\n\\n\\n\\nNo preparation is needed, and please remember that all presentations will be recorded and later published on CNCF’s YouTube channel
\\n\\n\\n\\nSubmitted by Mars Toktonaliev, who is looking forward to seeing old friends, making new friends, and of course all the canyons!
\\n\\n\\n\\nDon’t forget to register for KubeCon + CloudNativeCon North America 2024.
\\n\\nMember post originally posted on the Logz.io blog by Asaf Yigal
\\n\\n\\n\\nGenAI promises evolutionary changes in how we use observability tools, but meeting expectations means heeding the lessons of our AIOps mistakes.
\\n\\n\\n\\nThe emergence of generative AI in observability tools was inevitable, but there’s already been an extreme degree of hype in the market. Monitoring, DevOps and ITOps have never been immune to trends, and with GenAI capabilities, the propaganda hype machine is running out of control.
\\n\\n\\n\\nOrganizations looking to ride the wave of GenAI undoubtedly recall the massive hype around AIOps tools in the not-so-distant past. The core purpose of AIOps was to address the complexity, volume and velocity of operational telemetry, enabling proactive incident response and reducing manual intervention.
\\n\\n\\n\\nMany believed that AIOps was the future that could solve problems within systems, but adoption lagged because AIOps didn’t meet the needs of critical IT use cases. What were organizations trying to get out of AIOps? What were the right tools? Those questions were never answered.
\\n\\n\\n\\nTo succeed, AIOps needed organizations to change their processes, and many organizations were reluctant to do that. Failure to realize benefits from those solutions wasn’t due to the technology — it was because organizations weren’t making the changes required to get those benefits.
\\n\\n\\n\\nOrganizations are looking for productivity gains in their IT environments. Many ask: “How can we complete tasks faster? How can we increase our time-to-value? What can we do to remediate issues faster so we can get the most out of core issues in our business?”
\\n\\n\\n\\nGenAI and AI-powered observability tools can help in all of these areas. Surfacing insights about system behavior — and providing direct knowledge on how to remediate issues that arise in telemetry data (logs, metrics and traces) — is what observability should provide.
\\n\\n\\n\\nTraditionally, these insights haven’t been available to anyone except technical experts and analysts who understand complex query language or have an intimate understanding of the telemetry data flowing through a system. But what if AI-powered observability can take things even a step further? What if you could interact using natural language with your system?
\\n\\n\\n\\nThere’s potential for these tools to open up deeper insights to a much broader user base. This could significantly increase awareness of system behavior, democratize observability to nontechnical users and provide greater understanding of points of failure or difficulty in environments.
\\n\\n\\n\\nIn an era of IT staffing knowledge gaps and hiring difficulties, AI-powered observability could fill some of those needs. What would it mean for your team to have the equivalent of a junior developer working directly within your technology platform?
\\n\\n\\n\\nThe strongest applications of observability today involve strategic capabilities delivered through GenAI integration. These range from automatic collection of relevant contextual insights and anomaly detection, to the ability to pinpoint critical data to optimize data and costs.
\\n\\n\\n\\nAI-powered capabilities can transform the day-to-day interactions of engineering and DevOps teams by reinventing core monitoring and troubleshooting practices, spanning from querying to root cause analysis.
\\n\\n\\n\\nThese types of AI-powered systems — with full dashboarding, data visualizations and answers to pressing questions in seconds — can help meet the promise AIOps was intended to provide.
\\n\\n\\n\\nThe core idea of AIOps is to pull in as much telemetry data as possible to identify anomalies. However, this is different from what observability solutions provide. Observability provides services on selective telemetry data and displays real-time metrics, such as CPU usage or other areas of interest.
\\n\\n\\n\\nWhile incorporating AI for anomaly detection within these metrics might seem like an AIOps feature, it actually is an enhancement to an observability solution. In contrast, AIOps starts with AI and might not offer a single dashboard.
\\n\\n\\n\\nThe lessons from AIOps must be applied to the next generation of observability tools for them to help organizations meet varied and intricate use cases around cloud-native, ephemeral architectures.
\\n\\n\\n\\nThanks to GenAI, there is potential for evolutionary changes in the way we interact with our observability tools, as well as revolutionary changes in how we organize our operations teams.
\\n\\n\\n\\nWe’re already seeing benefits of bringing GenAI into observability tools, such as:
\\n\\n\\n\\nIt is one thing to talk about implementing these capabilities and another to take advantage of them. The question remains about what benefit organizations realistically can get from these shifts. Use cases have to be met, and productivity gains must be realized. It can be challenging for organizations to understand and accept the necessary changes; if the barriers are too great, the benefits won’t materialize.
\\n\\n\\n\\nThe next-generation approach to system monitoring and management, which leverages GenAI and machine learning to automatically detect, diagnose and resolve issues without human intervention, isn’t far off. This evolution will allow technical teams to focus on strategic tasks while ensuring optimal system performance and reliability.
\\n\\n\\n\\nTeams are best served by remembering the successes and failures of past rapid technology shifts. Be prepared to shift mindsets across an organization to meet your goals.
\\n\\nMember post originally published on the EJBCA by Keyfactor and Chainloop blogs by Ben Dewberry, Product Manager, Signing and Key Management, Keyfactor and Miguel Martinez Trivino, Co-founder, Chainloop
\\n\\n\\n\\nA software supply chain is the series of steps performed when writing, testing, packaging, and distributing software. During this process, information about what and how the software is built is generated at each step. This is called metadata.
\\n\\n\\n\\nA Software Bill of Materials (SBOM) is a canonical example of supply chain metadata, but more examples are scattered across your Software Delivery lifecycle, from QA tests/reports, CVE scans, VEX, legal/security/architecture reviews, etc.
\\n\\n\\n\\nCompanies are starting to rely on this metadata to make critical decisions, such as whether to deploy an app to a bank system or whether it is vulnerable, compliant, etc. These decisions could be motivated by wanting to improve your security posture and reach a specific SLSA level or purely pushed by existing regulations such as EO 14028 in the US or upcoming ones like the European Cyber Resilience Act.
\\n\\n\\n\\nHowever, metadata that cannot be trusted is useless and even harmful.
\\n\\n\\n\\nThat is why Chainloop and Keyfactor have partnered by integrating Chainloop’s modular evidence-based metadata storing platform with Keyfactor’s enterprise PKI solutions, EJBCA and SignServer. These integrations will allow organizations to collect, verify, trust, and protect the metadata generated by their software supply chain.
\\n\\n\\n\\nToday, we are introducing two integrations: Remote signing with Keyfactor’s SignServer, and Local signing with Keyfactor’s EJBCA using ephemeral certificates.
\\n\\n\\n\\nAfter implementing this integration, you will have a solid foundation on which to begin making this metadata actionable for implementing automated policies controlling what is allowed to be run in your environments.
\\n\\n\\n\\nChainloop is an open-source evidence store for Software Supply Chain attestations, Software Bill of Materials (SBOMs), VEX, SARIF, CSAF files, QA reports, and more. Metadata sent to Chainloop will get attested, signed, evaluated, routed, and stored.
\\n\\n\\n\\nChainloop offers an opinionated but pluggable end-to-end solution. In this blog post, we will give you an overview of configuring a Chainloop instance with SignServer and EJBCA. The result would be an end-to-end solution that will create in-toto attestations signed with SignServer and EJBCA, stored in an OCI registry.
\\n\\n\\n\\nBefore you begin, you need SignServer, EJBCA, and Chainloop running.
\\n\\n\\n\\nYou should now have a SignServer, EJBCA, and a Chainloop instance running, but Chainloop is not configured to use them yet! By default, Chainloop will sign attestations using Sigstore’s Cosign key pairs.
\\n\\n\\n\\nLet’s change that! Keyfactor and Chainloop have partnered to offer two integrations to sign attestations and artifacts, allowing you to bring your own PKI: Remote signing with SignServer and Local Signing with an EJBCA ephemeral certificate.
\\n\\n\\n\\nSignServer is a versatile and high-performing open-source code-signing software that enables secure cryptographic signing operations. It digitally signs different types of workloads including code, artifacts, attestations, and containers to ensure software integrity and authenticity. For more information, see signserver.org.
\\n\\n\\n\\nThis integration allows users to send the attestation payload to a SignServer worker before sending it to Chainloop for storage. Think of this as a KMS-like approach, where the client environment can access the PKI infrastructure and send the data for remote signing.
\\n\\n\\n\\nSome of the benefits of this approach:
\\n\\n\\n\\nThe guide: Use Keyfactor SignServer for attestation signing explains this integration in more detail, but the gist is that you should be able to run the attestation push command and have a remote SignServer worker sign the attestation like this.
\\n\\n\\n\\n> chainloop attestation push --key signserver://mysignserver/my-signing-worker
\\n\\n\\n\\nThen, the integration will send the payload to sign to SignServer, retrieve the signature, and craft and store the attestation DSSE envelope.
\\n\\n\\n\\nTo verify the payload, just instruct Chainloop to do it using the public key and CA Chain. The CA chain is provided by EJBCA which also issued the signing certificate to the SignServer worker.
\\n\\n\\n\\n> chainloop workflow run describe --digest sha256:a1b2c3 \\\\\\n --verify true \\\\\\n --cert my-worker-key.pem \\\\\\n --chain ManagementCA.pem
\\n\\n\\n\\nEJBCA is a flexible, scalable, open-source PKI (Public Key Infrastructure) system that manages certificates for a wide range of use cases, such as Kubernetes workloads and digital signing. For more information, see ejbca.org.
\\n\\n\\n\\nWith this integration, Chainloop can be configured to generate short-lived signing certs by using EJBCA as the certificate authority, enabling a user experience similar to Sigstore Fulcio’s “keyless” approach.
\\n\\n\\n\\nThere are some key differences compared to the SignServer approach
\\n\\n\\n\\nTo enable this feature, you’ll need to add your EJBCA settings to your Chainloop Helm Chart configuration, as explained the guide: Configure Chainloop to use EJBCA as CA. Once you have done so, the attestation process will not require providing any signing material; the resulting attestation will be automatically signed.
\\n\\n\\n\\n> chainloop attestation push
\\n\\n\\n\\nBuilding a trusted, robust security framework for your software supply chain requires solutions that enforce security policies across your entire infrastructure, including PKI, signing, and the evidence store, and with the integrations presented today you get:
\\n\\n\\n\\nCombining Chainloop with EJBCA and SignServer for PKI gives you the foundation to enable automated compliance and security in your organization. This gives you a head start and future-proofs you for upcoming regulations or security practices.
\\n\\n\\n\\nIn the next blog post, we’ll show you how to take security and compliance to the next level by running Rego Policies against metadata signed using EJBCA and SignServer ephemeral certificates in a segmented architecture with all keys HSM-protected and centrally stored. Stay tuned!
\\n\\n\\n\\nSoftware Attestation: A software attestation is an authenticated statement (metadata) about a software artifact or collection of software artifacts. For more information, see slsa.dev. The primary intended use case is to feed into automated policy engines, such as in-toto and Binary Authorization.
\\n\\n\\n\\n
One of the most popular attestation implementations is in-toto, an open-source framework for modeling software supply chain metadata, steps, and interrelations.
Open Policy Agent and Rego: Rego is a language used to write rules for the Open Policy Agent (OPA), a tool that helps control what actions are allowed or denied in a system. Rego policies define conditions that decide if something is permitted, such as who can access certain data or perform specific actions. For more information, see openpolicyagent.org.
\\n\\n\\n\\nVulnerability Exploitability eXchange (VEX): A VEX document (Vulnerability Exploitability eXchange) is a document that communicates the status of vulnerabilities in software products. Produced by vendors or maintainers, it clarifies whether specific vulnerabilities affect their products and provides guidance on mitigation or resolution.
\\n\\n\\n\\nBen Dewberry
\\n\\n\\n\\nProduct Manager, Signing and Key Management, Keyfactor
\\n\\n\\n\\nMiguel Martinez Trivino, Co-founder, Chainloop
\\n\\nMember post originally published on Cerbos’s blog by Twain Taylor
\\n\\n\\n\\nTraditional security models, which rely on perimeter-based defenses, have proven to be quite inadequate in the face of sophisticated attacks and the growing adoption of cloud computing and remote work. This shift has given rise to an altogether new approach to security: zero trust authorization.
\\n\\n\\n\\nThe zero trust authorization (ZTA) philosophy represents a seismic shift in cybersecurity, challenging the age-old practice of inherent trust within network boundaries. Instead, it is founded on 3 core principles:
\\n\\n\\n\\nTraditional models have largely relied on one-time authentication at the perimeter and granted broad network access to authenticated users. In contrast, zero trust authorization regularly verifies the identity and permissions of users and devices, and grants access to resources based on granular, policy-based controls. To that end, it relies on several key components:
\\n\\n\\n\\nThese models factor in key vectors such as user identity, device health, and application requirements to make informed access decisions.
\\n\\n\\n\\nThe traditional perimeter-based security model has long been the standard for protecting corporate networks. However, the rise of cloud computing, remote work, and the growing sophistication of cyber threats have exposed the inadequacies of relying solely on these defenses.
\\n\\n\\n\\nThe previous model’s inherent trust in users and devices within the network made it particularly vulnerable to insider threats and compromised accounts. And without granular control and visibility, organizations may find themselves at a heightened risk of data breaches and intellectual property theft, as malicious actors can operate undetected within the trusted network.
\\n\\n\\n\\nThis challenge was further compounded by the blurring of network boundaries, which made securing remote access and cloud-based resources increasingly difficult. While traditional VPNs and firewalls remain necessary, they may not be sufficient to defend against sophisticated threats that can circumvent perimeter defenses and exploit weaknesses in remote access systems.
\\n\\n\\n\\nZTA implementation is one of those journeys that requires careful planning, execution, and continuous improvement. Broadly speaking, here’s what’s needed to reap the benefits of this perimeter-less security model:
\\n\\n\\n\\nFirst of all, identify what needs protecting, and then conduct a comprehensive assessment of all the sensitive data, critical assets, and workflows. You’ll have to map out the entire network architecture, triangulate vulnerabilities, and as you get into the thick of it, prioritize the resources that require the highest level of protection.
\\n\\n\\n\\nWith the assessment complete, the next step is to architect a Zero Trust Network Access (ZTNA) framework that enforces the granular access control policies we discussed earlier. Here, we will segment the network into smaller, siloed zones, and define access policies that grant users and devices only the permissions necessary to perform their intended functions. Policy-based access control models come in handy here and we can rely on models such as RBAC, ABAC, or ReBAC to ensure that the principle of least privilege is applied consistently across the organization. To streamline the implementation process, developers can make use of Cerbos‘s Policy Decision Point (PDP) engine to easily integrate fine-grained access control into their applications.
\\n\\n\\n\\nA successful ZTNA implementation will demand multiple iterations of refinement through continuous monitoring. To wit, organizations must regularly review and update their access policies to ensure that they remain aligned with evolving business requirements and security best practices. Everything from monitoring user and device activity, to analyzing security logs, and conducting regular audits to identify potential vulnerabilities or policy violations, etc is included here.
\\n\\n\\n\\nIt’s worth noting that these steps provide only a general blueprint and most organizations will need a more tailored approach to address their specific risks and requirements.
\\n\\n\\n\\nZero trust authorization brings to the table several key advantages that help organizations strengthen their security posture and adapt to modern business requirements:
\\n\\n\\n\\nEven with all its many benefits, implementing ZTA comes with several hurdles:
\\n\\n\\n\\nAs we look to the future, all signs indicate the need for ZTA will only continue to grow, and organizations must adapt to evolving threats and embrace a finer, more context-aware approach to access control.
\\n\\n\\n\\nFor developers, implementing ZTA can be daunting, especially since there’s this need to balance security with scalability and agility. But Cerbos can go a long way to simplify the process. It offers a powerful solution that streamlines the integration of roles, permissions, and access control mechanisms, making it easier than ever to implement a true zero trust architecture.
\\n\\nBy Jorge Castro, Developer Relations at CNCF
The Project Pavilion is our dedicated space on the show floor for CNCF Projects. Since there are over 200 projects this can make the Pavilion a rapidly changing landscape, so projects shift in and out in the mornings and afternoons. You will want to drop by often!
\\n\\n\\n\\nIts purpose is to act as a direct means for attendees to interact with CNCF Projects. Additionally, it acts as our rallying point for CNCF Maintainers and Ambassadors. We want to create a mix of creators, consumers, and advocates, and we all wear our ”upstream hats.” There’s no commercial selling in the Project Pavilion; it’s an open invitation for everyone to actively participate in open source so that they can understand how open source works. If you’ve been to a farmer’s market before, it’s like that, but for software.
\\n\\n\\n\\nFor the projects, it’s their chance to show off the things they work on and share, as well as give introductory material to attendees. For end users, it offers a “live R&D lab”, where they can check in on the projects that they depend on and support, but also dip their toes into the CNCF Sandbox and see what all the “cool kids” are into.
\\n\\n\\n\\nAttendees get the best benefits because KubeCon + CloudNativeCon can be so overwhelming, especially for new folks. If you ask someone who has been to many of them they will commonly say that the “hallway track” is where it is at. We wanted to revitalize the Pavilion to be the place where new attendees can start their journey, by directly interacting with the projects and maintainers to get a grasp of the big picture. A CNCF Ambassador is usually close by to help out because it’s always better when you have a guide with you.
\\n\\n\\n\\nIf you are technically minded and want direct access to CNCF Contributors, then the Pavilion is the place to find them. If you want to get right into the weeds then this is where you want to be. We’ve got plenty of space and power for you to break off and do focused hacking, or just hang out and get to know each other.
\\n\\n\\n\\nThe Project Lightning talks are a great way to get ready for KCCNC. Since they are the day before and the show floor is closed they are a good primer. Each project does a 7 minute presentation. You can find the full schedule here, and as always, we record each one.
\\n\\n\\n\\nAnd we have two fabulous new hosts, Lori Lorusso and Katherine Druckman, who will be doing the introduction and covering it via the “Hitchhiker’s Guide to the CNCF Landscape.” We strongly recommend this session to new attendees so that you can have a good idea of what you are heading into before the show floor opens up.
\\n\\n\\n\\nAfter you’ve gotten your initial orientation at the Lighting Talks you can come check out the new things at the Pavilion. This year we added an entire stage with screens and plenty of seating. This stage is available exclusively for CNCF Maintainers to run additional demos and talks throughout the day, right next to the Pavilion. There will likely be a demo every 20 minutes throughout Wednesday, Thursday, and Friday.
\\n\\n\\n\\nWe purposely run this one as a bit of an unconference. These talks aren’t on the schedule and provide attendees a chance to experience what happens when open source maintainers speak passionately about their projects “live.” This is our first time trying this, and we strongly recommend stopping by for at least some of the talks.
\\n\\n\\n\\nAnd this year we’re happy to partner with Unified Patents, which will be running office hours and sessions throughout the week educating people about the importance of invalidating bad software patents. They will even show you how to search for prior art and we’ll be announcing a contest with prizes pretty soon!
\\n\\n\\n\\nDon’t forget to take a tour!
\\n\\n\\n\\nThis year we also added more tours by CNCF Ambassadors. Tours will be running daily and there will be a “safari tour” of the CNCF Landscape. They run Wednesday, Thursday, and Friday and you can find them on this section of the schedule. This is also the first time we’ve run this and the Ambassadors are already excited to show you what’s out there.
\\n\\n\\n\\nThis gives attendees an option to have someone with cloud native experience walk them through the myriad of options available, give them guidance, and help them connect with the projects and people that are important to know. This one is also a strong recommendation for new attendees!
\\n\\n\\n\\nIf it’s your first KubeCon this is a great place to start. We see cloud native as a vast ecosystem of projects and people, all looking for opportunities to learn something new. Come with a beginner’s mind and leave plenty of room for stickers in your bag!
\\n\\n\\n\\nThe Project Pavilion is ideally positioned in the middle of the show floor. We have the best couches in the conference center and we’re close to caffeine and the bathrooms. The hugely popular sticker dispensers will be returning, as will our job board and everything is centrally located.
\\n\\n\\n\\nOur hope is that you see the Pavilion as a lens right into the cloud native ecosystem, not just to observe but to participate and meet the people who are waiting to work with you! A CNCF Ambassador, Maintainer, or Staffer is almost always an arm’s length away to help guide you. We’ll see you there!
\\n\\nWe have many exciting new events happening in this Salt Lake City KubeCon, as well as a number of unique Experiences, and we don’t want you to miss anything. Here’s everything you need to know.
\\n\\n\\n\\nAlso, if you haven’t registered yet, it’s not too late.
\\n\\n\\n\\nLet’s start with what’s new this year:
\\n\\n\\n\\nWe’re hosting a laptop drive during this year’s KubeCon + CloudNativeCon North America 2024 that will benefit two non-profits in the tech space: Black Girls Code and Kids on Computers.
\\n\\n\\n\\nIn order to donate your laptop, make sure it meets the device requirements, wipe the data, fill out the form, and then you can drop it off at the Coat and Bag Check area of the Salt Lake City Convention Center during KubeCon + CloudNativeCon during the following hours:
Tuesday November 12 – 7:30am – 7:00pm
Wednesday November 13 – 7:30am – 8:00pm
\\n\\n\\n\\nThursday November 14 – 8:00am – 6:00pm
\\n\\n\\n\\nFriday November 15 – 8:00am – 11:00am
\\n\\n\\n\\nThe DEI Community Hub is an exciting new space to connect, learn, and celebrate diversity, equity, and inclusion and accessibility! Join community groups, participate in allyship and advocacy workshops, or simply relax in a safe space during open lounge hours.
\\n\\n\\n\\nAttend a BoF session with the Public Sector User Group and take a deep dive into software supply chains and how to integrate Sigstore and in-toto to meet global government needs. Get all the details.
\\n\\n\\n\\nDon’t miss the world premiere of Inside Argo: Automating the Future, a documentary set in 2017 that follows the project teams as they develop and launch Argo and the suite of tools around it. If you need a visual embodiment of the spirit of innovation and collaboration in the world of open source, Inside Argo will give you just that.
\\n\\n\\n\\nThe film debuts on Thursday November 14 at 6:15 at the Salt Lake City Convention Center, Salt Palace, Level 2, room 254.
\\n\\n\\n\\nAdd these Experiences to your schedule:
\\n\\n\\n\\nOn Wednesday November 13 from 6pm to 8pm, plan to attend the launch party in the Solutions Showcase where you can enjoy local seasonal treats and drinks while exploring sponsor booths, being entertained, and even taking a “ski lift” photo.
\\n\\n\\n\\nUnderstand the differences between malware and vulnerabilities – sponsored by Sonatype – and then pop into the Open Telemetry Observatory powered by Splunk where you can relax, power up, and talk all things OTel. These opportunities will be available in the Solution Showcase Wednesday through Friday.
\\n\\n\\n\\nCoder is offering attendees the opportunity to get a new professional headshot, but you have to sign up in advance here and space is limited!
\\n\\n\\n\\nDon’t miss the many, many options to keep your body and mind healthy and happy during KubeCon + CloudNativeCon North America 2024. From petting a therapy dog to meeting with a wellbeing coach, getting a chair massage, self-led runs and recharges with pure oxygen, make sure to schedule some time to decompress. Find all the available options.
\\n\\n\\n\\nWant to see what cloud native literally looks like? There’s no better place than the Project Pavilion where you can see CNCF’s 200 projects in action and interact with maintainers, CNCF staffers and more. Guided tours run daily.
\\n\\n\\n\\nWhether you’re brand new to CTF or well-versed, join us in the Salt Palace on Level 2 in Room 255A for three increasingly difficult and treacherous capture the flag scenarios. Beginners can take one of two introductory workshops (here or here) on Wednesday November 13, then jump in fully on Thursday the 14th beginning at 11am. Explore all the details.
\\n\\n\\n\\nOn Wednesday November 13 from 6pm to 8pm, meet and greet other end users at the Hyatt Regency’s Broadcast Lounge on Level 4. Expect food, beverages, networking and more!
\\n\\n\\n\\nAlso on November 13, meet at the Project Pavilion in the Solutions Showcase at 6:30pm to watch two people solve a tricky technical problem in just a very limited amount of time and with no advanced knowledge or preparation. Be prepared to be on the edge of your seats!
\\n\\n\\n\\nGrab your lunch and come meet other new and experienced K8s contributors and find out how to expand your participation in SIGs and WGs. The event will take place in the Project Pavilion on November 14 from 12:30pm – 2:30pm.
\\n\\nCommunity post originally published on Medium by Giorgi Keratishvili
Most probably if you have been working in IT over last decate you would heared such words as containers, docker, cloud native, maybe even kubernetes, but wonder what does all those buzz words mean and where you could start your learning jorny then my dear friend buckle up as we are about to dive in the word of container and kubernetes with the help of entry level certification from CNCF as it is very good entry point and aims to be beginer friendly as it was first multiple choose fromat exam which was introduced from linux fundation, I had opportunity to be first beta tester when this exam was first showcased and in this blog I will share materials and over all the style of exam in order to pass it and gain foundational knowledge
\\n\\n\\n\\nThe goal of this certification is to provide individuals with basic knowledge of implementing container best practices in a vendor-neutral way and giving them foundation to explore other cloud native ecosystems as this exam is belived to be first step to this whole world it is not intended to be super hard.
\\n\\n\\n\\nCompared to other certifications I would say it’s the easiest, but it still falls under the pre-professional level of difficulty and is a great way to test your knowledge before tackling the CKAD or CKA. For me, the order of difficulty felt like this: KCNA/CGOA/CKAD/PCA/KCSA/CKA/CKS. One thing to keep in mind is that I had already passed the couple kubernetes exams and when I was preparing for the exam, there were no tutorials or blogs to refer at that time , only some suspiciously scam-like dumps, so don’t fall for them. Below I will mention all the new courses and materials that should help in preparation.
\\n\\n\\n\\nRegarding persons who would benefit SysAdmins/Dev/Ops/SRE/Managers/Platform engineers or any one who is doing anything on production should consider it as knowing basic security is always good thing or somebody whom wants to becombe kubestronaut 😉 more about it in next blog…
\\n\\n\\n\\n\\n\\n\\n\\n\\nThe KCNA is a pre-professional certification designed for candidates interested in advancing to the professional level through a demonstrated understanding of kubernetes foundational knowledge and skills. This certification is ideal for students learning about or candidates interested in working with cloud native technologies.
\\n\\n\\n\\nA certified KCNA will confirm conceptual knowledge of the entire cloud native ecosystem, particularly focusing on Kubernetes. The KCNA exam is intended to prepare candidates to work with cloud native technologies and pursue further CNCF credentials, including KCSA, CKA, CKAD, and CKS.
\\n\\n\\n\\nKCNA will demonstrate a candidate’s basic knowledge of Kubernetes and cloud-native technologies, including how to deploy an application using basic kubectl commands, the architecture of Kubernetes (containers, pods, nodes, clusters), understanding the cloud-native landscape and projects (storage, networking, GitOps, service mesh), and understanding the principles of cloud-native security.
\\n
So we are ready to deploy vanilla clusters, make higly avalable enviroments and rock the world of fast software delivery? Then, we have a long path ahead until we reach this point. First, we need to understand what kind of exam it is compared to CKAD, CKA, CKS, or others. This is exam where the CNCF has adopted first time multiple-choice questions and compared to other multiple-choice exams, this one, I would say is on an easy level. and it qualified as pre-professional, on par with the KCSA/PCA/CGOA.
\\n\\n\\n\\nThis exam is conducted online, proctored similarly to other Kubernetes certifications and is facilitated by PSI. As someone who has taken more than 15 exams with PSI, I can say that every time it’s a new journey. I HIGHLY ADVISE joining the exam 30 minutes before taking the test because there are pre-checks of ID and the room in which you are taking it needs to be checked for exam criteria. Please check these two links for the exam rules and PSI portal guide
\\n\\n\\n\\nYou’ll have 90 minutes to answer 60 questions, which is generally considered sufficient time, the passing score is >75%. Be prepared for some questions that can be quite tricky. I marked a couple of them for review and would advise doing the same because sometimes you could find a hint or partial answers in the next question. By this way, you could refer back to those questions. Regarding pricing, the exam costs $250, but you can often find it at a discount, such as during Black Friday promotions or near dates for CNCF events like KubeCon, Open Source Summit, etc.
\\n\\n\\n\\nAt this point, we understand what we have signed up for and are ready to dedicate time to training, but where should we start? Before taking this exam, I had a good experience with Kubernetes and its ecosystem and experience with the CKA exam, but yet I still learned a lot from this exam preparation.
\\n\\n\\n\\nAt first glance, this list might seem too simple and easy but we need to learn the fundamentals of Kubernetes first in order to understand higher-level concepts such as Architecture, Core Components, 4C’s, and many more
\\n\\n\\n\\n**Kubernetes Fundamentals 46%**\\n\\n\\n\\n
Kubernetes Resources
Kubernetes Architecture
Kubernetes API
Containers
Scheduling
**Container Orchestration 22%**
Container Orchestration Fundamentals
Runtime
Security
Networking
Service Mesh
Storage
**Cloud Native Architecture 16%**
Autoscaling
Serverless
Community and Governance
Roles and Personas
Open Standards
**Cloud Native Observability 8%**
Telemetry & Observability
Prometheus
Cost Management
**Cloud Native Application Delivery 8%**
Application Delivery Fundamentals
GitOps
CI/CD
There are 5 key pillars (The key domains here). Let’s get into some details about it to help direct your studies:
\\n\\n\\n\\nYou can explore and learn about KCNA Certification and related topics freely through the following GitHub repositories which I have used and of course kubernetes documentation is our best friend also there is great courses from linux foundation
\\n\\n\\n\\nFor structured and comprehensive KCNA exam preparation, consider investing in these paid course Kodekloud, Exam PRO, Oreilly and play a lot with kubernetes playground killercoda from Kim Wüstkamp I have been using for his content a lot for exam preparation CKS/CKAD/CKA/LFCS, indeed it is very useful and I recomend it also I would higly advise not to click on every course which will pop up from google search as it is new exam there are plenty scams.
\\n\\n\\n\\nThe exam is not easy among other certs and I would rank it in this order KCNA/CGOA/CKAD/PCA/KCSA/CKA/CKS. After conducting the exam, within 24 hours you will receive grading and after passing the exam it feel pretty satisfied overall. I hope it was informative and useful 🚀
\\n\\n\\n\\nA special thanks to one of our CNCF Ambassadors, Ramesh Kumar for inspiring us to create the Kubestronaut program. We recently interviewed Ramesh to ask about how the Kubestronaut program came to be. If you’d like to be a Kubestronaut like Ramesh, get more details on the CNCF Kubestronaut page.
\\n\\n\\n\\nIn 2017, CNCF released the Certified Kubernetes Administrator (CKA) exam. I was among the first few people to study day and night to pass it. It was an exhilarating journey filled with passion and excitement. The CKA exam lit a fire in me—it was unlike any other exam, being a hands-on, scenario-based challenge. That unique aspect made it thrilling. The following year, in 2018, came the Certified Kubernetes Application Developer (CKAD) exam, which was just as exciting.
\\n\\n\\n\\nA fun fact: the late and excellent Dan Kohn sponsored my CKA and CKAD exams. In 2019, during KubeCon San Diego, I hosted a roundtable QA session on certifications. Dan Kohn observed my enthusiasm and excitement and suggested I start a study group to help others, especially students, kickstart their Kubernetes careers and certification goals. That suggestion led to me becoming an author for several Kubernetes certification exams, starting many workgroups in the Sacramento area and a few online groups. I also launched a meetup for Kubernetes certifications and workshops like CKA/CKAD to help spread the word about CNCF and its incredible resources.
\\n\\n\\n\\nAround 2021, after taking and getting all upcoming Kubernetes certifications, I wanted to create something special for those who achieved CNCF and Kubernetes certifications—a program that would recognize their accomplishments and inspire and motivate others to set ambitious goals. I shared this idea with Chris Aniszczyk, and he found it intriguing and worth implementing. The idea initially originated in November 2022 during KubeCon Detroit. It kept evolving for a while, and by February 2024, we had the perfect name: “Kubestronaut.” My journey deeply inspired the concept, where I often felt like an astronaut navigating
\\n\\n\\n\\nKubernetes’s vast, uncharted territory. I wanted others to experience that same sense of exploration and discovery but with guidance and a supportive community to help them succeed.
\\n\\n\\n\\nThe Kubernetes community is one of the most dynamic and enthusiastic groups in tech. People are always eager to learn, contribute, push boundaries, and help others grow, encourage, and support one another. I’ve found many amazing mentors in this community and had the opportunity to be both a mentee and a mentor. This level of collaboration and growth is only possible with CNCF communities.
\\n\\n\\n\\nBeing a Kubestronaut is very special. It’s a sense of achievement, fulfillment, and inspiration. The Kubestronaut initiative is unique in its focus on Kubernetes/CNCF education, certifications, and fostering a sense of community among learners. Hands-on certifications like CKA, CKAD, and CKS, along with foundational exams like Kubernetes and Cloud Native Associate (KCNA) and Kubernetes and Cloud Security Associate (KCSA), provide learners with opportunities to deepen their knowledge across different levels and domains. These certifications help participants develop hands-on skills and theoretical understanding, supporting their personal and professional growth. By incorporating diverse certifications, the Kubestronaut initiative empowers individuals to build expertise in Kubernetes and cloud-native technologies.
\\n\\n\\n\\nThe Kubestronaut is a unique initiative focused on Kubernetes education, certification, and fostering a sense of community among learners. It is designed to be inclusive and community-driven, emphasizing growth through learning, collaboration, and mentorship. One of the program’s highlights is the opportunity to connect with experts in the field, exchange ideas, and be part of a global network of Kubernetes professionals. We aim to make Kubernetes more accessible and rewarding for anyone, whether just starting their cloud-native journey or advancing their existing skills.
\\n\\n\\n\\nCNCF certifications and programs have a profound impact. For example, one of our community members, a chef with 18 years of experience, transitioned into the tech industry within 18 months by earning Kubernetes certifications. His journey is a powerful reminder of how these certifications can change lives. Programs like Kubestronaut and Kubernetes can inspire growth and open new opportunities, regardless of where you start. I can’t wait to see how many more lives will be impacted through the Kubestronaut initiative. This initiative is about empowering everyone to reach their full potential.
\\n\\n\\n\\nAt Apple, we always think big, whether it’s about scale or innovation. Currently, I work extensively with Kubernetes, which is central to my projects, particularly in managing cloud infrastructure at scale. I also engage with Helm for package management, Prometheus for monitoring, Grafana for dashboarding, Flux for continuous delivery (CD), and Emissary Ingress (formerly known as Ambassador) for service proxies. These CNCF projects are fundamental to building and operating cloud-native architectures in my work. Managing applications requires a lot of standards and security enforcement, and I always take these aspects very seriously, ensuring they are handled with deep precision.
\\n\\n\\n\\nIn terms of what I’ve enjoyed the most, Kubernetes has always been my passion. It’s a continuously evolving platform, and working with it has been incredibly rewarding. Prometheus and Grafana have been exciting to work with, mainly because of their robust monitoring capabilities and how they empower teams to gain real-time insights into their systems.
\\n\\n\\n\\nI’m a big believer in learning, training, and certifications because they provide a clear path, goal, and target. Certifications offer a structured roadmap, helping you gain deep knowledge in specific areas. While certifications may not make you a complete master in a subject, they provide a strong foundation and clear direction for continuous learning and growth.
\\n\\n\\n\\nThe CNCF certifications— KCNA, CKA, CKAD, KCSA, and CKS – all have been pivotal in my career. These certifications validated my skills and pushed me to develop a deeper understanding of Kubernetes and cloud-native technologies. They helped me transition into roles where I could build, manage, and operate complex Kubernetes environments at scale.
\\n\\n\\n\\nThe certifications also opened doors for me to engage with the broader cloud-native community, participate in discussions, and collaborate on meaningful projects. More importantly, they gave me the confidence to share my knowledge with others, leading to my involvement in mentoring and creating educational programs like Kubestronaut. These certifications have played a key role in elevating my career, expanding my network, and connecting me with opportunities that align with my passion for Kubernetes and cloud-native technologies.
\\n\\n\\n\\nMore about Ramesh
\\n\\n\\n\\nRamesh Kumar is a passionate advocate for cloud-native technologies with a deep background in Kubernetes and DevOps. Ramesh works at Apple as a Systems Architect, building and managing the next generation of cloud infrastructure and has had the opportunity to run and operate hundreds of applications from the ground up, ensuring they meet CNCF standards for cloud deployments. Over the years, he’s worked on various Kubernetes-related projects in education and hands-on enterprise-grade implementations. He’s also a teacher and enjoys mentoring others in navigating the rapidly evolving Kubernetes landscape. Ramesh is dedicated to fostering the communities that grow through shared learning and practical experiences. And he has a pretty unique email at Apple: Kubernetes@apple.com!
\\n\\nThis week’s Kubestronaut in Orbit, Maria Salcedo, is a full stack DevOps backend engineer in Germany with experience in cloud native Kubernetes deployments. Maria is passionate about GitOps, cloud native development of CI/CD pipelines, IaC automation, and open source software.
\\n\\n\\n\\nIf you’d like to be a Kubestronaut like Maria, get more details on the CNCF Kubestronaut page.
\\n\\n\\n\\nI started first using containers in 2017 as a backend engineer, deploying through a Jenkins server. Back then, I wasn’t part of the infrastructure team, so I didn’t have access to the K8s server. However my curiosity led me to create a tiny playground to see what an end-to-end solution would look like. Gradually, as I joined new projects, I went deeper into K8s and infrastructure as code. I approached this cautiously –as K8s evolved side-by-side with the CNCF ecosystem, I saw the potential to eventually safely onboard it in highly productive environments.
\\n\\n\\n\\nFrom the CNCF Landscape, the primary ones that I’ve enjoyed the most are (in no particular order): FluxCD, Linkerd, Istio, Prometheus, Grafana, Jaeger, OpenTelemetry, Quarkus, Gradle, PostgreSQL, Kyverno, Ansible, Terraform, Podman, and of course, Kubernetes.
\\n\\n\\n\\nHighly automated environments sometimes lead us to miss some feature details that might be important to consider. More so on security, it is extremely important to be aware of what’s currently being offered, what gets deprecated, and what are indeed the best practices to follow. For that, continuous learning must be a top priority in our careers as software developers, and one tool to “push” you to get better are certificates. For that, I find the certs very useful.
\\n\\n\\n\\nIt might sound obvious, but without CNCF we would have no CI/CD pipelines, GitOps, IaC, Monitoring, Storage Management, or Service Mesh solutions. Kubernetes without them is like farming with a fork instead of a tractor. I cannot imagine deploying new features and keeping track of multiple handmade productive clusters without the help provided by CNCF supported projects.
\\n\\n\\n\\nMany YouTube channels have very good videos explaining the very basics. I 100% recommend watching YouTube videos to validate your knowledge.
\\n\\n\\n\\nOnce you assess your skill level, you can see how far you wanna go. For instance, get your own little playground started. Most, if not all, cloud providers offer free credits, which can be used for learning purposes.
\\n\\n\\n\\nSome other knowledge sources can be found at The Linux Foundation online courses. I recommend Kubernetes for Developers (LFD259) in particular. You will get your money’s worth.
\\n\\n\\n\\nKodeKloud also offers some courses. Those courses are focused on the K8s certificates. I haven’t done them myself, however the community speaks well about their impact, and how clear their explanations are.
\\n\\n\\n\\nAside from K8s, there is so much more you can learn, such as deployments, pipeline automation, monitoring, etc.
\\n\\n\\n\\nGitOps, Infra as Code, and automation are as important as the very basics of K8s, for building a proper production-ready setup. For that, I recommend checking the CNCF Landscape page to find tools for it. At a project’s repository, tutorials and examples are usually offered to learn how to implement it in a more interactive way.
\\n\\n\\n\\nI enjoy nature while hiking. It allows my brain to relax and reflect on what happened during the week.
\\n\\n\\n\\nFor beginners, it pretty much depends where you are starting, so assessing your knowledge plays a big role.
\\n\\n\\n\\nAsk yourself questions such as: How much experience do you have using Docker and containers overall? Have you ever logged into a Kubernetes cluster before? If so, was it self managed or installed “from scratch”? Are you dealing with GitOps and Infrastructure as Code?
\\n\\n\\n\\nMany people underestimate their beginner knowledge, because you probably know many things about Kubernetes just by using some of their components on a day to day basis.
\\n\\n\\n\\nOf course! I am interested in both the Prometheus (PCA) and Cilium (CCA) certifications, as well as the Linux Foundation’s Sysadmin (LFCS) and GitOps (CGOA) certs.
\\n\\n\\n\\n\\n\\nCo-chairs: Amber Graner, Rajas Kakodkar, Ricardo Rocha, Yuan Tang
\\n\\n\\n\\nNovember 12, 2024
\\n\\n\\n\\nSalt Lake City, Utah
\\n\\n\\n\\nCloud Native & Kubernetes AI Day brings together a diverse range of technical enthusiasts, open source contributors, practitioners, researchers and end users. All united in a common goal: Enhancing Kubernetes as the ultimate infrastructure management tool for research, training, and production. Cloud Native & Kubernetes AI Day is welcoming the AI/ML and High Performance Computing (HPC) communities. Since 2022 there have been multiple dedicated events (Kubeflow Summit, Batch / HPC and Cloud Native AI days) but given the overlap in requirements, projects and end user interests it became clear we all fit better together.
\\n\\n\\n\\nThe Cloud Native & Kubernetes AI Day is aimed at seasoned practitioners as well as those new to the batch computing and MLOps worlds. Anyone looking for solutions and best practices to provide cost effective and efficient infrastructure to scale out batch computing, training and inference workloads, and make the best use of scarce and expensive hardware accelerators, will find inspiration here.
\\n\\n\\n\\nThis event will also help practitioners of MLOps interact with maintainers of Cloud Native AI projects and foster collaboration between the two worlds.
\\n\\n\\n\\nThis time we join the HPC and AI/ML communities in a single event. No need to jump from one event to the other if you want to listen to that particular item on batch computing without missing that awesome session on optimizing inference for chatbot applications.
\\n\\n\\n\\nWe will have a full day with 11 full sessions and 3 lightning talks and enough time for questions during the sessions and discussion in the break-outs. We will be hearing from researchers, project maintainers and many end users reporting on the successes and challenges of running AI/ML and HPC workloads on top of cloud native infrastructure. While the sessions will be engaging, there will be ample time during coffee breaks and lunch for hallway tracks and networking sessions, helping attendees engage with speakers, maintainers of projects and end users.
\\n\\n\\n\\nNo formal prep is required, but consider going through the schedule in advance so you can prepare to ask or raise particular topics of interest to you or your organization. This is a unique opportunity to meet and learn from some of the industry’s best practitioners and a good chance to also raise your particular requirements and help set the path for our community.
\\n\\n\\n\\nSubmitted by the co-chairs, who are eager to hear about progress in batch and inference in Kubernetes as well as real-world use cases and success stories.Don’t forget to register for KubeCon + CloudNativeCon North America 2024.
\\n\\n\\n\\n\\n\\nMember post by Ranjan Parthasarathy, CPO/CTO of Apica
\\n\\n\\n\\nTelemetry data is to your system what sensors are to an automobile. Put simply, it is vital. However, handling telemetry data is cumbersome, particularly with the current data explosion.
\\n\\n\\n\\nEnterprises require telemetry data for proactive management of complex systems and informed strategic planning with:
\\n\\n\\n\\nThus, telemetry is a goldmine of critical data. The caveat is turning that ingested data into useful insights. While it is challenging, with the right strategies, best practices and solutions like fleet management, you can tame the power of telemetry.
\\n\\n\\n\\nLet’s break down the process of turning your telemetry data into actionable insights.
\\n\\n\\n\\nRaw data comes in all shapes and forms. Turning that raw data into context-rich insights is imperative to unlock critical business and technical decisions. Thus, several key areas must be addressed to leverage the unprecedented amounts of ingested data effectively.
\\n\\n\\n\\nHere are a few practices for turning ingested data into meaningful insights:
\\n\\n\\n\\nData Quality and Integrity:
\\n\\n\\n\\nData Analysis and Visualization:
\\n\\n\\n\\nBusiness Context and Goals:
\\n\\n\\n\\nTechnology and Infrastructure:
\\n\\n\\n\\nOrganizational Culture and Processes:
\\n\\n\\n\\nSecurity and Privacy:
\\n\\n\\n\\nYou need a structured strategy to extract relevant insights from a dynamic data pool. This ensures informed decision-making and strategic planning.
\\n\\n\\n\\nThe following steps outline a comprehensive method for turning ingested data into valuable insights, guiding you through the process of filtering, aggregation, visualization, and analysis:
\\n\\n\\n\\nWith the right tools and strategies, you can maximize the value of your data, reduce downtime, and improve overall operations efficiency. Proactively managing telemetry turns data chaos into clarity and smarter decision-making. That said, you can unleash your systems’ full potential with the power of telemetry. Thus, a well-designed data pipeline is key to transforming raw information into game-changing insights. For more insights, connect directly with Ranjan Parthasarathy, feel free to reach out on Twitter, or LinkedIn, or explore more about Apica’s observability solutions on the website.
\\n\\nCommunity post by Dave Smith-Uchida, Technical Leader, Veeam (Linkedin, GitHub)
\\n\\n\\n\\nData on Kubernetes is growing with databases, object stores, and other stateful applications moving to the platform. The Data Protection Working Group (DPWG) focuses on data availability and preservation for Kubernetes – including backup, restore, remote replication, and the facilitation and orchestration of these processes. At the Data Protection Working Group Deep Dive at Kubecon + CloudNativeCon Salt Lake City (Nov. 13, 2:30 PM), Xing Yang, Cloud Native Storage Tech Lead at Broadcom/VMware, and I will cover topics including:
\\n\\n\\n\\nKubernetes has evolved from its original mission as an orchestrator for stateless containers that uses external services for data storage into a platform that supports data storage and state within a Kubernetes cluster. State can be stored in Persistent Volumes (PVs) but also in Kubernetes resources as native Kubernetes applications take advantage of the Kubernetes API server to store their working information. The evolution of Kubernetes into a stateful platform has created a need to protect the data stored in Kubernetes against loss, corruption, and other threats such as ransomware attacks. The Data Protection Working Group has published a white paper that outlines when you need data protection in Kubernetes. We’ll cover the high points during the session at KubeCon, but in the meantime, we invite you to read the paper here: https://github.com/kubernetes/community/blob/master/wg-data-protection/data-protection-workflows-white-paper.md
\\n\\n\\n\\nThe Data Protection Working Group has been working on adding CBT to Kubernetes and the Container Storage Interface (CSI). CBT improves the performance of backup and replication of large volumes by tracking the blocks that have been changed between two snapshots. When a backup is performed, the backup system creates a volume snapshot and retrieves the list of blocks that have changed since the previous backup’s snapshot was taken and only copies those blocks. Many storage systems, both traditional and cloud-based, implement CBT, but the APIs are proprietary. Since Kubernetes has already created standard APIs for allocating, attaching, snapshotting, and cloning volumes, adding CBT is the next step in standard APIs for storage systems. Veeam Kasten has been a leader in using proprietary CBT systems for Kubernetes data protection, and we’re proud to have been a participant in creating the Kubernetes CBT API, which is currently in beta with Kubernetes 1.32.
\\n\\n\\n\\nChanged Block Tracking KEP: https://github.com/kubernetes/enhancements/issues/3314
\\n\\n\\n\\nVolume Group Snapshots are another Kubernetes enhancement that supports data protection. When an application uses multiple volumes, taking a consistent snapshot of all of the volumes is important. Taking snapshots one by one while the application is running may result in inconsistencies between the volumes, and may create a backup that will not be usable. One way to get consistency is to quiesce the application and snapshot each volume it uses, but this can take considerable time and the application will be unavailable while in the quiesced state. Volume Group Snapshots snapshot all of the volumes in the group together without needing to quiesce the application to achieve consistency. As with CBT, this is a feature offered by many storage systems, but only with proprietary APIs.
\\n\\n\\n\\nVolume Group Snapshots KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3476-volume-group-snapshot
\\n\\n\\n\\nA new project for the group is creating a white paper on Best Practices to Prepare Kubernetes Applications for Data Protection. When working with data protection solutions like backup and restore, applications need to be structured so that the backup and restore process can make consistent backups and be restored to a working state. This is an ongoing project, and we invite everyone to join us and share their needs, experiences, and ideas for how to best prepare their applications for data protection.
\\n\\n\\n\\nThe Data Protection Working Group consists of participants who use Kubernetes and create applications, storage, and data protection solutions for the platform. We’re open to anyone who is interested in protecting their data on Kubernetes. Come join our session to exchange ideas, find out how to contribute and let us know what your needs are!
\\n\\n\\n\\nClick here to learn more about the Kubernetes Data Protection Working Group.
\\n\\n\\n\\nStop by Veeam’s booth (#K7) for in-person demonstrations of our Veeam Kasten data protection solution and talk to our subject matter experts.
\\n\\n\\n\\nKanister is an open source framework for data protection and management on Kubernetes. It is a CNCF Sandbox project and can be found at: https://www.kanister.io/.
\\n\\n\\n\\nCome see Xing and I at the Data Protection Working Group Deep Dive at Kubecon + CloudNativeCon Salt Lake City (Nov. 13, 2:30 PM)
\\n\\nA Delhi guide by Kunal Kushwaha, Field CTO at Civo
\\n\\n\\n\\nThe capital city of India, Delhi, has roots that trace back thousands of years. Known as Indraprastha in ancient texts dating as far back as 400 BCE, it has been the center of various cultures and empires. There’s always something happening in Delhi, with a wide array of activities to experience. In this guide, I’ll share how you can make the most of your time when you visit during KubeCon India in December.
\\n\\n\\n\\nIn December, Delhi will be chilly with temperatures ranging from 22°C during the day to as low as 5°C at night. Pack a jacket and warm clothes. Rain is unlikely during this time, but it’s best to check the weather forecast closer to your travel date.
\\n\\n\\n\\nDelhi offers a diverse range of experiences, from some of the best food in the world to a rich cultural and historical landscape. Enjoy your visit to the city and make the most out of KubeCon India 2024! If you have any questions, feel free to message me on the CNCF Slack.
\\n\\n\\n\\nSee you there!
\\n\\nCommunity post by Dan Garfield
\\n\\n\\n\\nFor the very first time, KubeCon + CloudNativeCon North America is traveling to where I live! Hi, my name is Dan Garfield, I’m an Argo Maintainer for Codefresh and Octopus Deploy residing in Salt Lake City. I wanted to share some quick tips for how to get the most out of your trip to Salt Lake City, Utah.
\\n\\n\\n\\nSalt Lake City has both a light rail and bus system for getting around town, as well as a regional train system called Front Runner that runs north (Ogden) to south (Provo). You can take a short train ride from the SLC Airport to downtown, or just grab an Uber, Lyft, or Taxi. If you plan to stay near the convention I don’t recommend renting a car because you’ll have to pay for parking and SLC is small enough that it’s not really worth it.
\\n\\n\\n\\nNovember is a wildcard in Utah, it could be 70-80 degrees or it could be snowing. It’s not uncommon for snow to start falling in SLC in late-October but these are usually quick storms that melt right away. Bring warm clothing options with closed shoes in case it’s cold and snowy the week of KubeCon.
\\n\\n\\n\\nSalt Lake City is known for its proximity to amazing outdoor experiences. We’ll talk about the best things close to town but if you want to stay in town I recommend hiking Ensign Peak, it’s a bit steep but at less than a mile it’s very doable and you’ll be rewarded with a view of the entire Salt Lake Valley. For something more robust I recommend doing the Bonneville Shoreline trail, the entire trail is hundreds of miles but you can start up off 18th Ave and hike 9 miles to the Hoogle Zoo. There are plenty of places to rent mountain bikes if you’d rather ride and you can ride the same trail. Iif you’re an expert (this is black diamond) you can do the Bobsled trail off the same trail as above.
\\n\\n\\n\\nIf you’re staying on foot, just across the street from the convention center is the City Creek Mall which runs through the heart of downtown and is a nice place to walk. Next to that is the world-famous Temple Square where you can catch the Tabernacle Choir at Temple Square, one of the largest and oldest choirs in the world. Also, there are several free museums – I recommend checking out the Family Search Library where you can get tons of information about your ancestors for free.
\\n\\n\\n\\nThere are plenty of bars and restaurants downtown and I’m hesitant to recommend any specific one because 10,000 KubeCon attendees would undoubtedly completely overrun any particular place. Soda shops like Thirst, Swig, Crema, Sips, Sodalicious, and others are Utah phenomena where you can get fancy soda. A dirty soda from these shops usually means Coca-Cola with flavorings added. They’re not my thing but they’re quite popular.
\\n\\n\\n\\nDuring the week of KubeCon there will be lots of sporting events, the Delta Center is literally across the street from the convention center. Jazz vs the Suns, Jazz vs the Mavericks, and Utah’s new and yet unnamed Hockey team will be playing the Hurricanes and Knights.
\\n\\n\\n\\nThere are also opportunities for service, my team is planning to spend some time after the conference at the Cannery on Welfare Square which sends food and aid worldwide.
\\n\\n\\n\\nLots of people have asked if they’re going to be able to Ski, while it is possible, currently, Park City is planning the earliest opening at November 22. If you don’t mind hiking backcountry like Matt Asay you will be able to find snow but if you don’t normally ski backcountry I would heavily discourage doing so without a local guide as it can be quite dangerous.
\\n\\n\\n\\nBeyond skiing, there are tons of day trips the weekend after KubeCon that are easily done with a rental car or in some cases with a tour bus. In no particular order:
\\n\\n\\n\\nWhile early November in Utah may fall between the peak summer and winter activities, there are still plenty of great experiences to enjoy. Whether you’re exploring Salt Lake City or taking a scenic drive south to Zion, Escalante, or Moab, this guide will help you discover a variety of exciting options to make the most of your time.
\\n\\n\\n\\n\\n\\nCommunity post originally published on Medium/IT Next by Giorgi Keratishvili
\\n\\n\\n\\nOver the last five years, GitOps has emerged as one of the most interesting implementations of using GIT in the Kubernetes ecosystem and when people hear about Argo they immediately associate it with Argo CD, but not many people know that there is more than just GitOps tools. The Argo project itself is a tool suite developed as a cloud and Kubernetes-native solution to help with accelerating SDLC and automatization of full DevOps solutions. Alongside of ArgoCD there are great tools such as Argo Workflow, Argo Rollout and Argo Events which are extensively covered in this exam and gives real-world examples of how to effectively utilize them. If you are planning to take the CAPA exam, then this blog will be interesting for you as we will dive into domains that are asked during this exam and how to prepare for them as part of my exam preparation journey.
\\n\\n\\n\\nThe Argo Project is a suite of open source tools for deploying and running applications and workloads on Kubernetes. It extends the Kubernetes APIs and unlocks new and powerful capabilities in application deployment, container orchestration, event automation, progressive delivery, and more also there is very interesting blog from Argo Project creator that explains why they created it. Argo Workflow was the first tool before Argo CD.
\\n\\n\\n\\n\\n\\n\\n\\n\\nThe CAPA is an associate-level certification designed for engineers, data scientists, and others interested in demonstrating their understanding of the Argo Project ecosystem.
\\n\\n\\n\\nThose who earn the CAPA certification will demonstrate to current and prospective employers they can navigate the Argo Project ecosystem, including when and why to use each tool, understanding the fundamentals around each toolset, explaining when to use which tool and why, and integrating each tool with other tools.
\\n\\n\\n\\nCAPA will demonstrate a candidate’s solid understanding of the Argo Project ecosystem, terminology, and best practices for each tool and how it relates to common DevOps, GitOps, Platform Engineering, and related practices. Learn more about the exam and sign up for updates on the Linux Foundation Training and Certification website.
\\n
Compared to other certifications I wouldn’t say it’s the easiest, but it still falls under the pre-professional level of difficulty and is a great way to test your knowledge of Argo Project. For me, the order of difficulty felt like this: KCNA/CGOA/CAPA/CKAD/PCA/KCSA/CKA/CKS. One thing to keep in mind is that I had already passed the CGOA before taking the CAPA and when I was preparing for the exam, there were no tutorials or blogs to refer to, only some suspiciously scam-like dumps, so don’t fall for them. Below I will mention all the new courses and materials that should help in preparation.
\\n\\n\\n\\nRegarding persons who would benefit SysAdmins/Dev/Ops/SRE/Managers/Platform engineers or any one who is doing anything on production should consider it as knowing basic GitOps is always good thing as we are managing multiple clusters it is handy skill.
\\n\\n\\n\\nSo we are ready to patch every security whole in our cluster, kick out hackers from our production system and make hard them to compromise your cluster? Then, we have a long path ahead until we reach this point. First, we need to understand what kind of exam it is compared to CKAD, CKA and CKS. This is exam where the CNCF has adopted multiple-choice questions and compared to other multiple-choice exams, this one, I would say is not an easy-peasy. However it is still qualified as pre-professional, on par with the KCNA/PCA/CGOA/KCSA.
\\n\\n\\n\\nThis exam is conducted online, proctored similarly to other Kubernetes certifications and is facilitated by PSI. As someone who has taken more than 15 exams with PSI, I can say that every time it’s a new journey. I HIGHLY ADVISE joining the exam 30 minutes before taking the test because there are pre-checks of ID and the room in which you are taking it needs to be checked for exam criteria. Please check these two links for the exam rules and PSI portal guide
\\n\\n\\n\\nYou’ll have 90 minutes to answer 60 questions, which is generally considered sufficient time, passing score is >75%. Be prepared for some questions that can be quite tricky. I marked a couple of them for review and would advise doing the same because sometimes you could find a hint or partial answers in the next question. By this way, you could refer back to those questions. Regarding pricing, the exam costs $250, but you can often find it at a discount, such as during Black Friday promotions or near dates for CNCF events like KubeCon, Open Source Summit, etc.
\\n\\n\\n\\nAt this point, we understand what we have signed up for and are ready to dedicate time to training, but where should we start? Before taking this exam, I had a good experience with Kubernetes, Flux CD and Argo Project ecosystem and had passed CGOA exam, but yet I still learned a lot from this exam preparation.
\\n\\n\\n\\nAt first glance, this list might seem too simple and easy but however, we need to learn the fundamentals of GitOps, CICD and SDLC first in order to understand higher-level concepts such as Branching Strategy, Event Driven Architecture and many more
\\n\\n\\n\\n**Argo Workflows 36%**\\n\\n\\n\\n
Understand Argo Workflow Fundamentals
Generating and Consuming Artifacts
Understand Argo Workflow Templates
Understand the Argo Workflow Spec
Work with DAG (Directed-Acyclic Graphs)
Run Data Processing Jobs with Argo Workflows
**Argo CD 34%**
Understand Argo CD Fundamentals
Synchronize Applications Using Argo CD
Use Argo CD Application
Configure Argo CD with Helm and Kustomize
Identify Common Reconciliation Patterns
**Argo Rollouts 18%**
Understand Argo Rollouts Fundamentals
Use Common Progressive Rollout Strategies
Describe Analysis Template and AnalysisRun
**Argo Events 12%**
Understand Argo Events Fundamentals
Understand Argo Event Components and Architecture
The Argo Project is headlined by Argo CD and Argo Workflows, two mainstay powerhouses that have become the defacto tools in their respective spaces. Joining Argo CD and Workflows are Argo Events and Rollouts with large successful followings of their own. One of the brilliant decisions made early on was that these tools which have their own uses could be useable with or without the rest of the Argo suite of tooling.
\\n\\n\\n\\nArgo CD is the world’s most popular and fastest growing GitOps tool. It allows users to define a git source of truth for an application and keep that in sync with a destination Kubernetes cluster. This powerful tool gets even more powerful when combined with tools like Argo Rollouts, which handles progressive delivery, and other open source tools like Crossplane for managing infrastructure, or OPA and Kyverno for security policy.
\\n\\n\\n\\nArgo Workflows provides a powerful workflow engine built for Kubernetes where each step operates in its own pod. This provides for massive scale and flexible multi-step workflow tasks. Argo Workflows has been especially popular for data pipelines as well as Kubernetes-native CI/CD pipelines. Workflows becomes especially powerful when paired with Argo Events, a Kubernetes-native event engine. It can be used to detect events in Kubernetes and trigger actions, either in Argo Workflows, or other services, as well as provide a general interface for webhooks and api-calls.
\\n\\n\\n\\nArgo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.
\\n\\n\\n\\nAn in-depth guide on Argo Workflows, covering its basics, core concepts, and a quick tutorial. It explains how Argo Workflows orchestrates parallel jobs on Kubernetes, details on defining workflows with containers and DAGs, and offers practical steps for installing and running Argo Workflows. It also highlights the integration of Argo Workflows with Codefresh for CI/CD pipelines, emphasizing ease of use and cloud-native capabilities. For more details, visit Argo Workflows: The Basics and a Quick Tutorial
\\n\\n\\n\\nThis Pipekit explains how to configure an artifact repository for Argo Workflows. It provides step-by-step instructions on setting up and managing artifact repositories, enabling efficient handling and storage of artifacts generated during workflow executions. This guide is valuable for ensuring smooth and effective workflow operations in Kubernetes environments. For more information, visit How to Configure an Artifact Repo for Argo Workflows (pipekit.io).
\\n\\n\\n\\nDirected Acyclic Graph (DAG) — A Directed Acyclic Graph (DAG) is a finite graph with directed edges where no cycles exist. It’s used in computer science to model various systems, including task scheduling, data flow, and dependency management. DAGs facilitate efficient algorithms like topological sorting and are central to concepts in computational complexity and data structures. Directed acyclic graph
\\n\\n\\n\\nOverview and Use Cases of Directed Acyclic Graphs (DAGs) — Directed Acyclic Graph (DAG) is a structure consisting of vertices and directed edges, representing activities and their order. Each edge has a direction, and there are no cycles, meaning you cannot return to a vertex once you leave it. DAGs are useful in modeling data processing flows, ensuring tasks are performed in a specific sequence without repetition. Visit Directed Acyclic Graph (DAG) Overview & Use Cases
\\n\\n\\n\\nDirected Acyclic Graphs (DAGs) in Argo Workflows
The Argo Workflows DAG guide explains how to create workflows using directed-acyclic graphs (DAGs). It covers defining task dependencies for enhanced parallelism and simplicity in complex workflows. Examples include a sample workflow with sequential and parallel task execution and features like enhanced dependency logic and fail-fast behavior for error handling. Visit DAG — Argo Workflows
Argo CD follows the GitOps pattern of using Git repositories as the source of truth for defining the desired application state. Kubernetes manifests can be specified in several ways:
\\n\\n\\n\\nArgo CD automates the deployment of the desired application states in the specified target environments. Application deployments can track updates to branches, tags, or pinned to a specific version of manifests at a Git commit. See tracking strategies for additional details about the different tracking strategies available.
\\n\\n\\n\\nArgo CD is implemented as a Kubernetes controller which continuously monitors running applications and compares the current, live state against the desired target state (as specified in the Git repo). A deployed application whose live state deviates from the target state is considered OutOfSync
. Argo CD reports & visualizes the differences, while providing facilities to automatically or manually sync the live state back to the desired target state. Any modifications made to the desired target state in the Git repo can be automatically applied and reflected in the specified target environments.
The article “Argo CD Fundamentals” by Nandhabalan Marimuthu introduces the basics of Argo CD, a declarative continuous delivery tool for Kubernetes. It covers key concepts such as GitOps, application management, synchronization, and automation workflows, providing practical insights and examples for effectively utilizing Argo CD in DevOps environments. Understand The Basics — Argo CD
\\n\\n\\n\\nArgo CD — Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes. It automates application deployments, allowing users to manage Kubernetes resources through Git repositories. By monitoring and syncing application states, Argo CD ensures that the desired state in Git matches the live state in clusters. Understanding Argo CD
\\n\\n\\n\\nExplore Argo CD with Codefresh: A comprehensive guide to deploying applications using Argo CD, a GitOps continuous delivery tool. Learn about its features, benefits, and how to set up automated application deployment pipelines seamlessly integrating with Kubernetes. Master the power of GitOps for efficient, scalable software delivery.
\\n\\n\\n\\nSynchronize Applications Using Kubectl with Argo CD — The link provides guidance on syncing Kubernetes resources with Argo CD using kubectl. It highlights kubectl’s compatibility with Argo CD for synchronizing and managing Kubernetes resources. Both imperative and declarative methods are supported, ensuring seamless integration and efficient resource management within Kubernetes clusters. Sync Applications with Kubectl — Argo CD.
\\n\\n\\n\\nSynchronization Choices in Argo CD: Declarative GitOps Continuous Delivery for Kubernetes — Argo CD’s synchronization options streamline Kubernetes deployments. With features like automatic sync, manual sync, and hooks, it ensures seamless application updates. Customize sync policies, including pruning and sync wave limitations, for efficient resource management. Monitor sync status and history effortlessly for comprehensive deployment control. Sync Options — Argo CD — Declarative GitOps CD for Kubernetes
\\n\\n\\n\\nSetting Up Declaratively with Argo CD — The link provides a comprehensive guide to Argo CD’s synchronization options. It details various strategies for syncing Kubernetes resources, including automated and manual methods. Users can learn about hooks, resource filtering, and customization, ensuring efficient management and deployment of applications within Kubernetes clusters. Declarative Setup — Argo CD
\\n\\n\\n\\nArgo Rollouts is a Kubernetes controller and set of CRDs which provide advanced deployment capabilities such as blue-green, canary, canary analysis, experimentation, and progressive delivery features to Kubernetes.
\\n\\n\\n\\nArgo Rollouts (optionally) integrates with ingress controllers and service meshes, leveraging their traffic shaping abilities to gradually shift traffic to the new version during an update. Additionally, Rollouts can query and interpret metrics from various providers to verify key KPIs and drive automated promotion or rollback during an update.
\\n\\n\\n\\nThe native Kubernetes Deployment Object supports the RollingUpdate strategy which provides a basic set of safety guarantees (readiness probes) during an update. However the rolling update strategy faces many limitations:
\\n\\n\\n\\nFor these reasons, in large scale high-volume production environments, a rolling update is often considered too risky of an update procedure since it provides no control over the blast radius, may rollout too aggressively, and provides no automated rollback upon failures.
\\n\\n\\n\\nArgo Rollouts” is a progressive delivery tool for Kubernetes. It enables advanced deployment strategies like Canary and Blue-Green deployments, ensuring smooth and controlled updates of applications. With features like traffic shifting and analysis, it empowers teams to deliver software with confidence and reliability in Kubernetes environments. Argo Rollouts and https://argo-rollouts.readthedocs.io/en/stable/
Argo Rollouts: A Brief Overview of Concepts, Setup, and Operations — Argo Rollouts: Quick Guide to Concepts, Setup & Operations
Progressive Delivery Controller for Kubernetes — Argo Rollouts simplifies Kubernetes deployments by offering advanced deployment strategies like blue-green, canary, and experimentation. Automated rollback and rollback analysis, ensure reliability. This documentation introduces its concepts, empowering users to leverage its powerful features for seamless, controlled application deployments in Kubernetes environments. Kubernetes Progressive Delivery Controller
\\n\\n\\n\\nExecuting Progressive Deployment Strategies Using Argo Rollouts — Progressive Deployment Strategies using Argo Rollout
\\n\\n\\n\\nAnalysis Template — https://docs.opsmx.com/opsmx-intelligent-software-delivery-isd-platform-argo/user-guide/delivery-verification/analysis-template
Creating Analysis Runs for Rollouts — https://argo-rollouts.readthedocs.io/en/release-1.5/generated/kubectl-argo-rollouts/kubectl-argo-rollouts_create_analysisrun/
Argo Events is an event-based dependency manager for Kubernetes which helps you define multiple dependencies from a variety of event sources like webhook, s3, schedules, streams etc. and trigger Kubernetes objects after successful event dependencies resolution.
\\n\\n\\n\\nExplore Argo Events on Codefresh: Discover event-driven automation for Kubernetes. Learn to build resilient, scalable workflows triggered by events from various sources. Harness the power of Argo CD, Events, and Workflows to streamline your CI/CD pipelines and optimize Kubernetes deployments. Empower your DevOps with seamless automation. Argo Events: the Basics and a Quick Tutorial
\\n\\n\\n\\nThe Argo Events tutorial introduces Argo, an open-source workflow engine, focusing on event-driven architecture. It details installation, configuration, and usage, emphasizing event-driven workflows for Kubernetes. Learn to automate tasks, trigger workflows based on events, and streamline Kubernetes operations for efficient, scalable development and operations. Argo Events — The Event-Based Dependency Manager for Kubernetes
\\n\\n\\n\\nArgo Events- The Event-Driven Dependency Manager for Kubernetes — Argo Events, part of the Argo project, orchestrates Kubernetes-native event-driven workflows. Its architecture leverages triggers, gateways, and sensors to detect events from various sources like Pub/Sub systems or HTTP endpoints, initiating workflows. This decentralized, cloud-native system enables scalable, resilient event processing and automation within Kubernetes clusters. Argo Events — The Event-Based Dependency Manager for Kubernetes
\\n\\n\\n\\nEvent-Driven Architecture on Kubernetes — Discover the dynamic world of event-driven architecture on Kubernetes with Argo Events. This blog delves into orchestrating microservices, leveraging Kubernetes’ power, and streamlining workflows with Argo Events. Explore seamless event handling and scalable solutions, unlocking the potential for efficient, responsive systems in the modern tech landscape. Event-Driven Architecture on Kubernetes with Argo Events
\\n\\n\\n\\nYou can explore and learn about CAPA Certification and related topics freely through the following materials which I have used and of course Argo Project documentation is our best friend
\\n\\n\\n\\nAt this moment there is not many courses as it is a new exam and I would highly advise not to click on every course which will pop up from Google search as it is a new exam there are plenty of scams.
\\n\\n\\n\\nI hope in the near future we will see more courses from bigger platforms such as kodekloud or killersh.
\\n\\n\\n\\nThe exam is not easy. Among other certs I would rank it in this order KCNA/CGOA/CAPA/CKAD/PCA/KCSA/CKA/CKS. After conducting the exam in 24 hours you will receive grading and after passing the exam it feels pretty satisfying overall. I hope it was informative and useful 🚀
\\n\\nMember post originally published on Tetrate’s blog
\\n\\n\\n\\nThe industry is embracing Generative AI functionality, and we need to evolve how we handle traffic on an industry-wide scale. Keeping AI traffic handling features exclusive to enterprise licenses is counterproductive to the industry’s needs. This approach limits incentives to a single commercial entity and its customers. Even single-company open-source initiatives do not promote open multi-company collaboration.
\\n\\n\\n\\nA shared challenge like this presents an opportunity for open collaboration to build the necessary features. We believe bringing together different use cases and requirements through open collaboration will lead to better solutions and accelerate innovation. The industry will benefit from diverse expertise and experiences by openly collaborating on software across companies and industries.
\\n\\n\\n\\nThat is why Tetrate and Bloomberg have started an open collaboration to bring critical features for this new era of Gen AI integration. Collaborating openly in the Envoy community, bringing AI traffic handling features to Envoy, via Envoy Gateway and Envoy Proxy.
\\n\\n\\n\\nWhat makes traffic to LLM models different from traditional API traffic?
\\n\\n\\n\\nOn the surface it appears similar. Traffic comes from a client app that is making an API request, and this request has to get to the provider that hosts the LLM model.
\\n\\n\\n\\nHowever, it is different. Managing LLM traffic from multiple apps, to multiple LLM providers, introduces new and different challenges where traditional API Gateway features fall short.
\\n\\n\\n\\nFor example, traditional rate-limiting based on number of requests doesn’t work for controlling usage of LLM providers as they’re computationally complex services. To measure usage LLM providers tokenize the words in the request message and response message, and count the number of tokens used. This count gives a good approximation of the computational complexity and cost of serving the request.
\\n\\n\\n\\nBeyond controlling usage of LLMs there are many more challenges relating to ease of integration and high-availability architectures. It’s no longer enough to just optimize for quality of service alone, adopters must consider costs of usage in real time. As adopters of Gen AI look for Gateway solutions to handle these challenges for their system, they often find the necessary features locked behind enterprise licenses.
\\n\\n\\n\\nNow, let’s look at how handling AI traffic poses new challenges for Gateways. There are several features we discussed together with our collaborators at Bloomberg, and together we decided on three key features for the MVP:
\\n\\n\\n\\nWhat other features are you looking for? Get in touch with us to share your use case and define the future of Envoy AI Gateway.
\\n\\n\\n\\nWe are really excited about these features being part of Envoy. They will benefit those integrating with LLM providers and, ultimately, also Gateway users for general API request traffic.
\\n\\n\\n\\nWhen it comes to AI Gateway features, we have chosen to collaborate and build within the CNCF Envoy project because we believe multi-company, open-source projects benefit the entire industry by enabling innovation without creating single vendor risk.
\\n\\n\\n\\nVia our Newsletter: Sign up to our mailing list to stay updated.
\\n\\n\\n\\nAttend Online CNCF Panel Event: To learn more about integrating AI in the enterprise, attend the live CNCF-hosted event “Enabling AI adoption at scale – The AI Platform with Envoy AI Gateway” on October 17th with panelists from Tetrate and Bloomberg. You will be able to ask questions and interact with the panel live.
\\n\\n\\n\\nSee a demo at Tetrate booth at KubeCon NA: Visit Tetrate at booth #Q2 at KubeCon NA for a demo to see a demo and talk to engineers working on the AI traffic handling features.
\\n\\n\\n\\n###
\\n\\n\\n\\nIf you’re new to service mesh, Tetrate has a bunch of free online courses available at Tetrate Academy that will quickly get you up to speed with Istio and Envoy.
\\n\\n\\n\\nAre you using Kubernetes? Tetrate Enterprise Gateway for Envoy (TEG) is the easiest way to get started with Envoy Gateway for production use cases. Get the power of Envoy Proxy in an easy-to-consume package managed by the Kubernetes Gateway API. Learn more ›
\\n\\n\\n\\nGetting started with Istio? If you’re looking for the surest way to get to production with Istio, check out Tetrate Istio Subscription. Tetrate Istio Subscription has everything you need to run Istio and Envoy in highly regulated and mission-critical production environments. It includes Tetrate Istio Distro, a 100% upstream distribution of Istio and Envoy that is FIPS-verified and FedRAMP ready. For teams requiring open source Istio and Envoy without proprietary vendor dependencies, Tetrate offers the ONLY 100% upstream Istio enterprise support offering.
\\n\\nMember post originally published on Devtron’s blog by Bhushan Nemade
\\n\\n\\n\\nIn the previous blog on Jenkins, we already covered how to set up a Jenkins pipeline, and the pros and cons of Jenkins for CI/CD pipelines for Kubernetes environments. In this blog, we will dive deep into potential challenges that can be faced if Continuous Deployments (CD) are managed through Jenkins. We will also take a look at how we can execute robust Continuous Deployments in complex Kubernetes environments keeping Jenkins for Continuous Integration (CI) and integrating Jenkins with Devtron for complete CI/CD experience.
\\n\\n\\n\\n\\n\\n\\n\\n\\nWe recommend avoiding Jenkins for Continuous Deployments.
\\n
Jenkins was originally designed for Continuous Integration (CI). Still, with the level of flexibility, Jenkins has been used for Continuous Deployments. It comes with multiple Jenkins plugins that can be integrated within the pipelines to execute Continuous Deployments through Jenkins. The CD pipelines through Jenkins can be managed in two ways:
\\n\\n\\n\\nIntegrating as a Stage into Jenkins CI Pipeline: With this approach, you can add a new stage to your existing build pipeline which will execute deployment to the Kubernetes environments. Typically, you will have to write your custom scripts to apply the configurations on the target environment.
\\n\\n\\n\\nSetting a Dedicated Jenkins Agent: Configure a separate Jenkins agent on your target Kubernetes environment. This agent will be responsible for executing deployments to target environments.
\\n\\n\\n\\nEven though Jenkins has been used widely for Continuous Integration (CI) and Continuous Deployment (CD), there are certain pitfalls where you would not want to use Jenkins. Here are some of the disadvantages of Jenkins for Continuous Deployment on Kubernetes.
\\n\\n\\n\\nJenkins pipelines are executed through groovy scripts, when it comes to deploying applications/services to multiple environments it becomes tedious and error-prone to write and manage these scripts. Additionally, there is a learning curve to understand Groovy and write those scripts.
\\n\\n\\n\\nManaging crucial configurations through Jenkins scripts poses risks of misconfigurations and potential leaks of sensitive information. It also makes it challenging to audit and track configuration changes effectively. It heavily relies on plugins, and managing a large number of plugins can become challenging as they need to be frequently updated to avoid compatibility issues.
\\n\\n\\n\\nIn Jenkins, every process/step gets handled through a Jenkins pipeline which is in the form of a script and as the scale increases it becomes more complex to manage those scripts and might impose human errors. Additionally, Jenkins has a complicated permission and access control system, especially when working with large teams. Managing fine-grained access control across multiple teams and projects is not as straightforward as in some other modern Continuous Deployment (CD) tools like ArgoCD or Devtron. Since Jenkins relies heavily on third-party plugins, it can introduce security vulnerabilities, especially if plugins are not regularly updated, and managing the complex web of plugins itself is a big challenge. Here are some references to previous zero-day vulnerabilities in Jenkins because of the vast web of plugin ecosystem.
\\n\\n\\n\\nJenkins was originally designed for Continuous Integration (CI), but it has been extended to continuous deployment with some hacks and plugins. Even with custom scripts and plugins, it lacks built-in production capabilities which are highly important for a stable production pipeline such as policy-driven pipelines, approval workflows, prod configuration protection, DR mechanisms, configuration drifts, release management, and much more.
\\n\\n\\n\\nJenkins was originally designed for Continuous Integration, but it has limitations in deployment capabilities. It lacks built-in approval workflows and gates for production deployments, as well as native support for advanced deployment strategies like canary releases or blue-green deployments. Jenkins also lacks to provide visibility and rollback capabilities for production deployments.
\\n\\n\\n\\nExecuting CI/CD pipeline operations using Jenkins is handled by multiple plugins and extensive scripts. These plugins and pipelines need to be verified from time to time to make sure that the current version is supported. With time and scale, this becomes a tedious process and an overhead work to keep everything up to date. Additionally, Jenkins can become resource-intensive, especially in larger setups with multiple pipelines and concurrent builds. It often requires dedicated servers and frequent performance tuning as performance degrades over time because of concurrent builds and jobs.
\\n\\n\\n\\nJenkins doesn’t natively support advanced deployment strategies such as canary releases, blue-green deployments, or rolling updates. Achieving these requires custom scripting or integrating third-party tools.
\\n\\n\\n\\nJenkins does not come with advanced monitoring, logging, or observability features out of the box. To gain insight into deployment health or performance, you need to integrate external monitoring tools and create your custom dashboard in maybe Grafana. For your workloads deployed on Kubernetes, you can’t see the real-time status of the workloads, if it is deployed or in a pending state, which again is one of the important aspects of any continuous deployment which is available in modern continuous deployment (CD) tools like ArgoCD or Devtron.
\\n\\n\\n\\nJenkins is not designed specifically for Kubernetes, so managing Kubernetes deployments can be clunky and inefficient compared to Kubernetes-native tools like ArgoCD, FluxCD, or Devtron. For Kubernetes deployment, Jenkins often requires additional tools or scripts for integration, leading to more complex configurations. Additionally, Jenkins is not cloud-native. It was originally designed for on-premise setups. While it can be adapted for cloud environments, it doesn’t offer the seamless integration, autoscaling, and cost optimization features of cloud-native CD tools designed specifically for Kubernetes environments. In cloud-native environments, Jenkins often needs manual scaling for build agents and resources, unlike modern CD tools that automatically scale based on load.
\\n\\n\\n\\nJenkins doesn’t natively support GitOps workflows, which are increasingly popular for continuous deployment in Kubernetes environments. This requires the integration of additional layers of configuration and tools to achieve a GitOps model which introduces the management of different tools, thus increasing the complexity. Additionally, it doesn’t have an automated/ SLO-based rollback mechanism which is highly critical for continuous deployment pipelines
\\n\\n\\n\\nContinuous Deployment is a crucial component of the software development lifecycle, which provides the capability to execute lightning-fast deployments. Every deployment going to the production servers is responsible for the status of applications/services for users; for instance, if one single deployment goes wrong, the impact can be worse as disruption of all services. To strike a balance between stability and innovation, every deployment going to any server should be passed through some checklist. The ideal checklist for continuous deployment can be:
\\n\\n\\n\\nHaving a robust RBAC is crucial for the deployment process, as it ensures that only authorized stakeholders can take crucial actions, such as triggering deployments at production servers.
\\n\\n\\n\\nEnforcing certain policies ensures that the best practices are being followed before executing the deployments. For instance, the policies can be like a mandatory scan of Docker images before deployments, on which if vulnerabilities are detected, the deployment will be aborted.
\\n\\n\\n\\nThe approval process reduces the chances of the wrong deployment going to the production environments and disrupting the services across regions. Setting up an approval process where the stakeholders need to approve the deployment reduces the chances of service disruptions.
\\n\\n\\n\\nIt helps you gain control over deployments, ensuring stability and preventing disruptions during peak hours. Define time slots for planned deployments or block deployments entirely for critical business hours.
\\n\\n\\n\\n\\n\\n\\n\\n\\nNOTE: Devtron supports a robust CI, which is precisely crafted for Kubernetes environments and offers multiple features and flexibility. However, in this blog, we will focus exclusively on Devtron’s CD capabilities.
\\n
The CD capabilities of Devtron include features such as choosing the deployment type Automatic or Manual, way of deployment Helm or GitOps, multiple deployment strategies, approval-based deployments, and pre-post deployment stages. Let’s get our hands on the Devtron dashboard and deploy the application.
\\n\\n\\n\\nFor this blog, we will use Jenkins in collaboration with Devtron to build, deploy, and manage our Kubernetes application. In the previous blog where we built an image and pushed it to the Docker hub, we will be using the same image (built by Jenkins) and deploying it through Devtron using all CD capabilities of the dashboard.
\\n\\n\\n\\nTo install Devtron the only prerequisite is to have a running Kubernetes cluster. For complete installation of Devtron follow their documentation.
\\n\\n\\n\\nOnce we have installed Devtron, we can start by creating a new application and configuring the CI/CD workflows. Before setting CI/CD pipelines we need to set some configurations i.e. Git Repository, Build Configurations, and Base Configurations to know how to set them up you can follow Devtron’s documentation. Now that we are done with all the required configurations, let’s make our workflows i.e. CI/CD pipelines. As for this blog, we are going to use Jenkins as our CI pipeline Devtron offers two ways to execute deployments. Let’s check them one by one. The first one is Deploy Image from External Service here, Devtron provides us an External Webhook to execute the deployments. Another one is Deployment Using JobCI, which will be executed using the Devtron JobCI of Devtron and Devtron’s plugin.
\\n\\n\\n\\nOnce the configurations are done choose the type of pipeline for deployment, for this instance, we are choosing, Deploy Image from External Service as we already have a docker image built through jenkins.
\\n\\n\\n\\nAfter selecting the type pipeline i.e. Deploy Image from External Service, we are redirected to configurations of the CD pipeline. At the CD pipeline configuration, we need to provide the Environment, and Namespace, and choose the execution type Automatic or Manual. Moreover with Devtron we also get the capability to execute advanced deployment strategies i.e. Rolling, Recreate, Canary, and Blue-Green.
\\n\\n\\n\\nOn completing the above configurations cliche Create Pipeline, now we will be able to visualize the application workflow i.e. the CI/CD pipeline for the application. The CI part will be executed on the Jenkins server and the generated Docker image will be fetched by the webhook. The CD pipeline of Devtron will help us perform robust CD operation
\\n\\n\\n\\nLet’s do some configurations for our External Webhook we need to use a cURL request in our Jenkins pipeline which will let Devtron know that the image is built by an external service.
\\n\\n\\n\\nOn Workflow Editor click External Source (Webhook) > Select or auto-generate token with required permissions > Auto-generate token > {Enter the name for token} > Generate token.
\\n\\n\\n\\nClick Sample cURL request > copy the Sample cURL request and integrate it into your Jenkins pipeline along with the Auto-generate token and tag for Docker Image.
\\n\\n\\n\\nNow that we have configured the External webhook for CI, let’s set the CD pipeline through Devtron.
\\n\\n\\n\\nClick Deploy (deployment pipeline), and you will be navigated to the Edit deployment pipeline, where you can configure a robust CD pipeline.
\\n\\n\\n\\nTo make sure that the deployment going to the critical environments is vulnerability-free, you can navigate to the Pre-Deployment stage. At this stage, you can configure Devtron’s plugins for scanning the Docker image, in case any vulnerabilities are detected you can stop the deployments.
\\n\\n\\n\\nDevtron also allows you to set an approval process, where before the execution of each deployment approval from relevant stakeholders is required. To set an approval navigate to the Deployment stage > Approval for deployment, here you can select the number of approvals required before executing the deployment to specific environments.
\\n\\n\\n\\nAlso, you can select the way you want to execute the deployment pipeline, i.e. Automatic or Manual.
\\n\\n\\n\\nTo set the notifications on completion of the deployment process, you can navigate to the Post Deployment stage and configure the plugins like Custom Email Notifier. Which will be sending notifications for the completion of the deployment process.
\\n\\n\\n\\nOnce every configuration is done click Update Pipeline, and trigger your application pipeline accordingly. Once the application is deployed you can manage your applications through Devtron’s dashboard where you can leverage multiple other features like Deployment Window for controlled deployments, configuration management across environments, robust user access management, and many more. You can follow Devtron’s documentation to explore this further.
\\n\\n\\n\\nAnother way we can deploy to applications using Devtron and Jenkins is by using a JobCI pipeline. In Devtron we need to select Create a job.
\\n\\n\\n\\nTo configure the Job CI and CD pipeline, navigate to New Workflow > Create a job > You will be redirected to the Basic configuration > Set the basic configurations like Pipeline Name, Source Type, and Branch Name.
\\n\\n\\n\\nOnce the Basic Configurations are done, navigate to the Task to be executed. Here, you need to add Devtron’s plugins which will facilitate the triggering of the Jenkins pipeline are deployment of applications through Devtron.
\\n\\n\\n\\nThe Jenkins plugin of Devtron, helps us to remotely trigger our Jenkins pipelines and fetch the logs to the Devtron dashboard. Once the pipeline is executed successfully the generated container image will be available for us to deploy on the Devtron dashboard (provided that your Jenkins pipeline is pushing the docker image to the container registry which is configured with Devtron).
\\n\\n\\n\\nIn case the container registry is not configured with Devtron, you can utilize the pull images from the container repository plugin of Devtron. This plugin continuously polls the specified container registry and pulls images from the container repository to Devtron for deployments.
\\n\\n\\n\\nClick the Update Pipeline button, and that’s it. We have configured Job CI, which will trigger our Jenkins CI pipeline and fetch the container image for deployment.
\\n\\n\\n\\nFor the CD pipeline with Devtron, you can refer to Step No. 4, where it’s been described how we can configure and execute the robust CD pipeline with Devtron.
\\n\\n\\n\\nWhile Jenkins is a powerful CI/CD tool, it has notable disadvantages as a continuous deployment tool, particularly for modern, cloud-native, and Kubernetes-centric environments. Alternatives like ArgoCD, FluxCD, or Devtron offer better out-of-the-box integrations for advanced deployment strategies, GitOps workflows, and Kubernetes management, making them more suitable for many organizations looking for scalable and automated CD solutions. In this blog, we have covered the potential disadvantages of Jenkins when it comes to continuous deployment (CD) for Kubernetes, capabilities for an ideal CD pipeline, and how Devtron helps us to set up a robust CI/CD pipeline integrating with Jenkins.
\\n\\n\\n\\nDevtron stands out as a complete solution when it comes to managing Kubernetes and has the capabilities to handle robust continuous deployment (CD) operations like robust RBAC, approval-workflows, Devtron’s plugins, vulnerability scanning, configuration drift management, release orchestration, and team collaborations. When it comes to managing the multiple Kubernetes cluster resources and day 2 operations like managing multiple applications spread across multiple Kubernetes clusters, Devtron comes as a savior
\\n\\n\\n\\nIf you have any queries, don’t hesitate to connect with us. Join the lively discussions and shared knowledge in our actively growing Discord Community.
\\n\\n\\n\\nCommunity post originally published on Medium by Dotan Horovits
PromCon Europe 2024 just wrapped up in Berlin, and this year’s edition was a big one. Not just because the Prometheus community gathered in full force, but also because we got the long-anticipated unveiling of Prometheus 3.0! The maintainers literally hit the ‘merge’ button for 3.0-Beta on stage!
\\n\\n\\n\\nFor the occasion, I sat down with Julius Volz, the creator of Prometheus, for a walk through the major announcements and what they mean for the future of the Prometheus ecosystem. I’ll recap the highlights in this blog, and you can find the full fireside chat on the latest episode of OpenObservability Talks:
\\n\\n\\n\\nPrometheus v2 has been around for almost 7 years, with no less than 54 minor releases. It was high time to bump that major release. Major versions are also the time that software usually has breaking changes, and Julius says this was indeed part of the reason for this major release. So should we expect things to break when we upgrade from 2.x to 3.0?
\\n\\n\\n\\nThe good news, according to Julius, is that for most people, when they upgrade from 2.x to 3.0, nothing much will break. He said it was about getting rid of “old crafty stuff”, primarily removing deprecated experimental feature flags, so the majority of users should not experience any breaking changes. Don’t worry, Prometheus keeps its longstanding backwards-compatibility commitment to its users.
\\n\\n\\n\\nThis major release is not just about incremental changes — it represents a shift in how Prometheus will be used moving forward. Julius explained that one of the major themes for v3.0 is OpenTelemetry (OTel) compatibility. This is not trivial, given different design principles employed by each project around metric naming (dots or underscores? with or without units?) and encoding (UTF-8 now supported on Prometheus v3), pull vs. push mode, delta vs. cumulative temporality, and more. But there’s a clear upside to it.
\\n\\n\\n\\nPrometheus has always been known for its robust metric collection, but with the move to support native OpenTelemetry metrics, Prometheus maintainers want to position it as the go-to backend for OpenTelemetry. This is a strategic step to make Prometheus more interoperable and future-proof as OpenTelemetry continues to gain adoption across the industry.
\\n\\n\\n\\nWe also discussed the revamping of the Prometheus UI. While Prometheus has traditionally been loved for its backend performance and flexibility, the frontend has sometimes lagged behind. Julius walked us through how the new UI in 3.0, which he developed himself, will improve usability without sacrificing the core simplicity that Prometheus users value. It’s based on Mantine, a modern React component library, and achieves a slick and cleaner look.
\\n\\n\\n\\nThe new Prometheus UI also offers new functionality, including a PromLens-style tree view, a better metrics explorer and an “Explain” tab. The revamped metrics explorer enables easy visual exploration of metrics and their labels, cardinality and more. The tree view was inspired by PromLens, a PromQL query builder tool which was recently donated to Prometheus.
\\n\\n\\n\\n\\nJust ported over the metrics and labels explorer from PromLens into the new @PrometheusIO UI: pic.twitter.com/7jfDB6NdDt
— Julius Volz (@juliusvolz) September 2, 2024
To my question, Julius said that the Alertmanager UI is not being touched at the moment, and he’s not sure when that would happen. He did confirm that the goal is to adopt the same UI framework (currently it’s not even React-based) and look and feel, as the revamped Prometheus UI. For more on the new UI, check out Julius’s blog post.
\\n\\n\\n\\nOne of the standout features of Prometheus 3.0 is Remote Write 2.0. Remote Write format serves to transmit bulk metrics from Prometheus to analytics backends. Julius shared insights into how this new iteration of remote write enhances the way Prometheus handles long-term storage. Prometheus, by design, is all about short-term, high-performance metric collection. But the ecosystem has evolved to handle long-term storage solutions like Thanos, Cortex, and Mimir, which integrate with Prometheus.
\\n\\n\\n\\nWith Remote Write 2.0, the focus is on reliability and efficiency. It drastically reduces the chance of data loss during network outages or downtime and allows for better streaming of data to remote storage. This is a big leap forward for teams that rely on Prometheus for critical monitoring and need bulletproof data pipelines.
\\n\\n\\n\\nAlso important is the exposition format, which defines how metrics are exposed by different components for scraping by Prometheus. A while back this format was spun off into its own project called OpenMetrics, in hope of making it into an independent standard. This hasn’t succeeded, and cluttered the Prometheus ecosystem. Now OpenMetrics is officially archived and merged back into Prometheus, where it belongs.
\\n\\n\\n\\nAnother exciting feature coming with Prometheus 3.0 is the introduction of native histograms. This feature significantly enhances how Prometheus can handle high-cardinality data, making it easier to manage large data sets without sacrificing performance.
\\n\\n\\n\\nWith v3.0, native histograms now support out-of-order ingestion, which addresses various scenarios brought up by OpenTelemetry, and more broadly by networking and similar disconnect and temporary gaps in metric data, a gap that that can now be filled.
\\n\\n\\n\\nBy natively supporting histograms, Prometheus 3.0 reduces the complexity of metric aggregation and makes queries faster and more efficient. It’s one of those under-the-hood improvements that might not make headlines, but it will have a big impact for users at scale.
\\n\\n\\n\\nWe couldn’t talk about Prometheus without touching on the broader ecosystem. One important open source project in the Prometheus ecosystem is Thanos, which offers long-term scalable storage for Prometheus. Similar to Cortex and Mimir, Thanos is introducing native multi-tenancy support, enabling sending data from different tenants, and then tracking and controlling access to it by tenant.
\\n\\n\\n\\nAnother interesting development in Thanos is distributed query execution, which improves query performance by pushing some of the processing down to leaf nodes, in a hub-and-spoke (or map-reduce) fashion.
\\n\\n\\n\\nOn the visualization side, one of the interesting developments coming out of PromCon this year was Perses project that recently joined the CNCF. Julius hinted that while Grafana has been the default for many, Perses is offering a lightweight, Prometheus-native dashboarding experience, with GitOps capabilities and foundational open source philosophy. Perses is still in its early days, but it’s worth keeping an eye on as it evolves.
\\n\\n\\n\\nAs Prometheus continues to grow, the governance model has also undergone changes. We touched on the project’s shift to a more formalized governance structure, lowering the barrier to entry for people to get involved, and making it easier for contributors to reach key positions in the project and collaborate on the future of Prometheus. This move aligns with other CNCF projects that have embraced more structured, tiered, transparent governance models as they mature.
\\n\\n\\n\\nJulius pointed out that with Prometheus now being used by countless organizations, it was important to ensure that decisions about its direction are made inclusively. The new governance framework will offer more tiers and empower a broader community to get more responsibility, accountability and permissions, and ultimately influence the roadmap while maintaining the high standards that have made Prometheus a cornerstone of the observability space.
\\n\\n\\n\\nPromCon was full of updates, such as automatic reloading of the Prometheus configuration, Regex and query functionality, the new Service Discovery Manager, and the Agent mode reaching Stability, and we couldn’t cover it all here.
\\n\\n\\n\\nThe 3.0 Beta is out and you’re welcome to try it out and check the release notes. The Prometheus 3.0 GA release is expected towards KubeCon North America 2024 in November, according to the outcomes of the Prometheus maintainers’ DevSummit that followed PromCon.
\\n\\n\\n\\nAs we wrapped up our discussion, Julius emphasized that while Prometheus 3.0 is a huge milestone, the work is far from over. The team is already looking ahead to what’s next, including improvements to scalability, deeper integrations with OpenTelemetry, and continued enhancements to the overall user experience.
\\n\\n\\n\\nPromCon Europe 2024 was a reminder of just how far Prometheus has come. With the launch of Prometheus 3.0, the project is poised to remain a dominant force in the observability ecosystem for years to come.
\\n\\n\\n\\nWant to learn more? Check out the OpenObservability Talks episode: Prometheus 3.0 Unveiled: PromCon Highlights with Julius Volz
\\n\\nCloud native technology adoption continues to increase across all enterprises, with most new applications being built on cloud native platforms and, in particular, being built on Kubernetes. This is largely driven by the increasing maturity and trust in Kubernetes as a core cloud native technology, the de facto standard for managing cloud native environments. According to the 2023 CNCF Annual survey, 84% of the cloud native community is using or evaluating Kubernetes
\\n\\n\\n\\nAt CNCF, we’re seeing a significant increase in new members, cloud native ambassadors, and Kubestronauts who have passed all Kubernetes certifications, indicating strong community support. Those who run applications on Kubernetes are trusting it more than ever to host mission critical elements, like databases, real-time analytics, and AI/ML workloads. In a recent survey of organizations that have already adopted Kubernetes, 80% report they are planning to build most of their new applications on cloud native. This means that their cloud native data platforms need to be able to provide enterprise-grade capabilities for these applications to run efficiently and securely.
\\n\\n\\n\\nHowever, adopting a cloud-native approach can introduce new challenges, including security and skills gaps. Practicing good cloud security protects organizations from a variety of risks that can have severe financial, operational, legal, and reputational consequences. Proper cloud security measures are essential to safeguarding data, maintaining compliance, and ensuring the continuity of business operations. In the 2023 Annual CNCF survey, 40% of organizations reported security as the leading challenge, but 46% say lack of training is the biggest challenge facing organizations that have not started, or are just beginning, their cloud native journey
\\n\\n\\n\\nCNCF is committed to growing the community of specialists knowledgeable in Kubernetes security to enable continued growth across organization and industries. Certification is a key step in that process, allowing certified security specialists to quickly establish their credibility and value in the job market, and also allowing companies to more quickly hire high-quality teams to support their growth.
\\n\\n\\n\\nCloud security certifications like CKS and KCSA are especially important for Kubernetes professionals in the current digital landscape, where cloud environments must be secure to protect data and maintain business processes.
\\n\\n\\n\\nThe Certified Kubernetes Security Specialist (CKS) certification, created by the Cloud Native Computing Foundation (CNCF), in collaboration with Linux Foundation Education and subject matter experts, is considered one of the most valuable certifications in the DevOps space because it’s a hands-on, performance-based certification exam that tests candidates’ knowledge of Kubernetes and cloud security in a simulated, real world environment. CKS certification also provides these benefits:
\\n\\n\\n\\nThe CKS exam was updated on October 15, 2024, to ensure that Certified Kubernetes Security Specialists have up-to-date skills and knowledge in securing Kubernetes and container-based applications. While the CKS domains (i.e. Cluster Setup, Cluster Hardening, etc.) will remain the same, there will be changes to competencies, including additions, deletions, and updated language, as well as adjustments to the percentage weight of some domains. These updates align with the latest developments in Kubernetes and cloud security.
\\n\\n\\n\\nAchieving your Certified Kubernetes Administrator (CKA) is a prerequisite to registering and taking your CKS exam, although your CKA certification does not have to be active.
\\n\\n\\n\\nThe CKS certification is especially valuable for those looking to specialize in the security aspects of Kubernetes – a critical area as container adoption continues to grow. As Kubernetes continues to dominate as the platform of choice for managing cloud-native applications, staying current with the latest security practices is essential. The upcoming changes to the CKS exam confirm that certified CKS IT professionals possess the most relevant skills and knowledge to protect container-based applications and allow the professionals to validate their expertise, as well as enhance their career prospects, contributing to the overall security and success of cloud-native initiatives across industries.
\\n\\nCo-chairs: Paula Kennedy, Stacey Potter, Vijay Chintha
\\n\\n\\n\\nNovember 12, 2024
\\n\\n\\n\\nSalt Lake City, Utah
\\n\\n\\n\\nPlatform Engineering Day focuses on solutions over tooling. We believe that Platform Engineering is a vital practice that helps organizations to increase their speed and efficiency of delivering software. The focus for the day will be to look at the challenges that teams are tackling and sharing use cases and advice from practitioners and consultants that will give attendees practical takeaways. This is the second time we’ve had a Platform Engineering Day; the first time was at KubeCon + CloudNativeCon Europe 2024.
\\n\\n\\n\\nThis will be relevant for any engineers working in DevOps, Platform Teams, SRE teams or any team that is providing services or tooling to another team. There will also be key ideas and suggestions that will be useful for those working in infrastructure or operations teams, and for decision makers looking to understand how to solve bottlenecks within their organization caused by internal tech debt and platform sprawl.
\\n\\n\\n\\nWe have packed in even more amazing content than we had at Paris so it is going to be a fantastic day for learning.
\\n\\n\\n\\nThe event is a full day, single track event that has a mixture of keynotes, lightning talks, breakout sessions and panels. We’ll kick off with an update from the CNCF Platforms Working Group (part of the TAG App delivery) and from then it will be an agenda packed with awesome stories.
\\n\\n\\n\\nSubmitted by Paula Kennedy, who is looking forward to catching up with old friends and making new ones.
\\n\\n\\n\\nDon’t forget to register for KubeCon + CloudNativeCon North America 2024.
Member post originally published on ngrok’s blog by Mike Coleman
\\n\\n\\n\\n\\n\\n\\n\\nMicroK8s is a lightweight, efficient, and easy-to-use Kubernetes distribution that enables users to deploy and manage containerized applications. ngrok, on the other hand, provides a secure and scalable universal gateway that enables secure access to your workloads regardless of where they are hosted.
\\n\\n\\n\\nBoth MicroK8s and ngrok excel in the arena of edge computing. This blog post examines how leveraging MicroK8s alongside ngrok can simplify and accelerate edge computing use cases.
\\n\\n\\n\\nMicroK8s is a lightweight Kubernetes distribution from Canonical. MicroK8s is extremely simple to set up. According to the documentation, it can be up and running in as little as 60 seconds. Additionally, it’s optimized to run in low-resource environments requiring as little as 540MB RAM; however, 4G of RAM is recommended for running actual workloads. Because of its low overhead, MicroK8s is especially well suited to operating on edge compute devices, including point-of-sale systems, controllers, and lightweight hardware.
\\n\\n\\n\\nDespite its ability to run in low-resource environments, MicroK8s is a full-featured Kubernetes distribution. It provides several advanced features, such as GPU support, automated HA configurations, and strict confinement to provide additional isolation between workloads and the underlying resources they run on.
\\n\\n\\n\\nngrok is a unified ingress platform that provides access to workloads running across a wide variety of infrastructure. By combining a lightweight agent with a powerful global network, ngrok makes it extremely easy to bring workloads online, including APIs, services, devices, and applications. With ngrok, you get a single platform that can replace multiple disjointed services, including reverse proxy, API gateway, firewall, and global load balancer.
\\n\\n\\n\\nBy automating and standardizing network access, ngrok simplifies providing edge access to edge computing workloads. Specific to MicroK8s ngrok features a Kubernetes operator that can provide access to cluster-based workloads. The ngrok Kubernetes operator provides both a traditional ingress controller as well as a gateway API implementation.
\\n\\n\\n\\nMicroK8s and ngrok are powerful combinations for hosting and accessing edge-based workloads, but what makes those workloads challenging to manage?
\\n\\n\\n\\nEdge computing presents several challenges, including managing and orchestrating distributed infrastructure, ensuring security and data privacy, and addressing variability in network connectivity and latency. Additionally, edge computing environments often require specialized hardware and software, which can be costly and difficult to maintain. Furthermore, edge applications require low-latency processing and real-time decision-making, making it essential to optimize performance and minimize downtime. Moreover, the edge ecosystem is highly fragmented, with diverse technologies and standards, making interoperability a significant challenge. Finally, edge computing also raises concerns about data management, including data processing, storage, and analytics, which must be addressed to unlock the full potential of edge computing.
\\n\\n\\n\\nAs mentioned before, MicroK8s is a full-featured, lightweight Kubernetes distribution. This combination allows MicroK8s to run on edge devices while handling intensive workloads such as AI inference and data analytics. By having MicroK8s deployed at the edge, latency can be greatly reduced. An additional benefit is reducing the unnecessary flow of data, which helps to provide additional layers of security.
\\n\\n\\n\\nWith MicroK8s handling the processing duties, ngrok can focus on managing access to the running workloads. In traditional environments, a mishmash of hardware and software provides access to running workloads. This results in a high amount of friction in deploying and managing access and security concerns.
\\n\\n\\n\\nngrok provides a standardized platform that removes the complexity of providing access to edge-based workloads. ngrok standardizes connectivity into external networks hosting IoT devices without requiring support from partners or changes to their network configurations. ngrok does all this while working with any operating system or platform. Additionally, ngrok supports security policies to ensure compliance with necessary security requirements. Finally, ngrok’s global network helps guard against networking failures by providing multiple points of presence with automated failover.
\\n\\n\\n\\nIn addition to edge computing, MicroK8s, ngrok, and the ngrok Kubernetes Operator can be combined to enable other use cases, including:
\\n\\n\\n\\nUltimately, deploying and managing edge computing workloads can be challenging. However, by leveraging ngrok, our Kubernetes Operator, and MicroK8s these issues can be greatly reduced. Getting started is simple ngrok has a generous free tier (sign up here!), and you can follow our ngrok and MicroK8s integration guide to get started today.
\\n\\nNico Verbert is a Senior Staff Technical Marketing Engineer at Isovalent at Cisco and one of the creators of the Cilium Certified Associate Certification (CCA). Nico is a leading cloud and networking technologist, with over 18 years’ experience in the industry. He’s passionate about making customers successful with their business strategy through the use of innovative technology. We spoke to Nico about his role, Cilium and its importance in the Kubernetes ecosystem, the CCA exam and how to prepare, and the benefits of getting the certification.
\\n\\n\\n\\nMy official title, Senior Staff Technical Marketing Engineer, is quite a mouthful. In simple terms, my role is to make Kubernetes networking – an intimidating topic for many – a bit more approachable.
\\n\\n\\n\\nGiven that folks like to learn in different ways, I use different methods to teach and educate: blog posts, books, videos, online labs, social media, conferences and online workshops, etc.
\\n\\n\\n\\nIsovalent was founded by the creators of Cilium (the most popular cloud native networking platform) and acquired by the networking giant Cisco, so I get to work with the brightest minds in networking. I’m very privileged to do this for a living.
\\n\\n\\n\\nCilium is a cloud native networking, security and observability solution powered by a revolutionary technology called eBPF. Over the past few years, Cilium has become the de-facto networking platform for Kubernetes. It’s been adopted by the likes of AWS, Azure and Google to power their Kubernetes services and distributions and is so widely adopted it is the only “Graduated” platform in the CNCF Cloud Native Networking category.
\\n\\n\\n\\nWhenever you face a networking challenge in Kubernetes, it’s likely that you will consider using Cilium to address it.
\\n\\n\\n\\nIt is the Linux Foundation entry-level certification for Cilium. The CCA certification was built in collaboration with Isovalent and Cilium experts from all across the industry, from companies such as Microsoft, Accenture, Datadog, Solo.io, Conoa, etc. The CCA is a 90-minute multiple choice exam, covering Cilium and Kubernetes networking topics such as architecture, network policy, service mesh, installation and configuration. Eventually, I hope we’ll have a multi-tier certification path and follow the CCA with an expert-level hands-on exam. But I’d recommend folks to get started with the CCA first.
\\n\\n\\n\\nWhen you consider the list of CNCF certifications, you’ll see that the established ones like the CKA and CKAD are quite broad. The only “Specialist” certification is the Certified Kubernetes Security Specialist (CKS).
\\n\\n\\n\\nIf you’re interested in networking, you may wonder whether there will ever be a networking-focused “CKN”. I think it’s unlikely – the Kubernetes networking model is a plug’n’play one where you – platform engineers and architects – have to decide which container network interface you should install in your cluster. There are too many networking options to come up with a CKN. But given that Cilium is approaching the status of “Kubernetes networking by default,” I think the CCA is the closest exam to what a “CKN” might look like.
\\n\\n\\n\\nAnd as I mentioned earlier, networking is often perceived as hard. If you can demonstrate a good understanding and foundation knowledge through passing the CCA, it’s going to help you stand out from the crowd of other applicants and candidates.
\\n\\n\\n\\nThis Study Guide is a really good start. In addition, I would recommend many of the videos on our YouTube channel but best of all, I would highly recommend our free online hands-on labs. We now have 30+ labs that cover various Cilium use cases and we’ve designed them to be short but also entertaining. Let me know what you think – you can find me on LinkedIn.
\\n\\n\\n\\nLevel up your skills, sign up for the Cilium Certified Associate Certification (CCA).
\\n\\nCommunity post originally published on Medium by Giorgi Keratishvili
Over the last five years, security has emerged as one of the most demanding skills in IT. When combined with the equally sought-after skill of containers, we get a prestigious certification known as the CKS. But what if someone is interested in all of the above yet seeks an entry-level certification or is just starting out? Suppose this person has a basic understanding of container operations and architecture, is passionate about vendor-neutral solutions and might even hold a KCNA certification? Ah, then my dear friend, you are in the right place. In this blog, we will explore who might find this certification appealing, how to prepare for it and the knowledge one is expected to acquire upon completion…
\\n\\n\\n\\nWhen people hear the word “security” in a certification title their first thought is often that it must be very challenging and require great expertise to study for and pass the examination. However don’t let such thoughts deter you as diving into the details might reveal that the concepts we imagined were difficult are not so hard in reality. The goal of this certification is to provide individuals with basic knowledge of implementing container security best practices in a vendor-neutral way.
\\n\\n\\n\\nCompared to other certifications I wouldn’t say it’s the easiest, but it still falls under the pre-professional level of difficulty and is a great way to test your knowledge before tackling the CKS. For me, the order of difficulty felt like this: KCNA/CGOA/CKAD/PCA/KCSA/CKA/CKS. One thing to keep in mind is that I had already passed the CKS before taking the KCSA and when I was preparing for the exam, there were no tutorials or blogs to refer, only some suspiciously scam-like dumps, so don’t fall for them. Below I will mention all the new courses and materials that should help in preparation.
\\n\\n\\n\\nRegarding persons who would benefit SysAdmins/Dev/Ops/SRE/Managers/Platform engineers or any one who is doing anything on production should consider it as knowing basic security is always good thing or somebody whom wants to becombe kubestronaut 😉 more about it in next blog…
\\n\\n\\n\\nSo we are ready to patch every security whole in our cluster, kick out hackers from our production system and make hard them to compromise your cluster? Then, we have a long path ahead until we reach this point. First, we need to understand what kind of exam it is compared to CKAD, CKA and CKS. This is exam where the CNCF has adopted multiple-choice questions and compared to other multiple-choice exams, this one, I would say is not an easy-peasy. However it is still qualified as pre-professional, on par with the KCNA/PCA/CGOA.
\\n\\n\\n\\nThis exam is conducted online, proctored similarly to other Kubernetes certifications and is facilitated by PSI. As someone who has taken more than 15 exams with PSI, I can say that every time it’s a new journey. I HIGHLY ADVISE joining the exam 30 minutes before taking the test because there are pre-checks of ID and the room in which you are taking it needs to be checked for exam criteria. Please check these two links for the exam rules and PSI portal guide
\\n\\n\\n\\nYou’ll have 90 minutes to answer 60 questions, which is generally considered sufficient time, passing score is >75%. Be prepared for some questions that can be quite tricky. I marked a couple of them for review and would advise doing the same because sometimes you could find a hint or partial answers in the next question. By this way, you could refer back to those questions. Regarding pricing, the exam costs $250, but you can often find it at a discount, such as during Black Friday promotions or near dates for CNCF events like KubeCon, Open Source Summit, etc.
\\n\\n\\n\\nAt this point, we understand what we have signed up for and are ready to dedicate time to training, but where should we start? Before taking this exam, I had a good experience with Kubernetes and its ecosystem and had experience with CKS exam, but yet I still learned a lot from this exam preparation.
\\n\\n\\n\\nAt first glance, this list might seem too simple and easy but however, we need to learn the fundamentals of security first in order to understand higher-level concepts such as RBAC, Shared Responsibility Model, 4C’s and many more
\\n\\n\\n\\n**Overview of Cloud Native Security 14%**\\n\\n\\n\\n
The 4Cs of Cloud Native Security
Cloud Provider and Infrastructure Security
Controls and Frameworks
Isolation Techniques
Artifact Repository and Image Security
Workload and Application Code Security
**Kubernetes Cluster Component Security 22%**
API Server
Controller Manager
Scheduler
Kubelet
Container Runtime
KubeProxy
Pod
Etcd
Container Networking
Client Security
Storage
**Kubernetes Security Fundamentals 22%**
Pod Security Standards
Pod Security Admissions
Authentication
Authorization
Secrets
Isolation and Segmentation
Audit Logging
Network Policy
**Kubernetes Threat Model 16%**
Kubernetes Trust Boundaries and Data Flow
Persistence
Denial of Service
Malicious Code Execution and Compromised Applications in Containers
Attacker on the Network
Access to Sensitive Data
Privilege Escalation
**Platform Security 16%**
Supply Chain Security
Image Repository
Observability
Service Mesh
PKI
Connectivity
Admission Control
**Compliance and Security Frameworks 10%**
Compliance Frameworks
Threat Modelling Frameworks
Supply Chain Compliance
Automation and Tooling
Kubernetes is based on a cloud-native architecture and draws on advice from the CNCF about good practice for cloud native information security. Read Cloud Native Security and Kubernetes for the broader context about how to secure your cluster and the applications that you’re running on it.
\\n\\n\\n\\nYou can explore and learn about KCSA Certification and related topics freely through the following GitHub repositories which I have used and of course kubernetes documentation is our best friend
\\n\\n\\n\\nFor structured and comprehensive KCSA exam preparation, consider investing in these paid course Linkedin and Oreilly from Michael Levan I have been paying attention for his content, indeed it is very useful and I recomend it but I would higly advise not to click on every course which will pop up from google search as it is new exam there are plenty scams.
\\n\\n\\n\\nhope in near future we will see more courses from bigger platforms such as kodekloud or killersh
\\n\\n\\n\\nThe exam is not easy amon other certs I would rank it in this order KCNA/CGOA/CKAD/PCA/KCSA/CKA/CKS after conducting exam in 24 hours you will recive grading and after passing exam it feel prety satisfying overall hope it was informative and useful 🚀
\\n\\nChair: Sebastian Stadil
\\n\\n\\n\\nNovember 12, 2024
\\n\\n\\n\\nSalt Lake City, Utah
\\n\\n\\n\\nOpenTofu Day is the best place to meet and learn from OpenTofu developers and users from around the world. This is the second time this event has taken place. The first time was during KubeCon + CloudNativeCon Europe 2024 in Paris this past March.
\\n\\n\\n\\nAlthough anyone is welcome, platform and DevOps engineers interested in migrating to or expanding their use of OpenTofu are the target audience for OpenTofu Day.
\\n\\n\\n\\nAttendees should expect “more” from this event: the community is larger, the project is more mature, more new features will be presented, and more tips and tricks will be shared.
\\n\\n\\n\\nOpenTofu Day is a half-day event for the community, with talks from the OpenTofu Core Team, as well as from experienced OpenTofu users
\\n\\n\\n\\nNo prep work is required for this event!
\\n\\n\\n\\nExpect the talks at this event to be better than anything the community has seen before!
\\n\\n\\n\\nSubmitted by Sebastian Stadil.
\\n\\n\\n\\nDon’t forget to register for KubeCon + CloudNativeCon North America 2024.
\\n\\nComing to KubeCon + CloudNativeCon North America in Salt Lake City next month? Members of the CNCF End User Technical Advisory Board (TAB) pulled together their top talk recommendations with insights into their recommendations 🙂 Worth a look!
\\n\\n\\n\\nPractical Supply Chain Security: Implementing SLSA Compliance from Build to Runtime – Enguerrand Allamel, Ledger
\\n\\n\\n\\nWhy Mario is interested:
\\n\\n\\n\\nModern Software depends increasingly on open-source packages and libraries. Attacks on the supply chain have a huge impact and can give attackers extensive authorizations in your own system. Therefore, it’s becoming increasingly important that open-source projects increase their supply chain security and that we, as users, adopt the same patterns in our build and release processes.
\\n\\n\\n\\nThe Policy Engines Showdown – Gabriel L. Manor, Permit.io; Andres Aguiar, Okta; Omri Gazitt, Aserto; Anders Eknert, Styra; Sarah Cecchetti, AWS
\\n\\n\\n\\nWhy Mario is interested:
\\n\\n\\n\\nBuilding platforms still requires increasing security policies, best practices, or other governing rules to reach a common internal standard. Enforcing and maintaining these kinds of policies works best if they are treated as code. It will be super interesting to see how the maintainers of all these different frameworks argue to hopefully get a rough idea which engine best fits my needs.
\\n\\n\\n\\nWhy Joseph is interested:
\\n\\n\\n\\nI’m particularly excited about The Maintainer Monologues because it dives into project maintainers’ real, behind-the-scenes journeys within the CNCF ecosystem. Discussing how maintainers are “made”—through hard work, trial and error, and navigating a complex landscape—feels deeply relatable. Maintainers are vital to our community, and hearing their stories is important.
\\n\\n\\n\\nThe Node Tetris Rabbit Hole: Why Your Binpacking Might Be Underperforming – Hannah Taub (Adobe)
\\n\\n\\n\\nWhy Joseph is interested:
\\n\\n\\n\\nOkay, yes, this is a shameless plug for my colleague on the Adobe Ethos team, but I’m genuinely excited about The Node Tetris Rabbit Hole. If you’ve ever wondered why your autoscaling isn’t maximizing node capacity, this talk dives into real-world strategies for improving bin packing efficiency at scale. It’s a must-watch for anyone dealing with Kubernetes resource optimization.
\\n\\n\\n\\nBest of Both Worlds: Integrating Slurm with Kubernetes in a Kubernetes Native Way – Eduardo Arango Gutierrez, NVIDIA & Angel Beltre, Sandia National Laboratories
\\n\\n\\n\\nWhy Ricardo is interested:
\\n\\n\\n\\nThe world is thirsty for GPUs. While cloud providers increase their offerings, a large number of old and new HPC centers are being built or upgraded with the latest generations of these accelerators but mostly blocked to the cloud native world. Bridging the gap between cloud and scientific computing infrastructure will open up new doors for end users requiring this type of hardware.
\\n\\n\\n\\nOrchestrating Quasi-Real Time Data Processing in the Computing Farm of the ATLAS Experiment at CERN – Giuseppe Avolio, CERN
\\n\\n\\n\\nWhy Ricardo is interested:
\\n\\n\\n\\nScaling problems appear in different areas and with different sizes. But when scaling clusters to 5000 nodes and ensuring 25000 pods are up and running in less than a minute, things get entertaining. Add 5TB/s of data processing plus a multi-billion dollar scientific experiment and it’s guaranteed this is an awesome story to tell your cloud native friends.
\\n\\n\\n\\nOptimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes – David Gray, Red Hat
\\n\\n\\n\\nWhy Mike is interested:
\\n\\n\\n\\nAs generative AI language models (LLMs) become more integral to business applications, managing the compute-intensive nature of LLM inference is crucial. David’s talk introduces the KServe platform for deploying LLMs on Kubernetes, focusing on efficient use of GPU hardware to manage costs and power usage.
\\n\\n\\n\\nAutomated Multi-Cloud Blue-Green Cluster Rotations: Zero Downtime Upgrades at Scale – Sourav Khandelwal, Databricks
\\n\\n\\n\\nWhy Mike is interested:
\\n\\n\\n\\nI look forward to hearing Sourav present Databricks’ system for managing cluster rotations across an extensive fleet of over a thousand cloud-managed Kubernetes clusters on AWS, Azure, and GCP. There is no shortage of complexity executing blue-green cluster rotations or cluster swaps at scale. I am excited to learn more about the methodologies employed and challenges they faced to identify the significant benefits of automating multi-cloud Kubernetes upgrades at scale.
\\n\\n\\n\\nOpenTelemetry and Prometheus sessions
\\n\\n\\n\\nI am excited to see the progress and new features that have been added to the OTEL landscape. As Prometheus continues to mature it is great to see how user pain points have been addressed as well as new features that will provide business value inside of Boeing.
\\n\\n\\n\\nLastly, I like to find out things I wasn’t aware of (but should have been). There is just so much going on, it’s always a challenge to keep up.
\\n\\n\\n\\nEvolving Reddit’s Infrastructure via Principled Platform Abstractions – Karan Thukral & Harvey Xia, Reddit
\\n\\n\\n\\nWhy Henrik is interested:
\\n\\n\\n\\nKubernetes is tablestakes but complexity still looms large. At Intuit, like many others, we’re focused on simplifying that complexity by abstracting infrastructure, enabling developers to focus on innovation, not operations. The target is self-service and AI-driven automation, avoiding the need for large infrastructure teams while keeping things efficient and focused on the developers’ needs. Reddit’s journey in tackling these challenges is sure to resonate, offering valuable insights for anyone facing—or about to face—similar hurdles and I’m excited to see how they’ve tackled it
\\n\\n\\n\\nSpace Age GitOps: Lifting off with Argo Promotions (Live Demo!) – Michael Crenshaw & Zach Aller, Intuit
\\n\\n\\n\\nWhy Henrik is interested:
\\n\\n\\n\\nGitOps has cemented itself as the process to follow when developing cloud native applications, but as we learn, adapt and evolve, so must our technologies and tools. Led by two Argo CD maintainers from Intuit, this talk addresses some of our challenges and how we envision the future of Argo CD in our GitOps tool chain and will offer valuable insights for other GitOps practitioners and Argo CD users.
\\n\\n\\n\\n(and live demos are always cool!)
\\n\\n\\n\\nPlatform Engineering for Software Developers and Architects – Daniel Bryant, Syntasso
\\n\\n\\n\\nWhy Henrik is interested:
\\n\\n\\n\\nThe best saved for almost last… Platform engineering is all about creating the optimal platform that truly empowers developers, but it goes beyond just delivering cool features. It’s about setting the right expectations on how the platform will drive developer productivity and providing ways to measure that impact effectively. Equally important is how the platform, architecture, and development process align to maximize efficiency. This session promises to cover all of that, and I’m really excited to see how Daniel will address these challenges.
\\n\\nCo-chairs: Bill Mulligan and Vlad Ungureanu
November 12, 2024
Salt Lake City, Utah
Cilium + eBPF Day will offer a deep dive into how Cilium and eBPF are revolutionizing networking, security, and observability for cloud native environments. From real-world case studies like SamsungAds live-migrating production clusters from Calico to how eBay’s using Cilium for network observability, the day is packed with practical insights. We’ve got a stellar lineup of experts, and you’ll leave with a clear understanding of why eBPF is at the center of cloud native innovation.
\\n\\n\\n\\nCilium + eBPF Day started as CiliumCon in Amsterdam 2023 and this is the fourth time we are hosting it. Cilium + eBPF Day has been steadily growing since, with each edition bringing in more real-world use cases from end users and deeper technical dives. We’ve had successful events in both North America and Europe, with the goal of showcasing how eBPF and Cilium are driving innovation in cloud native networking, observability, and security.
\\n\\n\\n\\nIf you’re a platform engineer, Kubernetes operator, cloud architect, or security expert, this event is made for you. You’ll benefit most if you’re hands-on with building and securing cloud native infrastructure. Expect to walk away with actionable knowledge on everything from scaling network policy beyond clusters to accelerating IPSec with eBPF. Seriously, this is the place to be if you’re thinking about or already working with Cilium, eBPF, or modern cloud native platforms.
\\n\\n\\n\\nYou’ll see more practical, real-world use cases of Cilium and eBPF in production environments, including live migrations and multi-cloud implementations. We’re also going deeper into the kernel with talks on eBPF internals and XDP for boosting throughput. Oh, and let’s not forget the panel on cloud native security use cases with eBPF—this year is about hands-on, real-world applications that you can take back to your team.
\\n\\n\\n\\nWe’ve structured the day into key sections. We’ll start with user stories and dive into talks like Confluent’s journey to Cilium in a multi-cloud setup and eBay’s use of Cilium for network observability. You’ll also get technical deep dives on kernel programming and security policies, with experts like Liz Rice breaking down the magic of eBPF. After lunch, it’s a mix of panel discussions, hands-on case studies like SamsungAds’ use of Cilium, and lightning talks that cover everything from avoiding configuration gotchas to Cilium at the edge. We’ll wrap it all up with closing remarks, but the conversation will keep going long after!
\\n\\n\\n\\nIf you’ve got some knowledge of Kubernetes, Linux networking, or eBPF, you’ll be in a great spot to follow along with the talks. That said, if you’re new to eBPF, check out the eBPF documentary we created to get the full backstory on why this tech is a game-changer. And yeah, I’m biased, but my children’s guide to eBPF, Buzzing Across Space, is a fun way to get a light intro too.
\\n\\n\\n\\nI’m excited to see the community come together again. Cilium + eBPF Day is always packed with smart people who are pushing the limits of cloud native infrastructure. Whether you’re here to learn, share what you know, or network with some of the brightest minds in the space, it’s going to be a great warm up for KubeCon. I can’t wait to see how people are putting these technologies to work in the wild.
\\n\\n\\n\\nSubmitted by Bill Mulligan, who is looking forward to the people of KubeCon because every time he leaves inspired by the energy, innovation, and collaboration in the cloud native community.
Don’t forget to register for KubeCon + CloudNativeCon North America 2024.
Member post by Sameer Danave, Senior Director of Marketing at MSys Technologies
\\n\\n\\n\\nDo you know half of the global storage capacity will be deployed as Software Defined Storage(SDS)? It is a remarkable number! And it’s not just enterprises moving towards it; small companies are making this shift too. So, what does SDS mean in practical terms, what are the IT benefits, and how is it transforming hybrid-cloud architectures?
\\n\\n\\n\\nWell, the answer to these questions came to me when I recently met 20 IT professionals at the 2024 Conference and fourth annual Storage Technology Showcase. These interactions gave me some insightful information about software-defined storage and how it transforms hybrid cloud architectures in organizations worldwide. And that’s what we’ll discuss today.
\\n\\n\\n\\nSoftware-Defined Storage (SDS) encompasses various interpretations depending on the vendor. However, Wikipedia provides a concise definition; “the abstraction of storage software from the underlying hardware, along with the provision of a unified management platform and data services across heterogeneous or homogeneous enterprise storage assets.
\\n\\n\\n\\nIn simpler terms, SDS separates software and hardware, enabling efficiencies in cost and performance. For instance, organizations can leverage cost-effective industry-standard servers instead of expensive proprietary storage solutions. This decoupling empowers IT departments to accomplish several significant objectives.
\\n\\n\\n\\nHere’s how SDS is Transforming the Hybrid Cloud Architectures and what are the IT benefits of software-defined storage that are leading organizations to make this shift.
\\n\\n\\n\\nConsider the example of a natural disaster; Hurricane Sandy, which caused the closure of thousands of IT companies due to disruption in their operations. This is an extreme example, to begin with; however, it demonstrates the disastrous results of natural disasters on IT infrastructure. So, what can businesses do to achieve continuous operations even during such incidents and develop disaster recovery solutions?
\\n\\n\\n\\nWell, the answer is SDS solutions hosted in the cloud. Such cloud solutions can provide a separate disaster recovery region without investing in the creation, installation, and maintenance of said DR regions. Data is also replicated to the cloud infrastructure from on-prem systems and resources can also be automatically scaled up on demand.
\\n\\n\\n\\nSuch solutions can also help in off-site cloud backups and provide a viable solution for cold storage which is kept off-site and automatically scales without any additional investment in hardware. The SDS management plane also allows data management that enables an administrator to implement these critical DR functions from a single interface.
\\n\\n\\n\\nCompanies need to move data without disruption among storage systems for many reasons, such as lease expirations, performance optimization, tiering to place the correct data on the proper hardware, and technology or vendor changes. With hybrid cloud storage solutions with the latest technologies and capabilities, companies can move data sets between cloud providers and on-premises resources almost instantaneously to improve their data economics. In a better word, it also means that companies can manage data both on-premises, public cloud, hybrid–cloud, multi-cloud environments from a single pane of glass.
\\n\\n\\n\\nSoftware-defined storage solutions come with enterprise-grade features like provisioning, deduplication, and data compression, enhancing storage efficiency and reducing costs. SDS storage solutions also abstract the underlying storage in cloud environments, allowing for the provisioning of cloud storage and infrastructure similar to on-premises settings. This capability is valuable when leveraging storage tiers within cloud environments, such as between Amazon EBS and Amazon S3 storage in AWS or between Azure disks and Azure Blob storage.
\\n\\n\\n\\nThe cloud SDS solution automates the tiering process, further reducing storage costs. Additionally, it provides technical agility across cloud, mobile, social, and analytics infrastructures.
\\n\\n\\n\\nAfter meeting with 20 IT professionals, I learned how software-defined storage solutions are transforming hybrid cloud architectures. Moreover, I learned about adopting best practices, including leveraging SDS solutions for heterogeneous IT environments, which assist organizations in surmounting the challenges associated with storage management.
\\n\\n\\n\\nTo address these data management challenges effectively, we at Msys Technologies present cutting-edge SDS solutions to help companies in data storage management. We provide automated and on-demand storage management services to organizations across the globe.
\\n\\n\\n\\nConnect with us to learn how Mysys can revolutionize your data storage and cloud management practices.
\\n\\n\\n\\nAuthor
\\n\\n\\n\\nSameer Danave, Senior Director of Marketing, MSys Technologies
\\n\\n\\n\\nBio
\\n\\n\\n\\nSameer is a seasoned technology marketing professional with 16 years of full-stack marketing experience. He believes in the two Cs—‘Customer Value’ and’ Communications’—and all his Marketing campaigns and projects are packaged with them.
\\n\\n\\n\\nHe drives phygital (physical + digital) campaigns that attract and pull customers towards the brand’s value. His marketing strategies apply omnichannel, conversational marketing tactics (Storytelling, social, and chatbot), and AI-enabled inbound marketing, backed by solid analytics and insights with ‘content’ as a core part of the strategy.
Sameer is a team sport with meticulous planning, attention to detail, and the ability to perform effortlessly under pressure.
Community post by Adam Korczynski, Adalogics and Jan Dubois, Lima maintainer
\\n\\n\\n\\nLima, a CNCF sandbox project for launching virtual machines with automatic file sharing and port forwarding, recently completed a fuzzing audit. As part of the audit, Lima integrated into Google’s OSS-Fuzz project and added fuzz coverage of various packages in the Lima code base. OSS-Fuzz is a service by Google that runs critical open source projects’ fuzzers continuously and with large amounts of compute power. OSS-Fuzz also handles the infrastructure of building the fuzzers against integrated projects’ latest source tree, and it reports crashes to maintainers and automatically marks them as fixed when it can no longer reproduce them. As a result, OSS-Fuzz will regularly build Lima’s fuzzers against its latest source code and run them with excess compute. As such, if bugs make it into Lima’s code base – either directly or in a dependency – OSS-Fuzz can catch it before the bugs make it to a release.
\\n\\n\\n\\nThe build files for Limas OSS-Fuzz integration and Limas fuzz test exist in Lima’s own repository, so that the project itself can manage the build and future new fuzz tests.
\\n\\n\\n\\nAda Logics, who carried out the audit, added fuzz coverage for multiple packages in Lima. The fuzzers found several issues in 3rd-party libraries. They found several crashes in dependencies that handle YAML parsing. Coincidentally, these crashes sparked a discussion about why Lima imports three different libraries to process YAML. The fuzzers also found crashes in image conversion routines in Limas own underlying image processing library.
\\n\\n\\n\\nLima has opened a public tracker for the crashes that have come as a result of the audit here. In addition, Ada Logics have published a report about the audit which can be found here.
\\n\\n\\n\\nWith the completion of its fuzzing audit, Lima joins many other CNCF projects which have integrated fuzzing into their testing efforts and are integrated into OSS-Fuzz. Other notable projects are Kubernetes, containerd, Helm, Dapr, Envoy, Vitess and Linkerd. The CNCF has led the efforts on adoption across its ecosystem which has led to finding both bugs and vulnerabilities in sandbox and graduated projects.
\\n\\nWe’re very excited to announce the Keynote Speakers and Daily Themes for KubeCon + CloudNativeCon North America 2024 in Salt Lake City beginning November 12 -15. If you haven’t registered yet, it’s not too late.
\\n\\n\\n\\nGet ready to deep dive into the cloud native universe, where industry experts will share insights on the hottest topics shaping the future. Each day will be dedicated to a unique theme, ensuring a rich experience.
\\n\\n\\n\\nKicking off keynotes on Wednesday, November 13, the spotlight will be on groundbreaking advancements in artificial intelligence and platform engineering, with industry leaders from NVIDIA, Coreweave, Capital One, CERN, Google, Lunar, Intel, and more.
\\n\\n\\n\\nThursday’s keynotes will center on security, one of the most critical aspects of the cloud native ecosystem. Expect to hear from experts from Microsoft, Broadcom, Red Hat, NYU, and Purdue University.
\\n\\n\\n\\nFinally, Friday is all about the community and celebrating KuberTENes with thoughtful presentations from industry leaders representing Adobe, Heroku, Google, Solo.io, Red Hat, and more. Each will share their visions for the future of cloud native.
Whether you want to level up your knowledge or connect with like-minded professionals, these sessions will inspire and empower you. Explore the entire schedule, and keep reading for a detailed keynote description.
Multicluster Batch Jobs Dispatching with Kueue at CERN
\\n\\n\\n\\nRicardo Rocha, Lead Platforms Infrastructure, CERN & Marcin Wielgus, Staff Software Engineer, Google
\\n\\n\\n\\nWith the skyrocketing demand for GPUs and problems with obtaining the hardware in requested quantities in desired locations, the need for multicluster batch jobs is stronger than ever. During this talk we will show how you can automatically find the needed capacity across multiple clusters, regions or clouds, dispatch the jobs there and monitor their status. We will discuss the setup in both fixed-size on-prem environments, fully autoscaled clusters running on clouds and mixed, hybrid environments. In the end we will present what a recent effort for a multi-cluster setup looks like at CERN, do a quick (but impressive) demo, and share the lessons learned during the deployment.
\\n\\n\\n\\nOpen Source-powered Cloud Native Creates the Future (sponsored)
\\n\\n\\n\\nDirk-Peter van Leeuwen, Chief Executive Officer, SUSE
\\n\\n\\n\\nWhen Kubernetes first started out it was a place of potential and early adopters jumped all over it because of the openness. Now, more than 10 years later – an eternity in software – the cloud native space approaches adulthood where early adopters still abound, but large regulated industries are now depending on its technologies, and everyone in between wants to find their place in this world. Along the way, edge solutions, AI, virtualization, and other technologies started building on top of them. SUSE has always been at the heart of the open source community, and CEO DP van Leeuwen says he sees that commitment only getting deeper in future. SUSE developers have made significant contributions for decades, and DP will explain why he believes so passionately in open, community-based standards based on Kubernetes for the new innovations in AI and the edge.
\\n\\n\\n\\nTake a Peek Under the Hood of Cloud-Native AI at Scale
\\n\\n\\n\\nChen Goldberg, Senior Vice President of Engineering, CoreWeave & Peter Salanki, Chief Technology Officer, CoreWeave
\\n\\n\\n\\nTraining large-scale foundation models on Kubernetes brings a new set of challenges compared to traditional workloads. With tens of thousands of interconnected GPUs, even small hardware failures can lead to significant performance bottlenecks. This talk will dive into real-world lessons learned while building Kubernetes clusters at scale, including tackling hardware failures, optimizing GPU scheduling, and improving observability. We’ll also explore how CNCF projects and Kubernetes provide the best platform for managing the complex infrastructure required for generative AI, making it easier to monitor and maintain AI workloads with the right observability tools. Attendees will walk away with actionable insights into how to navigate these challenges and build robust, scalable systems for training foundation models.
\\n\\n\\n\\nBuild Intelligent Applications with an Open, Flexible, and Data Driven AI Platform (sponsored)
\\n\\n\\n\\nSudha Raghavan, Senior Vice President of Oracle Cloud Development Platform, Oracle
\\n\\n\\n\\nIn today’s technology landscape defined by rapid advancements in AI and data technologies, both openness and flexibility are imperatives for driving innovation. Cloud-native technologies empower developers to build and run massively scalable AI services and applications on a flexible platform. With the platform, developers can leverage familiar open-source tools, data technologies, and AI models of their choice on the cloud native platform. Join Sudha Raghavan, SVP of the Oracle Cloud Development Platform, as she shares key insights, best practices, and the tangible benefits of using open source technologies to build AI platforms and cloud native applications at Oracle, deployed at scale across several cloud environments.
\\n\\n\\n\\nPaving the Way for AI Through Platform Engineering
\\n\\n\\n\\nKasper Borg Nissen, Lead Platform Architect, Lunar
\\n\\n\\n\\nIn today’s ever-evolving world, organizations struggle to integrate AI efficiently. Is AI a department, a team, or a tool for a select few? No, it should be woven into the fabric of the business. Drawing from the principles of platform engineering in cloud native ecosystems, we explore how organizations can reuse experiences and investments in cloud native and platform engineering to democratize AI, enabling every team to access its potential. Real-world examples, like Lunar, demonstrate how businesses can implement AI without needing complex infrastructure and harness the power of AI through simplified platforms and evolutionary architectures.
\\n\\n\\n\\nThe Future of GenAI: Cloud Native Blueprints with OPEA (sponsored)
\\n\\n\\n\\nEzequiel Lanza, AI Open Source Evangelist, Intel
\\n\\n\\n\\nEnterprises are eager to adopt generative AI to boost productivity. The Open Platform for Enterprise AI (OPEA), a Linux Foundation project, offers a framework of composable microservices for creating advanced GenAI systems, including LLMs, data stores, and prompt engines. OPEA provides blueprints for popular workflows, such as ChatQnA, CodeGen, and RAG systems, all designed to simplify deployment of cloud native architectures. Featuring a user-friendly pipeline definition language for Kubernetes, this keynote will cover how to get started running GenAI applications on a K8s cluster utilizing a microservices architecture for component flexibility.
\\n\\n\\n\\nNVIDIA Case Study: The Many Facets of Building + Delivering AI in the Cloud Native Ecosystem
\\n\\n\\n\\nChris Lamb, Vice President, Computing Software Platforms, NVIDIA Corporation
\\n\\n\\n\\nNVIDIA provides accelerated computing, networking infrastructure and platforms at the vanguard of today’s most rapidly evolving fields (artificial intelligence, robotics, autonomous systems, virtual world simulation, etc) among operating cloud services, a global gaming network, and powering the development effort to creating the next wave – all of which are powered by the Cloud Native Ecosystem and enabled by CNCF community software in some form! This keynote provides a glimpse into how we use, derive from, and contribute to a wide variety of CNCF projects to build, deliver, and advance technologies’ most rapidly advancing fields. We’ll also offer our perspective of the state of the ecosystem, call attention to remaining challenges, and help inspire further community collaboration to power the AI revolution onward!
\\n\\n\\n\\nEngineering the Future of Generative AI Platforms on Kubernetes
\\n\\n\\n\\nAparna Sinha, SVP, Head of AI Product, Capital One
\\n\\n\\n\\nThe evolution of machine learning and AI has raised the stakes on how technology organizations extend their enterprise platforms to deploy generative AI applications at scale. In this keynote, Capital One’s SVP, Head of AI Product – and early contributor to Kubernetes at Google – Aparna Sinha, will share considerations and principles for leveraging Kubernetes and other open source technologies to create open enterprise platforms that power generative AI applications. Attendees will walk away with actionable insights and best practices on areas including:Extending existing machine learning platforms to create modern generative AI platforms; New platform layers that need to be built on top of Kubernetes when evolving machine learning platforms for generative AI; In-house vs. third party build and deployment components and decisions; Risk and governance implications and considerations; Articulating platform requirements to enable co-creation and contribution from user communities.
\\n\\n\\n\\nAbove the Clouds: Mountainous Achievements with End Users
\\n\\n\\n\\nTaylor Dolezal, Head of Ecosystem, CNCF
\\n\\n\\n\\nGet ready, innovators! We’re embarking on an exciting journey through the CNCF ecosystem. We’ll discover valuable insights in the End User stream, explore new areas with our End User TAB progress, and examine the thriving hubs of our ever-changing ecosystem. This talk will remind you that in the vast expanse of cloud native technology, our strength lies not in isolated efforts, but in the community we build together.
\\n\\n\\n\\nA Developer’s Guide to Securing Your Software Supply Chain (sponsored)
\\n\\n\\n\\nToddy Mladenov, Principal Product Manager, Microsoft
\\n\\n\\n\\nContainer images, AI weights, WebAssembly modules, and software packages – what’s the link? They are all examples of some of the many artifacts found throughout a software supply chain. With so many different artifacts, the real question becomes, “Is your software supply chain as secure as your production environment?” In this keynote, we will navigate the journey of these artifacts from source to production, and showcase how to secure your software at each step of the supply chain using cloud native open-source tooling. With the help of key CNCF projects like in-toto, Notary Project, Ratify, and Copa, you will learn how to ensure your software is secure, consistent, and reliably delivered to production.
\\n\\n\\n\\nCloud Native’s Next Decade: Stable, Secure, and…Ready for Disruption?
\\n\\n\\n\\nNikhita Raghunath, Principal Engineer, Broadcom
\\n\\n\\n\\nIn the cloud native world, we’ve come a long way – after a decade of driving Kubernetes adoption, it feels like we’ve hit a milestone. With the pace of new Kubernetes features slowing down, some might think that we’ve reached a “steady state”. But have we really? While the ecosystem feels more stable and secure than ever, the next decade holds challenges that demand our attention, especially in the realm of security. Sure, we all know security is vital, and there’s been a lot of fantastic work across the board, from OSS initiatives to innovative startups. But the threat landscape is shifting fast. As we peer into the future, it’s not just about refining what we already know. It’s about tackling new challenges that are emerging on the horizon, like securing AI systems. Some may say it’s too early — that AI security is just hype. But think again. AI introduces complexities we haven’t seen before, offering both new vulnerabilities and fresh opportunities for defense. Are we ready to face the coming wave of security threats that could reshape the digital landscape? This keynote will dive into what’s next in cloud native security, while showing why this “boring” phase is just the calm before a new storm of innovation and challenges that will shape the next decade.
\\n\\n\\n\\nApplication Development’s Great Cloud Native Disruption (sponsored)
\\n\\n\\n\\nColin Walters, Senior Principal Software Engineer, Red Hat & Preethi Thomas, Senior Manager, Engineering, Red Hat
\\n\\n\\n\\nArtificial Intelligence is exposing technological and operational gaps in our industry faster than ever. Newer workloads are forcing application developers to innovate in ways that open source is uniquely positioned to help guide. This talk will discuss the current state of open source technologies and how application developers can collectively harness and contribute to transparent, open innovation and guide the next generation of cloud native development.
\\n\\n\\n\\nOpen Source Security Is Not A Spectator Sport
\\n\\n\\n\\nJustin Cappos, Professor, NYU & Santiago Torres Arias, Assistant Professor, Purdue University
\\n\\n\\n\\nThe CNCF has been a trailblazer in resilient open source software security by enabling innovation, coordination and community building. We will highlight some of the efforts and resources provided by TAG Security including security assessments for CNCF projects, one of the first supply chain security recommendations, A Reference Architecture to Securing the Software Supply Chain, and the Cloud Native Security Whitepaper. We’ve done this all by fostering an open and welcoming community of security professionals. Come and join our community and help us improve cloud-native security for all!
\\n\\n\\n\\nHonoring the Past to Forge Ahead
\\n\\n\\n\\nBob Wise, CEO, Heroku
\\n\\n\\n\\nTwelve-Factor was published by Heroku founder Adam Wiggins over a decade ago and served as a guiding principle for many software engineers and tech founders of SaaS companies. In that time, cloud-native and Kubernetes has fundamentally transformed technology. As we look to the next decade of technology innovation and the millions of apps we’ll build and run – how durable are these Twelve-Factors? In this talk Bob Wise, CEO of Heroku will reflect on the journey of the Twelve-Factors, Heroku’s journey as a platform designed around these twelve-factors and what that means for the future, and announce the open sourcing of twelve-factor as a community project to participate in the ongoing refreshing and revisiting of the Twelve-Factors to guide us through the next decades.
\\n\\n\\n\\nKubernetes in the Second Decade: Balancing Innovation with Stability (sponsored)
Jago Macleod, Engineering Director, Kubernetes & GKE, Google
Kubernetes has come a long way since Google introduced the project to the world a decade ago. We are so proud of what Kubernetes has become. But like the rest of the technology landscape, the explosive growth of Batch, ML, and GenAI workloads alongside existing, often business-critical, workloads creates new challenges for end users running on Kubernetes. In previous technology cycles, the most likely outcome would be the rise of a new platform. However, due to the declarative API, the modular and extensible nature of the platform, and the strength of the ecosystem, Kubernetes can evolve in ways that were more difficult for previous platforms. Google is significantly re-focusing and increasing our already industry-leading investment in Kubernetes to refactor, extend, and even re-invent significant aspects of Kubernetes to meet the needs of the next decade. This is the fundamental challenge: innovation to address new opportunities <em>without disruption to the installed base.
\\n\\n\\n\\nFive Cloud Native Technology Areas to Watch For
\\n\\n\\n\\nLin Sun, Head of Open Source, solo.io & Karena Angell, Chief Architect, Red Hat
\\n\\n\\n\\nAs part of the CNCF Technical Oversight Committee (TOC), we’ve been busy reviewing projects that are coming into the CNCF, promoting projects from various stages (sandbox, incubation and graduation) and identifying areas of greater collaboration. Along with the reviews, we are innovating the technical review processes, and collecting end user feedback. We would like to take you through five key cloud native technology areas to watch for in the next few years, reflecting what we see in the cloud native ecosystem.
\\n\\n\\n\\nRethinking Kubernetes Connectivity (sponsored)
\\n\\n\\n\\nIdit Levine, Founder and CEO, Solo.io
\\n\\n\\n\\nCloud traffic management is broken. Any time you put a request on the network you need to secure, control, and observe its behavior. Today, ingress traffic is treated differently than east west traffic. Outgoing requests to services like GenAI, LLMs and SaaS are treated differently. The technology used to solve these challenges is outdated, inconsistent, overlapping, and does not fit our platform engineering principles of automation. New innovations in the cloud-networking space allow us to re-imagine the solutions to these problems. In this keynote, we’ll discuss the future of API connectivity and cloud networking that goes beyond Kubernetes.
\\n\\n\\n\\nKubernetes Family Feud: A Decade of Architecture and Evolution
\\n\\n\\n\\nRags Srinivas and Tim Hockin
\\n\\n\\n\\nKubernetes is 10 years old! While its growth has been remarkable, it has also accrued technical debt, familiar to long-standing members, and there are still many issues that remain as top concerns. In this fun Family Feud-style format, team leads and a moderator will engage experienced community members in discussing these architectural and design issues. By polling the community ahead of time, we aim to surface the most pressing, or at least bothersome, concerns. Two teams, each headed by a captain on stage, will compete in the typical Family Feud style. After attending this session, participants will gain awareness of early design and architectural decisions and compromises, understanding which have been fixed and how to work around others. Along the way, attendees might enjoy a hearty laugh or two, appreciating the collective wisdom of the community members polled.
\\n\\n\\n\\nWe know you’re as excited as we are to hear what these experts have to share in Salt Lake City. Register today and start planning your days at KubeCon + CloudNativeCon North America 2024.
\\n\\nHeading to KubeCon a bit early, or planning on staying around for the weekend? The options for outdoor fun are endless, even if it’s not quite ski season. From winter hiking to snowshoeing, bobsledding, winter camping or even soaking in a geothermal cave, there are so many options, including skiing if the weather cooperates. Here’s everything you need to know.
\\n\\n\\n\\nIf body temperature water in an underground cave with the chance to swim, soak, stand up paddleboard or even scuba sounds good, look no further than Utah’s Homestead Crater. Located about an hour from Salt Lake City, this geothermal hot spring is an unexpected, but totally relaxing, adventure inside a 55 foot dome made of limestone. Make reservations in advance and be prepared to shed stress in a very unique setting.
\\n\\n\\n\\nTechnically the ski season around Salt Lake City doesn’t start until later in November, but because the weather often has a mind of its own we wanted to include the many, many snow-filled adventures available. In an hour or less, you’ll have your choice of 11 different ski resorts offering what the locals call “the greatest snow on earth.” Skiers, snowboarders, snowmobilers, cross-country skiers, and those wanting to try helicopter-powered skiing or boarding all have an abundance of choices.
\\n\\n\\n\\nFor indoor and outdoor chilly fun, the Salt Lake area has places where you can lace up and have some fun. The Gallivan Center, an outdoor rink near downtown, opens up when KubeCon does on November 12. Or head indoors to The Steiner/SLC Sports Complex with two Olympic-size skating rinks.
\\n\\n\\n\\nBundle up with winter trail gear and get out there for five “bucket list” hikes that are made better by cold weather, i.e., fewer people, sweeping, high contrast views, and an incredible feeling of self-satisfaction that you got out and did it despite the weather.
\\n\\n\\n\\nEven if there isn’t enough snow for good skiing, it could easily be possible to snowshoe. Here’s a list of snowshoe-friendly trails, ranging from easy to difficult.
\\n\\n\\n\\nAgain, it might not be prime powder, but it could still be a fun sledding day. Find 6 tubing parks and countless sledding hills in the Salt Lake area and beyond.
\\n\\n\\n\\nWho needs the perfect powder when you can experience a BOBSLED? Located in nearby Park City (about half an hour from Salt Lake), the Utah Olympic Park offers visitors the opportunity to experience a 60MPH bobsled ride under the guidance of a professional driver. This opens up to the public again on November 8, so please make reservations in advance and know there are some age/weight minimums…but the cool factor cannot be topped! Also, there are guided tours of the Olympic Park and other offerings including a ropes course.
\\n\\n\\n\\nFor those planning a longer stay in Utah, a number of the state parks have yurt camping or “glamping” available. Obviously reservations need to be made in advance, and amenities vary. But just imagine the views!
\\n\\n\\n\\nIf it’s really cold, but *not* snowy, ice fishing might be a possibility. Experts suggest the fishing is best early in the season, but of course they also recommend the ice be at least 4 inches thick. Enthusiasts should explore more, here.
\\n\\nMember post originally published on Redpill Linpro’s blog by Torbjørn Gjøn
\\n\\n\\n\\nRead more here or contact us for a cloud chat through our contact form.
\\n\\nMember post by Chelsio Communications
\\n\\n\\n\\nAs Kubernetes continues transforming the cloud-native infrastructure, high-performance networking has become essential for maintaining seamless operations in containerized applications. Chelsio T6 Unified Wire Adapters provide advanced networking capabilities that meet the demands of modern Kubernetes deployments, especially for handling data-intensive workloads. With features like SR-IOV, hardware offloading, and acceleration for versatile protocols such as iWARP, NVMe/TCP, and NVMe-oF, T6 adapters allow Kubernetes clusters to scale efficiently while ensuring optimal performance.
\\n\\n\\n\\nIn this blog post, we explore the integration of Chelsio T6 adapters into Kubernetes environments using CNI plugins, highlighting how their flexibility improves networking across a wide range of applications.
\\n\\n\\n\\nChelsio T6 adapters support Single Root I/O Virtualization (SR-IOV), allowing multiple Virtual Functions (VFs) to be exposed to Kubernetes pods and enabling direct I/O access to the network hardware. This delivers low-latency, high-throughput networking, which is essential for Kubernetes clusters handling critical workloads. By offloading network tasks to hardware, T6 adapters reduce the load on the host CPU, allowing more resources to be available for containerized applications, which enhances their performance, particularly when managing large volumes of data.
\\n\\n\\n\\nWith SR-IOV, T6 adapters enable Kubernetes deployments to scale efficiently without compromising performance, making them the ideal choice for demanding high-performance workloads such as AI, analytics, and machine learning.
\\n\\n\\n\\nChelsio T6 adapters are fully compatible with Kubernetes’ Container Networking Interface (CNI) plugins, which manage pod networking across clusters. In the T6 testing environment, Flannel and Multus are used as CNI plugins. This demonstrates the seamless integration of T6 adapters into Kubernetes for managing Layer 3 IPv4 networking and multiple networks per pod.
\\n\\n\\n\\nThe SR-IOV CNI plugin and the SR-IOV Network Device Plugin are used to attach Chelsio Virtual Functions (VFs) directly to pods. This integration allows Kubernetes administrators to effectively manage high-performance networking while maintaining flexibility in their network configurations. With these plugins, T6 adapters can deliver near-line-rate performance across multiple Kubernetes nodes, ensuring efficient communication between the pods.
\\n\\n\\n\\nChelsio T6 adapters support hardware offloading for multiple networking protocols, such as iWARP, NVMe/TCP, and NVMe-oF (NVMe over Fabrics). This flexibility makes the T6 adapters ideal for Kubernetes clusters with diverse networking requirements.
\\n\\n\\n\\nNVMe/TCP and NVMe-oF provide high-performance data transfer capabilities for storage-intensive applications, making T6 adapters ideal for environments that demand quick and reliable access to storage resources. T6 adapters also support TLS/SSL and IPsec offloading, improving the security of data transfers while lowering CPU usage during encryption and decryption. These offloading features make T6 adapters beneficial in environments where secure and high-speed data transfers are essential.
\\n\\n\\n\\nFor workloads that require significant data throughput, such as AI and machine learning applications, Chelsio T6 adapters provide the necessary performance. During testing, T6 adapters achieved 98 Gbps line-rate throughput while utilizing only 35% of CPU resources. This leaves ample CPU headroom for additional container deployments, making T6 adapters ideal for real-time and data-driven Kubernetes environments.
\\n\\n\\n\\nIn addition to improving network performance, the offloading capabilities of T6 adapters significantly reduce CPU utilization, enabling Kubernetes clusters to scale without requiring expensive CPU upgrades. This makes the T6 adapters an ideal solution for data centers handling high-throughput applications or seeking to consolidate networking, storage, and computing onto a unified platform.
\\n\\n\\n\\nIn the test environment described in the technical report, CRI-O served as the container runtime, while Flannel provided Layer 3 IPv4 networking between nodes. Multus was used to configure multiple networks per pod, enabling more advanced network configurations. Chelsio T6 VFs were allocated to pods using the SR-IOV CNI plugin, achieving near-line-rate performance across a 3-node Kubernetes cluster.
\\n\\n\\n\\nThis setup illustrates how easily Kubernetes administrators can deploy T6 adapters in real-world environments, utilizing SR-IOV to enhance performance and efficiency.
\\n\\n\\n\\nChelsio T6 adapters provide cutting-edge networking solutions for Kubernetes, utilizing SR-IOV and CNI plugins, such as Flannel and Multus, to achieve high-throughput and low-latency connectivity. An integration of protocol offloading and acceleration, including NVMe/TCP, NVMe-oF, iWARP, TLS/SSL, and IPsec, makes T6 adapters an excellent choice for Kubernetes environments running high-performance and data-intensive applications.
\\n\\n\\n\\nChelsio T6 adapters, with their seamless scalability and support for multiple networking protocols, empower Kubernetes clusters to effortlessly manage modern workloads.
\\n\\n\\n\\nTo learn more, explore the full technical report on T6 Kubernetes performance here or visit the Chelsio Communications website to discover how T6 adapters can enhance your Kubernetes deployments.
\\n\\nBy TAG Environmental Sustainability
\\n\\n\\n\\nGet ready for the CNCF Cloud Native Sustainability Week 2024, which will take place from October 7th to 13th, 2024. This global event, organized by the CNCF Technical Advisory Group for Environmental Sustainability (TAG ENV), aims to bring together communities worldwide to discuss, learn, and contribute to a more sustainable cloud-native future.
\\n\\n\\n\\nWith in-person meetups across 20+ cities, including Berlin, Tokyo, and Aarhus, and a Global Virtual Mini-Conference on October 8th, 2024, this week promises to deliver rich discussions on topics like green software and optimizing IT infrastructures to reduce environmental impact. The online conference will bring together experts from across the globe, with sessions streamed live on YouTube, so no matter where you are, you can participate and engage with the latest innovations in sustainable tech.
\\n\\n\\n\\nThese meetups provide a platform for tech enthusiasts, developers, and sustainability advocates to connect, share insights, and collaborate on cloud-native solutions that reduce energy consumption and optimize IT infrastructure. Local challenges and global perspectives will be explored, all with the goal of making the tech industry more sustainable.
\\n\\n\\n\\nThroughout the week, you can also expect a series of YouTube livestreams featuring discussions and insights from sustainability leaders in cloud-native technology. The recordings will be available afterward, ensuring anyone can catch up on the key takeaways.
\\n\\n\\n\\nWhether you’re a seasoned professional in the tech world, a sustainability enthusiast, or someone just starting your journey, there’s something for everyone at Cloud Native Sustainability Week 2024. This event is an opportunity to learn, contribute, and be part of a global movement dedicated to making the technology sector more environmentally friendly. You can explore the full event details, including how to participate in both local meetups and the virtual mini-conference, on the official pages here.
\\n\\n\\n\\nSustainability Week offers a unique opportunity to:
\\n\\n\\n\\nBy attending or hosting a session, you can help drive the conversation forward, sharing ideas and best practices that will shape the future of sustainable cloud-native infrastructure. Let’s work together to make a meaningful impact and create a greener, more responsible tech ecosystem!
\\n\\n\\n\\nJoin us in making tech more sustainable, one innovation at a time!
\\n\\n\\n\\nEngage in Environmental Sustainability: Let’s discuss, consider being a part of us, and take action for change!
\\n\\nThis week’s Kubestronaut in Orbit, Phong Nguyen Van, is a full-stack software engineer in Ho Chi Minh, Vietnam with over 7 years of experience and a passion for cloud technologies and Kubernetes. Phong also holds 5 AWS certifications along with his 5 K8s certifications and is in the top 3% of Stack Overflow users.
\\n\\n\\n\\nIf you’d like to be a Kubestronaut like Phong, get more details on the CNCF Kubestronaut page.
\\n\\n\\n\\nMy first project was creating and supporting Kubernetes clusters in production environments for multi-tenancy LMS as Microservices.
\\n\\n\\n\\nI primarily use Kubernetes.
\\n\\n\\n\\nThe certs helped me to get an in depth understanding of Kubernetes architecture and security.
\\n\\n\\n\\nTo prepare, I used Udemy (check out the CNCF endorsed content), and content on Cloud Guru and KodeKloud.
\\n\\n\\n\\nI like to study new technologies and spend time with my family – my wife Nhu Dang and my daughter Khanh Vy .
\\n\\n\\n\\nMake sure you have a deep understanding of the fundamentals and once you have that, you can move on to the higher and more complicated levels. CNCF recommends exploring the CNCF Kubernetes project to get started.
\\n\\n\\n\\nAbsolutely, I am thinking about observability certs like:
\\n\\n\\n\\nWe’re excited to share the updated etcd Project Journey Report! etcd is one of CNCF’s longest-standing graduated projects. We initially looked at the project’s growth back in 2021, and are happy to see continued growth in innovation from old and new contributors and end users.
\\n\\n\\n\\netcd is a key-value store for distributed systems that provides a way for applications of any complexity — from simple web applications to large software platforms — to store and access critical data. It is the primary datastore for CNCF’s highest open source velocity project, Kubernetes.
\\n\\n\\n\\netcd was accepted into the CNCF Incubator in 2018, where it remained for two years before graduating in November 2020. It has demonstrated steady and impressive growth since joining CNCF in 2018.
\\n\\n\\n\\nSome of the highlights of the report include:
\\n\\n\\n\\nHave a look at the full updated etcd Project Journey Report to learn more about these accomplishments in more detail.
\\n\\nMember post originally published on the Middleware blog by Keval Bhogayata
\\n\\n\\n\\nIn distributed applications with complex, resource-intensive microservices—each of which generates a mountain of telemetry data—collecting and managing telemetry with your application can be cumbersome and inefficient. It may also lead to CPU consumption and high latency.
\\n\\n\\n\\nAs a Software Development Engineer (SDE), assigning this job to the OpenTelemetry Collector is your ideal solution to tackle telemetric overload.
\\n\\n\\n\\nRead on to understand what the OpenTelemetry Collector is, how it works, its benefits, and its deployment methods.
\\n\\n\\n\\nOpenTelemetry (OTel) is an open-source project that provides a set of APIs, libraries, SDKs, and tools for instrumenting applications and generating, collecting and exporting telemetry data (like metrics, traces and logs) to backend systems. These backend systems can include observability platforms, logging systems, and tracing tools.
\\n\\n\\n\\nOTel spawned out of the OpenTracing and OpenCensus merge with an aim to provide universal observability instrumentation for software apps while creating a unified, vendor-neutral standard for telemetry gathering, processing and collection.
\\n\\n\\n\\nThe OpenTelemetry Collector processes telemetry data from instrumented applications, acting as a vendor-agnostic hub within OTel. It eliminates the necessity for multiple agents, supporting various open-source formats and exporting data to diverse backends within the ecosystem.
\\n\\n\\n\\nThe Collector receives, processes and exports telemetry through one or more pipelines. Each pipeline contains receivers, processors, exporters, and connectors. The Collector filters, aggregates and transforms within the pipeline.
\\n\\n\\n\\nLet’s take a look at the collector’s architecture to see how it works:
\\n\\n\\n\\nInstrumentation is the first step. Instrumented apps are software applications integrated with the OpenTelemetry instrumentation library to generate telemetry data. SDEs can instrument their app manually or automatically.
\\n\\n\\n\\nManual instrumentation involves adding observability code to the app, then using the OTel SDK to initialize the OTel Collector and the API to instrument the app code. Automatic instrumentation involves attaching a language-specific agent to the application. The OTel agent used depends on the app’s programming language.
\\n\\n\\n\\nThis is the first component of the OTel Collector. A Collector Pipeline must have at least one configured receiver to be valid. Receivers are responsible for ingesting telemetry data from various sources into the OTel Collector.
\\n\\n\\n\\nThey accept telemetry in preconfigured formats, translate it into formats that the OTel Collector can understand, and then send it to processors.
\\n\\n\\n\\nThe OpenTelemetry Collector supports over 40 types of receivers, including the OTLP Receiver, which collects telemetry data using the OpenTelemetry Protocol (OTLP), Zipkin, Jaeger, and Prometheus.
\\n\\n\\n\\nProcessors are optional OpenTelemetry Collector Pipeline components used to perform specific tasks on telemetry data.
\\n\\n\\n\\nWhere available, telemetry is sent from the receivers to processors. A processor can aggregate metrics, enrich trace data, or apply custom filtering rules.
\\n\\n\\n\\nProcessors can be chained together to create a processing pipeline and are used to remove users’ PII from the collected data to enable regulatory compliance.
\\n\\n\\n\\nThe OTel Collector supports several processors, three of which are:
\\n\\n\\n\\nProcessed telemetry data is sent to exporters, which transmits them to desired backends on a pull or push basis.
\\n\\n\\n\\nSDEs can configure different exporters for various backends. Each exporter converts telemetry from the OTLP format to the preconfigured backend-compatible format.
\\n\\n\\n\\nThe OpenTelemetry Collector supports various exporter plugins and allows SDEs to customize their data based on your needs and requirements.
\\n\\n\\n\\nConnectors function as bridges between components of the telemetry pipelines, allowing each component to communicate with the next.
\\n\\n\\n\\nConnectors serve as exporters and receivers, consuming bogus telemetry from instrumented apps and summarizing them for faster identification of app behavior issues or consuming telemetry in one format (e.g., traces from exporters) and producing it in a different format (e.g., converted to metrics for receivers) as required.
\\n\\n\\n\\nConnectors may also be used to control telemetry routing or replicate telemetry across different pipelines.
\\n\\n\\n\\nSimply put, once apps are instrumented using tools and frameworks from the instrumentation libraries, they generate telemetry data. The OTel Collector then receives this data (via the receiver), performs any necessary transformations or filtering (with the processor), and exports it to various backends.
\\n\\n\\n\\nThe connector links each stage and converts data from one telemetry type to another where necessary.
\\n\\n\\n\\nHere are four important reasons to use the OpenTelemetry Collector in your observability environment.
\\n\\n\\n\\nThe OpenTelemetry Collector allows you to separate the instrumentation logic from your applications. This simplifies observability and eliminates the complexity of configuring individual instrumentation points.
\\n\\n\\n\\nOnce you instrument your applications with the OpenTelemetry SDK, the Collector handles data collection, transformation, and exporting. This allows you to focus on analyzing and gaining insights from the collected data without worrying about the intricacies of data collection and export processes.
\\n\\n\\n\\nThe OpenTelemetry Collector supports various data formats and backends, enabling you to switch between observability platforms or tools without modifying your applications’ instrumentation. This interoperability and compatibility feature makes it easier to integrate with your existing systems or change platforms in the future.
\\n\\n\\n\\nThe Collector optimizes the telemetry data by filtering or aggregating it before exporting it, thereby reducing telemetry traffic, making it easier to identify and resolve performance issues and improving the overall performance of your monitoring infrastructure.
\\n\\n\\n\\nThe OpenTelemetry Collector offers a pluggable architecture that allows you to customize and extend its functionalities, such as adding or developing your own exporters, processors, or plugins as required. This flexibility makes it easy for you to adapt the Collector to different use cases.
\\n\\n\\n\\nOpenTelemetry Collector can be used in various scenarios. Listed below are some examples:
\\n\\n\\n\\nIn a distributed application, the OTel Collector can be used as a centralized component to collect telemetry data from all microservices and export telemetry data to the desired backends.
\\n\\n\\n\\nWhen working with applications that use different programming languages, the OpenTelemetry Collector is a consistent instrumentation approach. It can be used to standardize telemetry data collection across various languages to enable seamless observability across the entire application stack.
\\n\\n\\n\\nIf you have legacy systems that lack telemetry capability, the OpenTelemetry Collector can retroactively add observability without requiring significant modifications to the existing codebase.
\\n\\n\\n\\nIf you need to perform data aggregation, filtering, masking or transformation, the Collector’s flexible processing capabilities allow you to tailor the telemetry data to your specific needs.
\\n\\n\\n\\nIf you are building an application that needs to work with different observability platforms or tools, using the OTel Collector ensures vendor neutrality.
\\n\\n\\n\\nAt the recent Open Source Summit North America, Chris Featherstone and Shubhanshu Surana stated that they use OTel Collector to track massive amounts of observability data their company collects, including metrics—330 million unique series a day, span data of 3.6 terabytes a day, and log data of over 1 petabyte a day.
\\n\\n\\n\\nAdobe implemented OpenTelemetry Collector in 2020, starting with trace ingestion and expanding to include metrics in 2021, with plans for log data integration.
\\n\\n\\n\\nInstrumentation played a pivotal role in Adobe’s strategy, with a focus on auto-instrumentation, primarily in Java, using OpenTelemetry libraries. The team used custom extensions and processors, managing configurations through GitOps.
\\n\\n\\n\\nThe dynamic nature of the OTel collector, which extends data to multiple destinations, earned it the moniker “Swiss Army knife of observability” in Adobe’s toolkit.
\\n\\n\\n\\nAdobe boosted operational efficiency through configuration management using Git, alongside the adoption of OpenTelemetry Operator Helm charts for infrastructure use cases.
\\n\\n\\n\\nAuto-instrumentation using OpenTelemetry Operator significantly streamlined the process, allowing engineers to instrument services automatically and marking a notable enhancement in developer productivity.
\\n\\n\\n\\nFurthermore, Adobe implemented data management and enrichment processes to handle the vast amounts of observability data. Reduction and custom processors in the OpenTelemetry Collector facilitated the enrichment and elimination of sensitive information.
\\n\\n\\n\\nAdobe employed a unique service registry to prevent service name collisions, ensuring each service’s unique identification in the tracing backend.
\\n\\n\\n\\nIn terms of data distribution, the OTel Collector proved instrumental in sending data to multiple export destinations, simplifying the process for engineering teams accustomed to different processes and libraries.
\\n\\n\\n\\nLooking forward, Adobe outlined several key initiatives. Firstly, a focus on improving data quality involves eliminating unnecessary data and implementing rules to limit spans at the edge.
\\n\\n\\n\\nSecondly, rate limiting spans at the edge aim to manage the substantial data volume efficiently. Adobe envisions the shift towards trace-first troubleshooting to accelerate issue resolution within its intricate service ecosystem.
\\n\\n\\n\\nAs Adobe explores further integration, including OpenTelemetry logging libraries with core application libraries, running OTel collectors as sidecars, and building trace sampling extensions at the edge, the OpenTelemetry Collector remains a crucial component in their evolving observability practices.
\\n\\n\\n\\nThe OpenTelemetry Collector can be deployed using any of the following three methods.
\\n\\n\\n\\nIn a standalone deployment, the OpenTelemetry Collector is deployed as a separate process consisting of one or more instances of the Collector.
\\n\\n\\n\\nDistribute the load among all available collector instances using a load balancer to ensure reliable telemetry collection in standalone deployments. Standalone deployment allows SDEs to allocate dedicated resources specifically for the Collector.
\\n\\n\\n\\nIt also allows for the independent management of the collector, making it easier to update or scale as per requirement. However, this method prevents the Collector from having full visibility into the state of the system. Standalone Collectors can be run on any host or in a containerized environment, such as Docker or Kubernetes.
\\n\\n\\n\\nDeploy the Collector as a standalone process by using the Docker image or binary distribution provided on the official OpenTelemetry website. To configure the collector by using a YAML configuration file, run the following command:
\\n\\n\\n\\nyaml \\nversion: \'3.8\'\\n\\nservices: \\n otel-collector:\\n image: otel/opentelemetry-collector\\n volumes:\\n - ./config.yaml:/etc/otel/config.yaml\\n ports:\\n - 4317:4317
\\n\\n\\n\\nIn a sidecar deployment, the OpenTelemetry Collector is deployed alongside your application as a separate container or process.
\\n\\n\\n\\nEach microservice in your app will have a specific Collector instance, and a local host connects all instances. The collector collects telemetry data directly from the application and exports it to the desired backends.
\\n\\n\\n\\nAn advantage of this method is the Collector’s improved visibility into the state of the system.
\\n\\n\\n\\nWith a sidecar deployment, SDEs can encapsulate the telemetry collection logic within the Collector, reducing the complexity of their application code.
\\n\\n\\n\\nIt also provides isolation between the telemetry collection and the application, making it easier to add or remove instrumentation without modifying the application itself.
\\n\\n\\n\\nConfigure the sidecar deployment using the following code:
\\n\\n\\n\\n`yaml \\nversion: \'3.8\'\\n\\nservices: \\n my-app:\\n image: my-app-image:latest\\n ports:\\n - 8080:8080\\n otel-collector:\\n image: otel/opentelemetry-collector\\n volumes:\\n - ./config.yaml:/etc/otel/config.yaml\\n depends_on:\\n - my-app
\\n\\n\\n\\nThe OpenTelemetry Collector is embedded within an existing monitoring agent or agent framework in an agent deployment. This deployment method provides a unified solution for monitoring and telemetry data collection.
\\n\\n\\n\\nThe agent utilizes existing infrastructure and monitoring tools to collect and export telemetry data. This approach consolidates data collection and reduces the complexity of managing separate components for monitoring and telemetry collection.
\\n\\n\\n\\nConfigure the collector by extending the agent’s configuration file or utilizing specific configuration properties provided by the agent framework.
\\n\\n\\n\\nHere’s a sample configuration file for the agent with an embedded OpenTelemetry Collector.
\\n\\n\\n\\n\\n yaml \\nagent:\\n collectors:\\n telemetry:\\n otel:\\n config:\\n receivers:\\n otlp:\\n protocols:\\n grpc:\\n exporters:\\n jaeger:\\n endpoint: http://jaeger:14268/api/traces\\n processors:\\n batch:\\n timeout: 1s\\n
\\n\\n\\n\\nHere is a sample application code for an agent deployment with an embedded OpenTelemetry Collector using Python:
\\n\\n\\n\\n\\n`python \\nfrom opentelemetry import trace\\nfrom opentelemetry.exporter.otlp.trace_exporter import OTLPSpanExporter\\nfrom opentelemetry.sdk.trace import TracerProvider\\nfrom opentelemetry.sdk.trace.export import BatchSpanProcessor\\n\\n# Initialize the agent with an embedded OpenTelemetry Collector\\nagent_config = {\\n # Specify the agent configuration\\n \\"config\\": \\"\\"\\n}\\n\\nagent.init_agent(agent_config)\\n\\n# Configure the OpenTelemetry SDK\\nprovider = TracerProvider()\\ntrace.set_tracer_provider(provider)\\n\\n# Configure the OTLP exporter\\nexporter = OTLPSpanExporter(endpoint=\\"http://localhost:4317\\")\\nspan_processor = BatchSpanProcessor(exporter)\\nprovider.add_span_processor(span_processor)\\n\\n# Your application code here\\n\\n
\\n\\n\\n\\nEnsure you have your `config.yaml` file properly configured and adjust paths, image names, and ports based on your specific setup.
\\n\\n\\n\\nAgent deployments are secure, flexible and efficient for telemetry collection. Agents can be configured to operate within set boundaries to avoid security breaches.
\\n\\n\\n\\nThey can be configured to work with other tools or customized as required. Since they do not run on multiple collector instances and are lightweight, they consume very minimal system resources.
\\n\\n\\n\\nMiddleware is an AI-powered monitoring and observability platform that allows developers to visualize and analyze collected telemetry data.
\\n\\n\\n\\nYou will start off on the unified view dashboard once you have logged in:
\\n\\n\\n\\n2. Navigate to the installation page in the bottom left corner to Install the Middleware agent (MW Agent) on your host infrastructure and begin to explore your data.
\\n\\n\\n\\nThe MW Agent is a lightweight software that runs on your host infrastructure, collecting and sending events and metrics to Middleware. It collects real-time data at the system and process levels to provide detailed insights into host performance and behavior.
\\n\\n\\n\\n3. Copy and run the installation command. If you’re using docker, run the command example below. Copying the command directly from the installation page ensures your API key and UID are accurately inputted.
\\n\\n\\n\\n\\n\\n\\n\\n\\nCheck our documentation for installation commands for other environments such as Windows, Linux, etc.
\\n
MW_API_KEY= MW_TARGET=https://.middleware.io:443 bash -c \\"$(curl -L https://install.middleware.io/scripts/docker-install.sh)\\"
\\n\\n\\n\\n4. Verify the status of the MW Agent with the following command.
\\n\\n\\n\\ndocker ps -a --filter ancestor=ghcr.io/middleware-labs/mw-host-agent:master
\\n\\n\\n\\nA successful installation returns the status ‘Up’ or ‘Exited’. If the installation is unsuccessful, the status will be blank.
\\n\\n\\n\\nThe OpenTelemetry Ingestion API has two endpoints: metrics and logs. For both endpoints, the resource type attribute groups the ingested data under the specified label on Middleware dashboards and reports.
\\n\\n\\n\\nPOST https://<UID>.middleware.io/v1/metrics
\\n\\n\\n\\n\\n curl -X POST \\"https://demo.middleware.io/v1/metrics\\" \\\\\\n-H \\"Accept: application.json\\" \\\\\\n-H \\"Content-type: application.json\\" \\\\\\n-d @ << EOF\\n{\\n \\"resource_metrics\\": [\\n {\\n \\"resource\\": {\\n \\"attributes\\": [\\n {\\n \\"key\\": \\"mw.account_key\\",\\n \\"value\\": {\\n \\"string_value\\": \\"xxxxxxxxxx\\"\\n }\\n },\\n {\\n \\"key\\": \\"mw.resource_type\\",\\n \\"value\\": {\\n \\"string_value\\": \\"custom\\"\\n }\\n }\\n ]\\n },\\n \\"scope_metrics\\": [\\n {\\n \\"metrics\\": [\\n {\\n \\"name\\": \\"swap-usage\\",\\n \\"description\\": \\"SWAP Usage\\",\\n \\"unit\\": \\"bytes\\",\\n \\"gauge\\": {\\n \\"data_points\\": [\\n {\\n \\"attributes\\": [\\n {\\n \\"key\\": \\"device\\",\\n \\"value\\": {\\n \\"string_value\\": \\"nvme0n1p4\\"\\n }\\n }\\n ],\\n \\"start_time_unix_nano\\": 1673435153000000000,\\n \\"time_unix_nano\\": 1673435153000000000,\\n \\"asInt\\": 400500678\\n }\\n ]\\n }\\n }\\n ]\\n }\\n ]\\n }\\n ]\\n}\\nEOF \\n
\\n\\n\\n\\n4. There are two components for metrics: metadata and datapoint. The metadata fields are attributes that define the metric and determine how it will appear in Middleware. The datapoint fields are defined within the data attribute. The datapoint fields are consistent across all data attribute types and can be Gauges, Sums, Histograms or Summaries.
\\n\\n\\n\\n5. The Logs endpoint lets you send custom logs to the Middleware backend. To send custom logs to Middleware, POST to the following endpoint.
\\n\\n\\n\\nPOST https://<UID>.middleware.io:443/v1/logs
\\n\\n\\n\\n6. Navigate to OpenTelemetry Logs to view the following example of a curl request sending custom logs to Middleware.
\\n\\n\\n\\n\\ncurl -X POST \\"https://demo.middleware.io:443/v1/logs\\" \\\\\\n-H \\"Accept: application/json\\" \\\\\\n-H \\"Content-type: application/json\\" \\\\\\n-d @- << EOF\\n{\\n \\"resource_logs\\": [\\n {\\n \\"resource\\": {\\n \\"attributes\\": [\\n {\\n \\"key\\": \\"mw.account_key\\",\\n \\"value\\": {\\n \\"string_value\\": \\"xxxxxxxxxx\\"\\n }\\n },\\n {\\n \\"key\\": \\"mw.resource_type\\",\\n \\"value\\": {\\n \\"string_value\\": \\"custom\\"\\n }\\n },\\n {\\n \\"key\\": \\"service_name\\",\\n \\"value\\": {\\n \\"string_value\\": \\"nginx-123\\"\\n }\\n }\\n ]\\n },\\n \\"scope_logs\\": [\\n {\\n \\"log_records\\": [\\n {\\n \\"severity_text\\": \\"WARN\\",\\n \\"body\\": {\\n \\"string_value\\": \\"upstream server not accepting request\\"\\n },\\n \\"severity_number\\": \\"11\\",\\n \\"attributes\\": [\\n {\\n \\"key\\": \\"server\\",\\n \\"value\\": {\\n \\"string_value\\": \\"nginx\\"\\n }\\n }\\n ],\\n \\"time_unix_nano\\": \\"1694030143000\\"\\n }\\n ]\\n }\\n ]\\n }\\n ]\\n}\\nEOF \\n
\\n\\n\\n\\n7. Navigate to the Middleware app’s homepage and scroll to “view dashboard.” Click on it to ensure metrics, traces, and logs are appearing in the Unified Dashboard.
\\n\\n\\n\\n8. Click on Unified Dashboard:
\\n\\n\\n\\n9. You’ll see your metrics, logs, and traces on the Middleware Unified Dashboard with end-to-end visibility.
\\n\\n\\n\\nThe OTel Collector simplifies and harmonizes the SDE’s observability responsibilities. Middleware further improves the Collector’s efficiency with its unified end-to-end view of telemetry data in a single user-friendly dashboard.
\\n\\n\\n\\nThe AI-based platform allows you to customize telemetry gathering and visualization, ensuring that the telemetry you collect serves your specific use case.
\\n\\n\\n\\nMiddleware regularly updates its agent with new, exciting features for improved observability into app performance. Try Middleware for free!
\\n\\nProject post by Volcano maintainers
\\n\\n\\n\\nOn September 19, 2024, UTC+8, Volcano Community officially released version 1.10.0, introducing the following new features:
\\n\\n\\n\\nIn traditional big data processing scenarios, users can directly set queue priorities to control the scheduling order of jobs. To ease the migration from Hadoop/Yarn to cloud-native platforms, Volcano supports setting priorities at the queue level, reducing migration costs for big data users while enhancing user experience and resource utilization efficiency.
\\n\\n\\n\\nQueues are a fundamental resource in Volcano, each with its own priority. By default, a queue’s priority is determined by its share
value, which is calculated by dividing the resources allocated to the queue by its total capacity. This is done automatically, with no manual configuration needed. The smaller the share
value, the fewer resources the queue has, making it less saturated and more likely to receive resources first. Thus, queues with smaller share
values have higher priority, ensuring fairness in resource allocation.
In production environments—especially in big data scenarios—users often prefer to manually set queue priorities to have a clearer understanding of the order in which queues are scheduled. Since the share
value is dynamic and changes in real-time as resources are allocated, Volcano introduces a priority
field to allow users to set queue priorities more intuitively. The higher the priority
, the higher the queue’s standing. High-priority queues receive resources first, while low-priority queues have their jobs reclaimed earlier when resources need to be recycled.
Queue Priority Definition:
\\n\\n\\n\\ntype QueueSpec struct {\\n...\\n // Priority define the priority of queue. Higher values are prioritized for scheduling and considered later during reclamation.\\n // +optional\\n Priority int32 `json:\\"priority,omitempty\\" protobuf:\\"bytes,10,opt,name=priority\\"`\\n}
\\n\\n\\n\\nTo ensure compatibility with the share mechanism, Volcano also considers the share value when calculating queue priorities. By default, if a user has not set a specific queue priority or if priorities are equal, Volcano will fall back to comparing share values. In this case, the queue with the smaller share has higher priority. Users have the flexibility to choose between different priority strategies based on their specific needs—either by using the priority or the share method.
\\n\\n\\n\\nFor queue priority design doc, please refer to: Queue priority
\\n\\n\\n\\nVolcano introduced the elastic queue capacity scheduling feature in version v1.9, allowing users to directly set the capacity for each resource dimension within a queue. This feature also supports elastic scheduling based on the deserved value, enabling more fine-grained resource sharing and recycling across queues.
\\n\\n\\n\\nFor detailed design information on elastic queue capacity scheduling, refer to the Capacity Scheduling Design Document.
\\n\\n\\n\\nFor a step-by-step guide on using the capacity plugin, see the Capacity Plugin User Guide.
\\n\\n\\n\\nConfigure each dimension deserved resource samples for the queue:
\\n\\n\\n\\napiVersion: scheduling.volcano.sh/v1beta1\\nkind: Queue\\nmetadata:\\n name: demo-queue\\nspec:\\n reclaimable: true\\n deserved: # set the deserved field.\\n cpu: 64\\n memeory: 128Gi\\n nvidia.com/a100: 40\\n nvidia.com/v100: 80
\\n\\n\\n\\nIn version v1.10, Volcano extends its support to include reporting different types of GPU resources within elastic queue capacities. NVIDIA’s default Device Plugin
does not distinguish between GPU models, instead reporting all resources uniformly as nvidia.com/gpu
. This limits AI training and inference tasks from selecting specific GPU models, such as A100 or T4, based on their particular needs. To address this, Volcano now supports reporting distinct GPU models at the Device Plugin
level, working with the capacity
plugin to enable more precise GPU resource sharing and recycling.
For instructions on using the Device Plugin
to report various GPU models, please refer to the GPU Resource Naming Guide.
Note:
\\n\\n\\n\\nIn version v1.10.0, the capacity
plugin is the default for queue management. Note that the capacity
and proportion
plugins are incompatible, so after upgrading to v1.10.0, you must set the deserved
field for queues to ensure proper functionality.
For detailed instructions, please refer to the Capacity Plugin User Guide.
\\n\\n\\n\\nThe capacity
plugin allocates cluster resources based on the deserved
value set by the user, while the proportion
plugin dynamically allocates resources according to queue weight. Users can select either the capacity
or proportion
plugin for queue management based on their specific needs.
For more details on the proportion plugin, please visit: Proportion Plugin.
\\n\\n\\n\\nOnce a Pod is created, it is considered ready for scheduling. In Kube-scheduler, it will try its best to find a suitable node to place all pending Pods. However, in reality, some Pods may be in a “lack of necessary resources” state for a long time. These Pods actually interfere with the decision-making and operation of the scheduler (and downstream components such as Cluster AutoScaler) in an unnecessary way, causing problems such as resource waste. Pod Scheduling Readiness is a new feature of Kube-sheduler. In Kubernetes v.1.30 GA, it has become a stable feature. It controls the scheduling timing of Pods by setting the schedulingGates field of the Pod.
\\n\\n\\n\\nIn previous versions, Volcano has integrated all algorithms of the K8s default scheduler, fully covering the native scheduling functions of Kube-scheduler. Therefore, Volcano can completely replace Kube-scheduler as a unified scheduler under the cloud native platform, supporting unified scheduling of microservices and AI/big data workloads. In the latest version v1.10, Volcano has introduced Pod Scheduling Readiness scheduling capability to further meet users’ scheduling needs in diverse scenarios.
\\n\\n\\n\\nFor the documentation of Pod Scheduling Readiness features, please refer to: Pod Scheduling Readiness | Kubernetes
\\n\\n\\n\\nFor the Pod Scheduling Readiness design doc of volcano, please refer to: Proposal for Support of Pod Scheduling Readiness by ykcai-daniel · Pull Request #3581 · volcano-sh/volcano (github.com)
\\n\\n\\n\\nA Sidecar container is an auxiliary container designed to support the main business container by handling tasks such as logging, monitoring, and network initialization.
\\n\\n\\n\\nPrior to Kubernetes v1.28, the concept of Sidecar containers existed only informally, with no dedicated API to distinguish them from business containers. Both types of containers were treated equally, which meant that Sidecar containers could be started after the business container and might end before it. Ideally, Sidecar containers should start before and finish after the business container to ensure complete collection of logs and monitoring data.
\\n\\n\\n\\nKubernetes v1.28 introduces formal support for Sidecar containers at the API level, implementing unified lifecycle management for init containers, Sidecar containers, and business containers. This update also adjusts how resource requests and limits are calculated for Pods, and the feature will enter Beta status in v1.29.
\\n\\n\\n\\nThe development of this feature involved extensive discussions, mainly focusing on maintaining compatibility with existing APIs and minimizing disruptive changes. Rather than introducing a new container type, Kubernetes reuses the init container type and designates Sidecar containers by setting the init container’s restartPolicy to Always. This approach addresses both API compatibility and lifecycle management issues effectively.
\\n\\n\\n\\nWith this update, the scheduling of Pods now considers the Sidecar container’s resource requests as part of the business container’s total requests. Consequently, the Volcano scheduler has been updated to support this new calculation method, allowing users to schedule Sidecar containers with Volcano.
\\n\\n\\n\\nFor more information on Sidecar containers, visit Sidecar Containers | Kubernetes.
\\n\\n\\n\\nvcctl is a command line tool for operating Volcano’s built-in CRD resources. It can be conveniently used to view/delete/pause/resume vcjob resources, and supports viewing/deleting/opening/closing/updating queue resources. Volcano has enhanced vcctl in the new version, adding the following features:
\\n\\n\\n\\nFor detailed guidance documents on vcctl, please refer to: vcctl Command Line Enhancement.
\\n\\n\\n\\nVolcano closely follows the pace of Kubernetes community versions and supports every major version of Kubernetes. The latest supported version is v1.30, and runs complete UT and E2E use cases to ensure functionality and reliability.
\\n\\n\\n\\nIf you want to participate in the development of Volcano adapting to the new version of Kubernetes, please refer to: adapt-k8s-todo for community contributions.
\\n\\n\\n\\nVolcano has always attached great importance to the security of the open source software supply chain. It follows the specifications defined by OpenSSF in terms of license compliance, security vulnerability disclosure and repair, warehouse branch protection, CI inspection, etc. Volcano recently added a new workflow to Github Action, which will run OpenSSF security checks when the code is merged, and update the software security score in real time to continuously improve software security.
\\n\\n\\n\\nAt the same time, Volcano has reduced the RBAC permissions of each component, retaining only the necessary permissions, avoiding potential risks of unauthorized access and improving the security of the system.
\\n\\n\\n\\nRelated PRs:
\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nIn large-scale scenarios, Volcano has done a lot of performance optimization work, mainly including:
\\n\\n\\n\\nThe new version of Volcano optimizes and enhances GPU monitoring indicators, fixes the problem of inaccurate GPU monitoring, and adds node information to the GPU computing power and video memory monitoring indicators, allowing users to more intuitively view the computing power of each GPU on each node, the total amount and allocated amount of video memory.
\\n\\n\\n\\nRelated PR: Update volcano-vgpu monitoring system by archlitchi · Pull Request #3620 · volcano-sh/volcano (github.com)
\\n\\n\\n\\nVolcano has optimized the installation and upgrade process of helm chart, and supports installing helm chart packages to set more custom parameters, mainly including:
\\n\\n\\n\\nVolcano 1.10.0 version includes hundreds of contributions from 36 community contributors. Thanks for your contributions.
\\n\\n\\n\\nContributors on GitHub:
\\n\\n\\n\\n@googs1025 | @WulixuanS | @SataQiu |
---|---|---|
@guoqinwill | @lowang-bh | @shruti2522 |
@lukasboettcher | @wangyysde | @bibibox |
@Wang-Kai | @y-ykcir | @lekaf974 |
@yeahdongcn | @Monokaix | @Aakcht |
@yxxhero | @babugeet | @liuyuanchun11 |
@MichaelXcc | @william-wang | @lengrongfu |
@xieyanker | @lx1036 | @archlitchi |
@hwdef | @wangyang0616 | @microyahoo |
@snappyyouth | @harshitasao | @chenshiwei-io |
@TaiPark | @Aakcht | @ykcai-daniel |
@lekaf974 | @JesseStutler | @belo4ya |
Release note: v1.10.0
\\n\\n\\n\\nhttps://github.com/volcano-sh/volcano/releases/tag/v1.10.0
\\n\\n\\n\\nBranch:release-1.10
\\n\\n\\n\\nhttps://github.com/volcano-sh/volcano/tree/release-1.10
\\n\\n\\n\\n\\n\\nCommunity post by Saqib Jan
\\n\\n\\n\\nAs technologies become more advanced year on year, the complexity of software testing increases, too.
\\n\\n\\n\\nWhen building a testing strategy, companies typically map their operations into three segments: the people, the process, and the technology involved. And the hardest challenges companies face are often around the technology itself, like dealing with complex test environments that have various frameworks and APIs at different levels.
\\n\\n\\n\\nIt’s exceedingly important to create a realistic test environment with the right APIs to test how everything works together from start to finish. But managing a large testing infrastructure is a major hurdle for all companies – startups and enterprises alike.
\\n\\n\\n\\nThis is because every month dozens of mobile devices are launched by various vendors, and there’s also the launch of operating systems. And so, it means that your potential customers could land on your website or app from anywhere, and buying all the devices for testing with all the different operating systems is not economical which makes testing inordinately hard.
\\n\\n\\n\\nBut engineering leaders see in TestOps an opportunity to deliver high-quality software at an accelerated pace. “By taking a TestOps approach to software testing, organizations can determine the most cost-effective strategies in the long term—removing blockers, empowering development teams to deliver better features and apps, and enabling cross-functional conversations about high availability testing practices,” says Mayank Bhola CTO of a software testing platform LambdaTest.
\\n\\n\\n\\nYou want a testing environment that is not limited by the devices, operating system, or programming language, and can quickly allow you to create real-time machines with the exact testing environment you need. But building and maintaining this kind of infrastructure requires significant resources and bandwidth.
\\n\\n\\n\\nYou could build your own in-house device lab, but the reality is that if you buy ten devices today, a few months later, ten more will hit the market. You’ll then need to buy and maintain these new devices, along with your older ones. And even then your ambitious IT requirements will inevitably have to grapple with scaling problems, let alone changes to tooling and processes.
\\n\\n\\n\\nShay Elmualem, Principal Tech Lead at Legit Security, shares, “To prioritize test coverage, we first focus on the most commonly used browsers among our users—typically Chrome and Firefox. We use a matrix setup in GitHub Actions to run our tests across these different browsers simultaneously, which helps us cover multiple platforms in parallel without sacrificing speed.” This systematic approach helps ensure consistent cross-browser compatibility and is by practice crucial for user experience.
\\n\\n\\n\\nBecause every company with a growing product portfolio might find it increasingly difficult to keep up with the latest devices and operating systems while ensuring consistent testing across different teams. It’s very exhaustive and when you try to automate at scale, things usually fail.
\\n\\n\\n\\nBhola, in our email conversation, discussed how LambdaTest empowers software testing with highly available instant infrastructure by dynamically updating testing capabilities based on real-time metrics and usage patterns. “We consistently monitor device, platform, browser, and version usage in production through a monitoring tool over a rolling window of 30 days. And based on this data, we dynamically update our testing capabilities to align with customer needs. This involves prioritizing test cases that target the most widely used configurations and eliminating outdated device configurations that are no longer in demand.”
\\n\\n\\n\\nThis ensures that testers have access to the most relevant testing environments. Bhola further enthused, “We also analyze the usage patterns of commands to identify the most frequently executed commands by customers and ensure that our test suite includes relevant test cases to cover these scenarios comprehensively.”
\\n\\n\\n\\nTo ensure testing produces high availability and high user satisfaction, targeted testing and leveraging customer telemetry are growing trends for achieving sustainable app performance. It’s not about testing everything, but instead focusing on the areas that often have the most significant impact according to Jeff Friedman, Director of Engineering at RAD Security.
\\n\\n\\n\\nBy focusing on the most critical areas of the application, teams can optimize their testing efforts and ensure that the highest-value features are delivered efficiently. “We can maintain high velocity in the dev environment with targeted tests, to ensure feature functionality, while relying on more stable, higher-level tests in the CI and pre-production environments to ensure that system-level functionality isn’t impacted by code changes. We use customer telemetry and the Pareto Principle (80-20 rule) to focus development and testing efforts on the highest-payoff configurations, using CI compatibility matrix testing to guide fast-follow compatibility fixes,” Friedman elaborated in an email interview.
\\n\\n\\n\\n“Our CI environment can test across multiple cloud and Kubernetes configurations for backend services, and different browser types and form factors for the frontend,” Friedman explained. “We standardize our testing process with uniform open source telemetry tooling and testing platforms to manage a standard set of CI checks as well.” This helps ensure a high quality level across the board, as well as a high level of consistency across applications.
\\n\\n\\n\\nPrioritizing environments (devices, platforms, browsers) to focus on has remained unchanged for decades. “This is an area where I see a missed opportunity,” says Kohsuke Kawaguchi, famous for creating Jenkins, the open-source automation server. “The abundance of data,” he exposits, “combined with machine learning, should enable us to make much more intelligent, fine-grained decisions about which tests on which environments are worthwhile.”
\\n\\n\\n\\n“Infrastructure as code, device clouds, and containers are great—until they’re not. These approaches often scale testing without improving efficiency. In other words, you’re just burning money,” Kawaguchi pointed out in an email interview. “To truly enhance efficiency, teams need to do more with the same budget. Focus on selecting tests that matter and use AI to screen test failures. You can spend 10 times the money to run 10 times more tests, but if the people processing the results can’t handle 10 times the input, it’s not effective.”
\\n\\n\\n\\nEveryone wants a highly available testing environment. Speed is critical, not just for rapid development and efficient deployment, but also for testing. Realistically, speed can be both the solution and the problem when developing higher-quality software. So, you need to make sure you’re not chasing speed and exhausting resources to the point where you unintentionally compromise on quality.
\\n\\n\\n\\nMatthew Jones, a distinguished engineer and Chief Ansible Architect at Red Hat, shares his experience in managing software delivery. “Because of the nature of how we deliver software, we work really hard to control our requirements and dependencies so that we can limit our test matrix. We specifically normalize all of our deployment types so that they look and act the same no matter where we’re deploying them.”
\\n\\n\\n\\nBy emphasizing the need to control requirements, dependencies, and deployment types, Jones underscores how a streamlined approach can significantly reduce the complexity of the testing process. “This simplification,” he affirms, “allows organizations to trust their layered testing with fewer artifacts, leading to faster development and deployment cycles.”
\\n\\n\\n\\nTestOps is the most economical and best way to effectively productionize testing. “It is emerging as a critical requirement in cross-functional teams because it offers businesses an efficient way to manage testing and operations throughout their software development life cycle,” Bhola remarks. “You can solve for managing infrastructure complexities while keeping up with evolving technologies to better meet performance expectations and offer more value to meet complex user needs.”
\\n\\n\\n\\nMost interestingly, to this end, more businesses today are adopting modern TestOps methodologies and leveraging the benefits of cloud-based testing platforms that provide instant high-performance infrastructure to ensure consistency across multiple test environments—staging, QA, and production—and accelerate the delivery of high-quality, reliable applications with more efficient release cycles.
\\n\\n\\n\\nAuthor: Saqib Jan
\\n\\n\\n\\nEmail: sakimjan8@gmail.com
\\n\\n\\n\\nLinkedIn: https://linkedin.com/in/s-jan
\\n\\n\\n\\nBIO: Saqib Jan is a technology analyst with experience in application development, FinOps, and cloud technologies.
\\n\\nMember post originally published on the Netris blog
\\n\\n\\n\\nNetris version 4.3.0 has been recently released, enabling a number of functionalities for GPU-based AI cloud providers and operators. Most pieces have been designed to support the NVIDIA Spectrum-X networking platform reference guidelines and factor in best practices and field experience.
\\n\\n\\n\\nThe NVIDIA Spectrum-X networking platform, featuring NVIDIA Spectrum-4 switches and NVIDIA BlueField-3 SuperNICs, is the world’s first Ethernet fabric built for AI, accelerating generative AI network performance by 1.6X over traditional Ethernet fabrics. Spectrum-X was developed specifically for GPU-to-GPU connectivity, often referred to as east-west data center traffic.
\\n\\n\\n\\nMany private and public cloud operators have been using Netris in CPU cloud scenarios to achieve network automation, abstraction, and multi-tenancy. These deployments often include NVIDIA Spectrum switches running Cumulus Linux.
\\n\\n\\n\\nIt is challenging for a cloud services provider to develop the network automation, abstraction, and multi-tenancy software in-house and deliver typical cloud native constructs such as VPCs, Internet Gateways, NAT Gateways, Elastic IPs, and Elastic Load Balancers.
\\n\\n\\n\\nWith Netris, these functionalities are available for CPU cloud providers as well as for GPU cloud providers.
\\n\\n\\n\\nNetris switch-fabric management functionality for NVIDIA Spectrum switches is designed to automate the day-0, day-1, and day-2 phases of switch fabric operations.
\\n\\n\\n\\nDay-0:
\\n\\n\\n\\nA Netris controller initialization workflow comes as a Terraform module to help providers generate initial data for Inventory, IPAM, and network topology. The initialization module has knowledge of rail-optimized network topologies for Spectrum-X. It calculates the appropriate number of switches, IP addressing, and rail-optimized topology and creates the necessary blueprints in the Netris controller automatically based on the number of GPU servers.
\\n\\n\\n\\nNVIDIA provides clear guidelines for east-west AI fabrics with Spectrum-X, and Netris software helps users consistently adhere to these standards. However, the north-south fabric offers more flexibility. To accommodate this, the Netris initialization module accepts additional parameters, such as the number of leaf, management, and spine switches, the number of leaf-to-spine links, and various other optional settings. This approach gives users an easy and flexible way to create validated blueprints from day one of network deployment.
\\n\\n\\n\\nUsers can also add custom changes to the topology – Netris does not force constraints on topology. It only suggests based on the validated designs.
\\n\\n\\n\\nDay-1:
\\n\\n\\n\\nNetris introduced the Netris2Air plugin, which allows users to leverage the NVIDIA Air networking simulation tool to automatically create a digital twin of the network based on inventory, IPAM, and the topology blueprint declared in the Netris controller. This helps during the design and staging phases to evaluate the resulting network before applying it to the production hardware.
\\n\\n\\n\\nNVIDIA Base Command Manager users can also leverage the integration of BCM and Netris. The BCM built-in ZTP is capable of bootstrapping switches based on MAC addresses defined in the Netris blueprint, binding between a physical network switch and logical switch in the Netris controller, and handing over further automatic management to Netris. This model allows for the use of a single bootstrapping mechanism for both GPU servers and network switches.
\\n\\n\\n\\nNetris supports the parallel management of multiple, physically separate fabrics in a given site. This is critical for cloud providers because GPU-based AI clouds always have multiple switch fabrics.
\\n\\n\\n\\nNetris automatically generates configurations for both east-west and north-south fabrics to bring up the underlay BGP/EVPN fabrics according to the blueprint generated in the Netris controller.
\\n\\n\\n\\nThe Spectrum-X fabric, the east-west fabric for AI networking over Ethernet, requires slightly different configurations in order to enable AI-specific functionalities such as QoS, RoCE, Adaptive Routing, Congestion Control, ASIC monitoring, and others. Netris software algorithms know how to handle these AI-specific configurations and are able to distinguish between the Spectrum-X fabric from the North-South fabric to automatically configure both fabrics safely and appropriately.
\\n\\n\\n\\nMonitoring
\\n\\n\\n\\nBasic monitoring features are built in, and Netris will alert users to wiring mismatches, link status errors, or switch health issues. For more comprehensive monitoring, the NVIDIA NetQ network operations toolset can be used alongside Netris, providing deeper insights. Netris is also working on further integrating with NetQ to offer even more detailed analytics and a smoother experience.
\\n\\n\\n\\nNVIDIA InfiniBand Support
\\n\\n\\n\\nA Netris plugin for the NVIDIA UFM network management platform will be available in Netris version 4.4.0, which will be the “glue” between Netris controller and NVIDIA UFM. This functionality is for cloud providers that use Ethernet networking as their TAN (Tenant Access Network) and NVIDIA Quantum InfiniBand as their compute network.
\\n\\n\\n\\nBottom Line
\\n\\n\\n\\nIn these examples, by acting as both the fabric manager and the source of truth for the NVIDIA Spectrum-X Ethernet fabric and the NVIDIA Quantum InfiniBand fabric (where NVIDIA UFM is the fabric manager), Netris can deliver cloud networking constructs for the entire cluster through a single abstract API.
\\n\\n\\n\\nHost Networking and DPU/SuperNIC
\\n\\n\\n\\nNetris can optionally manage host networking, including dynamic IP address assignments and static route configurations. It also handles various DPU/SuperNIC setups necessary for optimal GPU performance in a Spectrum-X environment. All configurations are managed by Netris software, helping ensure a secure solution for multi-tenant deployments.
\\n\\n\\n\\nFrom a GPU-based AI cluster use case perspective, a common operation is to carve out clusters isolated on the network switch level, allowing each tenant to access only resources assigned to them. It is challenging for cloud providers to automate this part in house, even in CPU networks, but it’s more challenging in GPU networks because there are multiple fabrics (East-West, North-South/TAN, OOB-management). Each fabric requires various low-level and different isolation techniques (Layer-2 VXLAN, Layer-3 VXLAN, VRFs, or pKeys).
\\n\\n\\n\\nNetris streamlines network isolation and multi-tenancy for cloud providers. Netris provides simple APIs where the user (or user-facing portal) can request a new “Cluster” and list GPU servers. The user (or user-facing portal) can choose the cluster to be either in a new VPC (Virtual Private Cloud – a unit of isolation in the cloud native world) or in one of the existing VPCs. Such an API request does not need to contain switch-fabric level details – only a simple list of servers – Netris software will figure out and implement the necessary configuration dynamically on the fly – without conflicts – following all NVIDIA Spectrum-X deployment guidelines and best practices.
\\n\\n\\n\\nOnce the API request has been submitted to the Netris controller, Netris agents running on every switch and GPU hosts (optional) will automatically reconfigure the network to deliver required access and isolation across VPCs and groups of GPU servers.
\\n\\n\\n\\nAll Netris functionality is accessible through (1) a web console – for viewing and ongoing changes, (2) RestAPI – commonly used by cloud services providers to consume Netris API from their customer-facing user portals, and (3) Terraform – usually used by the cloud services providers network and DevOps engineers.
\\n\\n\\n\\nThese methods are ideal for cloud providers that would like to offer dynamic multi-tenancy for their customers.
\\n\\n\\n\\nWhen NCPs (Nvidia Cloud Partners) are building a cloud, they need to mimic cloud networking constructs beyond isolation, VPCs, and multi-tenancy. See, the isolated tenants, VPCs need secure access to/from the Internet and sometimes peering with end users’ other remote networks.
\\n\\n\\n\\nNetris SoftGate HS (hyper-scale) is a software gateway that is designed to utilize regular servers to provide multi-tenant VPC-aware and hyper scalable cloud networking services, such as (1) Internet Gateway – provide Internet access to the hosts in the VPC, (2) NAT Gateway – provide 1:1 NAT, Port-forwarding, or elastic IP services, (3) Elastic Load Balancing – critical for inference workloads to load balance the incoming requests across multiple servers, and (4) Direct connect – to allow a tenant connect their VPC cluster to their remote data center or remote office network.
\\n\\n\\n\\nThese cloud networking constructs are essential services that every public cloud provider provides to their users, so when NCPs (Nvidia Cloud Partners) evaluate network automation and abstraction strategies, they should make sure to factor in these critical services.
\\n\\n\\n\\nEarning the Kubernetes and Cloud Native Security Associate (KCSA) certification is valuable for both organizations and IT professionals. This certification signifies a strong understanding of basic security configurations for Kubernetes clusters, crucial for embedding security across all roles in an organization. KCSA Certification holders not only increase their value to current and future employers but also enhance their problem-solving skills in security incidents and will have increased confidence in cloud native security discussions. This certification ensures a thorough grasp of best practices, significantly improving an organization’s security posture and fostering a culture of continuous learning and development in a rapidly evolving tech landscape.
\\n\\n\\n\\nSecurity is in everyone’s interests across every organization and obtaining a security-focused certification like the soon-to-be-updated Certified Kubernetes Security Specialist (CKS), the Kubernetes and Cloud Native Security Associate (KCSA) certification is increasingly valuable. This certification confirms a person’s deep understanding of basic security configuration of clusters, and ensures they have the knowledge to build security into their role, understanding what and why there are security configurations in place, regardless of their role.
\\n\\n\\n\\nKey benefits of having a Kubernetes security certification
\\n\\n\\n\\nOverall, security is an essential part of all businesses and as more and more organizations embrace cloud native computing it becomes necessary to increase security understanding to all roles. Kubernetes security certification demonstrates that you are continuously learning and developing – which in itself is a valuable trait for employers. It shows that you understand the seriousness of security in cloud native and highlights your diligence to employers.
\\n\\n\\n\\nRunning applications in a safe, scalable, and efficient manner is really difficult. Managing hardware failures, deployments, networking, and all the other components of modern software can be tricky. Kubernetes provides a means for more and more people to deploy applications in a simple way, knowing that Kubernetes can manage complexity. However, the complexity doesn’t go away! It is just handled by the platform or orchestrator. It’s therefore essential to have an understanding of the security implications and best practices, and that demands specialized knowledge and awareness of containers, data, and the types of threats that you might encounter and that’s why it’s important to have specialized knowledge in Kubernetes security
\\n\\n\\n\\nKubernetes security certification demonstrates that the holder has gone through comprehensive training on the best practices and tools available to help mitigate risks in a containerized environment and being certified in Kubernetes security differentiates you from other IT professionals because organizations and individuals in the industry recognise that Kubernetes environments need specialized knowledge. Being certified in Kubernetes security differentiates you by calling out your in-depth knowledge of the specialized threats on that platform.
\\n\\n\\n\\nToday’s tech landscape is in a state of constant change. From new ways of working and technologies to novel threats and risks. Continuous learning is essential if you want to remain relevant in the cloud native security space. Certification in Kubernetes security demonstrates your commitment to keeping your skills relevant in this fast paced environment.
\\n\\n\\n\\nBut Kubernetes security certification is not just a “checkbox” exercise, the training and exposure to the tools and techniques needed to secure a Kubernetes environment are varied. The KCSA certification ensures that holders have covered all the areas of risk rather than one or two specific areas.
\\n\\n\\n\\nKCSA certification is valuable for the DevOps teams specifically as certification will enable the team to communicate with each other using a common understanding built on their shared learnings through certification. Where only some members of the team can be certified, they can then use this learning to enhance and expand their colleagues’ security knowledge.
\\n\\n\\n\\nLearning the first principles of security in a Kubernetes cluster means that real-world security challenges can be examined through a knowledgeable lens. The certification makes you an effective member of a team being able to contribute understanding and best practices to any challenges.
\\n\\n\\n\\nThere is a large demand for individuals with security knowledge in the cloud native space. Regardless of your area of specialization within IT your certification is a building block that could take you further into operations, secure development, DevSecOps, or even into security teams.
\\n\\n\\n\\nUnderstanding the components and risks in each area of Kubernetes is essential for effective problem-solving in security incidents. The certification brings together your organizational knowledge, your application knowledge, your platform knowledge, and your security knowledge, allowing you to effectively understand the stack and to troubleshoot, with a suitably experienced team, a security incident. As much as your certification will make you a vital team member, it is essential to remember that security incidents must be investigated in a specific way – especially when law enforcement may need to be involved.
\\n\\n\\n\\nAdditionally, Kubernetes security certification is vital for working in multi-cloud environments. As Kubernetes becomes the standard for multi-cloud deployments, having this certification proves an individual’s capability in managing security across hybrid, private, or multi-cloud Kubernetes environments. Kubernetes security certification isn’t specific to cloud environments (Cloud Native doesn’t mean it has to run in the cloud!)
\\n\\n\\n\\nBeyond technical skills, the certification fosters a deeper engagement with the cloud-native community. It encourages and gives you the ability to have deeper discussions on critical topics, leading to further collaboration and innovation within the industry. This involvement not only enhances personal growth but also contributes positively to the broader cloud-native ecosystem, driving progress and setting new standards in security practice and may inspire you to do more work with the CNCF. That can only lead to growth in the community and that’s something that can only be good for the entire industry!
\\n\\n\\n\\nKubernetes and Cloud Native Security Associate certification is a powerful asset for IT professionals. It enhances career prospects, bolsters problem-solving skills, and ensures comprehensive security knowledge in managing cloud-native applications. For organizations, it signifies a commitment to security excellence, aiding in recruitment, retention, and delivering secure solutions. This certification is essential for staying relevant in the dynamic tech landscape and contributing to a stronger, more secure future in cloud-native computing.
\\n\\nMember post by Anshul Sao, Co-founder & CTO, Facets.cloud
\\n\\n\\n\\nIn today’s tech landscape, organizations frequently face the need to migrate—whether from on-premise to the cloud, from one cloud provider to another, or managing multiple cloud environments. While cloud service providers (CSPs) offer equivalent services and often provide funding and resources to facilitate these migrations, a common challenge persists: many migrations result in a target state that is not significantly better, and sometimes even worse, than the initial state.
\\n\\n\\n\\nWhen it comes to cloud migration, a one-size-fits-all approach doesn’t work. Your strategy should be tailored to your starting point—your initial state. Understanding your current infrastructure before diving into a migration can make the difference between a smooth transition and a series of costly headaches.
\\n\\n\\n\\nFurther, it may be a missed opportunity to clear out legacy mistakes and sub-optimal tools and practices.
\\n\\n\\n\\nFor CIOs, CTOs, and Heads of DevOps, this is especially crucial. Let’s explore why it’s essential to grasp your initial state and how it can shape your cloud migration journey, whether you’re moving from one cloud provider to another (like AWS to GCP) or transitioning from on-premises to the cloud.
\\n\\n\\n\\nBefore you can plan your migration, you need to understand the state of your existing infrastructure. There are three initial states most organizations fall into:
\\n\\n\\n\\nEach of these scenarios requires a different approach to migration. Let’s break them down.
\\n\\n\\n\\nIn this scenario, your organization has already embraced modern practices. You’re running Kubernetes with cloud-native databases, and have well-defined CI/CD pipelines in place. Infrastructure as Code (IaC) is used for everything from infra creation to observability and other concerns.
\\n\\n\\n\\nBut here’s the catch: when you migrate to a different cloud provider, much of the tooling you’ve relied on becomes obsolete. Cloud-specific Terraform scripts, CDK, networking configurations, and security policies— all need to be rewritten to fit the new environment. The same goes for your FinOps tools, account management systems, reporting, ETLs, and analytics. Cloud functions like AWS Lambda, Google Cloud Functions (GCF), and Azure Functions (AFs) need a fresh approach.
\\n\\n\\n\\nWhat Stays the Same: Thankfully, not everything changes. Your Kubernetes manifests, deployment strategies, config management, and CI/CD tooling can remain largely intact, providing continuity in an otherwise disruptive process.
\\n\\n\\n\\nAdopting Platform Engineering: This is the perfect moment to rethink how you approach automation. Instead of building out individual projects with repetitive processes like in the current cloud, consider adopting platform engineering. By creating reusable building blocks that can be deployed across multiple projects, you’ll not only streamline your migration but also set your organization up for long-term scalability and success. Most of your existing tooling can be remodeled in this way, achieving mostly overlooked but important aspects like DevOps and Developer Productivity, Stable Infrastructure, and Cost optimization by design.
\\n\\n\\n\\nIf your infrastructure is fully automated but still running on legacy systems—think VMs and self-hosted databases—your migration will involve a steep learning curve. The automation you’ve developed for your current environment may not translate well to the target cloud provider.
\\n\\n\\n\\nChallenges: Moving from VMs and self-hosted databases to a new cloud environment means you must re-engineer everything. Autoscaling, health checks, load balancers, security groups, route tables—all fundamental components work differently across cloud providers. It’s almost like starting from scratch.
\\n\\n\\n\\nTooling Obsolescence: In this case, almost all of your tooling will need to be replaced. For example, your existing CloudFormation templates or Terraform scripts may be useless in the target cloud. Even your CI systems might require adjustments.
\\n\\n\\n\\nThe Modernization Hook: Now there are two choices, rebuild the posture as it was in the target cloud or invest the same effort in modernizing your stack. Since you’re already redesigning and re-automating everything, you might as well future-proof your operations. The delta in effort required for modernization at this stage is minimal, as the kinds of testing and qualification—such as performance testing, sanity checks, and other validation processes—will be similar for both migration and modernization activities. By aligning these efforts, you can streamline the process and ensure that your infrastructure is ready for the future.
\\n\\n\\n\\nThis is the toughest spot, but it’s a reality for many organizations. If your processes are still largely manual and you haven’t embraced automation, migrating to the cloud will be a significant challenge.
\\n\\n\\n\\nChallenges: The main problem here is that your current operating method won’t cut it in the cloud. Manual processes are inefficient, error-prone, and impossible to scale in a modern cloud environment.
\\n\\n\\n\\nWhy Now is the Time to Change: Cloud migration is your chance to transform how your organization works. It’s the right moment to evaluate new tools and platforms to help you manage workloads more effectively in the cloud. All aspects, from provisioning to deployment to monitoring, can be automated, making your operations more efficient and reliable. Reskilling your team, adopting new practices, and investing in automation will make the migration smoother and set your organization up for future success. That said, it will require significant investment in upskilling your current workforce, seeking consultation, and finding partners to ease your journey of automation.
\\n\\n\\n\\nCloud migration is more than just a technical shift; it’s an opportunity to question your old practices, embrace new trends like platform engineering, and reboot your operations for a more efficient, scalable future. Assessing your initial state is the first step in this journey. Whether you’re fully automated with modernized infrastructure or struggling with manual processes, understanding where you stand will help you confidently navigate the complexities of cloud migration.
\\n\\n\\n\\nIf you’re ready to take the next step, consider exploring how platform engineering can affect your migration strategy. It could be the key to unlocking the full potential of your cloud operations.
\\n\\n\\n\\n\\n\\nEnd user post by Dan Williams, Senior Infrastructure Engineer at loveholidays
\\n\\n\\n\\nIn this blog post, we’ll share how loveholidays was able to utilise Linkerd to provide uniform metrics across all services, leading to a decrease in incident Mean Time To Discovery (MTTD) and an increased customer conversion rate by reducing search response times.
\\n\\n\\n\\nFounded in 2012, loveholidays is the largest and fastest growing online travel agency in the UK and Ireland. Having launched in the German market in May 2023, we’re on a mission to open the world to everyone. Our goal is to offer our customers unlimited choice with unmatched ease and unmissable value, providing the perfect holiday experience. To achieve that, we process trillions(!) of hotel/flight combinations per day, with millions of passengers travelling with us annually.
\\n\\n\\n\\nloveholidays has around 350 employees across London (UK) and Düsseldorf (Germany), with more than 100 of us in Tech and Product, and we are constantly growing. Our current engineering headcount sits at around 60 software engineers and 5 platform engineers.
\\n\\n\\n\\nEngineering at loveholidays is scaled based on 5 simple principles, embodying our “you build it, you run it” engineering culture. Our engineers are empowered to own the full software delivery lifecycle (SDLC), meaning each and every engineer in our company is responsible for their services at every step of the journey, from initial design through to deploying to production, building monitoring alerts/dashboards and being on-call for all day-2 operations.
\\n\\n\\n\\nThe platform infrastructure team at loveholidays works to enable developers to operate in a self-serve manner. We do this by identifying common problems and friction points, then solving them with infrastructure, process and tooling. We are huge advocates for open source tooling, meaning we are constantly pushing the boundaries and working on the cutting edge of technology.
\\n\\n\\n\\nWe are a team of Google Cloud Platform evangelists. In 2018, we migrated from on-prem to GCP (see our Google case study here!), with 100% of our infrastructure now in the cloud. We run 5 GKE clusters spread across production, staging and development environments, with all services running in one primary region in London. We are actively working on introducing a multi-cluster, multi-region architecture. Learn more about our multi-cluster expansion in this blog post.
\\n\\n\\n\\nWe run somewhere in the region of 5000 production pods with around 300 Deployments / StatefulSets, all managed by our development teams. The languages, frameworks, and tooling used are all governed by the teams themselves, so we have services in Java, Go, Python, Rust, JavaScript, TypeScript, and more. One of our engineering principles is “Technology is a means to an end”, meaning teams are empowered to pick the correct language for the task rather than having to conform to existing standards.
\\n\\n\\n\\nWe use Grafana’s LGTM Stack (Loki, Grafana, Tempo, Mimir) for all of our observability needs, along with other open-source tools such as the Prometheus-Operator to scrape our application’s metrics, ArgoCD for GitOps along with Argo Rollouts and Kayenta powering our canary deployments.
\\n\\n\\n\\nWith GitOps powered deployments, we deploy to production over 1500 times a month. We move fast and we occasionally break stuff.
\\n\\n\\n\\nSearch response time has a direct correlation with customer conversion rate — the likelihood for a customer to complete their purchase on our website. In other words, it’s absolutely vital for us to monitor the latency and uptime of our services to spot regressions or poorly behaving components. Latency also correlates with infrastructure cost as faster services are cheaper to run.
\\n\\n\\n\\nSince teams are in charge of creating their own monitoring dashboards and alerts, historically, each team reinvented the wheel for monitoring their applications. This resulted in inconsistent visualisations, incorrect queries/calculations, or completely missing data. A simple example of this is one application reporting throughput in requests per minute, another with requests per second. Some applications would record latency in averages, some with percentiles such as P50/P95/P99, but not even consistently the same set of percentiles.
\\n\\n\\n\\nThese small details made it nearly impossible to quickly compare the performance of any given service, where the observer would have to learn the dashboards for each application before beginning to evaluate its health. We couldn’t quickly identify when one application was performing particularly badly, meaning our MTTD for regressions introduced with new deployments would have a direct impact on our sales / conversion rate.
\\n\\n\\n\\nEach language and framework we use brings with it a new set of metrics and observability challenges. Just creating common dashboards wouldn’t be an option, as no two applications present the same set of metrics, and this is where we decided that a service mesh could help us. Using a service mesh, we would immediately have a uniform set of Golden Signals (latency, throughput, errors) for HTTP traffic, regardless of the underlying languages or tooling.
\\n\\n\\n\\nAs well as uniform monitoring, we knew a service mesh would help us with a number of other items on our roadmap, such as mTLS authentication, automated retries, canary deploys, and even east-west multi-cluster traffic.
\\n\\n\\n\\nThe next question was which service mesh? Like most companies, we decided to evaluate Linkerd and Istio, the only two CNCF-graduated service meshes. Our team had brief previous exposure to both Istio and Linkerd, having tried and failed to implement both meshes in the very early days of our GKE migration. We knew Linkerd had been completely rewritten with 2.0, so we decided to pursue this first as our Proof of Concept (PoC).
\\n\\n\\n\\nOne of our core principles is “Invest in simplicity”, and this is embodied throughout Linkerd’s entire philosophy and product. For our initial PoC, we installed Linkerd to our dev cluster using their CLI, we added the Viz plugin (which provides real-time traffic insights in a nice dashboard), and finally added the `linkerd/inject: enabled` annotation to a few deployments. That was it. It just worked. We had standardised golden metrics being generated by the sidecar Linkerd proxies, mTLS securing pod-to-pod traffic and we could listen to live traffic using the tap feature, all within about 15 minutes of deciding to install Linkerd in dev.
\\n\\n\\n\\nLinkerd as a product aligned exactly with how we approach engineering, by investing in simplicity. We had a clear goal in mind of the problem we wanted to solve, without additional complexity or overhead, and our findings during the PoC stage made it immediately clear that Linkerd was the correct solution to the problems we set out to solve, and the entire open source community around Linkerd deserves a mention as everyone involved is incredibly open and willing to help.
\\n\\n\\n\\nTo roll out Linkerd into production, we took a very slow and calculated approach. It took us over six months to get to full coverage, as we onboarded a small number of applications at a time and then carefully monitored these in production for any regressions. Our implementation journey is detailed in a three part series on our tech blog: “Linkerd at loveholidays — Our journey to a production service mesh” blog post.
\\n\\n\\n\\nThis slow approach may seem counterintuitive given we went with Linkerd for both ease and speed of deployment. The reality is that edge cases exist, and things will always break in ways you don’t expect as no two production environments are the same.
\\n\\n\\n\\nThroughout our onboarding journey, we identified and fixed a number of issues with edge case issues such as connection leaking and memory leaks, requests being dropped due to malformed headers and a number of other network related issues that were previously hidden before the Linkerd proxy exposed them.
\\n\\n\\n\\nI like the following excerpt from the Debugging 502s page in Linkerd’s documentation: “Linkerd turns connection errors into HTTP 502 responses. This can make issues which were previously undetected suddenly visible.”
\\n\\n\\n\\nOur efforts combined with the Linkerd maintainer’s receptiveness and willingness to help, led to code fixes both in ours and Linkerd’s code to address some of the issues we found. As we progressed through our onboarding journey and identified different failure modes, we added new Prometheus and Loki (logs) alerts to quickly alert us to issues related to the service mesh before they become problems.
\\n\\n\\n\\nSince finishing our onboarding in early 2023, `linkerd/inject: enabled`has become the default setting for all new applications deployed via our common Helm Chart and isn’t even something we think about anymore.
\\n\\n\\n\\nWe built a common dashboard based on Linkerd metrics, automatically showing for all meshed applications:
\\n\\n\\n\\n…and more. All new services are automatically meshed, meaning we provide near-full observability for all services from the moment they are deployed.
\\n\\n\\n\\nFor each service, we can see the traffic from and to different services, so we can quickly identify if, for example, a particular downstream service is responding slow or erroring. Previous dashboards might tell us requests were failing, but not be able to easily identify the source or destination service for those failing requests.
\\n\\n\\n\\nWith Linkerd’s throughput metrics, combined with some pod labels from Kube State Metrics, we have been able to produce dashboards identifying cross-zone traffic inside the Cluster, which we’ve then used to optimise our traffic and save significant amounts on our networking spend. We plan to use Linkerd’s HAZL to further reduce our cross-zone traffic.
\\n\\n\\n\\nUsing a combination of Argo Rollouts and Kayenta, we use the metrics provided by Linkerd to perform Canary deployments with automated rollbacks in case of failure. A new deployment will create 10% of pods using the new image, then we use the Linkerd metrics captured from the new and old pods to perform statistical analysis and identify issues before the new version is promoted to stable. This means if, for example, an application’s P95 response time increases by a certain % compared to the previous stable version, we can consider it a failed deployment and rollback automatically.
\\n\\n\\n\\nWe have caught hundreds of failed deployments with this approach, significantly reducing our MTTD and incident response time. Oftentimes, issues that would have become full-scale production outages, are caught and rolled back long before they would have been caught by an engineer. This type of metrics based analysis is only possible when you collect data from your applications in a consistent way.
\\n\\n\\n\\nWe’ve also built automated latency and success SLIs and SLOs based on Linkerd metrics, using a combination of open source tools: Pyrra and Sloth.
\\n\\n\\n\\nThe final use case we’ll share is that we export all of the Linkerd metrics to BigQuery, and have an Application Performance Monitoring dashboard which shows us Latency and RPS for all applications since the start of us collecting Linkerd metrics. This is very useful for identifying regressions (or improvements as shown below!) over time:
\\n\\n\\n\\nOf course, Linkerd offers so much more than just metrics, and we have been able to utilise these features in other areas but we are barely scratching the surface of what is possible. We’ve utilised Service Profiles to define Retries and Timeouts at the Service level, meaning this becomes a uniform configuration to make an endpoint automatically retryable, allowing us to shift this logic out of application code.
\\n\\n\\n\\nWith the new advances in HTTPRoutes from Linkerd, we are excited to get into seeing how we can utilise new features such as Traffic Splitting and Fault Injection to further enhance our use of the service mesh.
\\n\\n\\n\\nYou can read more about our experience using Linkerd for monitoring at Linkerd at loveholidays — Monitoring our apps using Linkerd metrics.
\\n\\n\\n\\nA great user journey is at the heart of everything we do, with a fast search being crucial to the holiday browsing experience for our customers. And as mentioned above, search response time has a direct correlation with customer conversion rate. A refactor of our internal content repository tooling with a decreased P95 search time resulted in a 2.61% increase in conversion, based on a 50% traffic split A/B test. Based on the numbers published in the atol report, with loveholidays flying nearly 3 million passengers in 2023, this means an additional ~75,000 passengers travelled with us as a result of a faster search, all monitored and powered with the metrics produced by the Linkerd proxy.
\\n\\n\\n\\nAt loveholidays, we are big open source fans, and as such, we use many CNCF projects. Here’s a quick overview of our current CNCF stack. At the core of it all is, of course, Kubernetes. All of our production applications are hosted in Google Kubernetes Engine, with Gateway API used for all ingress traffic.
\\n\\n\\n\\nWe’ve recently migrated from Flux to ArgoCD, which we use as our GitOps controller with a combination of Kustomize and Helm to deploy our manifests. We have canary analysis and automated rollbacks using Argo Rollouts, Kayenta and Spinnaker all powered by Linkerd metrics. We’ve also developed an in-house “common” Helm chart, powering all of our applications. This ensures consistent labels and resources for all applications.
\\n\\n\\n\\nSome of our workloads use KEDA to scale pods based on GCP’s Pub/Sub and/or RabbitMQ. We have processes which dump thousands of messages at once into a Pub/Sub Queue, so we use KEDA to scale up hundreds of pods at a time to rapidly handle this load.
\\n\\n\\n\\nWe use Conftest / OPA as part of both our Kubernetes and Terraform pipelines. This enables us to enforce best practices as a platform team (check out Enforcing best practice on self-serve infrastructure with Terraform, Atlantis and Policy As Code for more details).
\\n\\n\\n\\nWe store our secrets in Hashicorp Vault, with Vault Secret Operator for Cluster integration.
\\n\\n\\n\\nOur monitoring system is fully open source. We built it with Grafana’s Mimir, Grafana, Prometheus, Loki, and Tempo. We use the prometheus-operator with kube-stack-prometheus. We also use OpenTelemetry and collect distributed tracing with otel-collector and Tempo, and Pyroscope for Application Performance Monitoring.
\\n\\n\\n\\nKnown internally as Devportal, we use Backstage to keep track of our services as an internal service catalogue. We have tied Backstage “service” resources into our common Helm chart to ensure every deployment inside Kubernetes has a Backstage / Devportal reference, with quick links to the Github repository, logs, Grafana dashboards, Linkerd Viz, and so on.
\\n\\n\\n\\nVelero is used for our cluster backups, and Trivy is used to scan our container images. Kubeconform is used as part of our CI pipelines for K8s manifest validation.A combination of cert-manager and Google-managed certificates, both using Let’s Encrypt, power loveholidays’ certificates, and external-dns automates DNS record creation via Route53. We use cert-manager to generate the full certificate chain for Linkerd, which is the third part of our Linkerd blog series – Linkerd at loveholidays — Deploying Linkerd in a GitOps world.
\\n\\nProject post by Karmada Maintainers
\\n\\n\\n\\nKarmada is an open multi-cloud and multi-cluster container orchestration engine designed to help users deploy and operate business applications in a multi-cloud environment. With its compatibility with the native Kubernetes API, Karmada can smoothly migrate single-cluster workloads while still maintaining coordination with the surrounding Kubernetes ecosystem tools.
\\n\\n\\n\\nThis version includes the following new features:
\\n\\n\\n\\nIn the latest released v1.11 version, Karmada has added the feature of cross-cluster rolling upgrades for federated workloads. This feature is particularly suitable for workloads deployed across multiple clusters, allowing users to adopt more flexible and controllable rolling upgrade strategies when releasing new versions of their workloads. Users can finely control the upgrade process to ensure a smooth transition for each cluster during the upgrade, minimizing the impact on the production environment. This feature not only enhances the user experience but also provides more flexibility and reliability for complex multi-cluster management.
\\n\\n\\n\\nBelow is an example to demonstrate how to perform a rolling upgrade on federated workloads:
\\n\\n\\n\\nAssuming that the user has already propagated the Deployment to three member clusters through PropagationPolicy: `ClusterA`, `ClusterB`, `ClusterC`:
\\n\\n\\n\\napiVersion: policy.karmada.io/v1alpha1\\nkind: PropagationPolicy\\nmetadata:\\n name: nginx-propagation\\nspec:\\n resourceSelectors:\\n - apiVersion: apps/v1\\n kind: Deployment\\n name: nginx\\n placement:\\n clusterAffinity:\\n clusterNames:\\n - ClusterA\\n - ClusterB\\n - ClusterC
\\n\\n\\n\\nAt this point, the version of the Deployment is v1. To upgrade the Deployment resource version to v2, users can perform the following steps in sequence.
\\n\\n\\n\\nFirstly, the user configures the PropagationPolicy to temporarily halt the propagation of resources to `ClusterA` and `ClusterB`, so that the deployment changes will only occur in `ClusterC`:
\\n\\n\\n\\napiVersion: policy.karmada.io/v1alpha1\\nkind: PropagationPolicy\\nmetadata:\\n name: nginx-propagation\\nspec:\\n #...\\n suspension:\\n dispatchingOnClusters:\\n clusterNames: \\nClusterA\\nClusterB
\\n\\n\\n\\nThen, update the PropagationPolicy resource to allow the system to synchronize the new version of the resources to the `ClusterB` cluster:
\\n\\n\\n\\n suspension:\\n dispatchingOnClusters:\\n clusterNames: \\nClusterA
\\n\\n\\n\\nFinally, remove the `suspension` field from the PropagationPolicy resource to allow the system to synchronize the new version of the resources to the `ClusterA` cluster:
\\n\\n\\n\\nFrom the example above, we can see that by using the cross-cluster rolling upgrade capability of federated workloads, the new version of the workload can be rolled out cluster by cluster, and precise control can be achieved.
\\n\\n\\n\\nAdditionally, this feature can also be applied to other scenarios:
\\n\\n\\n\\nIn this version, the Karmada community has focused on enhancing Karmadactl capabilities to provide a better multi-cluster operations experience, thereby reducing users’ reliance on kubectl.
\\n\\n\\n\\nA More Extensive Command Set
\\n\\n\\n\\nKarmadactl now supports a richer command set including `create`, `patch`, `delete`, `label`, `annotate`, `edit`, `attach`, `top node`, `api-resources`, and `explain`. These commands allow users to perform more operations on resources either on the Karmada control plane or member clusters.
\\n\\n\\n\\nEnhanced Functionality
\\n\\n\\n\\nKarmadactl introduces the `–operation-scope` parameter to control the scope of command operations. With this new parameter, commands such as `get`, `describe`, `exec`, and `explain` can flexibly switch between cluster perspectives to operate on resources in the Karmada control plane or member clusters.
\\n\\n\\n\\nMore Detailed Command Output Information
\\n\\n\\n\\nThe output of the `karmadactl get cluster` command now includes additional details such as the cluster object’s `Zones`, `Region`, `Provider`, `API-Endpoint`, and `Proxy-URL`.
\\n\\n\\n\\nThrough these capability enhancements, the operational experience with karmadactl has been improved. New features and more detailed information about karmadactl can be accessed using `karmadactl –help`.
\\n\\n\\n\\nIn this version, Karmada has standardized the generation semantics of workload at the federation level. This update provides a reliable reference for the release system, enhancing the accuracy of cross-cluster deployments. By standardizing generation semantics, Karmada simplifies the release process and ensures consistent tracking of workload status, making it easier to manage and monitor applications across multiple clusters.
\\n\\n\\n\\nThe specifics of the standardization are as follows: the observedGeneration value in the status of the federated workload is set to its own `.metadata.generation` value only when the state of resources distributed to all member clusters satisfies `status.observedGeneration` >= `metadata.generation`. This ensures that the corresponding controllers in each member cluster have completed processing of the workload. This move aligns the generation semantics at the federation level with those of Kubernetes clusters, allowing users to more conveniently migrate single-cluster applications to a multi-cluster setup.
\\n\\n\\n\\nThe following resources have been adapted in this version:
\\n\\n\\n\\nIf you need to adapt more resources (including CRDs), you can provide feedback to the Karmada community or extend using the Resource Interpreter.
\\n\\n\\n\\nCRD (Custom Resource Definition) resources are key prerequisite resources used by the Karmada Operator to configure new Karmada instances. These CRD resources contain critical API definitions for the Karmada system, such as PropagationPolicy, ResourceBinding, and Work.
\\n\\n\\n\\nIn version v1.11, the Karmada Operator supports custom CRD download strategies. With this feature, users can specify the download path for CRD resources and define additional download strategies, providing a more flexible offline deployment method.
\\n\\n\\n\\nFor a detailed description of this feature, refer to the proposal: Custom CRD Download Strategy Support for Karmada Operator.
\\n\\n\\n\\nThe Karmada v1.11 release includes 223 code commits from 36 contributors. We would like to extend our sincere gratitude to all the contributors:
\\n\\n\\n\\n@08AHAD | @a7i | @aditya7302 | @Affan-7 |
@Akash-Singh04 | @anujagrawal699 | @B1F030 | @chaosi-zju |
@dzcvxe | @grosser | @guozheng-shen | @hulizhe |
@iawia002 | @mohamedawnallah | @mszacillo | @NishantBansal2003 |
@jabellard | @khanhtc1202 | @liangyuanpeng | @qinguoyi |
@RainbowMango | @rxy0210 | @seanlaii | @spiritNO1 |
@tiansuo114 | @varshith257 | @veophi | @wangxf1987 |
@whitewindmills | @xiaoloongfang | @XiShanYongYe-Chang | @xovoxy |
@yash | @yike21 | @zhy76 | @zhzhuang-zju |
More Information:
\\n\\n\\n\\nKarmada Website:https://karmada.io/
\\n\\n\\n\\nGitHub:https://github.com/karmada-io/karmada
\\n\\n\\n\\nSlack:https://slack.cncf.io/(#karmada)
\\n\\n\\n\\nKarmada v1.11: https://github.com/karmada-io/karmada/releases/tag/v1.11.0
\\n\\n\\n\\nCustom CRD Download Strategy Support for Karmada Operator: https://github.com/karmada-io/karmada/tree/master/docs/proposals/operator-custom-crd-download-strategy
\\n\\nCommunity post originally published on Medium by Maryam Tavakkoli
\\n\\n\\n\\nThis article will explore CNCF projects that directly contribute to green technology, helping organizations align with their sustainability goals.
\\n\\n\\n\\nIn recent years, the conversation around sustainability has expanded beyond traditional industries and entered the realm of technology. As data centers, cloud platforms, and digital services become more vital to modern life, their environmental impact has become impossible to ignore. By enabling businesses to optimize resource consumption, reduce carbon footprints, and improve operational efficiency, CNCF projects are preparing for more sustainable tech practices.
\\n\\n\\n\\nThe environmental footprint of cloud computing is significant. Powering millions of servers across the globe requires a tremendous amount of energy, and inefficient resource utilization can lead to waste. According to studies, data centers contribute around 1% of global electricity use. However, cloud-native technologies — like Kubernetes, and other CNCF projects — offer a new path forward by optimizing resource usage, reducing waste, and supporting more sustainable infrastructure management.
\\n\\n\\n\\nAt the heart of CNCF’s green tech ecosystem is Kubernetes, the de facto platform for container orchestration. Kubernetes allows organizations to run containerized applications efficiently by dynamically scaling resources to meet demand.
\\n\\n\\n\\nCapabilities such as Horizontal Pod Autoscaler, Vertical Pod Autoscaler, and Cluster Autoscaler ensure that workloads consume only the resources they need, reducing energy consumption by avoiding over-provisioning. Kubernetes’ ability to optimize infrastructure usage plays a crucial role in minimizing the energy footprint of large-scale cloud environments, ensuring that data centers run efficiently and sustainably.
\\n\\n\\n\\nKEDA (Kubernetes Event-Driven Autoscaling) is another CNCF project that plays a key role in resource efficiency. KEDA allows Kubernetes to scale workloads based on event-driven patterns, ensuring that resources are dynamically adjusted in response to actual demand. By autoscaling applications, KEDA helps avoid over-provisioning and minimizes energy consumption.
\\n\\n\\n\\nWith KEDA, applications only consume resources when triggered by specific events, making it an ideal solution for organizations seeking to optimize their infrastructure and reduce energy waste in cloud-native environments.
\\n\\n\\n\\nKubeGreen is a project within the Kubernetes ecosystem specifically designed to reduce the energy consumption of clusters. It achieves this by scaling down or stopping non-essential workloads during periods of low demand, such as off-peak hours or weekends.
\\n\\n\\n\\nBy pausing workloads and conserving resources when they are not needed, KubeGreen allows organizations to significantly reduce the energy footprint of their Kubernetes clusters. This feature makes KubeGreen a key tool for businesses looking to align their infrastructure management practices with sustainability goals.
\\n\\n\\n\\nThe rise of edge computing is another significant factor in reducing the energy demands of cloud infrastructure. KubeEdge, a CNCF project, extends Kubernetes to the edge, allowing applications to run closer to where data is generated, reducing the need for long-distance data transfers.
\\n\\n\\n\\nBy processing data at the edge, KubeEdge reduces the energy and bandwidth consumption of centralized cloud data centers. This is particularly important for IoT environments, where large volumes of data are continuously produced. KubeEdge helps organizations optimize power consumption, contributing to more sustainable tech operations at the edge.
\\n\\n\\n\\nPrometheus, the leading open-source monitoring and alerting toolkit, plays a vital role in making cloud infrastructure more sustainable. By providing real-time metrics on resource usage, Prometheus enables organizations to monitor CPU, memory, and network consumption closely. This visibility helps teams identify inefficiencies and optimize workloads to minimize energy consumption.
\\n\\n\\n\\nPrometheus’ ability to provide insights into resource bottlenecks and underutilized infrastructure allows businesses to take proactive steps in reducing their environmental impact, such as shutting down idle servers or optimizing workloads for efficiency.
\\n\\n\\n\\nWhile not part of CNCF, Karpenter is an open-source project by AWS that enhances cluster autoscaling in Kubernetes, making it highly relevant to organizations pursuing sustainability. By right-sizing compute resources in real-time and optimizing node placement, Karpenter reduces idle infrastructure and energy waste. Its dynamic autoscaling policies allow workloads to consume only what they need, leading to significant reductions in both costs and energy consumption.
\\n\\n\\n\\nThe Green Software Foundation complements CNCF’s sustainability goals by focusing on energy-efficient software development. This nonprofit initiative promotes sustainable coding practices, helping developers build applications with lower energy consumption and carbon footprints.
\\n\\n\\n\\nThough not a CNCF project, its best practices, toolkits, and research contribute to the broader green tech movement, ensuring that the software powering cloud-native infrastructure is as energy-efficient as the infrastructure itself.
\\n\\n\\n\\nWhile CNCF projects such as Kubernetes, KEDA, and Prometheus are fostering innovation in green cloud-native infrastructure, the broader open-source ecosystem also plays a critical role. Tools like Karpenter and initiatives like the Green Software Foundation show that sustainability is not limited to any one organization or project. Collaboration across the tech landscape is key to reducing the environmental footprint of digital infrastructure, and as more projects embrace green practices, the future of cloud-native technology promises to be more sustainable.
\\n\\n\\n\\n\\n\\n\\n\\n\\nI would love to hear your thoughts and feedback on this article. Let’s continue learning, sharing, and evolving together! Until next time!
\\n
Find all my links here.
\\n\\n\\n\\n\\n\\nCommunity post by Ronald Petty and Tom Thorley of the Internet Society US San Francisco Bay Area Chapter (original post)
\\n\\n\\n\\nWhen you hear the word encryption, what comes to mind? Take a moment…
\\n\\n\\n\\nUpon asking this question to others, answers range from nice and clear, such as “safety” to challenging and obtuse, such as “hard to use.” It’s not unexpected to see why such a range of responses come to mind. Encryption is an idea, a set of tools, part of a toolchain, or an action that one takes. Turning to a practical description of encryption, from Cloudflare, we see:
\\n\\n\\n\\n\\n\\n\\n\\n\\nWhat is encryption?
\\n\\n\\n\\nEncryption is a way of scrambling data so that only authorized parties can understand the information. In technical terms, it is the process of converting human-readable plaintext to incomprehensible text, also known as ciphertext. In simpler terms, encryption takes readable data and alters it so that it appears random. Encryption requires the use of a cryptographic key: a set of mathematical values that both the sender and the recipient of an encrypted message agree on.
\\n
This definition already closely matches our ad-hoc survey from earlier, “safety” === “only authorized parties can understand the information” and “hard to use” === the rest of the description.
\\n\\n\\n\\nWhy should we care about encryption? After all, if you do nothing wrong, what’s there to hide? This is a common sentiment on why people are lax on security. Computer crime is often invisible to end-users. Ever receive a mailing from a service provider stating your own data, just one of many who’s information “may” have been stolen? It’s usually at these points in time where we wonder if our data is safe in the hands of those holding it.
\\n\\n\\n\\nIn today’s interconnected landscape, where our lives unfold online—from social interactions to financial transactions—encryption has become a vital safeguard. Every day, we trust digital platforms with personal details. Encryption serves as a protective barrier, defending these interactions against cyber threats. In an era marked by data breaches and identity theft, strong encryption is more important than ever. It acts as a shield, deterring cybercriminals and ensuring that our private lives remain just that—private.
\\n\\n\\n\\nBut encryption isn’t solely about protection; it represents a broader principle of freedom. In places where information is controlled, encryption empowers individuals to communicate without the fear of surveillance or retaliation. It underscores the belief that privacy is not just a luxury but a fundamental right.
\\n\\n\\n\\nEvents such as Global Encryption Day (GED) are opportunities for the Internet community to come together and rally around best practices in protecting our data. Even the most well intentioned developer can make mistakes when implementing encryption as part of a security process. GED events provide ways to train and share knowledge around encryption usage.
\\n\\n\\n\\nGED is for everyone, not only developers or security experts. It’s for anyone placing data in harm’s way. Does your mobile app encrypt its data? GED events meet you where you’re at, from concepts to complex implementation. GED keeps encryption alive by helping us use it.
\\n\\n\\n\\nBy attending a local GED event (e.g. https://gedsf.org), you not only invest in yourself but all of us, the Internet community.
\\n\\n\\n\\nGlobal Encryption Day is an annual event organized by the Global Encryption Coalition (GEC). GED is on 21-October, local editions happen throughout the month of October. GEDSF is a local GED hybrid event based in San Francisco put on by the local chapters of the Internet Society and Association for Computing Machinery.
\\n\\n\\n\\nRonald Petty – Internet Society – San Francisco Bay Area Chapter & SF Bay ACM
\\n\\n\\n\\nTom Thorley – Internet Society – Los Angeles Southern California Chapter
\\n\\n\\n\\nCommunity post by Abby Bangser, Christophe Fargette, Piotr Kliczewski, Valentina Rodriguez Sosa
\\n\\n\\n\\nThe term IDP can be confusing, as some of the industry refers to Internal Developer Portals and some speak about Internal Developer Platforms. And the fact these are two different things helps us actually to define the portal, which is our focus today. An Internal Developer Portal (IDP from now on in this article) is an interface that combines the essential tools, information, and support to enable software developers to succeed in their jobs. These interfaces are often webpages that allow users a familiar interface while integrating all the tools capabilities and actions they depend on. An IDP is an addition to your existing tools. Not a replacement for your current solution.
\\n\\n\\n\\nWe all know too well that different services are scattered around the infrastructure. It is considered tribal knowledge on how to get things done, which is acquired over time. It is not surprising that the most productive developers are usually those who work the longest. Thanks to IDPs, it doesn’t need to be this way. They bring a single pane of glass to interact with the infrastructure and reduce cognitive load. Additionally, they will provide organizations with self-service capabilities. IDPs provide a way to share best practices across different teams and standardize infrastructure management.
\\n\\n\\n\\nInternal Developer Portals are a powerful tool for Developers to build, deploy and manage any application without looking into the underlying infrastructure. Additionally, they could access other resources in a secure and controlled self-service approach, such as virtual machines, namespaces, clusters, etc.
\\n\\n\\n\\nIDPs are very powerful and can integrate diverse tools called plugins, enabling developers to access platform capabilities without needing to learn and manage these tools directly. Platform Engineers and Operations can shape the developer experience by integrating the IDPs with business specific plugins and building software templates to be consumed by developer teams.
\\n\\n\\n\\nIt is important to keep in mind that software engineers are the most common consumers of portals, but given their visual and web based nature, there are many other roles that find value in them. This includes managers who want to get an overview of the system, QAs or SMEs who want access to testing environments and to validate features, and Customer Support agents who may want to track down the state of an issue.
\\n\\n\\n\\nSome portals try to build the data and tooling logic into their interface. This often leads to challenges with scaling both horizontally (to other interface types such as CLIs or CI/CD pipelines) and vertically (as in composing more complex experiences within the portal).
\\n\\n\\n\\nAnother concern is when the data in the portal can become stale. This may be because of systems getting out of sync or a lack of adoption, leading to reduced investment. In either case, the immense value of a portal is a single pane of glass access to information, so it must be easy to keep current with what is available on the platform.
\\n\\n\\n\\nPlatform Engineering is the practice of taking common elements available in the market (such as cloud resources or SaaS products) and engineering them fit for their own business while prioritizing the use cases that are most common across the organization. The way to be successful here is to apply the same software practices to any engineering team, and therefore, treating the platform as a product is a key part of long-term success. This means investing in understanding your customers, drawing clear boundaries of what the product is trying to achieve (and therefore also what it is not going to help with), and constantly looking at ways to improve, which can be adding new features or deprecating ones that are no longer useful and cost a lot to maintain. A portal is frequently identified as a platform feature that will support adoption and usability and is an area the team will invest in.
\\n\\n\\n\\nAn Internal Developer Portal is needed to help bootstrap new engineers and projects and maintain the entire software lifecycle at the organization. This includes streamlining day-to-day operations, which can be done by using orchestration. Successful orchestration needs to consider everyday business needs including Infrastructure as Code automation, incorporation of all business processes, and the reality that success often depends on managing handoffs across several different teams. The orchestration tooling can interact with a ticketing system and any services needed for approval processes or manual steps that are hard to automate. Good examples of orchestration are onboarding and offboarding users, infrastructure management or application modernization.
\\n\\n\\n\\nThe biggest trend I see is that people are getting value from portals due to their user-friendliness and visual nature, which means teams are ready to (and need to) invest in longer-term operability. This includes decoupling logic from the front end, building their portal via Infrastructure as Code and declarative techniques, and sharing the load with the community through sharing Open Source plugins.
\\n\\n\\n\\nApplication modernization development and maintenance are part of many organizations’ daily activities, and developers are still a crucial part of this process. Providing a self-service approach where developers can easily access legacy applications and provision virtual machines will remove barriers and accelerate the software development lifecycle from:
\\n\\n\\n\\nUsually, when a new person joins a company or a new consultant needs to be onboarded, there is a lengthy list of steps containing account creation and infrastructure provisioning, and most of those steps require interactions with many teams. Effective Internal Developer Portals offer self-service ways to automate onboarding steps, so teams maintaining systems provide an action to onboard, which significantly accelerates the time needed for new employees to become effective.
\\n\\n\\n\\nThe same building blocks can be used to orchestrate infrastructure management like migrating virtual machines from one infrastructure to another, application modernization, and container or virtual machine management combined with business-related actions like approvals.
\\n\\n\\n\\nSecurity will become more important in IDPs. The templates used to create new applications, or provision infrastructure, will have security baked in, so developers don’t have to add it later. The supply chain will be entirely secured and provide features not only to scan for vulnerabilities in the source code and dependencies but also to generate signatures, SBOM and more.
\\n\\n\\n\\nPortals will continue improving their impact through broader use cases in users and technologies.
\\n\\n\\n\\nAI features such as assistant/chatbot will add very powerful capabilities to developer-centric IDPs.
\\n\\n\\n\\nSuccessful portals will focus on their extensibility and community engagement, allowing them to onboard both existing tooling (yes, even mainframes use cases!) and future use cases.
\\n\\n\\n\\nTo conclude, the future of Internal Developer Portals is to continue building a user-centric approach that drives business further with innovation, reducing friction and thinking on the personas and bringing together their knowledge, processes, and best practices into one single point of contact to provide an excellent user experience beyond Developers to different teams such as Data Scientists and any teams that build applications and components to end users.
\\n\\nMember post originally published on the Syntasso blog by Cat Morris
\\n\\n\\n\\nWhile building an internal developer platform sounds like something an engineering organisation would do – and often tries to do – from scratch, the reality is, most are working with a complex web of legacy systems and tools.
\\n\\n\\n\\nOrganisations often turn to platform engineering to alleviate inefficiencies in or limitations of existing systems. This means that most of the time, a platform initiative isn’t a greenfield endeavour, and it isn’t in most cases realistic to build something brand new or replace existing solutions wholesale.
\\n\\n\\n\\nPlatforms evolve from the responses to challenges identified in existing systems, making most such projects brownfield in nature – and this can, in fact, be advantageous.
\\n\\n\\n\\nAs I shared in my PlatformCon 2024 talk, “The Power of a Brownfield Mindset when Building Your Platforms“, I’ve learned a number of lessons in my years working with developer platforms in brownfield environments, and this article will cover five of the key realisations I’ve had in building successful platforms:
\\n\\n\\n\\nPlatforms emerge from what already exists. I’ve been on a number of platform development journeys with different organisations, and almost invariably, I’ve discovered that most platforms are brownfield. In the software development space, “brownfield development” describes areas that need the development and deployment of new software systems and solutions alongside existing software applications and systems.
\\n\\n\\n\\nThe term “brownfield” was borrowed from civil engineering where brownfield developments require you to work with or around existing structures, environments and hazards. And this is where the majority of platform development takes place. It flies in the face of conventional greenfield thinking, which imagines that platforms can be developed as brand-new, standalone tools to help developers ship software consistently, swiftly and more safely.
\\n\\n\\n\\nThe greenfield way of thinking doesn’t account for the reality facing most organisations: the considerable footprint of existing systems and legacy software around which most platforms will end up being built. And most platforms aren’t built from scratch and brought in to replace whole systems in their entirety. Instead, the old and new co-exist. I’ve routinely supported organisations in adapting their platform development “to meet the existing complexities of the legacy enterprise and its infrastructure and systems.” Starting with a brownfield mindset and planning accordingly creates more realistic expectations and conditions for building a platform that will actually serve the needs of its users.
\\n\\n\\n\\nAs an example of my own experience, I worked as a part of a consulting team helping a financial company/neobank build their developer platform. The company had a DevOps team that struggled to scale the necessary governance and security checks across teams. As the needs of the business grew, DevOps could not hire people fast enough to meet the new demands. They saw platform engineering as an efficient way to meet those needs and offload the DevOps burden without having to double or triple the size of their team.
\\n\\n\\n\\nThis isn’t a unique or even unusual situation. Many organisations balance growing scalability, compliance and governance demands against legacy infrastructure and systems that won’t fit neatly into a shiny new platform. Naturally, this introduces unintentional layers of complexity and is an automatic ticket to brownfield development. What becomes important in these cases is figuring out how to work with what you have alongside mapping out what you actually need. What actually needs to get done, and how can you make this happen with what you have and what you can adopt and adapt?
\\n\\n\\n\\nTo illustrate the complexity in real terms, when I worked with the neobank to create their platform, we started mapping the processes for what we wanted to build. We went in enthusiastically, thinking we could begin with a lightweight, minimum viable product (MVP), add a bit of observability, deploy one service and include a few governance checks. Digging a little deeper, we discovered that there were entrenched interests we would need to consider first. For example, the team had no capacity to do anything new. Also, each of the development teams followed slightly different processes and would not or could not change what they were doing. For example, some teams were using obscure load-balancing features and would not migrate to a new platform without bringing those along. Another had super admin permissions to all the databases and used them daily, meaning we couldn’t get rid of them.
\\n\\n\\n\\nThe bottom line: We spent three months trying to free up DevOps capacity by automating existing systems and processes before even beginning to tackle platform building. And we realised we are not actually building something new (see “Most platforms are brownfield” above). The organisation didn’t have a platform or platform team, but what it did have was a host of existing ways of doing things a platform was going to do and embedded preferences and a lot of accompanying complexity. We also could see that continuing what we were doing would lead to the all-too-common and almost comical problem of multiple competing platforms all trying to accomplish the same things.
\\n\\n\\n\\n\\n\\n\\n\\nWe needed to rethink and adapt platform requirements to accommodate this complexity and legacy while still scaling up. At the same time, we needed to create something fit for purpose (ideally, we thought, a universal platform that would work for everyone) and not just another thing that we would have to support and maintain without the payoff of actually delivering value for its users.
\\n\\n\\n\\nRelated to the idea that most platforms are not built as blank canvases, one key lesson is that platforms are emergent. They emerge from a need to scale and a need for consistency in a complex environment. And what they emerge from is stuff that already exists. What this means is that developers are already deploying codes onto environments that already exist. This code is being maintained somewhere. Errors are already being found and fixed in production. Accounts are being configured and managed. All of this is already happening now, and the need for a unifying platform emerges from these circumstances – potentially highlighting the chaos and inconsistency (and the need for scalability) that necessitates a platform.
\\n\\n\\n\\nWith the neobank’s platform project, we reached a critical mass of services and necessary business capabilities that needed to be supported by someone but were unique to the organisation, so it was not possible to just buy a platform off the shelf. Keeping in mind that a big part of scaling up hinges on getting the product into the hands of customers – fast, the real needs of the platform take shape. Consider that once an organisation starts putting products out there, the products are out there. They can’t be reeled back in. With that in mind, it’s clear that the initial processes and implementations needed to be supported or migrated to the new platform, which is where working around existing structures, environments and hazards becomes important.
\\n\\n\\n\\nBecause of the brownfield, emergent nature of platforms, we need to change how we think about MVPs. The common way we think about building an MVP is often through a lens that doesn’t take real-world considerations into account. For example, if we are building a car, we might start with a skateboard that becomes a scooter that becomes a bike, which eventually becomes a car. With an MVP, we may be testing our iterations in perfect conditions – a beautiful day, clear road, pothole-free, no traffic. But with a platform, you’re driving into a veritable spaghetti junction overloaded with cars driving in the wrong lane at speed, trying to merge into the right lane and get to where you want to go on time.
\\n\\n\\n\\nWhat this tells us is that the standard approach to building an MVP isn’t going to cut it. They just don’t work in the platform context. But why? One explanation is loss aversion.
\\n\\n\\n\\nDaniel Kahneman and Amos Tversky published a paper on prospect theory and loss aversion that showed that when offered two options, people choose a guaranteed but lower amount of cash over a higher amount at a lower probability of winning. To illustrate, a person could choose between either a 99% chance of winning $1,400 and a 1% chance of winning nothing OR a guaranteed $900. The bias toward avoidance of loss (also detailed in Kahneman’s book Thinking, Fast and Slow) is significant.
\\n\\n\\n\\nIn the context of platform building, we draw on the idea of the pain of loss versus positive gains. What is the smallest gain that would need to be made to balance out an equal chance of loss? More specifically, for platform builders, even if people don’t really use or like a feature, they still feel the pain of giving it up more than any potential gain they get from the new platform. If we frame this again using our MVP example, it is like building a bike that may do everything your users need and have other benefits as well – but users don’t want to give up the heated seats in their car, even though they never turn the seat heating on.
\\n\\n\\n\\nInstead, we need to reframe what we are doing and focus on Tony Ulwick’s jobs-to-be-done theory, which is an approach to innovation that argues that customers buy products and services to get a specific job or jobs done. This helps companies streamline the innovation process by uncovering the true purpose of a product or service. In our platform, that means understanding all the steps needed for users to get their job done. The important takeaway for us is that we want to build the platform to allow a user to complete one job from beginning to end. This means that all the capabilities needed to do one thing in its entirety, from configuration to integration checks and rules and so on, need to be included. We create in this way more incrementally than if we tried to build a whole platform – but we are still thinking end to end, only focusing on a smaller but complete task.
\\n\\n\\n\\nEven once you have built a platform that should be embraced and used widely, there will be people who will continue to use all the tools they used before. Partly, this is attributable to loss aversion and partly to the endowment principle, which dictates that people tend to perceive what they already have as more valuable than something similar that they don’t have. For example, a brand-loyal iPhone user believes that the iPhone is inherently more valuable than a comparable Android phone.
\\n\\n\\n\\nIn terms of our platform, users will see what they already have or built as more valuable than the new platform, regardless of whether this is objectively true. Organisations investing in platforms can avoid this by helping users see the platform as something they own – not as competition for their existing solution. Users can embrace both the old and the new ways of getting their job done, but by fostering a sense of platform ownership, the organisation can get users on board and provide active ways for users to contribute to and make the platform better, such as pathways for giving feedback, ways to contribute code or resources and active platform support.
\\n\\n\\n\\nIn a brownfield world, we are building a platform with live systems and processes that need to be supportable in the long term. The platform has to scale exponentially, which is where fleet management can come in, that is, your platform can manage all of your services as one unit, upgrading them seamlessly and automatically. If the platform manages these things, it builds user trust and develops the sense of ownership we’re looking for.
\\n\\n\\n\\nBuilding a platform is typically a brownfield exercise. By the time an organisation is intentionally looking to build a platform, developers are already shipping applications to production and all the mechanisms that support that are already in place. Serve these users and build trust in a platform by taking this into account and eschewing the standard approach to MVP development. Think about the jobs to be done and create a thin but complete solution for one discrete job, and invite users to contribute and give feedback to the platform to make it both more valuable and successful in the long run.
\\n\\nCommunity blog post by Reza Ramezanpour, developer advocate at Tigera
\\n\\n\\n\\nKubernetes is known for its modularity, and its integration with cloud environments. Throughout its history, Kubernetes provided in-tree cloud provider integrations with most providers, allowing us to create cloud-related resources via API calls without requiring us to jump through hoops to deploy a cluster that utilizes the power of underlying networking infrastructure. However, this behavior will change with the release of Kubernetes v1.31, and right now is the best time to plan for it.
\\n\\n\\n\\nIn this blog post, we will examine cloud-provider integrations with Google Cloud Provider infrastructure, how it works, and how we can upgrade to later versions of Kubernetes without breaking our environment.
\\n\\n\\n\\nThe cloud provider controller in Kubernetes is responsible for establishing communication between the cluster and cloud services through HTTP/S requests. It handles tasks like integrating cloud-based resources, such as storage or load balancers, into the cluster’s operations. For example, persistent volumes are often tied to disk claims managed relying to the cloud provider. Additionally, the controller supports other cloud integrations, such as automatically provisioning load balancers for external/internal service exposure.
\\n\\n\\n\\nThe most notable way to think about cloud provider integration is when you create a service with the type load balancer in your cluster and wait for the magic to change the external-ip from <pending> to an actual IP address.
\\n\\n\\n\\nThe following image illustrates the moment that we’re all waiting to see; everything looks as it should:
\\n\\n\\n\\nThe primary goal of this move is to give cloud providers the ability to develop and release their services independently from Kubernetes’ core release cycle, and to level the playing field for clouds that do not have in-tree providers. By separating cloud provider-specific code from the core of Kubernetes, there’s a clear delineation of responsibilities between “Kubernetes core” and cloud providers within the ecosystem. This also ensures consistency and flexibility in how cloud providers integrate their services with Kubernetes.
\\n\\n\\n\\nBy housing each cloud provider’s code in its own repository or module, several advantages emerge:
\\n\\n\\n\\nThis approach fosters more agile development and a streamlined core Kubernetes ecosystem.
\\n\\n\\n\\nNow, if you are ready, let’s continue building a cluster with out-of-tree GCP cloud provider support.
\\n\\n\\n\\nThis blog post has no requirements. You can go through it like your favorite novel and learn the necessary steps to up your game. However, if you like to get your hands dirty there are a couple of requirements that allow you to build the environment in your own Google Cloud account.
\\n\\n\\n\\nTo help you better understand the process, I’ve developed a demo script project using Terraform, which is available on GitHub. It helps you quickly set up a testing environment and allows you to follow the steps of this tutorial.
\\n\\n\\n\\nClone the repository.
\\n\\n\\n\\ngit clone https://github.com/frozenprocess/demo-cluster.git\\ncd demo-cluster
\\n\\n\\n\\nCopy the gcp template from the examples folder to the current directory.
\\n\\n\\n\\ncp examples/main.tf-gcpmulticluster main.tf
\\n\\n\\n\\nOpen the “main.tf” in your favorite editor and change the automatic cloud integration to “true”.
\\n\\n\\n\\nNote: If you are on a trial account, adjust your cloud instance in the “main.tf” file to the ones permitted for trial use (line #23 and #28).
\\n\\n\\n\\nChange line 25 from:
\\n\\n\\n\\n disable_cloud_provider = false
\\n\\n\\n\\n… to:
\\n\\n\\n\\n disable_cloud_provider = true
\\n\\n\\n\\nUse the following command to install the required provider:
\\n\\n\\n\\nterraform init
\\n\\n\\n\\nNote: Completing the following step will populate cloud resources in your account and you will be charged for the duration you use them.
\\n\\n\\n\\nAfter that, issue the following command, review the resources that will be populated in your Google Cloud account, and submit the prompt.
\\n\\n\\n\\nterraform apply
\\n\\n\\n\\nUse the “demo-connection” from the output to ssh into the instance.
\\n\\n\\n\\nThe environment created by the demo-cluster project has permissions embedded with the virtual machine instance that are crucial in enabling cloud-provider integration. These roles and IAM permissions that are associated with the instance will be the identity that issues the resource request to create the necessary cloud resources.
\\n\\n\\n\\nNote: You can examine the list of minimum permissions required to enable load balancer creation associated with the demo instance here.
\\n\\n\\n\\nIf you are trying to accomplish the cloud provider integration in your environment, please make sure that you have a role that provides the following permissions:
\\n\\n\\n\\nAfter creating this role, you will need to create an IAM service account identity to associate with the roles. This can be done via Google Cloud Console.
\\n\\n\\n\\nAfter creating the role and the IAM identity, head over to your VM instance in GCP and edit the resources. Here, make sure you assign a network tag (used to automate firewall rule deployment) to the instance, and in the API identity management, select the newly created service account.
\\n\\n\\n\\nAt this point, we have configured everything necessary for the cloud provider controller to issue resource creation requests via GCP APIs.
\\n\\n\\n\\nPreparing kubelet
\\n\\n\\n\\nBefore installing the cloud provider integration, we must take a few steps to prepare our cluster. First, we need to run kubelet with the appropriate argument.
\\n\\n\\n\\nThese arguments are:
\\n\\n\\n\\n--kubelet-arg=cloud-provider=external
\\n\\n\\n\\nIf you have configured Kubernetes before, “cloud-provider=external” will look familiar. Prior to version v1.29, we could adjust the cloud-provider argument to “gcp,aws, …” to use the in-tree cloud provider. However, using this with recent versions will result in errors and, in some cases, prevent the Kubernetes process from running. Currently with the new releases the only valid option for cloud-provider is going to be “external”.
\\n\\n\\n\\nNote: Keep in mind that if you bootstrap the cloud provider integration your nodes will have a taint that will not be lifted until the provider completely initializes. You can read about this here.
\\n\\n\\n\\nBy default, K3s is shipped with an internal cloud-controller component, which must be disabled before we can use the cloud-specific integration. This can be accomplished by appending the following argument to K3s command.
\\n\\n\\n\\n--disable-cloud-controller
\\n\\n\\n\\nNote: You can accomplish both steps in a fresh install using your K3s installation command. Here is an example.
\\n\\n\\n\\nIf you are following the blog example and you have disabled the provider in your Terraform, use the following command to enable the external provider:
\\n\\n\\n\\nsed -i \'/^$/d\' /etc/systemd/system/k3s.service\\necho -e \\"\\\\t\'--kubelet-arg=cloud-provider=external\' \\\\\\\\\\" | sudo tee -a /etc/systemd/system/k3s.service\\necho -e \\"\\\\t\'--disable-cloud-controller\'\\" | sudo tee -a /etc/systemd/system/k3s.service
\\n\\n\\n\\nUse the following commands to reload the service and restart the K3s server:
\\n\\n\\n\\nsudo systemctl daemon-reload\\nsudo systemctl restart k3s
\\n\\n\\n\\nGCP cloud provider can be configured using a config file. This file instructs the provider on how to interact with your cloud resources and where to generate the necessary resources. A comprehensive list of options can be found here.
\\n\\n\\n\\nThe following example shows the minimum required bits to run the controller:
\\n\\n\\n\\n[Global]\\nproject-id=<YOUR-GCP-PROJECT-ID>\\nnetwork-name=<YOUR-VPC-NAME>\\nnode-tags=<YOUR-INSTANCE-NETWORK-TAG> We assigned this in GCP permissions step
\\n\\n\\n\\nIn the test environment that you created using the “demo-cluster” projects, there is already a generated config file called “cloud.config” in the “/tmp/” directory.
\\n\\n\\n\\nUse the following command to create the necessary options for the provider:
\\n\\n\\n\\nkubectl create cm -n kube-system cloud-config --from-file=/tmp/cloud.config
\\n\\n\\n\\nNow that we have prepared all the necessary configs and permissions, it’s time to deploy the cloud controller. First download the daemonset manifest from here and open it in your favorite text editor.
\\n\\n\\n\\nNote: If you are using the demo environment, there is a daemonset template file called “cloud-controller.yaml” already generated for you in your control plane’s “tmp” folder.
\\n\\n\\n\\nIn the default manifest that we downloaded in the previous section, GCP cloud provider configurations are mounted as files from the host. While this approach is a great way to load your configuration for this blog post, we are going to use the config map that we created on the previous step.
\\n\\n\\n\\nSimply remove the following lines of code from the template
\\n\\n\\n\\nvolumeMounts:\\n\\n - mountPath: /etc/kubernetes/cloud.config\\n\\n name: cloudconfig\\n\\n readOnly: true\\n\\n hostNetwork: true\\n\\n priorityClassName: system-cluster-critical\\n\\n volumes:\\n\\n - hostPath:\\n\\n path: /etc/kubernetes/cloud.config\\n\\n type: \\"\\"\\n\\n name: cloudconfig
\\n\\n\\n\\nNow that these lines are removed, add the following place:
\\n\\n\\n\\nvolumeMounts:\\n\\n - mountPath: /etc/kubernetes/cloud.config\\n\\n name: cloudconfig\\n\\n readOnly: true\\n\\n hostNetwork: true\\n\\n priorityClassName: system-cluster-critical\\n\\n volumes:\\n\\n - hostPath:\\n\\n path: /etc/kubernetes/cloud.config\\n\\n type: \\"\\"\\n\\n name: cloudconfig
\\n\\n\\n\\nThis change allows us to use the configmap as a file inside the controller container.
\\n\\n\\n\\nWe need to change the image of our deployment. Each image corresponds to a version of Kubernetes, and we can verify that from the repository’s release page.
\\n\\n\\n\\nNote: A list of available image tags can be found here.
\\n\\n\\n\\nChange the image from:
\\n\\n\\n\\nimage: k8scloudprovidergcp/cloud-controller-manager:latest
\\n\\n\\n\\n… to:
\\n\\n\\n\\nimage: registry.k8s.io/cloud-provider-gcp/cloud-controller-manager:v27.1.6
\\n\\n\\n\\nWe will use v27.1.6 since our K3s test environment is the most compatible version with our Kubernetes environment.
\\n\\n\\n\\nThe following image illustrates the relationship between the cloud manager and Kubernetes versions.
\\n\\n\\n\\nCloud controller manager binary accepts many arguments and parameters that change its behavior regarding the underlying environment.
\\n\\n\\n\\nFor example, the demo-cluster I provided with this article is configured with Calico CNI, which provides networking and security for the cluster. In this setup, Calico is configured to provide an IP address to the pod.
\\n\\n\\n\\nBack to our manifest, find the following:
\\n\\n\\n\\n # ko puts it somewhere else... command: [\'/usr/local/bin/cloud-controller-manager\']\\n args: [] # args must be replaced by tooling
\\n\\n\\n\\n… and change it to:
\\n\\n\\n\\ncommand:\\n - /cloud-controller-manager\\n - --cloud-provider=gce\\n - --leader-elect=true\\n - --use-service-account-credentials\\n - --allocate-node-cidrs=true\\n - --configure-cloud-routes=false\\n - --cidr-allocator-type=CloudAllocator
\\n\\n\\n\\nLet’s quickly go over some important changes in the previous command and the thought process behind them.
\\n\\n\\n\\n` — cloud-provider=gce` tells the manager that we are using the Google Cloud infrastructure.
\\n\\n\\n\\n` — use-service-account-credentials` is used since our VMs have the necessary permission via the service account that is assigned to them (this step happened in the Terraform here).
\\n\\n\\n\\n` — controllers=*,-nodeipam` is used to disable the IPAM since Calico Open Source (CNI bundled in the demo) is used to assign IP addresses in our environment. This allows us to expand our cluster by using independent private IPs that are not part of the VPC, which provides flexibility beyond what is possible with the default provider IPs circumventing any IP exhaustion issues in the future.
\\n\\n\\n\\n`—cloud-config=/etc/cloud-config/cloud.config` is the configuration file that helps the manager determine where to deploy the resources.
\\n\\n\\n\\nIf you would like to learn about the other options used in the command, click here.
\\n\\n\\n\\nIf you are following up using the demo-cluster environment, a working copy is provided in the tmp folder.
\\n\\n\\n\\nkubectl create -f /tmp/cloud-controller.yaml
\\n\\n\\n\\nWait for the cloud provider pod to come up.
\\n\\n\\n\\nkubectl get pod -n kube-system
\\n\\n\\n\\nNow that we have configured everything it is finally time to create a service.
\\n\\n\\n\\nUse the following command to create a load balancer service:
\\n\\n\\n\\nkubectl create -f -<<EOF\\napiVersion: v1\\nkind: Service\\nmetadata:\\n name: example-service\\nspec:\\n selector:\\n app: example\\n ports:\\n - port: 8765\\n targetPort: 9376\\n type: LoadBalancer\\nEOF
\\n\\n\\n\\nAt this point, your service should acquire an IP address from the provider. 🎉
\\n\\n\\n\\nAlthough this blog post is designed to provide the impression that everything will function seamlessly, in real-world scenarios, you may encounter various challenges, such as changes in permissions or updates to components that may not behave as expected.
\\n\\n\\n\\nThe best place to troubleshoot cloud-provider integration issues is the logs. You can do this by running the following command:
\\n\\n\\n\\nkubectl logs -n kube-system ds/cloud-controller-manager
\\n\\n\\n\\nFor example, if you forgot to disable K3s internal cloud-controller component you will see the following error in your pod:
\\n\\n\\n\\nI0906 17:14:08.469610 1 serving.go:348] Generated self-signed cert in-memory\\n\\nI0906 17:14:09.062781 1 serving.go:348] Generated self-signed cert in-memory\\n\\nW0906 17:14:09.062820 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.\\n\\nfailed to create listener: failed to listen on 0.0.0.0:10258: listen tcp 0.0.0.0:10258: bind: address already in use\\n\\nError: failed to create listener: failed to listen on 0.0.0.0:10258: listen tcp 0.0.0.0:10258: bind: address already in use
\\n\\n\\n\\nIf you are missing a GCP permission, you will see something similar to the following:
\\n\\n\\n\\nE0906 17:24:43.925274 1 gce_loadbalancer_external.go:140] ensureExternalLoadBalancer(aaaaaaaaaaaaaaaaaaaaaaaaa(default/example-service)): Failed to release static IP 34.68.226.31 in region us-central1: googleapi: Error 403: Required \'compute.addresses.delete\' permission for \'projects/calico-rocks/regions/us-central1/addresses/aaaaaaaaaaaaaaaaaaaaaaaa\', forbidden.
\\n\\n\\n\\nSince cloud resources are invoiced by the minute, we have to clean up our environment as soon as possible. To do this, first make sure you have removed all the “loadbalancer” services from your Kubernetes cluster. Then, log out of the SSH box and issue the following command:
\\n\\n\\n\\nterraform destroy
\\n\\n\\n\\nIf you don’t clean up your services, you may see an error similar to the following:
\\n\\n\\n\\nmodule.cluster-a.google_tags_tag_key.tag_key: Still destroying... [id=tagKeys/281480681984856, 10s elapsed]\\n\\nmodule.cluster-a.google_tags_tag_key.tag_key: Destruction complete after 10s\\n\\n╷\\n\\n│ Error: Error waiting for Deleting Network: The network resource \'projects/calico-rocks/global/networks/k3s-demo-olmxliag\' is already being used by \'projects/calico-rocks/global/firewalls/k8s-ef18857c633faeda-node-http-hc\'\\n\\n│ \\n\\n│
\\n\\n\\n\\nIn such a case, you have to manually delete the resources from your Google Cloud account and re-run the Terraform clean-up step.
\\n\\n\\n\\nKubernetes is a vibrant project and with each release it offers more flexibility and features. While creating a cluster nowadays is easier than ever, it is important to keep an eye on the Kubernetes blog to keep yourself updated with the latest Kubernetes news.
\\n\\n\\n\\nThis blog post can be used to integrate out-of-tree providers in GCP. However, the same procedure can be used for other providers such as AWS, Azure, Alibaba, IBM, etc.
\\n\\nMentorship blog by Nate Waddington, Head of Mentorship & Documentation at CNCF
\\n\\n\\n\\nWe are thrilled to share that 45 CNCF mentees with the LFX Program have successfully completed their mentorship.
\\n\\n\\n\\nNumerous CNCF projects across Graduated, Incubating, Sandbox projects, and TAGs (Technical Advisory Groups) got involved, providing our mentees with invaluable experience within the open source and cloud native ecosystem. Projects involved include Harbor, Jaeger, Kubescape, KubeEdge, Kubernetes, Meshery, TAG Network, Vitess, TUF, Thanos, and many more!
\\n\\n\\n\\nAdditional details on the CNCF projects, mentors, and students who successfully completed the program can be found below and on GitHub.
\\n\\n\\n\\nKWOK is a Kubernetes sub-project; people often refer to these sub-projects as Kubernetes Special Interest Group (SIG) projects. It allows developers to simulate large clusters with hundreds of nodes and pods that consume little resources. It uses real Kubernetes components but operates without a Kubelet.
\\n\\n\\n\\nMentee: Charles Uneze (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Zhenghao Zhu and Shiming Zhang
\\n\\n\\n\\n“The LFX Mentorship program broadened my understanding of Kubernetes and how other cloud-native projects test the Kubernetes control plane or their custom control plane using a project like KWOK. While studying the CKA topics, I only knew about the Kubernetes control plane. I also understood how to actively collaborate using Git. I encourage others who’d like to break into the cloud-native industry to leverage the LFX mentorship program to get hands-on experience, you won’t regret it.”
\\n\\n\\n\\n+++
\\n\\n\\n\\nEnhancing the Kafka integration within Jaeger V2, a leading distributed tracing platform.
\\n\\n\\n\\nMentee: Harshith Mente (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Yuri Shkuro, Jonah Kowall, and Yash Sharma
\\n\\n\\n\\n“My experience with the LFX Mentorship program has been incredibly rewarding and transformative.The opportunity to work on a real-world project with the Jaeger community not only expanded my technical skills but also deepened my understanding of open-source collaboration. The mentorship from industry experts like Yuri Shkuro, Jonah Kowall, and Yash Sharma was invaluable, providing guidance and insights that greatly enhanced my learning journey. This program has given me the confidence to tackle complex challenges and the motivation to continue contributing to the open-source ecosystem. I’m immensely grateful for this experience and look forward to applying what I’ve learned in future endeavors.”
\\n\\n\\n\\n+++
\\n\\n\\n\\nImplementing a unified telemetry container for both V1 and V2 components of Jaeger.
\\n\\n\\n\\nMentee: Saransh Shankar (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Yuri Shkuro, Jonah Kowall, and Yash Sharma
\\n\\n\\n\\n“The guidance I received from my mentor Yuri Shkuro and the Jaeger community was invaluable. They were always ready to assist, providing feedback and sharing their extensive knowledge of the field. This collaborative environment not only facilitated my learning but also boosted my confidence as a developer. I gained insights into best practices for coding, debugging, and collaborating in a distributed team setting, all of which are crucial for a successful career in DevOps and open-source development.”
\\n\\n\\n\\n+++
\\n\\n\\n\\nEnhancing Meshery’s design capabilities by implementing support for versioning Meshery designs.
\\n\\n\\n\\nMentee: Saurabh Kumar Singh (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Lee Calcote and Uzair Shaikh
\\n\\n\\n\\n“My experience with the LFX Mentorship program was incredibly rewarding, offering me invaluable exposure to the cloud-native ecosystem and tools. Through this mentorship, I was able to develop and enhance a wide range of technical skills that are essential for a career in the software industry. The hands-on experience with real-world projects, such as contributing to Meshery, significantly boosted my confidence and prepared me for future career opportunities. Additionally, the stipend provided by the program was a great support, further motivating me to excel and pursue my career goals with greater determination. Overall, the LFX Mentorship program has been a pivotal experience in my professional development.”
\\n\\n\\n\\n+++
\\n\\n\\n\\nThis project aims to support more features of the IDE in the tree-sitter-kcl, adding the entire syntax of KCL so that the parser can pass all the test integrations. The primary motive is the addition of all the syntax present in KCL docs, which will, in turn, help the developers in their productivity and enhance the code quality by implementing extensive test cases and robust error handling.
\\n\\n\\n\\nMentee: Korada Vishal (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Zheng Zhang and Zong Zhe
\\n\\n\\n\\n“My mentors and the KCL project are very good. I had wanted to work with projects where I could learn a lot about the new upcoming technologies and I did learn about web3, Rust, and KCL by building small projects related to it. I will continue contributing to KCL-lang after completion.”
\\n\\n\\n\\n+++
\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nPrometheus has a “Remote Write” feature that allows the metrics it collects to be sent to other time series databases. In Remote Write 2.0, the implementation and wire format will significantly change to improve transmission efficiency. My task was to extend Prometheus’ benchmarking tool, Prombench to support the new feature.
\\n\\n\\n\\nMentee: Moeka Mishima (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Jesús Vázquez and Callum Styan
\\n\\n\\n\\n“The LFX Mentorship Program was a significant milestone for me. Contributing to Prometheus, a crucial open-source project, and acquiring new skills has been a significant boost to my career. I intend to continue working on this project and leverage what I’ve learned to collaborate with more developers and contribute to the growth of the open-source community.”
\\n\\n\\n\\n+++
\\n\\n\\n\\nMentee: Dahyeon Kang (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Vedant Shrotria, Sayan Mondal, and Raj Babu Das
\\n\\n\\n\\n“I have gained a lot from the LFX Mentorship, not only in open source, but also in English, communication skills (such as communicating my opinions), and meaningful contributions. This experience will help me grow in the future, both technically and professionally.”
\\n\\n\\n\\n+++
\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nEnhancing and improving the test coverage for KubeEdge by adding Unit and End-to-endEnd to End tests for the project. I also migrated the tests from the standardfrom standard library to a new assertion library and wrotewritten multiple blogs for KubeEdge releases.
\\n\\n\\n\\nMentee: Shubham Singh (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Fisher Xu and Shelley Bao
\\n\\n\\n\\n“The 3 months of LFX mentorship was one of the best learning experiences I ever had in tech. In these 3 months, I explored how to use KubeEdge, worked with my mentors to improve test coverage of the project as well as contributed to the documentation. I am really grateful that I got to work on an emerging CNCF project with such experienced engineers as my mentors. Working closely with such great engineers has taught me a lot about cloud native application development, testing, and a lot more. I am thankful to the Linux Foundation for conducting such an amazing open source program and doing it consistently over the years and I highly recommend to at least apply for LFX if you are a student or somebody interested in FOSS, even the application process and contributions will teach you a lot about FOSS development.”
\\n\\n\\n\\n+++
\\n\\n\\n\\nAdding new features (such as Gateway API installation), fixing CI issues (like CentOS 7 EOL), and updating application versions (including OpenStack Cloud Controller Manager and Cert Manager).
\\n\\n\\n\\nMentee: ChengHao Yang (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Kai Yan and Mohamed Zaian
\\n\\n\\n\\n“Participating in the LFX Mentorship program has provided me with invaluable growth opportunities. First and foremost, I would like to thank my mentors, Kai Yan and Mohamed Zaian, as well as other maintainers like Antoine Legrand, for giving me the freedom to explore my ideas and for offering guidance on how to fix the CI of different branches at the right moments. Collaborating with such experts in the open-source community has been a great honor. In addition to gaining hands-on experience, I have broadened my technical horizons and feel confident about my future career plans. Although this Mentorship has ended, my contributions will not stop here. I will continue to dedicate my efforts to maintaining Kubespray in the future.”
\\n\\n\\n\\n+++
\\n\\n\\n\\nIntegrating an upgrading agent into Litmus 3.x streamlining Chaoscenter upgrades, eliminating the need for fresh installations. This feature ensures seamless transitions between versions, which is especially useful when facing significant changes.
\\n\\n\\n\\nMentee: Kartikay (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Sarthak Jain and Saranya Jena
\\n\\n\\n\\n“The LFX Mentorship was an amazing, transformative journey that equipped me with better industry practices and helped me upskill to become a confident contributor to the open-source ecosystem. The mentorship also enabled me to learn and grow with the knowledge of experienced developers while contributing to a large-scale codebase and enhancing my technical skills, particularly in Golang. Thank you so much for this initiative!”
\\n\\n\\n\\n+++
\\n\\n\\n\\nThe project aimed to improve the system test coverage of KubeArmor and add tests for host protection in various modes. These tests were written using the Ginkgo framework and automated via GitHub workflows.
\\n\\n\\n\\nMentee: Navin Chandra (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Barun Acharya and Rudraksh Pareek
\\n\\n\\n\\n“The LFX Mentorship program was very impactful for my career. From learning new concepts in runtime security like eBPF and LSMs to enhancing my previous skills in golang, GitHub actions and Kubernetes, it was great learning experience for me.”
\\n\\n\\n\\n+++
\\n\\n\\n\\nImproving the onboarding experience for new users of Knative Eventing. Through a combination of user research, surveys, and in-depth interviews, the project identified key pain points that users face when first interacting with Knative Eventing.
\\n\\n\\n\\nMentee: Firat Bezir (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Leo Li and Mariana Mejia
\\n\\n\\n\\n“The LFX Mentorship program has been an incredibly rewarding experience, offering me the opportunity to work on a real-world open-source project while learning from industry experts. Throughout the mentorship, I gained hands-on experience with Knative Eventing, deepened my understanding of event-driven architectures, and improved my skills in user research and documentation. The supportive and collaborative environment fostered by my mentors and the broader community made the journey both educational and enjoyable. This experience has not only strengthened my technical skills but also my confidence in contributing to the open-source ecosystem. I am grateful for the chance to contribute to CNCF and look forward to future contributions.”
\\n\\n\\n\\n+++
\\n\\n\\n\\nArewefastyet is a benchmarking system for Vitess.
\\n\\n\\n\\nMentee: Jad Chahed (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Florent Poinsard
\\n\\n\\n\\n“Incredible experience!”
\\n\\n\\n\\n+++
\\n\\n\\n\\nImproving the developer experience of the Crossplane users. We used and analyzed the tooling that is being used by the developers so that we can create a better experience. My focus was mostly on the beta validate command. We added very important and helpful features to the command and improved its overall performance. After this mentorship, the beta validate command’s execution times went from 50-60 seconds (uncached) to 10-20 seconds (uncached) and from 2-3 seconds (cached) to 500-800ms (cached). We also added features like unknown field validation to the tool.
\\n\\n\\n\\nMentee: Mehmet Enes (Personal blog post about mentorship experience)
\\n\\n\\n\\nMentors: Jared Watts and Ezgi Demireal
\\n\\n\\n\\n“I want to thank my mentors for their kindness and help throughout the mentorship. This mentorship was a great source of self motivation and high quality technical insight for me. At some times I really felt lost and thought “I don’t even understand how things work, and how am I going to contribute??” But my mentors’ guidance made me learn so much on golang and Crossplane. In a very short time I learned how to look at the software as a source of user experience and I feel this greatly improved my perspective on software development. My overall experience with the Crossplane community was great and this mentorship made it perfect thanks so much for everything!”
\\n\\n\\n\\nEarlier this year, The Linux Foundation surveyed 200 organizations to understand how they’re tackling security in cloud native application development.
\\n\\n\\n\\nAt a time when security breaches are increasing in frequency and in impact – the average breach now costs $4.88 million according to IBM’s 2024 Cost of a Data Breach Report – it’s instructive to see how cloud native practitioners feel they’re handling the often sisyphean task of security. After all, security concerns are so widespread there’s now a list of 35 statistics to lose sleep over in 2024. And even the US government issued a stark warning in its 2024 Report on the Cybersecurity Posture of the United States: “It is now clear that a reactive posture cannot keep pace with fast-evolving cyber threats and a dynamic technology landscape.”
\\n\\n\\n\\nSecurity will also be top of mind during KubeCon + CloudNativeCon NorthAmerica 2024 in Salt Lake City, Utah, in November, with 25 sessions devoted to the topic (but more on that below).
\\n\\n\\n\\nTo start on a positive note, 84% of organizations report their cloud native apps are more secure than they were two years ago, and 76% said “much” or “nearly all” of their application development was cloud native. Companies acknowledging “nearly all” cloud native techniques were also the most likely (54%) to report their applications were significantly more secure.
\\n\\n\\n\\nWhy are those organizations saying they’re so much more secure? They are certainly doing more testing. These much more secure organizations are far more likely to be running static application security tests (SAST) – 69% vs. 55% reported by companies that said their security posture was largely unchanged from two years ago. The significantly more secure group is also 22% more likely to be doing software composition analysis than the unchanged group (67% vs. 45%), and 25% more likely to be running web application scans (WAS), 63% vs. 38%.
\\n\\n\\n\\nThese significantly more secure organizations are also doing more of what every organization should be doing: they’re checking all the security assessment “must do” boxes.
\\n\\n\\n\\nAlmost half of cloud native-forward organizations (49%) reported their biggest security challenge was keeping up with emerging threats, data which certainly tracks with the broader trends. Other problems included complexity of software and infrastructure (37%), time constraints and secure deployment and operations, both 35%.
\\n\\n\\n\\nWhere have companies experienced breaches? A majority of respondents (40%) said cloud infrastructure and services, followed by configuration and secrets management (25%), application runtime environment (23%), and data storage/management and user access/identity management, both 22%.
\\n\\n\\n\\nIf this has you thinking, KubeCon + CloudNativeCon North America 2024 has 25 in-depth sessions focused solely on security, meaning you can soak in all the advice, and then get even more good ideas networking in the hallways between sessions. Here’s a taste of what you can expect:
\\n\\n\\n\\nPreventing privilege escalation in GitOps
\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nHow to leverage WASM to secure cloud native apps
\\n\\n\\n\\n“Why perfect compliance is the enemy of real security”
\\n\\n\\n\\nDon’t wait – register today for KubeCon + CloudNativeCon North America 2024
\\n\\nCo-chairs: Naina Singh, Mark Fussell, Evan Anderson
November 12, 2024
Salt Lake City, Utah
AppDeveloperCon is specifically targeting software developers who are using cloud native technologies to solve problems for their end-user customers. While much of KubeCon focuses on how the technologies work, this co-located event aims to help developers figure out the technologies for the applications which run on top of cloud native platforms. These are developers who write in Go, Java, Node, Rust, C# and use a myriad of developer languages and frameworks.
\\n\\n\\n\\nApplication developers are aware of operational needs and get involved with operations, but their primary focus is writing code and using developer tools to solve business problems. This conference is not about deployments, platform upgrades, code packaging, rollouts, failover policies, networking, storage or any of the other operational concerns.
\\n\\n\\n\\nAppDeveloperCon was first introduced in KubeConNA 2023. This is the third AppDeveloperCon co-located event.
\\n\\n\\n\\nWho will get the most out of attending this event?
\\n\\n\\n\\nAppDeveloperCon is designed for developers at all levels who are involved in the architecture, design, and development (using any programming language) of cloud native applications.
\\n\\n\\n\\nWhat is new and different this year?
\\n\\n\\n\\nWe have several talks addressing how to leverage event-driven architecture alongside traditional client/server APIs as part of the talk schedule. We also have talks covering testing and the application development pipeline, in addition to running applications in production and at scale.
\\n\\n\\n\\nWhat will the day look like?
\\n\\n\\n\\nThe morning schedule is mostly focused on event-driven applications and user stories; after lunch we will focus more on traditional applications and how cloud native technologies can help solve problems at all stages from development through build and test to deployment and authorization.
\\n\\n\\n\\nShould I do any homework first?
\\n\\n\\n\\nWhile no formal prep work is required, a little planning can go a long way. Consider identifying key sessions that align with your interests and building a personalized schedule. This can help you make the most of your time and reduce stress.
\\n\\n\\n\\nMost importantly, come with a curious mind and don’t forget to share your thoughts afterward by letting us know which talks were most impactful and what you’d like to see in future events.
\\n\\n\\n\\nSubmitted by the co-chairs, each of who are looking forward to different things at KubeCon + CloudNativeCon in Salt Lake City:
\\n\\n\\n\\nNaina is eager to learn about the latest innovations in Kubernetes, especially those related to cloud native application development and deployment.
\\n\\n\\n\\nEvan wants to understand what new tools most devs are using now. A few years ago it was GitOps; Gateway API seems to be adding a bunch of new routing features. Container CSI seems to solve a real data-push problem in an easy way.
\\n\\n\\n\\nMark is interested in how developers are blending together ML models with traditional programming model frameworks and where this trend will go.
\\n\\n\\n\\nDon’t forget to register for KubeCon + CloudNativeCon North America 2024.
\\n\\n\\n\\n\\n\\nCommunity post by Shon Harris (Linkedin, X)
\\n\\n\\n\\nWelcome to Salt Lake City, KubeCon + CloudNativeCon attendees! You’ll see the beautiful Wasatch Mountain range to the east as you take in the sights. This area is known for its stunning ski resorts and picturesque fall and winter landscapes. It’s often called the ‘Silicon Slopes’ due to the presence of major universities and big tech companies along the Wasatch Range from Logan to Provo. With over 100 miles of trails for hiking, biking, and exploring, I’m here to help you make the most of your time in Salt Lake City, where we do things a little differently!”
\\n\\n\\n\\nWhen you arrive at Salt Lake International Airport (SLC), you have several transportation options to get to downtown. You can hop on the Green Line directly from the airport, or you can use ride-share apps. A useful tip for leaving the airport: when looking for the bus/light rail, taxi, or limo at Baggage Claim, head downstairs instead of crossing the bridge.
\\n\\n\\n\\nIf you use a ride-share app or have someone pick you up, head across the sky bridge after grabbing your bags at Baggage Claim and then take the escalator or elevator downstairs. If you find yourself by the Starbucks or the “Things to do in Utah” info booths, you will have to head back up and over because outside, you won’t be able to cross over to the Ride-Share / Pickup lanes from that side.
\\n\\n\\n\\nThe Grid System: When exploring Salt Lake City, you’ll encounter distinctive addresses like “1320 East 200 South” for various gatherings and events. This is due to the unique street numbering system used in Salt Lake City and most of Utah, centered around a specific point. Salt Lake City’s central point is at the intersection of South Temple and Main Street. For example, if you go thirteen blocks to the east and two blocks to the south from this central point, you will reach the address we used as an example (Which just happens to be home to the best pizza you will find in Utah). This numbering system can seem daunting initally, but once you get the hang of it, really it simplifies navigation in the downtown area, making it easier to find your way around and back to your hotel.
\\n\\n\\n\\nPublic Transit: Get to know the fantastic public transit system in Utah! The Utah Transit Authority (UTA) runs an extensive network of lightrail, high-speed trains, and buses, with convenient service to and from the airport. Hop on and off in the free fare zone without buying a ticket, and easily transfer between buses and trains using just one ticket. For just $25.00, you can grab a five-day transit pass that works seamlessly with the Ride UTA App. This handy app lets you use your pass and provides access to schedules, real-time updates on train and bus locations, and route information.
\\n\\n\\n\\nRemember, Sundays have different operating hours and restrictions, so plan accordingly. Take some time to explore the various services available – it’s worth it!
\\n\\n\\n\\nYou can learn more about UTA and their services here;
\\n\\n\\n\\nWhat should you wear? The weather can be quite diverse, with temperatures ranging from the mid-40s to the mid-60s during the day and getting as low as the mid-30s at night. It’s always good to dress in layers and don’t forget to bring a coat for those chilly nights. And if you’re exploring the canyons, wear cozy socks and shoes to stay warm!
\\n\\n\\n\\nThe dry desert air can be a hurdle for your skin, so be sure to pack moisturizers like chapstick and lotion to keep dryness in check. Also, keep a saline spray handy to avoid any unexpected nosebleeds caused by the dry air.
\\n\\n\\n\\nHeading to the canyons? Don’t forget to pack sunscreen to shield yourself from the sun’s rays, especially if you’ll be rocking sunglasses. Trust me, you don’t want to end up with raccoon eyes from a sunburn! Bring an umbrella in case of a surprise shower, sleet, or snow during your outdoor adventure.
\\n\\n\\n\\nWhere should you go? One of the best things about the Wasatch Front is the amount of stuff you can see or do. Sure, there is Temple Square and the Headquarters of the Church of Jesus Christ of Latter-Day Saints. But there is so much more to Salt Lake and the Surrounding area!
\\n\\n\\n\\nWhat About Where to Get a Drink? You may have heard rumors about Utah, and it’s strange approach to Liquor laws. Fear not. I am here to help. Basically, there are two sorts of places you can find yourself at, and a sign will usually tell you which.
\\n\\n\\n\\nThe Bars in Utah are easy to spot. They will have a full selection of food and drink to keep you going. They are allowed by law to operate from 10:00 a.m. to 1:00 a.m.
\\n\\n\\n\\nRestaurants with full-service (Beer, Wine, Liquor) licenses Open at 11:30 a.m. You must be in the restaurant to partake, and you can only have one drink per legal-aged person at the table. Utah does not allow for “sidecars” (Beer and a Shot), nor will they make you a “Double” drink. These restaurants will require you to order some food with your drink, or even before you can order a drink to ensure compliance with the law.
\\n\\n\\n\\nBeer and Wine Only: Follow the same laws as above, but can only serve you beer that is 5% ABV or Lower and will not liquor available. Similar to a full-service license, you will still need to order some kind of food off the menu before you can order a drink.
\\n\\n\\n\\nWait. Where is the bar?! Some restaurants will have a Full Service License, but you may not be able to see the Liquor selection, or there will be some door to the side where your server goes in and comes out with your drink. This is one of those Quirky Utah laws where the liquor has to be hidden from sight. We lovingly call it the “Zion Curtain”.
\\n\\n\\n\\nDo I need a membership? Previously, you needed to be a member of a private club at each bar to drink full liquor. If you were from out of town or visiting, you needed someone to sponsor your membership. Utah did away with this unique law during the ramp-up for the 2002 Olympics.
\\n\\n\\n\\nJust remember your ID. Utah is very strict about IDs being scanned. This can be your driver license, your passport, but if you don’t have it you won’t be allowed in to a bar, or to order drinks.
\\n\\n\\n\\nI hope you enjoy your time here and enjoy everything our vibrant city offers. From the stunning natural landscapes and rich cultural heritage to the diverse museums and exciting sporting events, there’s something for everyone. Whether exploring the trails, visiting a museum, catching a game, or simply enjoying the local cuisine, Salt Lake City is ready to welcome you and the rest of the KubeCon/CloudNativeCon attendees with open arms.
\\n\\nGet to know Camila
\\n\\n\\n\\nThis week’s Kubestronaut in Orbit, Camila Soares Câmara, is a Senior Cloud Engineer at Wellhub in Brazil with experience in Cloud and DevOps, working with technologies such as Kubernetes, CI/CD, AWS, and Infrastructure as Code (IaC). In her role as a platform engineer, Camila is focused on implementing scalable and secure solutions to improve the end user experience.
\\n\\n\\n\\nIf you’d like to be a Kubestronaut like Camila, get more details on the CNCF Kubestronaut page.
\\n\\n\\n\\nWhen did you get started with Kubernetes–what was your first project?
\\n\\n\\n\\nMy first project was creating and supporting Kubernetes clusters in production environments and creating stress tests for the main applications.
\\n\\n\\n\\nWhat are the primary CNCF projects you work on or use today? What projects have you enjoyed the most in your career?
\\n\\n\\n\\nKubernetes is the primary CNCF project I’m working with right now.
\\n\\n\\n\\nHow have the certs helped you in your career?
\\n\\n\\n\\nThe certs helped me to get an in depth understanding of Kubernetes architecture and security.
\\n\\n\\n\\nHow has CNCF helped you or influenced your career?
\\n\\n\\n\\nCNCF resources and training have helped me to get recognized in the tech industry, keep up-to-date with new technologies, and my certifications have opened up career opportunities for me.
\\n\\n\\n\\nWhat are some other books/sites/courses you recommend for people who want to work with k8s?
\\n\\n\\n\\nI recommend Kubernetes documentation, Udemy (check out the CNCF endorsed content), and Cloud Guru platforms as well as KodeKloud for practical tests.
\\n\\n\\n\\nWhat do you do in your free time?
\\n\\n\\n\\nI like to study new technologies and spend time with my family.
\\n\\n\\n\\nWhat would you tell someone who is just starting their K8s certification journey? Any tips or tricks?
\\n\\n\\n\\nLearn from the recommended courses in Udemy or use Cloud Guru, take the practical tests on KodeKloud, but most importantly, take the time to study the Kubernetes documentation and get as much education as you can exploring the CNCF Kubernetes project.
\\n\\n\\n\\nToday the Cloud native ecosystem is way more than Kubernetes. Do you plan to get other cloud native certifications from the CNCF ?
\\n\\n\\n\\nI am thinking about observability certs like:
\\n\\n\\n\\nHow did you get involved with cloud native and Kubernetes in general?
\\n\\n\\n\\nMy involvement starts with Kubernetes, but I’ve really enjoyed participating in the CNCF events.
\\n\\nMember post by Kyuho Han, SK Telecom
\\n\\n\\n\\nSince the World Economic Forum (WEF) 2021, The great reset of our society through digital transformation has been accelerating.
\\n\\n\\n\\nIn Korea, digital transformation is accelerating not only in the tech industry such as games, search, and telecommunications, but also in traditional industries such as education, real estate, and finance. In particular, in the financial industry, various financial services based on data analysis can be provided by various entities through the MyData project, which was revamped in 2021.
\\n\\n\\n\\n“MyData is a policy initiated by the South Korean government that enables individuals to directly manage and utilize their own data. The initiative aims to return data sovereignty to individuals, enabling them to receive a variety of customized services. As such, service providers are required to pass personal data to third parties via standardized APIs upon customer request.”
\\n\\n\\n\\nIn the traditional industry, it is not possible to develop services with only its own manpower, so various development entities need to develop services in the form of micro services and cooperate. Kubernetes is fulfilling the role of PaaS for building micro services in this era.
\\n\\n\\n\\nSK Telecom is providing TKS (SkTelecom Kubernetes Service) to traditional business fields such as finance and broadcasting.
\\n\\n\\n\\nAs services grow and become more personalized, more and more microservices are being developed, deployed, and operated. This naturally led to the demand for multi-tenant configurations, where a single Kubernetes cluster is shared by micro services developed by multiple development entities, sometimes with different development and operational entities.
\\n\\n\\n\\nIn most cases, many Kubernetes clusters are created and managed based on geo-redundancy and purpose (DEV/STG/PRD). Therefore, it is necessary to apply different security policies for each micro service and each Kubernetes cluster purpose.
\\n\\n\\n\\nAs a basic feature, TKS supports user management with multi-cluster support, RBAC by user/group, and single sign on (SSO) to the dashboard, various services, and Kube APIs.
\\n\\n\\n\\nHowever, this feature alone does not allow for more granular control. Therefore, various security guides were created, and based on these guides, CI/CD pipeline, Kubernetes RBAC, and various IT services needed to be configured to comply with the security guides, some of which were implemented in the form of guidelines for developers and operators to follow.
\\n\\n\\n\\nAlthough this configuration seems to solve everything, there was a high possibility of hard-to-find security threats due to misunderstanding of security guides or incorrect configurations for individual systems. In addition, applying security policies as a duty of developers and operators put the burden on the development and operation personnel, making it difficult for them to fully utilize their individual capabilities.
\\n\\n\\n\\nEspecially difficult was the fact that the security configuration that interpreted the security policy were distributed across multiple IT systems, which not only made it difficult to manage configurations, but also required interworking with multiple IT systems to establish a unified monitoring and auditing system for security violations. This increase in management points negatively affected the maintenance of the system.
\\n\\n\\n\\nBefore we explain the solution we chose, let’s start with the image we have in our minds. Many construction sites or factories use Travel Restraint Systems. These are devices that protect workers from hazards by restricting their radius of motion. They are very simple, intuitive, and minimally restrictive of the worker’s movement. They can also provide safety without much effort or awareness on the part of the worker.
\\n\\n\\n\\n
I think it’s similar to building micro services with K8S, where various stakeholders collaborate with each other to build micro services. To innovate, we need to ensure maximum convenience for developers/operators. We also need to provide them with a system that ensures they don’t breach security and create security holes without even realizing it.
We call this system a governance system. The requirements of a governance system can be summarized in three main points.
\\n\\n\\n\\nDuring the policy formulation phase, it’s important to clearly state what the policy is – you don’t want people looking at the policy and having different interpretations. However, a clear policy description may not express the intent of the policy. We struggled with the choice between clarity and intent. In the end, we decided that intent can lead to differences in interpretation, which can be confusing for developers/operators, so we prioritized clarity.
\\n\\n\\n\\nIn the end, we decided to adopt a form of Policy as a Code that clearly describes the policy itself as code.
\\n\\n\\n\\nThe existing system works by applying a defined policy to multiple systems. The settings of each system are designed for the unique purpose of that system. Therefore, no matter how clear a policy is described, there is a possibility of a gap in translating the policy into settings for the systems to which the policy should be applied. The size of this gap increases with the number of policy setpoints (individual systems), and the possibility of malfunction due to incorrect settings increases.
\\n\\n\\n\\nTherefore, the best practice is to have a single policy enforcement target. With Kubernetes, everything runs through the Kube API, so you can control everything right before the final execution through the extension of Admission provided by the Kube API Server.
\\n\\n\\n\\nAfter policies are applied, you need to observe how well they are adhered to, so you can remove unnecessary restrictions and continue to add necessary ones. The structure for visibility should be built on the same principles as the structure for accurate policy enforcement. Fewer points of policy enforcement, ideally a single point of policy enforcement, makes it simpler to monitor policy enforcement.
\\n\\n\\n\\nBelow is our proposed governance system.
\\n\\n\\n\\nThe Kube API Server in Kubernetes provides an extension of functionality through the Admission webhook.
\\n\\n\\n\\nTwo representative SWs that can provide governance in the form of policies as code via Admission Webhooks are OPA Gatekeeper and Validating Admission Policy, which is available as a stable feature in Kubernetes 1.30.
\\n\\n\\n\\nThe 2 solutions can be briefly compared as follows.
\\n\\n\\n\\nBoth technologies enforce policies through the deployment of Kubernetes Custom Resources, so you can manage the deployment of policies through a general pipeline like GitOps.
\\n\\n\\n\\nOPA Gatekeeper | Validating Admission Policy |
Policy as a CodeRego FeaturesValidatingMutatingAudit External Data Support Rich policy library | Policy as a CodeCEL FeaturesValidating |
We were able to develop 20 policies in advance through interviews with our customers. Looking at the content of the 20 policies, there was a requirement for not only validation but also mutating. There was also a requirement for validation in conjunction with external data. Therefore, we selected OPA Gatekeeper as the final solution.
\\n\\n\\n\\nAs mentioned earlier, the products/services that organizations offer, the technology they use, and the way people work are constantly changing, so policies need to be constantly changing as well. We need the ability to easily change and create policies, and easily import best practices.
\\n\\n\\n\\nThere is a learning curve with OPA Gatekeeper as it requires you to know both the less popular language of Rego and Kubernetes, so further development is needed to make editing/creating policies easier and the ability to update policies naturally.
\\n\\nCo-chairs: Melissa Logan and Adam Durr
November 12, 2024
Salt Lake City, Utah
Organizations like Etsy, Grab, Dish Network, and Chick-fil-A have standardized on Kubernetes and shared best practices for running different types of stateful workloads. Our aim for the Data on Kubernetes Day event is to bring you the resources you need to get started or advance on your DoK journey.
\\n\\n\\n\\nRunning data workloads on Kubernetes was once unthinkable, but the technology has matured significantly in the past few years to safely support data and stateful workloads such as databases, streaming, AI/ML, analytics, and CICD. Read more in this recent six-part series in Computer Weekly.
\\n\\n\\n\\nData on Kubernetes Day started as a virtual event in 2021 hosted by the Data on Kubernetes Community to coincide with KubeCon in Europe and North America, and became an official colocated event in 2023. You can stream all content from the DoK Events playlist.
Who will get the most out of attending this event?
\\n\\n\\n\\nIf you manage databases, AI/ML apps, big data, streaming apps on Kubernetes – this event will help you see how others are managing stateful workloads and pitfalls to avoid. You’ll hear technical best practices, as well as interesting ideas like using Kubernetes as your DBA. We also made space for an intriguing panel on Kubernetes and GPU trends for AI in financial services – with a stellar lineup of panelists from The Hartford, JP Morgan Chase, Deutsche Bank, and the Royal Bank of Canada.
\\n\\n\\n\\nDoKC community members will also be hosting a panel at KubeCon on day one: The Future of DBaaS on Kubernetes.
\\n\\n\\n\\nWhat is new and different this year?
\\n\\n\\n\\nThis year we added DBaaS content, which is an increasingly hot topic in the Kubernetes space. And, while we’ve always featured AI/ML talks, this year we saw a surge in AI/ML-related submissions and added a couple to the roster. We were also thrilled to receive so many submissions from women and people of color. This is the most diverse group of submissions and selected speakers we’ve had the privilege to organize.
\\n\\n\\n\\nWhat will the day look like?
\\n\\n\\n\\nYou can see the schedule here. To kick off the event, we’ll share new data from the 2024 Data on Kubernetes survey that’s being fielded now. (Learn more and take the survey here.)
\\n\\n\\n\\nDoKC Ambassadors will be onsite and would love to answer your questions.
\\n\\n\\n\\nAnd we will be giving away brand new DoK t-shirts at the end; stickers will be available for all attendees. Our current shirt is an homage to one of our favorite 80s hip hop groups Run DMC (the shirt says “RUN DOK”). In keeping with our musical swag theme, this year we’ve gone from rap to rock.
\\n\\n\\n\\nShould I do any homework first?
\\n\\n\\n\\nIf you’re just learning about stateful workloads, check out last year’s Stateful Workloads in Kubernetes: A Deep Dive video to get oriented. DoKC also collaborated with the CNCF Storage TAG to create the Database Patterns – Data on Kubernetes Whitepaper (GitHub). Those are great starting points for newbies.
\\n\\n\\n\\nThe DoK Special Interest Group also recently started a new project – the Getting Started Guide. It’s a work in progress, but has a few great resources to help people get started. If you want to contribute, join the DoK Slack #dok-sig channel.
\\n\\n\\n\\nFind your community!
\\n\\n\\n\\nThe Data on Kubernetes Community is a welcoming, inclusive community that anyone can join. We host monthly virtual meetups, a Slack channel, maintain a resource library, host Data on Kubernetes Days and meetups, and other activities. We hope to see you on the Slack!
\\n\\n\\n\\nSubmitted by the co-chairs.Don’t forget to register for KubeCon + CloudNativeCon North America 2024.
\\n\\nEnd user post by Alolita Sharma, Engineering Leader at Apple, CNCF Board & EndUser TAB, OpenTelemetry GC, CNCF Observability TAG Co-Chair
\\n\\n\\n\\nThe CNCF End User Technical Advisory Group (TAB) was formally announced at KubeCon + CloudNativeCon North America 2023 in Chicago last year. The TAB is one of the three central governing pillars of the Cloud Native Computing Foundation (CNCF).
\\n\\n\\n\\nThe mission of the TAB includes representing the interests and perspectives of end users in the cloud native ecosystem and ensuring that this ecosystem continues to evolve in a direction that meets the needs and expectations of its users. The TAB also bridges end users and the other governing pillars of the CNCF – the Governing Board (GB) and the Technical Oversight Committee (TOC).
\\n\\n\\n\\nIn 2024, the End User TAB kicked off its first year, electing Alolita Sharma as the TAB Chair and Henrik Blixt as the TAB Vice Chair. In the initial meetings, the TAB identified multiple areas to focus on, which include –
\\n\\n\\n\\nThe TAB selected the first three focus areas to be led by TAB working groups.
\\n\\n\\n\\nThe TAB Reference Architecture working group is a collaborative effort within the TAB aimed at providing practical guidance and examples for building cloud-native applications and infrastructure. This group plays a crucial role in promoting best practices, fostering interoperability, and accelerating the adoption of cloud-native technologies. Garry Cairns and Sergiu Peteau lead this group.
\\n\\n\\n\\nThe TAB End User Feedback working group is focused on gathering feedback from end users on various aspects of CNCF projects, including usability, reliability, performance, and documentation, providing constructive feedback from end users to project maintainers, and tracking feedback resolution. The TAB hopes to achieve increased collaboration between end users and projects, providing more feedback from end users to projects and greater end user satisfaction. Chad Beaudin and Joe Sandoval lead this group.
\\n\\n\\n\\nThe TAB Project Health Visibility working group is focused on improving the visibility and transparency of CNCF projects. This group plays a crucial role in helping end users understand the health and status of CNCF projects, making informed decisions about their adoption, and contributing to the overall success of the CNCF ecosystem. This working group aims to achieve stronger and well-balanced project governance, improved end user technology decision making and enhanced community engagement in projects. Joe Sandoval and Ricardo Rocha lead this group.
\\n\\n\\n\\nWe welcome end users interested in participating in these initiatives to join one of the working groups. Please do not hesitate to contact me or any of us on the End User TAB. You can reach out to me or any of us on the CNCF End User TAB on the CNCF Slack channel at #tab or in person at the upcoming Kubecon NA in Salt Lake City on Nov 12-16, 2024.
\\n\\n\\n\\nReference Links:
\\n\\n\\n\\nCommunity post by Gerardo Lopez Falcon
\\n\\n\\n\\nEn el mundo moderno del desarrollo de software, los contenedores han transformado la forma en que las empresas y los desarrolladores despliegan y administran sus aplicaciones. Sin embargo, con esta nueva tecnología, también han surgido nuevas amenazas de seguridad. La proliferación de contenedores y su naturaleza compartida hace que la seguridad en tiempo de ejecución sea una preocupación primordial. Aquí es donde tecnologías como gVisor entran en escena, ofreciendo una capa adicional de seguridad y aislamiento para proteger las aplicaciones y sus datos de ataques en tiempo de ejecución.
\\n\\n\\n\\nLos contenedores ofrecen un entorno ligero y eficiente para ejecutar aplicaciones. Sin embargo, como cualquier tecnología, no están exentos de riesgos. Los contenedores tradicionales, aunque más seguros que las aplicaciones en sistemas bare-metal o virtuales, comparten el kernel del sistema operativo subyacente. Esto significa que si un contenedor se ve comprometido, existe el riesgo de que el atacante acceda al kernel y a otros contenedores o recursos compartidos.
\\n\\n\\n\\nLos ataques más comunes en contenedores incluyen:
\\n\\n\\n\\nPara hacer frente a estos riesgos, han surgido varias soluciones de seguridad especializadas en el ámbito de contenedores, siendo gVisor una de las más destacadas.
\\n\\n\\n\\ngVisor es un runtime de contenedores desarrollado por Google que se enfoca en proporcionar un mayor aislamiento entre los contenedores y el kernel del host. En lugar de compartir directamente el kernel con el host, gVisor actúa como una capa intermedia que implementa una gran parte de las llamadas al sistema (syscalls) sin depender completamente del kernel. Esto reduce la superficie de ataque, ya que los contenedores no interactúan directamente con el sistema operativo subyacente.
\\n\\n\\n\\nAdemás de gVisor, existen otras soluciones en el ecosistema cloud-native que también ayudan a mejorar la seguridad en contenedores:
\\n\\n\\n\\nCada una de estas tecnologías ofrece diferentes enfoques para mejorar la seguridad en entornos de contenedores, y pueden usarse en conjunto para proteger mejor las aplicaciones.
\\n\\n\\n\\nPara implementar gVisor en tu infraestructura basada en Kubernetes o Docker, los pasos son bastante sencillos:
\\n\\n\\n\\nEjecuta tus contenedores utilizando runsc como el runtime:
\\n\\n\\n\\ndocker run --runtime=runsc -it alpine
\\n\\n\\n\\nRuntimeClass
de Kubernetes para que algunos pods utilicen gVisor como su runtime.Actualiza los manifests de tus pods para utilizar el nuevo RuntimeClass
:
apiVersion: v1\\nkind: Pod\\nmetadata:\\n name: secure-pod\\nspec:\\n runtimeClassName: gvisor\\n containers:\\n - name: my-app\\n image: alpine
\\n\\n\\n\\nCon estas configuraciones, los contenedores estarán utilizando gVisor para una mayor protección.
\\n\\n\\n\\nEs hora de que las empresas y desarrolladores den el siguiente paso en la evolución de la seguridad en contenedores. A medida que más organizaciones migran sus aplicaciones a la nube y adoptan contenedores, las amenazas también evolucionan. Adoptar soluciones como gVisor no solo mitiga los riesgos en tiempo de ejecución, sino que también proporciona una mayor confianza en la seguridad de las aplicaciones.
\\n\\n\\n\\nSi eres un desarrollador o una empresa que utiliza contenedores, no esperes a que ocurra un ataque para reforzar la seguridad. Implementa gVisor hoy y protege tus aplicaciones en cada fase de su ciclo de vida. Explora cómo puedes mejorar tu arquitectura cloud-native con mayor aislamiento y seguridad en tiempo de ejecución.
\\n\\n\\n\\n¡Comienza a explorar gVisor y otras tecnologías de seguridad en contenedores hoy mismo! Mejora la seguridad de tus aplicaciones, protege tu infraestructura y asegura la continuidad de tu negocio en un entorno cada vez más amenazante.
\\n\\n\\n\\n\\n\\n\\n\\n——————————————-ENGLISH VERSION—————————————————–
\\n\\n\\n\\nBy Gerardo Lopez Falcon
\\n\\n\\n\\nIn today’s software development world, containers have transformed how companies and developers deploy and manage applications. However, alongside this new technology comes new security threats. The proliferation of containers and their shared nature make runtime security a primary concern. This is where technologies like gVisor come into play, offering an additional layer of security and isolation to protect applications and their data from runtime attacks.
\\n\\n\\n\\nContainers provide a lightweight and efficient environment for running applications. However, like any technology, they are not without risks. Traditional containers, while more secure than bare-metal or virtualized applications, share the host system’s kernel. This means that if one container is compromised, the attacker could potentially gain access to the kernel and other containers or shared resources.
\\n\\n\\n\\nCommon container attacks include:
\\n\\n\\n\\nTo address these risks, several security solutions have emerged within the container space, with gVisor being one of the standout technologies.
\\n\\n\\n\\ngVisor is a container runtime developed by Google, designed to provide greater isolation between containers and the host kernel. Instead of sharing the kernel directly with the host, gVisor acts as an intermediary layer, implementing a significant portion of system calls (syscalls) without fully relying on the kernel. This reduces the attack surface, as containers do not directly interact with the underlying operating system.
\\n\\n\\n\\nIn addition to gVisor, other solutions in the cloud-native ecosystem also contribute to improving container security:
\\n\\n\\n\\nEach of these technologies offers different approaches to enhancing security in container environments and can be used together for better application protection.
\\n\\n\\n\\nIntegrating gVisor into your Kubernetes or Docker-based infrastructure is straightforward:
\\n\\n\\n\\nRun your containers using runsc as the runtime:
\\n\\n\\n\\ndocker run --runtime=runsc -it alpine
\\n\\n\\n\\nUpdate your pod manifests to use the new RuntimeClass:
\\n\\n\\n\\napiVersion: v1\\nkind: Pod\\nmetadata:\\n name: secure-pod\\nspec:\\n runtimeClassName: gvisor\\n containers:\\n - name: my-app\\n image: alpine
\\n\\n\\n\\nWith these configurations, your containers will run using gVisor for enhanced security.
\\n\\n\\n\\nIt’s time for companies and developers to take the next step in container security evolution. As more organizations migrate their applications to the cloud and adopt containers, threats also evolve. Adopting solutions like gVisor not only mitigates runtime risks but also provides greater confidence in the security of your applications.
\\n\\n\\n\\nIf you’re a developer or company using containers, don’t wait for an attack to strengthen your security. Implement gVisor today and protect your applications throughout their lifecycle. Explore how you can improve your cloud-native architecture with enhanced isolation and runtime security.
\\n\\n\\n\\nStart exploring gVisor and other container security technologies today! Improve your application security, protect your infrastructure, and ensure business continuity in an increasingly threat-prone environment.
\\n\\nCommunity post originally published on Medium by Dotan Horovits
\\n\\n\\n\\nLast month the OpenMetrics project was officially archived and folded into Prometheus. That’s the end of an open source project journey that ends exactly where it all started.
\\n\\n\\n\\nIt’s an interesting story. OpenMetrics was originally born as an attempt to spin off Prometheus exposition format into an independent and tool-agnostic open specification.
\\n\\n\\n\\nIt was even placed under a new independent umbrella repo on GitHub called OpenObservability (no relation to my podcast OpenObservability Talks 😊)
\\n\\n\\n\\nAt some point a few years ago there was even an attempt to turn it into an official IETF open standard (RFC2119), which hasn’t come to fruition.
\\n\\n\\n\\nhttps://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https%3A//x.com/horovits/status/1329055658663096327&image=
\\n\\n\\n\\nBut ultimately, Prometheus itself is today a de-facto standard, at least in the cloud-native space. Many tools today provide out-of-the-box support for exporting metrics in Prometheus format.
\\n\\n\\n\\nAnd as to tools outside the Prometheus ecosystem, they have their own formats and haven’t jumped to switch. Because, let’s face it, as elegant as the notion of an abstract universal exposition format is, in reality these formats are quite coupled to the way the data is stored and represented internally in the tool.
\\n\\n\\n\\nIn attempt to make OpenMetrics bigger than “just the Prometheus format”, it has also caused some confusion among Prometheus users, as to which format to use for exporting and receiving metric time-series data. It was even more confusing since they two are fairly similar, but then have their divergent points.
\\n\\n\\n\\nNot to mention the confusion in the broader community with the abundance of Open<X> projects.
\\n\\n\\n\\nLast month, July 2024, the Technical Oversight Committee of the Cloud Native Computing Foundation (CNCF TOC for short) has approved and signed the archiving of OpenMetrics and migrating it under Prometheus.
\\n\\n\\n\\nUltimately it’s a good thing. A project can be both a tool and a specification. Just like we do with OpenTelemetry. No need for separate projects here. This merge will realign the efforts around Prometheus, simplify things and reduce confusion and overhead.
\\n\\n\\n\\nOpenMetrics is dead, long live OpenMetrics (as Prometheus format).
\\n\\n\\n\\nYou can read more about open specifications in observability here.
\\n\\n\\n\\n\\n\\nCo-chairs: David Hirsch, Michael Beemer
November 12, 2024
Salt Lake City, Utah
The Open Feature Summit focuses on the use of feature flags and experimentation in cloud-native environments. It’s an event designed to help developers, architects, and decision-makers leverage feature management to accelerate release velocity and improve the reliability of deployments. The summit dives deep into the Open Feature project, which provides a unified standard for feature flags, making it easier to manage flags across diverse environments, tools, and programming languages. This is the first time the Open Feature Summit is being held as a co-located event at KubeCon + CloudNativeCon.
\\n\\n\\n\\nWho will get the most out of attending this event?
\\n\\n\\n\\nAnyone who is involved in the development, deployment, or operation of cloud-native applications will benefit from attending. This includes developers, DevOps engineers, platform teams, and product managers. The event is particularly suited for those already using or considering the use of feature flags and looking to enhance their progressive delivery practices.
\\n\\n\\n\\nShould I do any homework first?
\\n\\n\\n\\nNo formal prep work is required, but familiarity with the concept of feature flags and progressive delivery would be helpful. Attendees might want to explore the Open Feature documentation or check out the latest updates on GitHub to get an idea of what the project offers. Bringing specific challenges or use cases from your work can also help make discussions more valuable.
\\n\\n\\n\\nAnything else you would like to say about this event?
\\n\\n\\n\\nThe Open Feature Summit is a unique opportunity to connect with others who are passionate about improving software delivery through feature management. We’re excited to offer an interactive and practical experience, with sessions that will cater to everyone from beginners to advanced users. The goal is to leave attendees with actionable insights they can immediately apply to their own development workflows.
\\n\\n\\n\\nSubmitted by David Hirsch, who is looking forward to seeing how the Open Feature ecosystem is evolving, particularly how it’s being adopted across different cloud-native tools.
\\n\\n\\n\\nDon’t forget to register for KubeCon + CloudNativeCon North America 2024.
\\n\\n\\n\\n\\n\\nThe CNCF Technical Oversight Committee (TOC) has voted to accept Artifact Hub as a CNCF incubating project.
\\n\\n\\n\\nArtifact Hub is a web-based application that enables finding, installing, and publishing cloud native packages and configurations. Discovering useful cloud native artifacts like Helm charts can be difficult with general-purpose search engines. Artifact Hub makes finding artifacts easier by providing target searches.
\\n\\n\\n\\n\\n\\n\\n\\n\\n“Artifact Hub was created to bring together the discovery of cloud native artifacts. Prior to Artifact Hub, people had to use general search engines or targeted ones for a specific type (like the now deprecated Helm Hub) to find artifacts,” said Matt Farina, Artifact Hub maintainer and Distinguished Engineer at SUSE. “The experience had room for improvement. Dan Kohn, the founding executive director of CNCF, noticed this problem and brought together people involved with it at KubeCon + CloudNativeCon North America 2019 in San Diego. Artifact Hub was born from those conversations and became a sandbox project in 2020.”
\\n
Numerous types of artifacts are supported, including Argo templates, Backstage plugins, Container images, CoreDNS plugins, Falco rules, Headlamp plugins, Helm charts and plugins, Inspektor gadgets, KCL modules, KEDA scalers, Keptn integrations, Knative client plugins, Kubectl plugins, KubeArmor policies, Kubewarden policies, Kyverno policies, Meshery designs, OLM operators, OPA and Gatekeeper policies, OpenCost plugins, Tekton packages, and Tinkerbell actions.
\\n\\n\\n\\nSince joining the CNCF Sandbox, Artifact Hub has:
\\n\\n\\n\\n\\n\\n\\n\\n\\n“We are thrilled to provide an intuitive and easy to use solution that allows our users to discover and publish multiple kinds of Cloud Native artifacts from a single place. At the moment we support 26 different types of artifacts (most from other CNCF projects), and we’re looking forward to adding more in the future!” – Sergio Castaño Arteaga and Cintia Sanchez Garcia, Software Engineers at CNCF
\\n
Notable Milestones
\\n\\n\\n\\nThe main public deployment is available at https://artifacthub.io.
\\n\\n\\n\\nFor the future, Artifact Hub’s roadmap is focused on three categories:
\\n\\n\\n\\nAs a CNCF-hosted project, Artifact Hub is part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. Artifact Hub joins incubating technologies Backstage, Buildpacks, cert-manager, Chaos Mesh, Cloud Custodian, Container Network Interface (CNI), Contour, Cortex, Crossplane, CubeFS, Dapr, Dragonfly, Emissary-Ingress, gRPC, in-toto, Karmada, Keptn, Keycloak, Knative, KubeEdge, Kubeflow, KubeVela, KubeVirt, Kyverno, Litmus, Longhorn, NATS, Notary, OpenFeature, OpenKruise, OpenMetrics, OpenTelemetry, Operator Framework, Strimzi, Thanos, and Volcano. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.
\\n\\nCo-chairs: Iris Ding & Keith Mattix
November 12, 2024
Salt Lake City, Utah
Istio Day is the biannual community event for the industry’s most widely adopted and feature rich service mesh, where attendees will find lessons learned from running Istio in production, the latest updates on Istio’s ambient mode developments, and the opportunity to meet and learn from maintainers across the Istio ecosystem.
\\n\\n\\n\\nThis will be the fourth Istio Day. The first Istio Day was co-located with KubeCon Europe 2023, in Amsterdam. This colocated event replaces ServiceMeshCon, which was first held in 2019, before Istio was a CNCF project.
\\n\\n\\n\\nWho will get the most out of attending this event?
\\n\\n\\n\\nDevelopers, operators, platform engineers, architects, product managers, open-source enthusiasts, and C-level executives who are keen on enhancing their microservice management and increasing productivity within their companies.
\\n\\n\\n\\nWhat is new and different this year?
\\n\\n\\n\\nThis year, Istio Day will be putting first things first, focusing on the most important part of the project: our users. Attendees will hear from project maintainers and world-class Istio practitioners alike, equipping them with new skills, industry insights, and more!
\\n\\n\\n\\nWhat will the day look like?
\\n\\n\\n\\nThis year, Istio Day will be a half-day event filled with a plethora of engaging sessions for all skill levels. It will feature lightning talks, a panel, and technical deep dives on topics such as seamless upgrades, multi-cluster, Ambient mode, and security . The Istio community will also have a project kiosk at the main KubeCon conference; please come and meet our community members!
\\n\\n\\n\\nShould I do any homework first?
\\n\\n\\n\\nWhile no advanced preparation is needed, having a basic understanding of the concept of a service mesh and the goals of Istio in particular will allow participants to take full advantage of the opportunities presented. Sessions are designed for all levels of experience – from novice to knowledgeable.
\\n\\n\\n\\nFind your community!
\\n\\n\\n\\nIstio Day offers a wonderful learning and collaboration opportunity to attendees. Most of the project maintainers, steering committee, and technical oversight committee will be there. Many people-years of real-world experience, hard-won war stories, and uniquely creative insights make this a guaranteed accelerator wherever someone is in their service mesh journey.
\\n\\n\\n\\nContributed by the co-chairs.
\\n\\n\\n\\nDon’t forget to register for KubeCon + CloudNativeCon North America 2024.
\\n\\nNow is the time for the open source ecosystem to band together and find strength in numbers
\\n\\n\\n\\nCNCF and The Linux Foundation are expanding their partnership with Unified Patents to protect open source software from non-practicing entities (NPEs), commonly referred to as “patent trolls.” This enhanced partnership brings new benefits to LF and CNCF members in terms of access to enhanced NPE deterrence mechanisms. In this blog, Joanna Lee, Vice President of Strategic Programs and Legal at CNCF and the Linux Foundation, explains what those benefits are, why they are needed, and why it is so important that open source users and vendors join together to protect the cloud native and open source ecosystems.
\\n\\n\\n\\nJL: Patent trolls are entities whose sole purpose is to buy patents and threaten companies (both vendors and adopters) with patent litigation to extract money. Companies on the defense will often pay settlement fees to avoid the even higher cost of litigation, even when the troll’s patents and legal arguments are of questionable validity. Trolls use each of these wins to bolster the impression that they have an enforceable patent, and this helps them convince other companies to also settle.
\\n\\n\\n\\nJL: Patent trolls go after successful, broadly adopted technologies–whether closed or open source–because it’s a numbers game. Broad adoption and success equate to higher settlement payments and more companies to shake down for money. Any software that is pervasive and widely used are targets of NPE aggression.
\\n\\n\\n\\nJL: Unified Patents is a membership-based organization that uses a range of tools and strategies to deter NPEs from targeting specific technology areas, referred to as patent protection “zones.” In 2019, the Linux Foundation partnered with Unified Patents to establish an Open Source Zone. In addition to directly challenging NPE patents through invalidity proceedings (with a 90% success rate), Unified hosts crowdsourced prior art searches, shares intelligence about NPE campaigns, negotiates royalty-free licenses to benefit all companies who participate in the impacted zone, and arms companies with tools and information to strengthen their defense against NPE threats.
\\n\\n\\n\\nJL: Ultimately, we want patent trolls to conclude “the open source ecosystem is not worth our time because it has banded together and is too hard to shake down.‘’ When it comes to NPE deterrence, there is strength in numbers. When organizations join forces to safeguard open source innovation through Unified Patents’s programs, we can achieve far more with fewer resources than when individual companies act in isolation. Companies are far more vulnerable to NPE’s predatory behavior acting alone than when they work together to deter invalid assertions. Additionally, many companies find that it’s more cost-effective to sponsor certain types of deterrence activities through Unified than to pursue similar efforts on their own.
\\n\\n\\n\\nJL: As a result of this expanded partnership, members of the Linux Foundation and CNCF–over 1300 companies–will gain access to a suite of benefits based on their membership level to assist in proactive NPE deterrence, including:
\\n\\n\\n\\nJL: We encourage LF/CNCF members to take full advantage of the new benefits offered through this partnership. However, these benefits are just a starting point. We encourage all companies in our ecosystem to also participate in the broader set of NPE deterrence programs that Unified Patents offers, both to strengthen the collective defense and to support expansion of Unified’s deterrence activities to counter the rise in NPE aggression.
\\n\\n\\n\\nJL: Open source developers can help by contributing evidence of prior art to Unified’s crowdsourced PATROLL prior art contests for the Open Source Zone. Prior art–evidence that the claimed invention was publicly known about and therefore not “new” at the time the patent application was filed–can be used to invalidate an NPE’s patent. CNCF and Unified will co-host an in person PATROLL contest at KubeCon + CloudNativeCon North America 2024. The winner will be awarded a cash prize and recognized on the KubeCon keynote stage. More details will be announced soon.
\\n\\n\\n\\n\\n\\nMember post originally published on the Devtron blog by Bhushan Nemade
\\n\\n\\n\\nAs organizations rush towards the cloud-native paradigm, most face an unexpected issue i.e. skyrocketing infrastructure expenses. Inefficient resource management and the lack of proper demand-driven provisioning result in continuously active resources, regardless of having actual usage patterns. These factors emerge as primary reasons the cloud provider’s pay-as-you-go feature appears to be a nightmare for infrastructure costs to organizations. In this blog post, we will discuss an open-source tool Winter Soldier, a tool crafted by Devtron for time-based scaling of Kubernetes that can help us save some bucks and also how to optimize resource utilization.
\\n\\n\\n\\nThe time-based autoscaling scales your workloads according to defined time, it can be used to execute batch processes like hibernating microservices. The most effective way to use time-based autoscaling is where we know the exact pattern of the incoming traffic for your services. The time-based scaling perfectly aligns with our cost optimization goal, where by utilizing the pattern of our traffic we can scale down or scale up our infrastructure.
\\n\\n\\n\\nSo, now we will scale down the production environment to optimize the cost…? No, there is one hidden culprit that contributes significantly to cloud costs for every organization without even coming to notice. The “Non-Production Environments” (Dev, Staging, Testing, Preview), exist in every organization and in large quantities. All these environments keep running 24/7 and along with them, the cost meter keeps spinning.
\\n\\n\\n\\nIn the case of Non-production, it’s easy to track the pattern of traffic and time-based scaling can be a good option. For instance, Dev environments can be scaled down at night time and can be scaled up in the day time similarly all these environments can be scaled down at the start of the weekend say Friday evening, and can be scaled up on Monday morning. Scaling down these environments can have a significant impact on cost as every organization maintains multiple non-production servers behind a single production server.
\\n\\n\\n\\nFor instance, imagine scaling down your non-production environments every Saturday and Sunday throughout the year, let’s do some calculations and get the numbers on how much we can save on the cost.
\\n\\n\\n\\nWinter Soldier is an open-source tool from Devtron, it enables time-based scaling for Kubernetes workloads. The time-based scaling with Winter Soldier helps us to reduce the cloud cost, it can be deployed to execute things such as:
\\n\\n\\n\\nWinter Soldier can be operated in three modes, “scale”, “sleep”, and “delete”. These three are the modes of actions that can be executed by the Winter Soldier, if the action for the Winter Soldier is defined as scale the Kubernetes workloads will be scaled up according to the pre-defined time.
\\n\\n\\n\\nAs we have seen above non-production environments often contribute significantly to cloud costs. By implementing Winter Soldier for these environments, we can automatically scale down our non-production infrastructure during off-peak hours like nights and weekends.
\\n\\n\\n\\nThe scaling down of the infrastructure can also be done manually but it takes a lot of effort and time of the system administrators.
\\n\\n\\n\\nIs time-based scaling the only way to do this? No, it can also be done with HPA or Event-Driven Autoscaling, but time-based is recommended as we know the correct incoming traffic pattern. The HPA and Event-Driven autoscaling scale the workloads when the request is made, so it takes time to scale up and down the workloads.
\\n\\n\\n\\nLet’s explore how Winter Soldier can help and how to implement it in our Kubernetes infrastructure.
\\n\\n\\n\\nWinter Soldier comes as an operator for Kubernetes which requires the Custom Resource Definition (CRD) named Hibernator. Devtron provides a Helm chart for the deployment of the Winter Soldier in our Kubernetes cluster which makes the whole simpler.
\\n\\n\\n\\nNote: Winter Soldier can be used independently for any Kubernetes cluster. Still, I’ll proceed with devtron for this blog as it provides multiple additional features, for managing my Kubernetes native application and provides support for seamless CI/CD operations including visibility for the Helm applications.
\\n\\n\\n\\nIn this section, we will deploy Winter Soldier and configure it to scale down our dev environment. Let’s configure it to scale down the dev environment from Friday midnight to Monday morning. The current state of the dev environment can be seen in Figure 1, as of now we are having 2 applications up and running with their pods in the Resource Browser of Devtron.
\\n\\n\\n\\n
Step 1: Installation of Devtron
Devtron is an open-source modular Kubernetes dashboard, designed to ease the Kubernetes operations. It is built on top of some of the popular open source tools like ArgoCD, Grafana, Trivy, etc, and is built in a modular fashion where its capabilities can be extended from an advanced Kubernetes dashboard to Kubernetes-native CI/CD pipelines, DevSecOps, Continous Delivery, and GitOps, depending upon the requirements. Its installation is pretty straightforward.
\\n\\n\\n\\nhelm repo add devtron https://helm.devtron.ai\\n\\nhelm repo update devtron\\n\\nhelm install devtron devtron/devtron-operator \\\\\\n--create-namespace --namespace devtroncd\\n
\\n\\n\\n\\nCheck out the Devtron documentation for more details about the installation and integrations.
\\n\\n\\n\\nStep 2: Helm Chart for Winter Soldier
\\n\\n\\n\\nNavigate to the Chart Store.
\\n\\n\\n\\nSelect the Helm Chart for Winter Soldier: devtron/winter-soldier
Once you click the chart, you will be able to see a brief description of the chart, README, and an option to Configure & Deploy.
\\n\\n\\n\\nStep 3: Configuring Winter Soldier
\\n\\n\\n\\nLet’s take a look at configurations of Winter Soldier for our environment, here we want the dev environment to scale down on weekends i.e. Friday night to Monday morning.
\\n\\n\\n\\nDefault values for winter-soldier.\\n\\n\\nreplicaCount: 1\\nimage: quay.io/devtron/winter-soldier:abf5a822-196-14744\\ngraceperiod: 10\\n\\n\\nresources: {}\\n limits:\\n cpu: 100m\\n memory: 128Mi\\n requests:\\n cpu: 100m\\n memory: 128Mi\\n\\n\\nnodeSelector: {}\\n\\n\\ntolerations: []\\n\\n\\naffinity: {}\\n\\n\\n Provide the list of Hibernator objects in the yaml format with your custom requirements.\\nhibernator: []\\n - apiVersion: pincher.devtron.ai/v1alpha1\\n kind: Hibernator\\n metadata:\\n name: sleep-hibernator\\n spec:\\n timeRangesWithZone:\\n timeZone: \\"Asia/Kolkata\\"\\n timeRanges:\\n - timeFrom: 00:00\\n timeTo: 06:59:59\\n weekdayFrom: Fri\\n weekdayTo: Mon\\n selectors:\\n - inclusions:\\n - objectSelector:\\n name: \\"all\\"\\n type: \\"deployment,rollout,StatefulSet\\"\\n exclusions:\\n - namespaceSelector:\\n name: “devtron-ci,devtron-cd,argo,kube-system,devtroncd”\\n objectSelector:\\n name: \\"\\"\\n type: \\"deployment,rollout,StatefulSet\\"\\n action: sleep
\\n\\n\\n\\nresources
section, you can set the resource limits and request for Winter Soldier itself. You can adjust these according to your cluster.hibernator
section, we define how Winter Soldier manages your resources i.e. in timeRangesWithZone
we need to define timezone
for instance, we are taking Asia/Kolkata
. In timeRanges
we need to define the start time in timeFrom
and the end time in timeTo
, similarly in weekdayFrom
and weekdayTo
. In selectors
, in inclusions
we specify which resources to manage and in the exclusions
, we define the exceptions.spec
section of the hibernator
we can define `pause: true
and `
pauseUntil: \\"Jan 2, 2026 3:04pm\\"
. By defining the pause
action we can put the already scheduled hibernator
at pause for a specific time window.action
we define the goal for Winter Soldier, for the above example, we have set it as sleep. We can also set the action
as delete
or scale
according to need.Once the configurations are set we can proceed with the deployment of Winter Soldier.
\\n\\n\\n\\nStep 4: Winter Soldier in Action
\\n\\n\\n\\nOur application from the dev environment is up and running fine before the deployment of Winter Soldier.
\\n\\n\\n\\nEffect of Winter Soldier sleep
action on our applications from the dev
environment. Before the deployment of Winter Soldier, we were having frontend-app
and backend-app
deployed at the dev
environment, both were up and running. Let’s see what actions are taken by Winter Soldier.
Let’s look at other applications in the same environment.
\\n\\n\\n\\nNow that Winter Soldier has been deployed and we have checked the events of applications, to verify let’s navigate to the Resource Browser of Devtron to gain visibility for our dev environment.
\\n\\n\\n\\nFigure 9 shows that Winter Soldier has scaled down the pods previously visible in Figure 1, resulting in no active pods currently running in the dev environment.
\\n\\n\\n\\nTime-based scaling, allows organizations to automatically adjust their resource allocation based on predictable traffic patterns, such as scaling down during nights and weekends. Winter Soldier is a powerful open-source tool that helps organizations implement time-based scaling for their environments. By leveraging Winter-Soldier, organizations can significantly reduce their cloud infrastructure cost by around 28% of the total cost of the year just by scaling workloads on weekends, particularly in non-production settings like development, staging, and testing environments.
\\n\\n\\n\\nIf you liked the winter-soldier, feel free to give it a ⭐️ on GitHub. Join our actively growing Discord Community and ask your questions if you have any.
\\n\\nThe CNCF Technical Oversight Committee (TOC) has voted to accept Artifact Hub as a CNCF incubating project.
\\n\\n\\n\\nArtifact Hub is a web-based application that enables finding, installing, and publishing cloud native packages and configurations. Discovering useful cloud native artifacts like Helm charts can be difficult with general-purpose search engines. Artifact Hub makes finding artifacts easier by providing target searches.
\\n\\n\\n\\n\\n\\n\\n\\n\\n“Artifact Hub was created to bring together the discovery of cloud native artifacts. Prior to Artifact Hub, people had to use general search engines or targeted ones for a specific type (like the now deprecated Helm Hub) to find artifacts,” said Matt Farina, Artifact Hub maintainer and Distinguished Engineer at SUSE. “The experience had room for improvement. Dan Kohn, the founding executive director of CNCF, noticed this problem and brought together people involved with it at KubeCon + CloudNativeCon North America 2019 in San Diego. Artifact Hub was born from those conversations and became a sandbox project in 2020.”
\\n
Numerous types of artifacts are supported, including Argo templates, Backstage plugins, Container images, CoreDNS plugins, Falco rules, Headlamp plugins, Helm charts and plugins, Inspektor gadgets, KCL modules, KEDA scalers, Keptn integrations, Knative client plugins, Kubectl plugins, KubeArmor policies, Kubewarden policies, Kyverno policies, Meshery designs, OLM operators, OPA and Gatekeeper policies, OpenCost plugins, Tekton packages, and Tinkerbell actions.
\\n\\n\\n\\nSince joining the CNCF Sandbox, Artifact Hub has:
\\n\\n\\n\\n\\n\\n\\n\\n\\n“We are thrilled to provide an intuitive and easy to use solution that allows our users to discover and publish multiple kinds of Cloud Native artifacts from a single place. At the moment we support 26 different types of artifacts (most from other CNCF projects), and we’re looking forward to adding more in the future!” – Sergio Castaño Arteaga and Cintia Sanchez Garcia, Software Engineers at CNCF
\\n
Notable Milestones
\\n\\n\\n\\nThe main public deployment is available at https://artifacthub.io.
\\n\\n\\n\\nFor the future, Artifact Hub’s roadmap is focused on three categories:
\\n\\n\\n\\nAs a CNCF-hosted project, Artifact Hub is part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. Artifact Hub joins incubating technologies Backstage, Buildpacks, cert-manager, Chaos Mesh, Cloud Custodian, Container Network Interface (CNI), Contour, Cortex, Crossplane, CubeFS, Dapr, Dragonfly, Emissary-Ingress, gRPC, in-toto, Karmada, Keptn, Keycloak, Knative, KubeEdge, Kubeflow, KubeVela, KubeVirt, Kyverno, Litmus, Longhorn, NATS, Notary, OpenFeature, OpenKruise, OpenMetrics, OpenTelemetry, Operator Framework, Strimzi, Thanos, and Volcano. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria.
\\n\\nMember post from Swisscom by Lea Brühwiler, Ashan Senevirathne, Joel Studler, Alexander North, Henry Chun-Hung Tseng, Fabian Schulz
\\n\\n\\n\\nWe have adopted the GitOps model and leveraged Kubernetes to revolutionize the management of network services and infrastructure, enhancing both operational efficiency and reliability. The NetBox Operator leverages the power of the Kubernetes API, allowing users to directly manage essential resources such as IP addresses and prefixes within the Kubernetes environment. This integration ensures automated network maintenance through reconciliation mechanisms, thereby significantly improving both the simplicity and robustness of operations.
\\n\\n\\n\\nThe NetBox Operator uses the Kubernetes “claim” model to differentiate between the desired and the actual states. Users can conveniently express their high-level intent through a claim, such as reserving a specific prefix within a parent prefix. The operator then queries NetBox to identify the most suitable prefix that matches your requirements. Subsequently, a prefix CR (Custom Resource) is created in the Kubernetes cluster, representing the actual resource claimed by the prefix claim CR. Similarly, IP addresses can be claimed from a parent prefix using the same mechanism. The operator uses the leaselocker library (inspired by client-go) to provide distributed locks and prevent race conditions where the same resource gets assigned to different claims. With the leaselocker library, the parent prefix is locked until the reservation of a resource is completed.
\\n\\n\\n\\nFurthermore, for applications requiring sticky IP addresses, the NetBox Operator offers a convenient feature: the ability to keep an IP address reserved in NetBox even if the corresponding Custom Resource (CR) is deleted from the Kubernetes cluster – just set the .spec.preserveInNetbox flag of your Claim CR to true. If the CR is later recreated, it will be assigned the same IP as before. By seamlessly ensuring IP address consistency, this feature significantly enhances application reliability, enabling redeployment without the need to worry about IP addresses while keeping IPs sticky. The same mechanism is available to keep prefixes sticky.
\\n\\n\\n\\nUltimately, the NetBox Operator can be applied across a broad variety of Kubernetes-driven infrastructures, including 5G core infrastructure. The operator provides simplified resource management, enhanced operational efficiency, and improved reliability. By automating crucial aspects of network service and resource management, it empowers engineers to focus on higher-level tasks, driving greater scalability, flexibility, and agility within their infrastructure.
\\n\\n\\n\\nClone the NetBox Operator code from https://github.com/netbox-community/netbox-operator and follow the instructions in the README.md file to run the NetBox Operator and NetBox on a local kind cluster and test it using examples. Please also feel free to provide feedback and contribute.
\\n\\n\\n\\nThe NetBox Operator is just one component of our broader GitOps strategy, leveraging the Kubernetes Resource Model to drive operational excellence. Consuming IPAM from within Kubernetes allows us to hydrate (or generate) configuration from values which also live on the cluster. If you want to see some examples of this and how we approach this, please have a look at the examples at https://github.com/swisscom/containerdays-2024-krm.
\\n\\n\\n\\nTo understand the bigger picture of our journey, feel free to watch our KubeCon presentation outlining our GitOps evolution.
\\n\\n\\n\\nSwisscom actively invites collaboration from the tech community to enhance and expand the NetBox Operator. Meet us at the upcoming conference Open Source Summit on September 17th, where we will present and engage with peers on Kubernetes-driven network infrastructure.
\\n\\n\\n\\nTogether, we can push the boundaries of cloud-native transformation!
\\n\\nCommunity post by Danielle Cook, Cartografos Working Group
\\n\\n\\n\\nAs organizations continue their journey toward digital transformation, cloud native technologies are increasingly critical for achieving agility, scalability, and resilience. However, the path to cloud native maturity is not uniform across organizations. Some have embraced the model fully, reaping its benefits, while others are still navigating the complexities.
\\n\\n\\n\\nTo better understand where companies stand in this journey, the Cartografos Working Group wants to hear from you to evaluate end users against the cloud native maturity model. You can complete the Google Form here. All data is anonymous!
\\n\\n\\n\\nThe Cloud Native Maturity Model serves as a framework to assess an organization’s progress in adopting cloud native practices. It identifies different levels of maturity, ranging from initial adoption to full-scale, automated operations that align with business goals. While the model provides a clear pathway for organizations, our experience tells us that the reality is often more nuanced. This quick five minute survey aims to validate the levels of maturity outlined in the model and identify where end users need additional support and information.
\\n\\n\\n\\nBy understanding the current state of cloud native adoption, we can better serve the community. We want to help organizations bridge the gap between where they are and where they need to be, offering targeted resources and guidance that align with their specific needs.
\\n\\n\\n\\nThe annoymous survey focuses several key areas that are critical to cloud native maturity:
\\n\\n\\n\\nThe insights gained from this annoyoumous survey will be invaluable in shaping the future of the Cloud Native Maturity Model. By participating, you’ll contribute to a better understanding of where the community stands and help identify the areas that require more attention and resources. In return, you’ll be able to read about where the industry is in cloud native maturity and review recommendations in the updated maturity model on how to advance further.
\\n\\n\\n\\nOnce the survey is complete, we will analyze the data, publish the results and update the Cloud Native Maturity Model accordingly.
\\n\\n\\n\\nThe Cartografos Working Group is committed to helping organizations navigate the complexities of cloud native technologies. Please take just five minutes to help support this effort!
\\n\\n\\n\\n\\n\\nProject post originally published on Github by Sascha Grunert
\\n\\n\\n\\nThe CRI-O maintainers are happy and proud to announce that CRI-O v1.31.0 has been released! This brand new version contains a large list of cool new features, bug fixes and smaller enhancements. I would like to take the opportunity to guide you through CRI-O’s latest and greatest enhancements in the field of Kubernetes compliant container runtimes.
\\n\\n\\n\\nThe CRI-O community voted to use the OCI runtime crun as new default in replacement to runc. That’s actually not too new, because crun has been used as default runtime in the packages and static binary bundles for quite a while. The runtime offers an overall better performance and lower memory footprint than runc. It’s efficiency in terms of faster container start times and lower memory usage makes it a more optimized runtime for modern workloads, for example when it comes to edge use cases as well as running WebAssembly (Wasm) workloads.
\\n\\n\\n\\nCRI-O v1.31 also features support for fine-grained SupplementalGroups
(KEP-3619), which allow to control and track how supplemental groups are applied to a container process. If you like to learn more about the feature itself, then feel free to read through the corresponding Kubernetes v1.31 blog post.
Beside that, the CRI-O maintainers also added support for the Kubernetes image volume source alpha feature (KEP-4639). This feature allows users to utilize OCI images and artifacts as custom volume source and mount them into containerized workloads. There is another Kubernetes v1.31 blog post available which covers more details about the functionality and usage of the feature.
\\n\\n\\n\\nCRI-O now supports sigstore (cosign) signature verification for policies corresponding to a certain Kubernetes namespace. This means, that policies in the (default) directory /etc/crio/policies/[NAMESPACE].json
will be validated for each pod of the corresponding NAMESPACE
. This will also happen on container creation, which is a huge step forward in enforcing sigstore policies for a dedicated Kubernetes namespace in comparison to policies which only apply to the whole cluster itself.
Beside the support for bigger Kubernetes features, CRI-O v1.31 ships a bunch of cool smaller enhancements, for example:
\\n\\n\\n\\n--no-sync-log
/no_sync_log
option to disable fsync
on container log rotation and container exit. This can improve performance at the cost of potential data loss on machine crashes./dev/net/tun
to the default allowed devices, which helps users to run Podman inside containers.crio check
subcommand.It’s worth to mention some deprecations and removals in CRI-O v1.31 which may affect existing users:
\\n\\n\\n\\nregistries
config in crio.image
as well as the --registry
CLI argument which have been already deprecated.crio config --migrate-defaults
subcommand has been removed (deprecated in v1.28).Beside features and removals, the CRI-O maintainers fixed bugs and addressed CVE’s to ensure CRI-O’s stability over the past releases. A full list of them can be found in the official release notes.
\\n\\n\\n\\nI would like to take this opportunity to give a huge shoutout to all contributors and maintainers of the CRI-O for this awesome job! 🙌
\\n\\n\\n\\nIf you want to give CRI-O v1.31 a try, then feel free to head over to our official packaging repository, which supports mostly all deb and rpm based distributions.
\\n\\n\\n\\nIf you have any questions or feedback, feel free to reach out using the Kubernetes Slack #crio channel or create an issue in the official repository.
\\n\\nMember post originally published on CyberArk’s blog by Shlomo Heigh
\\n\\n\\n\\nIn today’s fast-paced world of DevOps and cloud-native applications, managing secrets securely is critical. CyberArk Conjur, a trusted solution for secrets management, has taken a significant step by integrating seamlessly with the External Secrets Operator (ESO). This collaboration brings together the best of both worlds: Conjur’s robust secret management capabilities and ESO’s flexibility in handling secrets across Kubernetes clusters.
\\n\\n\\n\\nAt CyberArk, we strive to provide tools that companies can use to keep their most sensitive information secure. We also do our best to ensure information can be accessed when needed and only by an authorized person or process. In some ways, that’s the more challenging part of secrets management, especially when considering all the “places” where processes may be running – in public, private or hybrid clouds; SaaS services; on-prem workloads; and many more. This is why the Conjur team provides several APIs and native integrations to allow applications to seamlessly retrieve the secrets they need just as they need them.
\\n\\n\\n\\nWith the massive growth of the cloud native ecosystem, particularly Kubernetes and its various flavors, such as OpenShift and Tanzu, one open source project has become a major player in handling this challenge – External Secrets Operator (ESO). ESO is an open source project under the Cloud Native Computing Foundation (CNCF), the same foundation that oversees Kubernetes’ development. ESO is popular because it’s a plug-and-play system that can be added to a Kubernetes cluster to handle the fetching of secrets from many secrets management systems and seamlessly provide them to workloads running in the cluster.
\\n\\n\\n\\n\\n\\n\\n\\n\\n(Side note: There is another way to achieve similar results using the Kubernetes Secrets Store CSI Driver, which Conjur also supports. The pros and cons of each are out of the scope of this article. Still, the primary differences are that the CSI driver mounts secrets into pods, requires volumes and doesn’t use Kubernetes secrets, while ESO is at the cluster/namespace level and uses Kubernetes secrets.)
\\n
We’re pleased to announce that we’ve recently worked with the ESO maintainers team to increase the support for fetching secrets from Conjur using ESO. In this post, we’ll walk you through setting up a Kubernetes environment with an application that uses secrets stored in Conjur and provided with ESO.
\\n\\n\\n\\nLet’s jump right in!
\\n\\n\\n\\n\\n\\n\\n\\n\\nNote: This walkthrough will be quite technical and have plenty of code. It’s intended for those already familiar with Kubernetes and shell scripts.
\\n
A scripted demo is available at https://github.com/conjurdemos/Accelerator-K8s-External-Secrets/, simplifying the process of creating a proof of concept without needing to delve into all the intricacies. This post goes into all the details for those who want to understand exactly what’s happening at each step.
\\n\\n\\n\\nIn this demo, we will use the distribution of Kubernetes included in Docker Desktop. You can enable it in the Settings screen, as seen here:
\\n\\n\\n\\nTo illustrate using Conjur and ESO in a Kubernetes environment, we’ll deploy an application that relies on database connection details to run. We will then have ESO fetch those details from Conjur and inject them into Kubernetes Secrets, where the app can read them.
\\n\\n\\n\\nWe will use our Pet Store Demo app for the application. You can see the code at https://github.com/conjurdemos/pet-store-demo, and its image is on DockerHub as cyberark/demo-app. This app requires four secrets as environment variables: DB_URL, DB_USERNAME, DB_PASSWORD, and DB_PLATFORM. We want to store these values in a secure location, so we’ll start by setting up Conjur as our secret store.
\\n\\n\\n\\nWe can easily install an instance of Conjur OSS right in our Kubernetes cluster using the Conjur OSS Helm charts. If you already have a Conjur instance, you can skip to the next section (Configuring Conjur).
\\n\\n\\n\\n$ CONJUR_NAMESPACE=conjur\\n$ kubectl create namespace \\"$CONJUR_NAMESPACE\\"\\n$ DATA_KEY=\\"$(docker run --rm cyberark/conjur data-key generate)\\"\\n$ helm repo add cyberark https://cyberark.github.io/helm-charts\\n$ helm install -n \\"$CONJUR_NAMESPACE\\" --set dataKey=\\"$DATA_KEY\\" --set authenticators=\\"authn\\\\,authn-jwt/eso\\" conjur cyberark/conjur-oss\\n$ CONJUR_ACCOUNT=demo\\n$ POD_NAME=$(kubectl get pods --namespace \\"$CONJUR_NAMESPACE\\" \\\\\\n -l \\"app=conjur-oss,release=conjur\\" \\\\\\n -o jsonpath=\\"{.items[0].metadata.name}\\")\\n# This will create an account and print the API key. Store this in a safe place.\\n$ kubectl exec --namespace $CONJUR_NAMESPACE \\\\\\n $POD_NAME \\\\\\n --container=conjur-oss \\\\\\n -- conjurctl account create $CONJUR_ACCOUNT | tail -1\\n\\n
\\n\\n\\n\\nYou now have an instance of Conjur running in your local Kubernetes cluster. To configure Conjur with the secrets the Pet Store Demo app will need, we must connect to it using the Conjur CLI. Since we’ve installed Conjur in Kubernetes, let’s go ahead and create a pod in the same namespace where we can run the CLI.
\\n\\n\\n\\nSave the following file as cli.yml:
\\n\\n\\n\\napiVersion: apps/v1\\nkind: Deployment\\nmetadata:\\n name: conjur-cli\\n labels:\\n app: conjur-cli\\nspec:\\n replicas: 1\\n selector:\\n matchLabels:\\n app: conjur-cli\\n template:\\n metadata:\\n name: conjur-cli\\n labels:\\n app: conjur-cli\\n spec:\\n containers:\\n - name: conjur-cli\\n image: cyberark/conjur-cli:8\\n command: [\\"sleep\\"]\\n args: [\\"infinity\\"]\\n
\\n\\n\\n\\nNow run the following command to create the Conjur CLI deployment defined in the file:
\\n\\n\\n\\n$ kubectl apply -f cli.yml -n $CONJUR_NAMESPACE\\n
\\n\\n\\n\\nNow, let’s get the name of the newly created CLI pod and log in to our Conjur instance.
\\n\\n\\n\\n$ CLI_POD_NAME=$(kubectl get pods --namespace \\"$CONJUR_NAMESPACE\\" -l \\"app=conjur-cli\\" -o jsonpath=\\"{.items[0].metadata.name}\\")\\n$ CONJUR_URL=https://conjur-conjur-oss.$CONJUR_NAMESPACE.svc.cluster.local\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -ti -- conjur init -a $CONJUR_ACCOUNT -u $CONJUR_URL --self-signed\\n# Now login. When prompted for a password, paste in the API key returned from when you created the account above\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -ti -- conjur login -i admin\\n
\\n\\n\\n\\nWe now have a pod running with the CLI and are logged into our Conjur instance as the administrator. We’re now ready to start creating policies and variables!
\\n\\n\\n\\nWe need to configure Conjur to:
\\n\\n\\n\\nTo avoid storing an API key to access Conjur, another secret to manage, we will use the Kubernetes native Service Account Tokens to allow ESO to authenticate to Conjur. Let’s start by creating a Conjur policy file that defines a JWT authenticator that ESO will use to authenticate. Create the following file and save it as authn-jwt.yml:
\\n\\n\\n\\n- !policy\\n id: conjur/authn-jwt/eso\\n annotations:\\n description: Configuration for AuthnJWT service\\n body:\\n - !webservice\\n\\n # - !variable jwks-uri\\n - !variable public-keys\\n - !variable issuer\\n - !variable token-app-property\\n - !variable identity-path\\n - !variable audience\\n\\n # Group of applications that can authenticate using this JWT Authenticator\\n - !group users\\n \\n - !permit\\n role: !group users\\n privilege: [ read, authenticate ]\\n resource: !webservice\\n\\n - !webservice status\\n \\n # Group of users who can check the status of the JWT Authenticator\\n - !group operators\\n \\n - !permit\\n role: !group operators\\n privilege: [ read ]\\n resource: !webservice status
\\n\\n\\n\\nNow, create another file called authn-jwt-apps.yml. We’re going to use “demo-app” for the namespace name and “demo-app-sa” for the service account name:
\\n\\n\\n\\n- !policy\\n id: conjur/authn-jwt/eso/apps\\n annotations:\\n description: Identities permitted to authenticate with the AuthnJWT service\\n body:\\n - !group\\n\\n - &hosts\\n - !host\\n id: system:serviceaccount:demo-app:demo-app-sa\\n annotations:\\n authn-jwt/eso/sub: system:serviceaccount:demo-app:demo-app-sa\\n\\n - !grant\\n role: !group\\n members: *hosts\\n\\n- !grant\\n role: !group conjur/authn-jwt/eso/users\\n member: !group conjur/authn-jwt/eso/apps
\\n\\n\\n\\nNow, create a third and final policy file that will define the secrets the application needs and grant ESO access. Save this one as secrets.yml:
\\n\\n\\n\\n- !policy\\n id: secrets\\n body:\\n - &variables\\n - !variable db/url\\n - !variable db/username\\n - !variable db/password\\n - !variable db/platform\\n\\n - !group users\\n\\n - !permit\\n resources: *variables\\n role: !group users\\n privileges: [ read, execute ]\\n\\n- !grant\\n role: !group secrets/users\\n members:\\n - !host conjur/authn-jwt/eso/apps/system:serviceaccount:demo-app:demo-app-sa\\n
\\n\\n\\n\\nWe must now copy these files to the CLI pod and load them into Conjur. Assuming you’re using the Conjur instance we created in the previous step, the commands would be as follows:
\\n\\n\\n\\n$ kubectl cp -n $CONJUR_NAMESPACE authn-jwt.yml $CLI_POD_NAME:/\\n$ kubectl cp -n $CONJUR_NAMESPACE authn-jwt-apps.yml $CLI_POD_NAME:/\\n$ kubectl cp -n $CONJUR_NAMESPACE secrets.yml $CLI_POD_NAME:/\\n\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur policy load -b root -f authn-jwt.yml\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur policy load -b root -f authn-jwt-apps.yml\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur policy load -b root -f secrets.yml\\n
\\n\\n\\n\\nNow, let’s populate the values of all those variables we just created. First we’ll do the ones necessary for the JWT authenticator. This will allow Conjur to verify that Kubernetes has issued the JWT presented by ESO.
\\n\\n\\n\\n# Get the necessary JWT info from the Kubernetes API\\n$ ISSUER=\\"$(kubectl get --raw /.well-known/openid-configuration | jq -r \'.issuer\')\\"\\n$ JWKS_URI=\\"$(kubectl get --raw /.well-known/openid-configuration | jq -r \'.jwks_uri\')\\"\\n$ kubectl get --raw \\"$JWKS_URI\\" > jwks.json\\n$ kubectl cp -n $CONJUR_NAMESPACE jwks.json $CLI_POD_NAME:/\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur variable set -i \\"conjur/authn-jwt/eso/token-app-property\\" -v \\"sub\\"\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur variable set -i \\"conjur/authn-jwt/eso/issuer\\" -v \\"$ISSUER\\"\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur variable set -i \\"conjur/authn-jwt/eso/public-keys\\" -v \\"{\\\\\\"type\\\\\\":\\\\\\"jwks\\\\\\", \\\\\\"value\\\\\\":$(cat jwks.json)}\\"\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur variable set -i \\"conjur/authn-jwt/eso/identity-path\\" -v \\"conjur/authn-jwt/eso/apps\\"\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur variable set -i \\"conjur/authn-jwt/eso/audience\\" -v \\"https://conjur-conjur-oss.$CONJUR_NAMESPACE.svc.cluster.local\\"
\\n\\n\\n\\nNote: We need to make sure that the JWT authenticator is enabled in the Conjur configuration. (“Configuring Conjur”), we did this by providing the “authenticators” property in the `helm install` command. If you’re using a different Conjur instance, make sure you enable it by following the steps in “Step 2: Allowlist the authenticators” section of the documentation (TL;DR: add “authn-jwt/eso” to the “CONJUR_AUTHENTICATORS” environment variable on the Conjur container).
\\n\\n\\n\\nWe can now add the values for the demo app’s secrets. Later, we’ll use these same values to configure the database service.
\\n\\n\\n\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur variable set -i \\"secrets/db/url\\" -v \\"postgresql://db.demo-app.svc.cluster.local:5432/demo-app\\"\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur variable set -i \\"secrets/db/username\\" -v \\"db-user\\"\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur variable set -i \\"secrets/db/password\\" -v \\"P0stgre5P@ss%\\"\\n$ kubectl exec -n $CONJUR_NAMESPACE $CLI_POD_NAME -- conjur variable set -i \\"secrets/db/platform\\" -v \\"postgres\\"
\\n\\n\\n\\nLet’s install the ESO in our cluster to act as a broker to pull secrets from Conjur. You can use any version since v0.9.17, but by default, this will use the newest release.
\\n\\n\\n\\n$ helm repo add external-secrets https://charts.external-secrets.io\\n$ helm install external-secrets external-secrets/external-secrets -n external-secrets --create-namespace
\\n\\n\\n\\nWe’ll need a Kubernetes namespace to put all of our demo app-related objects. Let’s create one now. This will allow us to use separate ESO stores if we want to provide secrets to a different app running in a different namespace. Remember, it’s always best to limit the access to secrets to the minimum number of processes. In this case, we only need them in the demo app’s namespace.
\\n\\n\\n\\n$ kubectl create namespace demo-app
\\n\\n\\n\\nNow, let’s load some configuration, so ESO knows how to connect to Conjur using a Kubernetes service account token. Copy the following into an editor and save it as service-account.yml:
\\n\\n\\n\\n---\\napiVersion: v1\\nkind: ServiceAccount\\nmetadata:\\n name: demo-app-sa\\n namespace: demo-app\\n---\\napiVersion: v1\\nkind: Secret\\ntype: kubernetes.io/service-account-token\\nmetadata:\\n name: demo-app-sa-secret\\n namespace: demo-app\\n annotations:\\n kubernetes.io/service-account.name: \\"demo-app-sa\\"
\\n\\n\\n\\nNow we need to create a SecretStore that will tell ESO how to connect to Conjur using the Service Account Token “demo-app-sa” we created and gave access to the demo app’s secrets. But first, you’ll need to replace the value of the CA bundle with the CA cert used by Conjur since we’re using a self-signed certificate that ESO won’t know to trust. You can get the value by running:
\\n\\n\\n\\nkubectl get secret -n conjur conjur-conjur-ssl-ca-cert -o jsonpath=\\"{.data[\'tls\\\\.crt\']}\\"
\\n\\n\\n\\nNow save the following into a “eso-jwt-provider.yml” file. Note that you’ll need to replace the URLs and other connection details if you’re using different settings.
\\n\\n\\n\\n---\\napiVersion: external-secrets.io/v1beta1\\nkind: SecretStore\\nmetadata:\\n name: conjur-jwt\\n namespace: demo-app\\nspec:\\n provider:\\n conjur:\\n url: https://conjur-conjur-oss.conjur.svc.cluster.local\\n caBundle: # Paste output of the previous command here\\n auth:\\n jwt:\\n account: demo\\n serviceID: eso\\n serviceAccountRef:\\n name: demo-app-sa\\n audiences:\\n - https://conjur-conjur-oss.conjur.svc.cluster.local
\\n\\n\\n\\nLoad these manifest files into Kubernetes:
\\n\\n\\n\\n$ kubectl apply -f service-account.yml\\n$ kubectl apply -f eso-jwt-provider.yml
\\n\\n\\n\\nNow that we have the basic configuration for ESO set up let’s install our demo app.
\\n\\n\\n\\nBefore creating the demo application, let’s create the database service it’ll connect to.
We’re using the same credentials that we populated in the Conjur variables for the demo app.
$ helm repo add bitnami https://charts.bitnami.com/bitnami\\n$ helm install postgresql bitnami/postgresql -n demo-app --set \\"auth.username=db-user\\" --set \\"auth.password=P0stgre5P@ss%\\" --set \\"auth.database=demo-app\\" --set \\"fullnameOverride=db\\" --set \\"tls.enabled=true\\" --set \\"tls.autoGenerated=true\\"
\\n\\n\\n\\nSave the following manifest to a file called “demo-app.yml”:
\\n\\n\\n\\n---\\napiVersion: v1\\nkind: Service\\nmetadata:\\n name: demo-app\\n namespace: demo-app\\n labels:\\n app: demo-app\\nspec:\\n ports:\\n - protocol: TCP\\n port: 8080\\n targetPort: 8080\\n selector:\\n app: demo-app\\n type: NodePort\\n---\\napiVersion: apps/v1\\nkind: Deployment\\nmetadata:\\n labels:\\n app: demo-app\\n name: demo-app\\n namespace: demo-app\\nspec:\\n replicas: 1\\n selector:\\n matchLabels:\\n app: demo-app\\n template:\\n metadata:\\n labels:\\n app: demo-app\\n spec:\\n serviceAccountName: demo-app-sa\\n containers:\\n - name: demo-app\\n image: cyberark/demo-app:latest\\n imagePullPolicy: IfNotPresent\\n ports:\\n - name: http\\n containerPort: 8080\\n readinessProbe:\\n httpGet:\\n path: /pets\\n port: http\\n initialDelaySeconds: 15\\n timeoutSeconds: 5\\n env:\\n - name: DB_URL\\n valueFrom:\\n secretKeyRef:\\n name: db-credentials\\n key: url\\n - name: DB_USERNAME\\n valueFrom:\\n secretKeyRef:\\n name: db-credentials\\n key: username\\n - name: DB_PASSWORD\\n valueFrom:\\n secretKeyRef:\\n name: db-credentials\\n key: password\\n - name: DB_PLATFORM\\n valueFrom:\\n secretKeyRef:\\n name: db-credentials\\n key: platform
\\n\\n\\n\\nYou can see that this app will take its environment variables from Kubernetes secrets. But we haven’t created any yet! Let’s add configuration to ESO to instruct it to sync those secrets from Conjur into Kubernetes, where the app can reach them.
\\n\\n\\n\\nSave the following as “external-secret.yml”:
\\n\\n\\n\\n---\\napiVersion: external-secrets.io/v1beta1\\nkind: ExternalSecret\\nmetadata:\\n name: external-secret\\n namespace: demo-app\\nspec:\\n # Optional: refresh the secret at an interval\\n # refreshInterval: 10s\\n secretStoreRef:\\n name: conjur-jwt\\n kind: SecretStore\\n target:\\n # The Kubernetes secret that will be created\\n name: db-credentials\\n creationPolicy: Owner\\n data:\\n # The keys in the Kubernetes secret that will be populated,\\n # along with the path of the Conjur secret that will be used\\n - secretKey: url\\n remoteRef:\\n key: secrets/db/url\\n - secretKey: username\\n remoteRef:\\n key: secrets/db/username\\n - secretKey: password\\n remoteRef:\\n key: secrets/db/password\\n - secretKey: platform\\n remoteRef:\\n key: secrets/db/platform
\\n\\n\\n\\nNow, for the moment of truth, let’s load these files into Kubernetes.
\\n\\n\\n\\n$ kubectl apply -f external-secret.yml\\n$ kubectl apply -f demo-app.yml
\\n\\n\\n\\nIf all goes well, you should be able to see a successful startup message in the demo app’s logs:
\\n\\n\\n\\n$ DEMO_POD_NAME=$(kubectl get pods --namespace demo-app -l \\"app=demo-app\\" -o jsonpath=\\"{.items[0].metadata.name}\\")\\n$ kubectl logs $DEMO_POD_NAME -n demo-app
\\n\\n\\n\\n
This means that the app is successfully connecting to the database using the credentials stored in Conjur.
You can even verify that the app is working by deploying a pod containing ‘curl’ and running a test query to the app:
---\\napiVersion: v1\\nkind: Pod\\nmetadata:\\n name: curl\\n labels:\\n name: curl\\nspec:\\n containers:\\n - name: curl\\n image: curlimages/curl:latest\\n imagePullPolicy: Always\\n command: [\\"sh\\", \\"-c\\", \\"tail -f /dev/null\\"]
\\n\\n\\n\\nAfter applying that, run:
\\n\\n\\n\\n$ kubectl exec curl -- curl -X POST -H \'Content-Type: application/json\' --data \'{\\"name\\":\\"Accelerator Alice\\"}\' http://demo-app.demo-app.svc.cluster.local:8080/pet\\n$ kubectl exec curl -- curl http://demo-app.demo-app.svc.cluster.local:8080/pets
\\n\\n\\n\\nIt should successfully create a new “pet” in the first request and return it when you run the second. This illustrates that the app can perform database operations using the credentials stored and retrieved from Conjur.
\\n\\n\\n\\nAnd that’s it! There are tons more options – for example, you can adjust your external-secret.yml spec to fetch several Conjur secrets by searching with a regex or matching on annotations, but for that, you’ll need to check out the documentation! Hopefully, this provides a helpful starting point, and you will now have a deep understanding of how the Conjur – ESO integration works.
\\n\\n\\n\\nIf you’ve read to the end, thank you! You’re a champ 💪
Now, go secure your software!
Shlomo is a senior software engineer at CyberArk working on Conjur Secrets Manager. He’s an open source and AppSec enthusiast, a member of the CNCF TAG Security and a contributor to multiple OWASP projects. In his free time, you can find him spending time with his wife and daughter, 3D printing, woodworking or hiking.
\\nWith KubeCon + CloudNativeCon North America 2024 just a few months away we thought it would be fun to ask our ambassadors and other locals about where to go and what to do while we’re all in Salt Lake City. Time to start planning the before and after conference fun now!
\\n\\n\\n\\nOne of the most “Utah things to do,” according to Ambassador Taylor Thomas, is to visit a soda shop. Recommendations include Sodalicious, Swig and Fiiz ”for lots of sugar and something unique.” Ambassador Matt Asay seconds this: “Utah is also home to a very peculiar other way of getting caffeine that everyone should try. We have a wide array of “dirty soda” places. Think Coke Zero + half-and-half + raspberry puree. Trust me: sounds strange but tastes great.”
\\n\\n\\n\\nOur ambassadors *love* the local coffee scene and here are some of their favorite choices:
\\n\\n\\n\\nCaffe d’Bolla – Ambassador Matt likens their “siphon coffee” experience to what you might find in San Francisco or Seattle. Allow an hour for the siphon and you do need to book in advance.
\\n\\n\\n\\nThere’s a lot in Salt Lake City, but there is also much to see around it as well. Ambassador Alan Clark suggests scheduling in time to visit at least one of Utah’s national or state parks because it is what the state is best known for. “Arches, Canyonlands, Zions and Bryce Canyon can be visited in a very long day trip. Bryce or Arches are the closest – both about a four-hour drive from Salt Lake. Bryce has the highest elevation (8,000 feet -2438 m). And November is also a great time to visit the parks as the crowds are gone, the temperatures are good for hiking, and the colors in the red rocks are more vibrant.”
\\n\\n\\n\\nIf you want to get away but stay closer to Salt Lake, Ambassador Matt suggests Park City. “It’s fun to drive up Big Cottonwood Canyon and over Guardsman Pass into Park City (<30 mins away if you drive directly to PC from SLC, but longer if you take the Big Cottonwood way) and walk Main Street there. Great vibe, great restaurants, and there will be snow on the ground in November. It’s a pretty drive.”
\\n\\n\\n\\nOpportunities to hike and explore abound around Salt Lake, but remember it’s going to be the second week of November so it will likely be cold and potentially snowy in some areas. Always check the weather before heading out.
\\n\\n\\n\\nHikers will want to explore this guide to local hiking around Salt Lake, and here’s a look at rock climbing options near the city. Locals let us know this isn’t a bad time perhaps to visit Great Salt Lake Park, and Big Cottonwood Canyon is always popular.
\\n\\n\\n\\nUtah is a fossil and dinosaur lover’s dream, so the Natural History Museum of Utah is a must-see. Don’t miss the world’s largest collection of horned dinosaur skulls.
\\n\\n\\n\\nIt may come as a surprise to find that the stretch from Provo, Utah to Salt Lake City is known as the “Silicon Slopes” and it’s considered a high tech haven. In fact, the Wall Street Journal said Salt Lake had the country’s hottest job market in 2023. And research firm FDi Intelligence is predicting Utah is going to have the highest percentage of tech job growth between 2023 and 2033. Learn more about this new “tech paradise” you’ll be visiting during KubeCon + CloudNativeCon North America 2024.
\\n\\nCo-chairs: Eduardo Silva, Chronosphere, Austin Parker, Honeycomb, Anna Kapuscinska, Isovalent at Cisco
November 12, 2024
Salt Lake City, Utah
Observability is a journey, and in a diverse ecosystem like ours, it’s easy to get confused or use the wrong tools for specific problems. Observability Day is an open space where we can learn together, make valuable connections, and shape the future of observability. And you’ll be surprised by how many new project features are born from the feedback and ideas discussed in the hallways after this event.
\\n\\n\\n\\nThe origins of Observability Day were found KubeCon Europe in Valencia 2022 when we hosted FluentCon, a co-located event dedicated solely to the Fluentd and Fluent Bit projects. The event was a success, and after receiving valuable feedback from our community, we realized we could offer even more to observability practitioners at KubeCon. To do this, we decided to broaden our focus, inviting other CNCF projects to join us.
\\n\\n\\n\\nJust a few months later, at KubeCon US 2022, we launched Open Observability Day, bringing together maintainers and community members from projects like Jaeger, Fluentd, Thanos, Prometheus and OpenTelemetry. The event was a great success, and it became clear that we needed to formalize it with a more structured approach, working closely with the CNCF and that’s when Observability Day was born.
\\n\\n\\n\\nBy 2023 and 2024, the event had grown organically to the point where Observability Day now features two rooms and boasts a high attendance rate.
\\n\\n\\n\\nWho will get the most out of attending this event?
\\n\\n\\n\\nThis event is perfect for anyone interested in observability, whether you’re an experienced practitioner or just getting started. If you’re navigating the complexities of monitoring and using various tools, looking for innovative solutions, or simply eager to learn more about observability, this event will be valuable for you.
\\n\\n\\n\\nWhat is new and different this year?
\\n\\n\\n\\nThis year, what’s new and different is the variety and quality of presentations. We were pleasantly surprised by the high number of submissions and the balanced range of topics covered. This marks a significant improvement from previous years, offering attendees an even richer and more diverse learning experience.
\\n\\n\\n\\nWhat will the day look like?
\\n\\n\\n\\nThe event will begin in the main room with a brief welcome from the co-chairs. Following that, maintainers from various CNCF projects will provide updates on their latest developments. Afterward, we’ll dive into the technical sessions, which will run simultaneously in two dedicated rooms. This setup ensures a dynamic and engaging experience, offering a wide range of topics for attendees to choose from.
\\n\\n\\n\\nShould I do any homework first?
\\n\\n\\n\\nNo specific preparation is needed, but we strongly recommend that attendees review the schedule in advance to plan their day and make the most of the sessions, especially when switching between rooms. If you’re interested in connecting with any of the Observability Day speakers, this is a great opportunity! Be sure to reach out to them on Slack or through their preferred channels to start the conversation.
\\n\\n\\n\\nAnything else you would like to say about this event?
\\n\\n\\n\\nThis event is FOR YOU! Come with an open mindset, ready to exchange ideas and connect with fellow observability practitioners. Whether you’re new to the space or a seasoned veteran, there’s always something new to learn and share. So dive in, engage with others, and make the most of this opportunity! 🙂
\\n\\n\\n\\nSubmitted by the co-chairs who are looking forward to continuing to help build a strong community.
\\n\\n\\n\\nDon’t forget to register for KubeCon + CloudNativeCon North America 2024.
\\n\\nCommunity post by Alexander Schwartz, Keycloak maintainer
\\n\\n\\n\\nKeyConf24, our 2024 Keycloak Identity Summit, will happen on September 19th, which is just around the corner! This year’s event promises to be even bigger and better, with a program packed full of relevant, cutting-edge topics.
\\n\\n\\n\\nThe event is organized by the Keycloak OAuth2 Special Interest Group which contributed to building OpenID FAPI standards into Keycloak (CNCF Incubating project). There is more standards-related work in progress, and it is a great opportunity to interact with the contributors and be part of such amazing contributions.
\\n\\n\\n\\nThis year, we’ll meet in Vienna and the whole event will be live streamed for the first time. We would like to invite the Keycloak community to join us and contribute to the discussions remotely, because due to high demand and limited space on-site, in-person tickets are already gone.
\\n\\n\\n\\nOur program committee selected great talks and the program is now online at https://keyconf.dev/
\\n\\n\\n\\nYou can expect talks about:
\\n\\n\\n\\nYou can register for the live stream at https://keyconf.dev/
\\n\\n\\n\\nWe’re excited and are looking forward to meeting you at our event on Sep 19. Let’s continue to shape the future of identity together!
\\n\\nCommunity post originally published on Dev.to by Syed Asad Raza
\\n\\n\\n\\nKubernetes plugins, or “kubectl plugins,” are tools that extend the functionality of the kubectl command-line tool. These plugins can be developed by the community or Kubernetes administrators to add specific features or automate tasks. They are designed to be seamlessly integrated into your existing Kubernetes setup, providing extra capabilities while maintaining the core functionality of Kubernetes.
\\n\\n\\n\\nPlugins can help you:
\\n\\n\\n\\nTo install plugins in Kubernetes, follow these steps:
\\n\\n\\n\\nEnsure you have the following prerequisites:
\\n\\n\\n\\nKrew is a plugin manager for kubectl that makes it easy to discover and install plugins. Follow these steps to install Krew:
\\n\\n\\n\\n(\\n set -x; cd \\"$(mktemp -d)\\" &&\\n OS=\\"$(uname | tr \'[:upper:]\' \'[:lower:]\')\\" &&\\n ARCH=\\"$(uname -m | sed \'s/x86_64/amd64/;s/arm.*/arm/;s/aarch64$/arm64/\')\\" &&\\n KREW=\\"krew-${OS}_${ARCH}\\" &&\\n curl -fsSLO \\"https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz\\" &&\\n tar zxvf \\"${KREW}.tar.gz\\" &&\\n ./\\"${KREW}\\" install krew\\n)\\n
\\n\\n\\n\\nexport PATH=\\"${KREW_ROOT:-$HOME/.krew}/bin:$PATH\\"\\n
\\n\\n\\n\\nkubectl krew\\n
\\n\\n\\n\\nYou should see a list of Krew commands if the installation was successful.
\\n\\n\\n\\nWith Krew installed, you can now search for and install plugins. Here’s how:
\\n\\n\\n\\nkubectl krew search\\n
\\n\\n\\n\\nThis command lists all available plugins.
\\n\\n\\n\\nkubectl krew install neat\\n
\\n\\n\\n\\nkubectl neat -f my-pod.yaml\\n
\\n\\n\\n\\nThis command will clean up the my-pod.yaml file to make it more readable.
\\n\\n\\n\\nHere are some essential plugins that every Kubernetes user should consider installing:
\\n\\n\\n\\nInstallation:
\\n\\n\\n\\nkubectl krew install neat\\n
\\n\\n\\n\\nUsage:
\\n\\n\\n\\nkubectl get pod my-pod -o yaml | kubectl neat\\n
\\n\\n\\n\\nInstallation:
\\n\\n\\n\\nkubectl krew install ctx\\nkubectl krew install ns\\n
\\n\\n\\n\\nUsage:
\\n\\n\\n\\nkubectl ctx # List all contexts\\nkubectl ctx my-context # Switch to \'my-context\'\\nkubectl ns my-namespace # Switch to \'my-namespace\'\\n
\\n\\n\\n\\nInstallation:
\\n\\n\\n\\nkubectl krew install who-can\\n
\\n\\n\\n\\nUsage:
\\n\\n\\n\\nkubectl who-can create pods\\n
\\n\\n\\n\\nInstallation:
\\n\\n\\n\\nkubectl krew install view-secret\\n
\\n\\n\\n\\nUsage:
\\n\\n\\n\\nkubectl view-secret my-secret\\n
\\n\\n\\n\\nInstallation:
\\n\\n\\n\\nkubectl krew install replace-image\\n
\\n\\n\\n\\nUsage:
\\n\\n\\n\\nkubectl replace-image deployment/my-deployment container-name=new-image:tag\\n
\\n\\n\\n\\nPlugins are a powerful way to extend Kubernetes’ functionality and streamline workflows. Using Krew to install and manage plugins, you can easily add new features to your Kubernetes toolkit and improve your cluster management capabilities. Start with the essential plugins mentioned in this guide and explore the extensive list of available plugins to find tools that best suit your needs.
\\n\\nGet to know Daiki
\\n\\n\\n\\nThis week’s Kubestronaut in Orbit, Daiki Takasao, is a Japanese IT infrastructure engineer at NRI. He works with CNCF technologies to build financial IT systems and has been using Kubernetes, Linkerd, and Prometheus since 2021
\\n\\n\\n\\nIf you’d like to be a Kubestronaut like Daiki, get more details on the CNCF Kubestronaut page.
\\n\\n\\n\\nWhen did you get started with Kubernetes–what was your first project?
\\n\\n\\n\\nIn 2021, I was assigned to a project to build a microservices infrastructure for a Japanese financial institution using Kubernetes. That was when I started using Kubernetes for the first time. Before that, I had done some research on Kubernetes for our own R&D, but this project was the first time I started using Kubernetes in earnest.
\\n\\n\\n\\nIn addition to Kubernetes, this project used many other CNCF technologies such as Linkerd and Prometheus. It was very difficult to learn these technologies and design the architecture, but the system is still running stably and I now have a lot of confidence in the high quality of Kubernetes and CNCF technologies.
\\n\\n\\n\\nWhat are the primary CNCF projects you work on or use today? What projects have you enjoyed the most in your career?
\\n\\n\\n\\nHere are some of the CNCF projects that I have used:
\\n\\n\\n\\n\\n\\n\\n\\n– Linkerd
\\n\\n\\n\\n\\n\\n\\n\\n– Fluentd
\\n\\n\\n\\n– gRPC
\\n\\n\\n\\n– Keycloak
\\n\\n\\n\\n\\n\\n\\n\\nThe most enjoyable project is Kubernetes. There are so many components that I had a hard time understanding at first, but now that I have a better understanding of it, the scalability of Kubernetes really amazes me.
\\n\\n\\n\\nPrometheus is another project that I have a lot of fondness for. When I learned that I could automatically monitor pods generated by service discovery, I was surprised at how different it was from the static monitoring I had been doing.
\\n\\n\\n\\nI’m also very attached to Linkerd. Linkerd automatically injects itself into application pods without having to prepare a special manifest, and it automatically handles mTLS and gRPC communications. I think it is a good OSS that configures a service mesh without any hassle.
\\n\\n\\n\\nHow have the certs helped you in your career?
\\n\\n\\n\\nThe Kubernetes certifications, CKS, CKA, and CKAD are hands-on exams. Therefore, studying for the certification requires not only theoretical study, but also hands-on learning. Through the certification, you will acquire not only the knowledge of Kubernetes, but also the basic skills to build and troubleshoot application environments on Kubernetes by actually doing the work. It was very valuable for me.
\\n\\n\\n\\nThe certification is also a way for me to prove my skills in Kubernetes. I believe that the certification was very effective in gaining more trust from my clients.
\\n\\n\\n\\nHow has CNCF helped you or influenced your career?
\\n\\n\\n\\nBecause Kubernetes and CNCF technologies are not dependent on a specific cloud vendor, I have been able to form a multi-cloud skill set by becoming familiar with these technologies.
\\n\\n\\n\\nParticularly in the Japanese financial industry, I believe that the introduction of Kubernetes and CNCF technology systems is in a transitional state. In this context, I believe that the fact that I was able to develop these skills was very effective in increasing the scarcity value of my skills as an IT infrastructure engineer.
\\n\\n\\n\\nWhat are some other books/sites/courses you recommend for people who want to work with k8s?
\\n\\n\\n\\nHere are some websites and learning courses that have been very helpful to me in learning Kubernetes.
\\n\\n\\n\\nWebsites
\\n\\n\\n\\nOnline Learning Courses
\\n\\n\\n\\nWhat do you do in your free time?
\\n\\n\\n\\nI enjoy going out with my family. I especially like to go camping with my family.
\\n\\n\\n\\nWe have two daughters, ages 9 and 6, and I am thinking that as they get older, they may not want to go out with their father. Therefore I always want to go on many outings while I can.
\\n\\n\\n\\nWhat would you tell someone who is just starting their K8s certification journey? Any tips or tricks?
\\n\\n\\n\\nWhen I first started learning about Kubernetes, I was overwhelmed by the complexity of the many components and API objects that make up Kubernetes and the relationships between them. I believe this is why Kubernetes is often referred to as an OSS with a high learning cost.
\\n\\n\\n\\nFrom there, I read the documentation and executed kubectl commands to repeatedly manipulate API objects. I think my understanding of Kubernetes has gradually increased.
\\n\\n\\n\\nIn that sense, I think that studying for the Kubernetes certification is a very efficient way to learn Kubernetes skills. This is because the CKS, CKA, and CKAD are hands-on exams and require repeated hands-on Kubernetes operation to achieve certification.
\\n\\n\\n\\nBesides, Kubernetes has a very fast evolutionary speed in addition to the many learning elements. It is important to be prepared to learn Kubernetes in a steady and continuous manner without being in a hurry. Otherwise, you will lose heart and will not be able to enjoy learning.
\\n\\n\\n\\nToday the Cloud native ecosystem is way more than Kubernetes.
\\n\\n\\n\\nDo you plan to get other cloud native certifications from the CNCF?
\\n\\n\\n\\nI am currently interested in the following certifications from CNCF. I have already applied for PCA and am currently studying for the certification. I have also recently become interested in Cilium and would like to focus my studies on acquiring certifications with an eye toward future system implementations.
\\n\\n\\n\\nHow did you get involved with cloud native and Kubernetes in general?
\\n\\n\\n\\nPreviously, as an infrastructure engineer, I designed and maintained infrastructure for a number of mission-critical systems. Then I had the opportunity to design and maintain infrastructure using Kubernetes and other cloud-native technologies.
\\n\\n\\n\\nAt first, I was very puzzled by the big change in the way of thinking about infrastructure. In the past, infrastructure design usually involved carefully designing and taking care of each server individually. The cloud-native infrastructure is constantly in flux, and the software automatically tracks and repairs these changes. This requires a more holistic approach to management than managing each individual server separately.
\\n\\n\\n\\nI would like to continue to educate Japanese engineers about the fundamental differences in the way of thinking when building cloud-native systems and the great benefits that can be gained from such systems. There are still many Japanese systems with legacy architectures. I would like to contribute to improving the productivity of Japanese companies by promoting the shift to cloud-native systems.
\\n\\n\\n\\nIn addition, the progress of CNCF technologies, especially Kubernetes, is very fast, and it is very difficult to keep up with them. However, it is always an interesting task that stimulates my curiosity. I would like to continue to actively introduce these latest technologies to Japanese companies and develop their systems.
\\n\\n\\n\\nSome final thoughts
\\n\\n\\n\\nAs other Kubestronauts have said, Kubernetes is an OSS that honestly takes time to learn. There are many elements that make up a Kubernetes cluster and many API resources that can be created on top of it. In addition, you can create your own API resources with Custom Resource Definitions, and the Kubernetes ecosystem is endless with these resources.
\\n\\n\\n\\nHowever, there is a huge benefit to learning these resources in acquiring portable skills that are applicable across multiple clouds, independent of a specific cloud vendor.
\\n\\n\\n\\nThere are no shortcuts to learning Kubernetes. I believe that by continuing to learn steadily while operating a Kubernetes cluster, you will gradually increase your understanding and develop the skills to design system architectures using Kubernetes.
\\n\\n\\n\\nI really love CNCF technology centered on Kubernetes, and I would like to continue to study it diligently. I hope this article will interest some of you in Kubernetes and CNCF technology. Let’s enjoy our voyage into the Kubernetes ocean together!
\\n\\n\\n\\n.
\\n\\nMember post originally published on the Taikun blog
\\n\\n\\n\\nIn the ever-evolving landscape of cloud-native technologies, managing deployments in Kubernetes clusters has become increasingly complex. Enter ArgoCD, a powerful tool that simplifies and automates the deployment process using the GitOps methodology. This blog post will dive deep into ArgoCD, exploring its features, benefits, and how to get started with it.
\\n\\n\\n\\nArgoCD is an open-source, declarative, GitOps continuous delivery tool for Kubernetes. It automates the deployment of applications to Kubernetes clusters by monitoring Git repositories and syncing the desired state defined in Git with the actual state in the cluster.
\\n\\n\\n\\nArgoCD offers several significant advantages for managing Kubernetes deployments:
\\n\\n\\n\\nBy leveraging these features, ArgoCD streamlines Kubernetes deployments, enhances collaboration between development and operations teams, and provides a robust foundation for implementing continuous delivery in cloud-native environments.
\\n\\n\\n\\nNow that we understand what ArgoCD is and why it’s useful, let’s walk through the process of setting it up and deploying an application using the ArgoCD Helm chart.
\\n\\n\\n\\nBefore we begin, ensure you have the following:
\\n\\n\\n\\nWe’ll use the official ArgoCD Helm chart to install ArgoCD in our Kubernetes cluster.
\\n\\n\\n\\nhelm repo add argo https://argoproj.github.io/argo-helm
helm repo update
kubectl create namespace argocd
helm install argocd argo/argo-cd --namespace argocd
Once ArgoCD is installed, you can access its web interface:
\\n\\n\\n\\nkubectl port-forward svc/argocd-server -n argocd 8080:443
https://localhost:8080
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath=\\"{.data.password}\\" | base64 -d
Now that we have ArgoCD set up, let’s deploy a sample application:
\\n\\n\\n\\nArgoCD will now sync the application from the Git repository to your Kubernetes cluster. You can monitor the deployment progress in the ArgoCD UI.
\\n\\n\\n\\nArgoCD simplifies Kubernetes deployments by leveraging GitOps principles. It provides a powerful set of tools for managing applications across multiple clusters, ensuring consistency, and improving collaboration among team members. By following the steps outlined in this guide, you can start using ArgoCD to streamline your Kubernetes deployments and embrace the GitOps workflow.
\\n\\n\\n\\nAs you become more familiar with ArgoCD, you can explore its advanced features, such as custom health checks, resource hooks, and integration with CI/CD pipelines. With its robust ecosystem and active community, ArgoCD is well-positioned to become an essential tool in your Kubernetes deployment strategy.
\\n\\nChair: Matt Turner
November 12, 2024
Salt Lake City, Utah
EnvoyCon is a practitioner-driven event which emphasizes end-user case studies, and technical talks from the Envoy developers. We do not have product pitches, but will hear about solutions which build on top of Envoy. So join us for an exciting day of technical content, knowledge sharing, and engagement with project maintainers! The first EnvoyCon was at KubeCon North America 2018 in Seattle, following Envoy’s donation to the CNCF in September 2017.
\\n\\n\\n\\nWho will get the most out of attending this event?
\\n\\n\\n\\nEnvoyCon is aimed at everyone using or building Envoy. We have a range of topics and levels so there’s content for everyone in the Envoy ecosystem.
\\n\\n\\n\\nWhat is new and different this year?
\\n\\n\\n\\nThis year we’re covering a host of new and exciting topics, as folks continue to push the boundaries of what can be built with Envoy. We’ve got talks on Zero Trust, resilience in production, performance, extension points, and even a tale of Envoy being used as a VPN concentrator!
\\n\\n\\n\\nWhat will the day look like?
\\n\\n\\n\\nWe’ve got a lot to pack into a short time! We’re only running for the morning this year, so we’ll have non-stop talks save a short coffee break. There will be traditional talk sessions, followed by a really interesting panel discussion and rounded off with some quick-fire lightning talks.
\\n\\n\\n\\nShould I do any homework first?
\\n\\n\\n\\nNo prep is needed to enjoy EnvoyCon and get the most out of it. We’re a single-track event, so you don’t even need to plan your schedule, just turn up and soak it in!
\\n\\n\\n\\nContributed by Matt Turner who is genuinely interested to hear about all of the things built on top of Envoy.
\\n\\n\\n\\nDon’t forget to register for KubeCon + CloudNativeCon North America 2024.
\\n\\nMember post by Abhijeet Kakade, Senior Marketing Expert at MSys Technologies
\\n\\n\\n\\nMotorcycle riding is my passion, and as an avid motorcycle enthusiast, I really know the importance of regular inspection to keep my bike in top shape. From my early days with a basic minibike powered by a lawn mower engine, I’ve learned that proper inspection not only ensures optimal performance but also prolongs the life of my motorcycle. I constantly strive to improve its reliability, safety, and overall functionality through meticulous care and attention. In the realm of technology, I’ve found a similar parallel in the concept of observability.
\\n\\n\\n\\nObservability, much like motorcycle inspection, is about gaining deep insights into the internal states of a complex system by examining its outputs to ensure they operate smoothly and efficiently. Using robust observability practices, organizations can proactively monitor and analyze system data, detect anomalies, and optimize performance. Just as regular motorcycle maintenance helps prevent breakdowns and improves overall performance, Observability allows organizations to identify and address issues before they escalate, leading to enhanced reliability and operational efficiency.
\\n\\n\\n\\nIn motorcycle inspection, paying attention to the hardware and components is crucial. Regular checks, such as inspecting the engine, brakes, tires, and electrical systems, help identify potential issues and ensure everything is in working order. Similarly, in observability, analyzing the underlying infrastructure and systems is essential. By analyzing data, organizations can gain visibility into system health, resource utilization, and potential bottlenecks. This proactive approach allows for the timely identification of performance issues and the implementation of necessary optimizations.
\\n\\n\\n\\nMoreover, motorcycle maintenance involves tracking and analyzing performance metrics. Monitoring factors like fuel efficiency, engine temperature, and tire wear helps identify areas for improvement and informs maintenance decisions. Observability follows a similar principle by monitoring key performance indicators (KPIs) specific to the organization’s systems and applications. By tracking metrics such as error rates, response times, and resource utilization, organizations can gain valuable insights into system behavior, identify performance bottlenecks, and optimize operations accordingly.
\\n\\n\\n\\nJust as motorcycle enthusiasts invest in specialized tools and equipment to perform maintenance tasks, observability requires the right technologies to effectively analyze system data. From diagnostic tools for motorcycles to sophisticated monitoring and logging solutions for observability, having the appropriate tools is crucial for accurate and comprehensive insights. These tools enable organizations to collect, visualize, and analyze data, providing actionable information to optimize system performance and address any maintenance or operational needs.
\\n\\n\\n\\nWhen it comes to motorcycle maintenance, regular inspections and adherence to manufacturer guidelines are essential. Following recommended maintenance schedules, changing fluids, replacing worn-out parts, and addressing potential issues early on are all part of ensuring the longevity and optimal performance of the motorcycle. Similarly, in Observability, organizations must establish robust practices, define alert thresholds, and regularly review and refine their monitoring strategies. By continuously monitoring and maintaining Observability solutions, organizations can ensure they capture accurate and relevant data for effective system analysis and troubleshooting.
\\n\\n\\n\\nIn this blog, I’ve explained observability by drawing parallels between observability and motorcycle inspection. The goal was to make the concept accessible to everyone, regardless of their technical background. By using relatable examples from everyday life, such as maintaining a motorcycle, I aimed to demystify observability and make it easy to understand the concept of observability extremely simple to anyone, not just from a tech background.
\\n\\n\\n\\nWe at Msys Technologies are providing DevOps observability services with the aim of helping you build the robust and efficient foundation of today’s cloud-based solutions.
\\n\\n\\n\\nContact us today if you’re looking for support to scale up and scale out your storage infrastructure.
\\n\\nKubernetes and the rest of the Cloud Native ecosystem are both evolving fast. The velocity report that is conducted by the CNCF each year is a great demonstration of those changes.
\\n\\n\\n\\nAre you convinced yet? Simply look back 3 years ago: Kubernetes is running on top of Dockershim, Gateway API is not even Beta, and only 16 CNCF projects have graduated (out of 27 now).
\\n\\n\\n\\nTo cope with such evolutions, Platform Engineers and Administrators are updating their Kubernetes platforms and their applications on a regular basis and we, at the CNCF and Linux Foundation Education, will also continue to update our exams. CKS updates are already planned, and the update of our beloved CKA exam is next – but not before November 25th!
\\n\\n\\n\\nAs stated in the Linux Education announcement, “The CKA domains (i.e. Storage, Troubleshooting, etc.) will remain unchanged” but “changes to competencies”.
\\n\\n\\n\\nHere are 2 examples of changes:
\\n\\n\\n\\nThe announcement provides the list of all new competencies that will be expected if you want to successfully pass your Certified Kubernetes Administrator exam.
\\n\\n\\n\\nPlease note that the exact date of when the CKA exam will be updated to reflect these changes is not yet confirmed, but we know that it will not take place sooner than November 25th. We wanted to mention the change as early as possible so that you can adapt your plans if you plan to take the exam soon.
\\n\\n\\n\\nWe are confident that you will feel it continues to be aligned with the knowledge that a Kubernetes Administrator needs to have. And you can count on us to continue to adapt our certifications to reflect the ongoing evolution of our ecosystem! We welcome your comments to our dedicated email for training and certification: training@cncf.io!
\\n\\nMember post originally published on InfraCloud’s blog by Shreyas Mocherla
\\n\\n\\n\\nAccelerated by the pandemic, online tech communities have grown rapidly. With new members joining every day, it’s tough to keep track of past conversations. Often, newcomers ask questions that have already been answered, causing repetition and redundancy. To tackle this, we built an intelligent assistant that tracks past conversations, searches Stack Overflow for technical help, and browses the web for relevant information.
\\n\\n\\n\\nInSightful is a ReAct (Reasoning and Action) Agent with access to multiple tools, such as a web searcher and a context retriever, to achieve the given task. In our case, it is a question and answer application. We will delve deeper into the workings of the application as we progress. At the end of the day, you will be able to understand how agents work and how you can develop your own. These agents are highly customizable, giving you a lot of flexibility in picking a use-case.
\\n\\n\\n\\nInSightful is meant to be an AI Agent that can smartly retrieve conversations from a workspace, search results on Stack Overflow with a given context and as well as browse the web whenever the information asked for is not available in the other two places. This makes sure that InSightful has a safety net to fallback on in case the intermediary response (thought and observation) does not match the user’s question. Let us see how we can develop such an application.
\\n\\n\\n\\nNote: This tutorial assumes the developer has experience with Docker and understands how to use it appropriately. If the developer is using Kubernetes, this tutorial assumes they have a multi-node GPU cluster.
\\n\\n\\n\\nFor the demo, we will be using a readily available HuggingFace dataset consisting of conversations from a workplace, simulating the real environment. The dataset mimics actual interactions amongst colleagues in tech communities like Slack workspaces, Reddit threads, or Discord servers. Downloading the dataset does not require any additional steps as it is included in the Python code.
\\n\\n\\n\\nIn our set up, we are using a Kubernetes cluster that is already provisioned and completely set up to run AI workloads on.
\\n\\n\\n\\nIf this is not your case, run the ai-stack separately on a GPU-enabled machine. If your machine is not equipped with a GPU, the inference will be extremely slow and is not recommended.
\\n\\n\\n\\n$ kubectl get svc -n ai-stack
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
tei LoadBalancer 10.233.22.94 192.168.0.203 80:32215/TCP 21h
tgi LoadBalancer 10.233.50.250 192.168.0.202 80:31963/TCP 21h
vectordb LoadBalancer 10.233.27.106 192.168.0.201 8000:30992/TCP 21h
\\n\\n\\n\\n(Kubernetes Services deployed using our AI stack chart)
\\n\\n\\n\\nOnce each service is deployed and running, we can access their respective IPs and ports to send and receive data. Go ahead and clone the repo and follow the steps in the InSightful README file.
\\n\\n\\n\\nAfter successfully following the steps in the README, if all goes well, you should be greeted with this page on your browser:
\\n\\n\\n\\nNow, let’s see InSightful in action by asking it a question.You can ask different types of questions, according to the information you need.
\\n\\n\\n\\nNow that we know what InSightful can do, let’s understand what it is, and how it works. As mentioned earlier, InSightful is powered by an AI Agent, so let’s begin with Agents.
\\n\\n\\n\\nWhen ChatGPT was released, it quickly became popular, surpassing 1 millions users in just 5 days. Natural Language Processing (NLP) has been around for a while, but ChatGPT took it to the next level by turning a Language Model into a Large Language Model (LLM).
\\n\\n\\n\\nHowever, LLMs, while powerful language models, are like parrots—they repeat what they hear without real understanding. To make them more applicable to specific tasks and problems, we have to fine-tune, provide detailed prompts, and set guard rails to align their responses closely to the demands of the user. Developing agents exploit some of these approaches that enable them to reason and act accordingly. With a set of tools, an agent can perform actions on an environment, something a regular base LLM cannot achieve. Agents integrate these additional tools and frameworks, allowing them to perform specific tasks, make decisions, and interact dynamically with their environment.
\\n\\n\\n\\nFor the sake of simplicity and to limit the scope of the InSightful demo, we will only be talking about the techniques we have employed to adapt the LLM to our own use case. We integrated LLM with web search and context retrieval tools, as well as provided a well-defined prompt. The combination of the prompt and the accessibility of the tools is what brings our agent to life. Something to note is that we don’t necessarily have to create an agent to utilize tools, instead, we can directly provide a compatible LLM with tools. We mention “compatible LLM” because not all LLMs are tool-calling enabled. However, with the help of LangChain, any LLM can be used to develop an agent to achieve the same results of tool-calling enabled LLMs. Ultimately, the LLM is used solely for reasoning and accessing a set of provided tools in a LangChain agent.
\\n\\n\\n\\nWe prompted the agent cautiously and refined the prompt in several iterations, making it almost foolproof. The prompt makes the agent much more reliable in solving tasks, consistently abiding to a structure while forming its final response.
\\n\\n\\n\\nFor InSightful, we took the Reasoning and Action (ReAct) Agent approach, which is a type of agent that can reason through a task, form an action plan to guide its actions and finally responds after thorough self-reflection. The incorporation of tools into the agent enables it to form effective responses.
\\n\\n\\n\\nFrom the diagram, we can understand that the task-in-hand is the question provided by the user and the environment is provided by the tool’s description. The tool’s description provides concise context to the agent to decide if a given tool is useful for the task or not. The cycle of “Thought-Action-Observation” is repeated a defined number of times until a satisfiable observation is made.
\\n\\n\\n\\nThere are other alternatives to ReAct that are linear rather than ReAct’s cyclic approach such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT), however, these alternatives do not support actions and will not enable learning from the consequences of the actions on an environment. They are purely reasoning-backed approaches.
\\n\\n\\n\\nFundamentally, tools are exactly what they sound like they are. They supplement the agent to generate accurate and contextual responses according to the user’s question. LangChain has an exhaustive list of such tools that can be integrated with agents.
\\n\\n\\n\\nInSightful applies several tools to enhance the agent:
\\n\\n\\n\\nAfter creating these tools as callable functions, we compile them into a simple Python list like so:
\\n\\n\\n\\ntools = [retriever_tool, stack_overflow_search_tool, web_search_tool]
\\n\\n\\n\\n(List of tools provided to the agent)
\\n\\n\\n\\nLangChain iterates over this list, using the name and description provided in each tool’s docstring to understand their purposes. Setting a name and description for each custom tool is mandatory as it provides LangChain with the context of what each function achieves.
\\n\\n\\n\\nOnce we have our list of tools, we set it as a parameter when creating our agent. This setup indicates to the agent that it has access to these tools. When the agent determines which tool to use, it runs the corresponding function to trigger the tool. Ideally, the tool returns the desired results, and the agent uses that as additional context to augment its responses.
\\n\\n\\n\\nBy integrating these tools, InSightful can:
\\n\\n\\n\\nLet’s break down each of the components and how they fit into the big picture:
\\n\\n\\n\\nHere’s an example of the process when a user query is entered.The query “Latest news on Kubernetes” yields this result, only visible to the developer:
\\n\\n\\n\\nThe user sees this on the front-end:
\\n\\n\\n\\nInSightful works well in various online tech communities, such as Slack workspaces, Reddit threads, and Discord servers. The on-premise approach offers several advantages:
\\n\\n\\n\\nInSightful uses state-of-the-art Generative AI to provide an intelligent assistant for tech communities and enterprises. By keeping track of past and present conversations, accessing Stack Overflow for technical questions, and browsing the web for additional research, InSightful reduces redundancy and improves the efficiency of information retrieval. This on-premise approach shows how integrating LLMs with specialized tools can create a practical and powerful AI assistant for enterprises. You can follow this article to build an AI Agent to power a conversation search and retrieval engine for your communication channels like Discord and Slack; however, having an efficient and scalable cloud infrastructure is crucial for building an AI application. AI & GPU Cloud experts can help you achieve this.
\\n\\n\\n\\nHopefully, this has been as informative as we sought out to make it. Stay connected with us by subscribing to our weekly newsletter. Please share your valuable feedback with us, and we would love to hear from you on LinkedIn!
\\n\\n\\n\\nMember post originally published on Fairwinds’ blog by Stevie Caldwell
\\n\\n\\n\\nIt’s hard to believe, but Kubernetes, our favorite container orchestration tool, turned ten this year! It feels like just yesterday when it was just an internal project at Google spinning up its first pod, and now it’s at the heart of cloud native architecture. In the spirit of our many years managing, optimizing, and deploying Kubernetes for our clients, let’s celebrate the 10th anniversary the best way we know how: with a song (sung to the tune of Come on Eileen by Kevin Rowland & Dexys Midnight Runners).
\\n\\n\\n\\nVideo not loading? Click here 🙂
\\n\\n\\n\\nWe wrote and performed it together! I’m singing and doing electronic drums, Robert Brennan (our former VP of Product Development) is playing guitar, James DeSouza (Senior Software Engineer) is on recorder, and Brian Bensky (Site Reliability Engineer) is on bass.
\\n\\n\\n\\nPicture this: it’s the middle of the night, and your SRE team is snoozing peacefully. Suddenly, the PagerDuty alert goes off, interrupting a good night’s sleep yet again. Poor old SREs! They wake up, stumble over to their desks, and get to work troubleshooting, still asleep despite a mug of strong coffee. Meanwhile, it feels like half of their containers have decided to play dead in a dramatic fashion, sending applications into a tailspin and leaving end users frustrated and confused. In the post-mortem (‘cause who doesn’t love a little root cause analysis?) everyone’s trying to determine what’s to blame. If you’ve ever been on-call this probably sounds familiar because, let’s be honest, who hasn’t had a bad tech day (or night)?
\\n\\n\\n\\nTrying to find a way to make it all easier, someone decides to set up EKS. “Oh, it’ll be a breeze,” they say. “It’s just Kubernetes,” they say. But instead it takes everything to figure out how it works! Full of features and seamless integration with the Amazon ecosystem, there are still a lot of knobs to turn to get a cluster to production-ready status. Figuring out Amazon EKS may make your head hurt. But it’s worth it to get that scalability and availability!
\\n\\n\\n\\nKubernetes promised us a dream. Containers that heal themselves? Automations and scheduling so everything always runs on time? Sounds like a breeze! But let’s be real. Learning Kubernetes is like learning a whole new tech language, one where “pod” doesn’t mean a group of whales (despite the Docker logo ) and “nodes” take on a whole new meaning.
They were beaten down and buried under a mountain of technical debt, resigned to the stress and instability of trying to build in-house automations to auto scale, manage failover, and enable load balancing. With the advent of Kubernetes, CTOs embraced the chaos of ephemeral environments and we all came out on the other side with K8s clusters that could run forever, thanks to our orchestration skills and capabilities.
\\n\\n\\n\\nKubernetes, now you’re full-grown, and you’ve shown us the sheer power of proper scheduling and controllers that handle everything. You’ve grown up so fast, running our workloads like a pro, surrounded by an awesome cloud native community. But everything continues to change, so keep an eye on your APIs and add-ons when you upgrade because it’s all just so much to keep up with!
\\n\\n\\n\\nKubernetes, it’s a dream, handling everything from controllers to nodes, making it all seem like we’re just deploying seamlessly. Containers crash, but in pods, they come back, right? We run into trouble sometimes – containers in CrashloopBackoff, resource contention – but you’ve made our lives easier, enabled us to deploy our workloads more smoothly, and our failures… well, a lot more recoverable. Thanks for ten years of enabling us to orchestrate our containers! And here’s to many more years of scaling, scheduling, and even some midnight PagerDuty alerts.
\\n\\n\\n\\nHappy 10th Anniversary, Kubernetes! We’re looking forward to the next decade of innovation, headaches, and a dash of orchestration magic.
\\n\\nNovember 11-12, 2024
\\n\\n\\n\\nSalt Lake City, Utah
\\n\\n\\n\\n\\n\\n\\n\\nWasmCon is a two-day event focused on all things Web Assembly. This is the first time WasmCon is being held in conjunction with KubeCon + CloudNativeCon North America 2024, and the second time it’s happened on its own.
\\n\\n\\n\\nFrom performance optimization to security to integration, a wide range of sessions, workshops and keynotes offer content for everyone. Discover the best practices and latest insights from the industry’s leading experts, developers, and users, and have unparalleled opportunities for networking and relationship-building in between sessions.
\\n\\n\\n\\nWasmCon is part of the All Access Pass for KubeCon + CloudNativeCon North America 2024 or attendees can register separately.
\\n\\n\\n\\nWho will get the most out of attending WasmCon?
\\n\\n\\n\\nA wide variety of technical and business professionals will benefit from attending WasmCon including web developers, tech developers and engineers wanting to build faster and more secure applications, software architects seeking to understand how to build Wasm into their roadmap, and business leaders and professionals exploring the potential business value.
\\n\\n\\n\\nWhat is new and different this year?
\\n\\n\\n\\nExpect a variety of exciting new offerings at WasmCon this year including workshops, a track devoted solely to “Practical Wasm” and another that is called “Powered by Wasm.” Attendees will also be able to take a deep dive into the Wasm ecosystem with 9 different presentations focusing just on that topic.
\\n\\n\\n\\nWhat will the days look like?
\\n\\n\\n\\nDay one starts with workshops, including what promises to be an invigorating Wasm-spin on “choose your own adventure.” Attendees should expect to have plenty of in-depth sessions to choose from as well as time to network in between sessions and during lunch. Keynotes will be interspersed through the two days, and the always popular lightning talks will wrap up the conference at the end of the second day.
\\n\\n\\n\\nShould I do a bit of homework first?
\\n\\n\\n\\nIt *never* hurts to make sure you’re up to date with the latest in the Wasm community. Beginners can get a great overview, while developers can check out the comprehensive guide.
\\n\\n\\n\\nFind your community!
\\n\\n\\n\\nNeed more information? Check out the YouTube playlist from WasmCon 2023.
CTA: Register for KubeCon + CloudNativeCon North America 2024 today!