Application Transformation - Observe Ability

Practical OpenTelemetry Review

Leigh Finch — Tue, 24 Oct 2023 01:54:11 +0000

OpenTelemetry is something I’ve been watching for a while now and reviewing Practical OpenTelemetry was the perfect excuse to dive deeper. In my lab I’ve been running a project called SigNoz for which I’m writing up a companion article to this one to show some practical examples.

When I first looked into OpenTelemetry about 4 years ago, I was primarily focused on the Tracing aspects such as how can OpenTelemetry replace vendor specific code profilers. I concluded that vendor specific profilers won’t necessarily be replaced by OpenTelemetry, but can be augmented by it. Most existing APM Observability vendors are looking at integrating with OpenTelemetry to reach languages and systems they previously hadn’t invested in such those focused on Java and DotNet now having the ability to include data from PHP, Python, and NodeJS (to name a few).

OpenTelemetry provides a standard and software to instrument many different things using Traces, Logs, and Metrics (converging Events in the MELT framework across the other telemetry types). The idea of using a standard for instrumenting applications and systems means that you can switch backends relatively easily. Therefore less time can be spent on the standard of instrumentation, and more time spent on the visualisation and analysis of the instrumentation of systems.

Practical OpenTelemetry is an easy read designed for developers, DevOps, SRE, and Observability practitioners to get familiar with where OpenTelemetry fits in their ecosystem. The practical aspect includes reference architectures with Prometheus, Elastic, and Kafka, as well as examples using Java whist are simple enough that anyone with any programming experience should be able to port to their language of choice.

To achieve that it starts with an analysis of why OpenTelemetry is needed focusing on the strengths like openness, standards, Observability (MTTRes, MTTI, MTTK MTTF). This is my favourite style of introduction because it explains the need before throwing solutions at the problem.

The second part of the book breaks down what OpenTelemetry is (and isn’t) focusing on telemetry types and how they are ingested (including Traces, Metrics, and Logs) the OTLP protocol, the role of the collector, schemas and conventions. This is where the concept of Spans becomes important. A Span is a unit of work that is done. That unit of work may be composed of multiple sub Spans with each span having a context and attributes. A Span could be a Trace with a defined entry point such a webpage, or something manually defined and coded within the application itself. See the below image of SigNoz representing a series of Spans.

The book then goes on to describe Tracing instrumentation styles such as auto instrumentation vs manual as well as local vs distributed Tracing. Distributed Tracing is the ability to correlate Spans across systems. Distributes tracing is incredibly important as it allows you to create real-time service maps and see how an individual transaction performed over multiple systems.

The chapters on Metrics and Logs and how they are implemented in OpenTelemetry. I liked how the author was able to discuss the convergence of logs and tracing and why it won’t happen overnight.

Blanco finished the book with adoption and institutionalisation covering the challenges of Brownfield environments vs Greenfield. This is especially valuable as relevancy and compatibility with other systems with overlapping capabilities is a constant battle for enterprises.

Practical OpenTelemetry on Amazon.

The post Practical OpenTelemetry Review first appeared on Observe Ability.

The CrUX Of Web Performance

Leigh Finch — Tue, 17 Oct 2023 05:36:01 +0000

The world of web performance is tougher than most internal applications because you are dealing with third parties that you don’t. The Internet, browsers (with a thousand tabs open), endpoints (with agents running), and poor connectivity in the form of cellular/wireless/consumer grade connections. This means that we need to be able to understand and manage what is within our control.

This article is tangental to SEO (Search Engine Optimisation), however focused on User Experience, rather than SEO aspects.

I like to think about performance using the point of consumption and the point of distribution when looking at a network like a WAN or Internet serving applications/services/web pages to consumers over some latency. We usually control is the point of distribution of an application (which could be a load balancer, Application Delivery Controller, Reverse Proxy such as Apache, NGINX, or Varnish).

The point of consumption (and everything in between) in an Internet delivered application is something we may have little control over. If we are lucky it might be a corporate managed device where tools like End User Experience Management (EUEM) agents might be installed.

If we go back to the point of distribution for the application, we need to ensure that it is operating as optimally as possible in terms of server response time and transport performance (e.g. TCP and QUIC). But we also need to consider the rendering at the point of consumption which is where Core Web Vitals and CrUX come into play.

Core Web Vitals is an initiative that looks at the performance of a web page from the perspective of the user, to improve User Experience. CWV is made up of the following meaningful metrics:

Largest Contentful Paint

Largest contentful paint looks at the time it takes to draw the largest content on the page which could be an image, video, or other visible object in the browser. LCP is favoured over earlier metrics such as the page load time.

Cumulative Layout Shift

Cumulative layout shift looks at how often the layout of a page shifts during its lifetime. The more things move around, the poorer the user experience, the lower the score. Think about pages that pops up an advertisement covering the content as you scroll through an article, or as the page loads, buttons and links move around as the page renders.

First Input Delay

First input delay is how long it takes for the user to start interacting with the page and have the browser event handlers register. This could be how long it takes to be able to type a field in a form, or click a button.

Interaction to Next Paint

Interaction to next paint is a pending (at the time of writing October 2023) metric for Core Web Vitals. INP is how long it takes from an interaction such as a click to render the next object such as a drop down box, login form, or even the next page. This can be considered how responsive a page is to interaction. Everyone I know has had a poor experience with a web application that is unresponsive to clicks, resulting in them clicking again and again adding to the queue of interactions that are to be processed.

CrUX (Chrome User eXperience) is the service for Chrome that reports these metrics back to go Google for the purposes of benchmarking and baselining web sites. Arguably it could be used to influence a web sites ranking in search results, but this may just be because Google likes to rank websites with better user experience higher. There are caveats in CrUX reporting:

The website has to be popular enough.
CrUX data is only collected by Chrome.
The user has to have opted in to metric collection.
Other Geographical and privacy reasons.

So what do you do if there isn’t enough data coming in, or you want more detailed analysis? Javascript snippets like those provided by Blue Triangle and APM vendors can provide detailed user experience from any browser that interacts with your page. OpenTelemetry can also be used to collect metrics using the JavaScript package.

Some Interesting Tools

Here are some tools that can help you with your End User Experience Management:

https://developers.google.com/search/ Console for monitoring web page performance over time.
https://pagespeed.web.dev/analysis/ Pagespeed tests website performance, SEO, accessibility with recommendations.
Google LightHouse. Embedded in the Chrome developer tools provides performance, SEO, and CWV details (see below).

Thanks for reading this quick introduction to Web User Performance. If you made it this far please consider buying a book in the Book Recommendations Page.

The post The CrUX Of Web Performance first appeared on Observe Ability.

SaaS to on-prem and back again! Interview With a Professional Services Director

Leigh Finch — Sat, 07 Oct 2023 05:04:51 +0000

Earlier this week I meet with a good friend of mine (who we’ll call Martin) who works as a Director of Professional Services. Martin works at a global enterprise software company. We’d been discussing a recent article I wrote on Cloud Migrations in Reverse and we discussed why he had seen some of his customers migrate from their SaaS offering back to their on premises products.

For the last 10 or so years traditional enterprise software companies have been working with their customers towards their SaaS offerings. The biggeest reason for software vendors to do this is that it is easier for the vendor to support their customers, when they also manage the environment. The strategy for Martin has been to retrain his professional services consultants to become (Technical) Customer Success Managers. The nature SaaS means that customers can churn at any point before the customer lifetime value exceeds the cost to acquire. CSMs work with customers achieve and maintain value from the as a service offering is in play.

Martin explained to me that his companies SaaS solution is a win-win for customers and vendors. He also sees a trend with certain types of customers reversing their decision, or indefinitely delaying go-live for three main reasons:

Poor planning
Changes in user experience
Data sovereignty and legal challenges

Poor Planning

I think we could consider all three of these as poor planning, but what Martin meant by this was that the customer assumes that migrating is an easy task and neglects to consult across teams including LOB (Line Of Business) heads and bring in professional help to perform the migration. The outcome of this can mean that data is not migrated correctly, and data inconsistencies.

Changes in User Experience

Changes as a result of where application is hosted or insufficient user training for newer releases resulting in users being having a poor user experience. I’ve talked a lot about how latency can really impact application performance in Performance Diagnostics Part 1.

Data Sovereignty and Legal Challenges

Martin outlined that some of his customers moved to their SaaS offering and found that they ran into problems with Personally identifiable information (PII) being stored in localities outside the locations specified by legislation. Not every SaaS provider will have a tenancy or can gaurantee that data may stay in a specific location. Sometimes the law, and peoples interpretations of the law can lag behind technology.

Martin did say that almost all of his customers are looking to eventually end in SaaS as an end state. His Professional Services team are busy advising and working with their customers to achieve that.

Thanks for reading. If you enjoyed this article you can buy me a coffee or buy yourself a book.

The post SaaS to on-prem and back again! Interview With a Professional Services Director first appeared on Observe Ability.

Why Are We Seeing Cloud Migrations in Reverse?

Leigh Finch — Sat, 23 Sep 2023 05:46:43 +0000

I’ve always loved using excess or old computers and network infrastructure to lab things up or run a PoC for an application or service. I still have a Dell R710 and HP ML10v2 I use to run services like Home Assistant and various observability tools for testing.

So what does this have to do with Cloud migrations? I can’t afford to run my labs in with IaaS for long periods of time. But this is the easy answer to a phenomenon that seems to be happening more and more. Let’s take a look.

When cloud services started becoming a thing in the mid-late 2000s there were a few options available, some of those were storage, compute, and some xAAS offerings like unified communications. The promise of the cloud was redundancy, scalability, reduced reliance on employing specialists, and OpEx spend as opposed to CapEx spend. Why would a legal firm need a large IT team and own equipment, when you can outsource it to the experts.

This sounds pretty attractive to technology leaders because they no longer need to own assets, their related service contracts, renewals, and also reduce the number of staff required to manage their technology investment. If the company had a launch or other event where scale was needed, they could simply scale their services temporarily.

I’d personally leveraged hosted dedicated servers in the US and Australia primarily for reliability and high bandwidth for my personal projects, as well as recommending this to my customers at the time as a way to avoid the risk of localised power and connectivity issues for those hosting out of makeshift datacentres or worse, basements.

As technology evolved through the 2010s, we saw cloud native technology starting to appear which had the promise of autoscaling, and automating the build of applications and microservices. Many organisations jumped on this and migrated some of their applications to become cloud native. Others did not, they simply took their on-premise architecture and moved it into cloud service providers like AWS, GCP, and Azure.

Without application transformation, the cost of running these legacy architectures can become extraordinarily expensive over the long term. Investing developer resources into transforming a ‘good enough’ application on a legacy architecture doesn’t make sense, and neither does the cloud costs of running it in the cloud.

Some organisations simply took their legacy applications and placed it into a container, which then has the overhead of running the service container. containerisation has benefits of abstracting the service from the underlying platform, but there has to be some re-think of how the application works.

Put simply, if you are building something new, build it cloud native. If it is legacy and you have no intent on transforming the application to run on a cloud native architecture, you may not realise the expected savings long term.

When organisation go through the process of troubleshooting a complex application performance issue they will often go for quick wins:

Increase bandwidth
Add CPU cores
Add Memory
War room?

These changes are never reversed and these costs add up quickly. Nothing is more permanent than a temporary fix.

Additionally it can be cheaper to prototype solutions internally on excess and legacy hardware, which if tested in the cloud has ongoing costs that would not otherwise be seen. Hands up if you ever left something running in the cloud by accident?

I think we will continue to see hybrid environments. If I was starting a business today, I would certainly still run a hybrid environment where prototyping could be done on-prem, with the majority of anything built would be cloud native.

If you got this far, thanks for reading. Feel free to buy a book from here!

The post Why Are We Seeing Cloud Migrations in Reverse? first appeared on Observe Ability.