At Candu, we care deeply about providing the highest availability and reliability to our customers. Because Candu is meant to be a seamless part of your UI, our explicit objective is that Candu must be as or more reliable than any of our customers' infrastructures. Of course, we never want our app to go down, but more importantly, we never want anything to negatively impact your users' experience within your product.
This article discusses all the architectural decisions that we have made and implemented in order to provide our customers with a highly reliable product. You can learn more about Candu's Architecture here.
To read how we handle failure in the SDK, please read this guide.
Candu is deployed over AWS, and we run most of our servers in ECS. We use blue-green deployments powered by a custom-made framework. At any given point, we can roll back to any commit within 3 minutes.
We also have the ability to create a staging environment in less than a minute for any commit that is pushed to our backend. Developers can then test any change in a production-like environment.
For databases, we use a combination of DynamoDB and RDS across multiple hosting zones to ensure maximum uptime and reliability.
High throughput data processing code is hosted using AWS Lambda to ensure the highest availability and scalability levels.
Finally, we use a message-driven architecture to ensure that critical customer data is processed at least once.
In order to provide high availability and fast delivery, Candu uses a CDN to publish content.
Whenever a user edits content in our dashboard, the changes are saved on our servers.
Once content has been published, Candu uploads a version of that content to the CDN. We use this publishing mechanism in order to:
- Ensure that you can safely edit content in a draft state before publishing.
- Increase the upload speed and availability of all content that you create.
- Provide a versioning process and audit trail for any content.
All of the assets served through Candu to your customers are hosted on an enterprise-grade CDN.
We use a CDN for the following reasons:
- CDNs have extremely high availability and reliability. We selected S3 + Cloudflare as our primary vendors.
- CDNs are distributed and geographically close to our customers, meaning we can serve content quickly.
The Candu SDK is installed within your application to provide dynamic segmentation and user analytics, and to render UI.
The SDK's main functionalities are:
- Rendering Content
- Collecting analytics via eventing
Our SDK is engineered to handle multiple failure points. The first request that is executed when initializing the SDK is retrieving the "customer segment." This is the only request from the SDK that is not stored on the CDN. Customer segments are cached to local storage in case the network fails.
If a network call fails or is too slow for any reason, then (to create a consistent user experience), the customer will see whatever content version they saw last. If the network call fails the first time, before it has been cached, then the customer will default to the Everyone Segment.
After the SDK requests Segment membership, Candu fetches the Content. These are also cached to prevent further network failures, as an additional failsafe, as they will already have been hosted on the CDN.
In the worst-case scenario where AWS S3 or Cloudflare fails and there is no local content cached, then no content will be served and the page remains intact, looking as though Candu was never installed.
Before release, each new SDK version needs to pass an extensive and continuously expanding set of unit and integration tests to eliminate any potential regression tests. Additionally, we aim to reach 100% coverage using TypeScript to minimize type unsafe errors and to help developers with integrating the SDK into their projects.
In case a breaking change is identified on an SDK release, we bump versioning for that release according to SemVer. This should prevent accidental installation of the SDK without proper migration steps in place.
At Candu, we take our SLA and partner operations extremely seriously. We strive to maintain a 99.9% SLA in all of our APIs & frontend assets.
SLA monitoring is done through 3rd party integration monitoring. We currently ping 10+ APIs for uptime, as well as other critical aspects of our infrastructure that we use to provide services.
All the tests are performed from 7 different locations around the world (Canada Central, Ohio, Oregon, Sydney, Tokyo, Frankfurt, London) to ensure we maintain availability within and throughout different regions.
All critical integration tests are performed each minute.
If any alerts were to fail, our team would be notified immediately as outlined in our escalation policy.
Updated 9 months ago