Choosing an Experimentation Platform: A Retrospective

, in each firm that wishes to ship merchandise folks love, when “we must always experiment extra” turns into “we can not preserve experimenting like this.” Hand-tuned holdouts; traffic-allocation tickets bouncing between PMs and engineers; analyst calendars booked weeks out. The want to be data-driven form of outgrows the equipment that was presupposed to make it so.

That was the place we sat at ManyChat final 12 months. We selected Eppo, however that call is the smallest a part of the story, and the half you may least transplant to your organization. What I need to share as a substitute is the method I walked via to get there, what I bought improper alongside the best way, and what stunned me on the opposite facet of the contract (yep, medical doctors hate me for this trick).

A notice on timing. We picked Eppo at an unusually thrilling second within the business, as the seller map was shifting beneath us mid-evaluation. Eppo itself had been acquired by Datadog some months earlier than. Statsig had lately been acquired by OpenAI, and would later be bought on to Amplitude. I don’t suppose any of what I describe under relies on that specific information cycle, however I need to acknowledge that a few of it formed our temper whereas we had been deciding.

I break what follows into three acts: earlier than the choice, throughout it (making the choice), and after.

Earlier than

Let me get you within the temper we had been in at first occurred. As I onboarded to the corporate, an engineer instructed me that if there have been two simultaneous alternatives to run experiments, his crew would merely postpone the second concept to a later dash as a result of the technical headache of configuring the 2 allocations. The chance of getting it improper ultimately outweighed the joy to check. That is fairly actually: anti-velocity at greatest; no experiment at worst. And for that one experiment that may be configured, copy-pasting boilerplate allocation logic was their bread and butter.

An analyst on the opposite facet of that very same pipe described herself as a “human microservice”; she meant the holdout teams, outlined by hand, refreshed by hand, handed on to the engineer, and so forth … an thrilling alternative to expertise your complete movement in first-person POV, certainly. However, irony apart, that was the second the case for a platform stopped being summary.

I had seen variations of this room earlier than. At Marktplaats, some years earlier, I had written the in-house Python libraries that attempt to soak up this type of ache, and we noticed time-to-insight go down from days to hours, within the tail circumstances.

I watched the identical build-or-buy debate play out once more at Adevinta, globally, at a bigger scale, the place it landed on constructing relatively than shopping for. Fortunate for us at Manychat, by the top of 2025 the platform choices had matured sufficient that, for a corporation our dimension and at that second, shopping for was the plain transfer.

We needed the software that may give us the most effective shot at getting our experimentation program the place we wish it: cutting-edge statistics, sure, however extra importantly a software that nudges its customers towards conclusive experiments by default; product managers included.

Two issues stood between us and the selection. The primary was easy: we had named the ache, nevertheless it was solely anecdotal to this point. Management had a (excellent) notion of what was damaged, and I had heard devs and product managers grumble in regards to the present stack once I first met them. However none of that was the identical form of object as a vendor necessities record. Till we might put the 2 facet by facet, we couldn’t inform which capabilities had been nice-to-haves and which had been the purpose.

The second was tougher. The choice carried a variety of weight as regardless of how you place it, there’s all the time a lock-in factor to any platform; culturally, if not technically. And sources are finite: we couldn’t POC each platform in the marketplace. Not to mention the chance price of getting to reverse the choice and begin over once more. Selecting one to wager on, in a single sitting, with no likelihood to course-correct, would have been asking to be improper. And with the choices being so related in most methods, discovering the most effective one for us was a matter of precision. We would have liked a strategy to break a single high-stakes choice into smaller, lower-stakes ones that constructed on one another.

Interviews, and de-risking the choice

I began with interviews. PMs, product analysts, engineers, entrepreneurs. The purpose was to transform anecdote into one thing we might maintain up in opposition to a vendor’s characteristic record. The engineer’s calendar story, the analyst’s “human microservice”, the PM who had given up on operating atomic experiments and was bundling modifications into larger releases as a substitute, suspending a few of them solely: these turned the job description for the software. I can not overstate how a lot this paid me again later. Each time the method drifted, and it drifted, the interviews had been the anchor we got here again to. They had been additionally what made the entire effort credible contained in the group: telling my CPO why we had been spinning up a POC was a distinct dialog once I might quote a particular friction again to her.

For the single-shot drawback, we phased the invention into three layers, every specializing in the subsequent stage of depth within the analysis:

Desk analysis. Learn the seller docs, sketch an extended record. Most platforms self-eliminated right here, earlier than we ever opened a gross sales funnel. Loads of Claude Code at this step, too.
Demos. A targeted dialog with every shortlisted vendor. A little bit gross sales pitch, certain, however largely us probing the areas we had determined mattered most.
POC. Fingers on the platforms, with actual information and actual evaluators, just for the 2 finalists.

Every layer narrowed the sphere and purchased us data at a “value” we might afford. By the point we reached the POC we had been down to 2, and the choice in entrance of us had shrunk to one thing we might truly maintain. Statsig, or Eppo?

There may be one a part of this I might repeat on day one in every of any future platform choice, in any class: the interviews outline these ache factors. They had been the only greatest unlock of the entire stage. Working shut behind them, sponsorship. And I don’t imply simply from my director, who requested to push it ahead. I saved friends and stakeholders who must again / undertake the choice within the loop the entire means via. By the point the POC ended, the choice stunned nobody.

On the finish of “earlier than” we had a shortlist of two, and the self-discipline of how we had narrowed to them. We knew what labored for us. The tougher query was nonetheless ready: between two platforms that each cleared our bar, which was truly higher for us? How would we outline “higher” conceptually, and the way would we agree on it virtually?

Throughout

It was the debrief, after the POC, and the analysts on the panel had been taking turns speaking. Two of them, who knew our stack greatest, completed their abstract with a sentence just like:

“As a product analyst, I might be actually joyful to maneuver ahead with both of them.”

I sat with that for a second. The consolidated scores agreed with them: the 2 platforms got here in at 4.36 and 4.47 on a five-point scale, throughout greater than twenty weighted standards. By any affordable learn, it was a tie. I had spent weeks constructing a course of that may level clearly at one platform, and the method had simply instructed me, within the voice of the friends I trusted most to identify a significant distinction, that there was no significant distinction from his seat.

What I realized in that second, and wouldn’t have realized with out the panel, is that analyst-grade rigor has turn into desk stakes. The marginal worth of selecting one trendy experimentation platform over one other doesn’t accrue to your scorecard; it accrues someplace else. The place, precisely, was the query I now needed to reply.

So I wanted a choice I might defend; to myself first, then to my information director and CPO, then to the groups who would inherit it. Coin flips and private preferences are unhealthy foundations for a multi-year contract. And the tie meant the tiebreaker couldn’t be invented after the very fact; it needed to replicate what we truly needed from the subsequent few years of experimentation at ManyChat.

Particularly, we weren’t selecting between two snapshots; we had been selecting between two trajectories. Eppo’s wager was on guided, opinionated, PM-shaped *cough * proof *cough * workflows; Statsig’s was on power-user flexibility. Each had been defensible for certain. However we had stated, recall:

We needed the software that may give us the most effective shot at getting our experimentation program the place we wish it: cutting-edge statistics, sure, however extra importantly a software that nudges its customers towards conclusive experiments by default (…)

I observed what didn’t occur. The POC plan known as for PMs to trial each platforms and feed scores again into the matrix. They largely didn’t due to bandwidth. One head of promoting operations and one PM gave me unprompted impressions, and the remainder of the PM-side proof and enter stayed skinny. The absence of PM suggestions did one thing counterintuitive: it elevated the load I gave to PM-facing UX / workflows, and governance, within the last name. The logic is uneven. Analysts are adaptable, power-users if you’ll; they’ll work their means via no matter interface you hand them. PM onboarding is just not adaptable in the identical means. If the platform our analysts rated equally can also be the one which lowers the barrier for our PMs, that could be a choice; the reverse, choosing the analyst-equivalent platform our PMs would have struggled with, would have been quiet self-sabotage.

Briefly, we might lastly say: every part else near-equal, the usability for non-technical people is what units the 2 platforms aside.

So we picked Eppo. The trajectory query is what tipped it: on an extended horizon, Eppo lined up higher with the place we needed experimentation to stay; nearer to experimenting groups, and past simply the analyst. Information administration as a first-class object. Reporting that doesn’t want a deck rebuilt round it. Statsig had its benefits too; CUPED (a variance-reduction method) inside its energy calculator, a standalone metrics explorer, a extra versatile evaluation floor; and we accepted these as Yr 1 gaps to work round, whereas Eppo was being revambed inside Datadog, and buying these options too.

Wanting again, the lesson I take away from it’s double-edged. The choice wanted extra rigour than intuition needed, after which much less religion in that rigour than I anticipated. The scorecard mattered as a result of it compelled everybody to be particular, and to create a way of belief and credibility within the end result. It gave me 360-degree protection, however the name got here from the moments inside it: the analyst tie, and the imaginative and prescient query. Six months after signing, a curious colleague would ask me how we had picked, and I might stroll them via the panel, the scorecard, the corrections, and the imaginative and prescient/framing query. That’s a win for me.

After

I believe I anticipated, someplace I might not admit aloud, that signing the contract was the end line. I had spent weeks constructing a reputable choice system, a course of, and had spent a few hours of vendor calls. The week we signed I had a quiet day. I sat down at my desk and began a working doc about what would occur subsequent. Legend has it that I’m nonetheless writing it.

The clean-water metaphor I had used within the proposal saved coming again to me. We had laid the pipes; that was the SDK integration, the info plumbing, the warehouse connections. The platform itself too, if you’ll. Pipes get you movement, however not clear water. Within the worst case, pipes contaminate it as a substitute (extra crap output, quicker). Clear water is what comes out of pipes when the remainder of the system (the supply, the remedy, the individuals who keep it) does its job. Experiments work the identical means: a platform will get you the movement, however the reliable outcomes come from governance and course of, from folks, and from how critically the group treats the distinction between testing an concept and launching a characteristic.

The software is prepared; the group is just not but prepared for the software.

Until that time I used to be deep in the price of the contract, however not the price of bridging the hole between the software is current now and the group is able to use it.

I had instructed colleagues, within the weeks main as much as signing, {that a} chunk of the analytics crew’s capability would slowly ramp as much as a brand new equilibrium as soon as Eppo was stay. As of writing, I’m nonetheless hopeful that can materialise 1 / 4 or two from now; however not earlier than we get some issues in place first. Velocity, the mere act of experimenting extra in a given interval, additionally has to attend.

Signing didn’t purchase time again but, nor did it deliver us extra experiments immediately. The work that began the day after signing, forming a cross-functional integration group, drafting the experiment lifecycle, configuring Eppo protocols (a part of its governance framework), certifying our first success metrics and guardrails, migrating a data base, designing a coaching curriculum, all needed to occur earlier than the platform might ship the speed potential we knew it had. En breve, what was forward was not a software drawback. Slightly, a governance, course of, and other people one.

Three legs of a stool

For experiments to really be reliable at Manychat, three issues should be current on the identical time: the tooling, and engineering integration so experiments can movement via the platform, course of and governance so the experiments that movement via are correctly designed and determined, and folks and expertise so the most effective practises are adopted in observe and never solely on paper. Drop any one of many three and the entire thing leans.

We had the software and the connections now. Course of and governance was totally on the info science crew: a five-stage experiment lifecycle (Suggest, Design, Run, Analyse, Resolve); an authorized set of success and guardrail metrics; all of it encoded into the platform’s personal protocol templates in order that the rails weren’t a Notion web page however a characteristic of the software. Folks and expertise are to be materialised in advert hoc Eppo-delivered software quick-starts, and an Experimentation 101 and 102 curriculum in the long run. An ongoing argument for a graduated autonomy mannequin, PMs paired with analysts at first, extra independence over time; that’s the dot on the horizon.

The opposite factor

A milder lesson: signing Eppo was the place my job description modified. I had walked into the venture because the Employees chargeable for choosing a software. I walked out doing change administration; onboarding groups, instructing, leaning on PMs about lifecycle compliance, spending credibility I had banked for different issues. It was completely price it for me, although.

Closing notes

If I needed to compress all of this, these can be the few traces I’d match it in:

A reputable choice is the deliverable, not the platform. The platform is an artifact. The choice is what your group will stay inside for years.

In the identical spirit, pipes are usually not water. A software is critical infrastructure for reliable experimentation, however not adequate. The work begins, not ends, on the day the contract is signed.

I’m writing all of this realizing the experimentation instruments market is in movement; the seller churn I flagged up prime has not stopped. Regardless of the map seems like by the point you learn this, the bits of course of that survived for me are most likely the bits price borrowing: the interviews, the phased discovery, the imaginative and prescient framing, and the trustworthy budgeting for what comes after.

If you wish to dive into the main points over a web-based cup of espresso, be happy to ping me on LinkedIn! I’d be joyful to share concepts with you.

Additionally try my private web page for extra piece like this.

Choosing an Experimentation Platform: A Retrospective

Decreasing container chilly begin instances utilizing SOCI index on DLAMI and DLC

Constructing Semantic Search with Transformers.js and Sentence Embeddings

Constructing Semantic Search with Transformers.js and Sentence Embeddings

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

Context Engineering — A Complete Fingers-On Tutorial with DSPy

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

About Us

Category

Recent Posts