Experiment, don’t argue! Prelegent pozitive technologies o eksperymentowaniu na produkcji

Willy Picard już w najbliższy czwartek, 26 września, wystąpi jako prelegent konferencji pozitive technologies w Poznaniu. Z tej okazji poprosiliśmy go o przygotowanie artykułu nie tylko o uczeniu maszynowym, ale i o współpracy na linii klient – system. Główny cel jaki przyświeca Picardowi w pracy to: Experiment, don’t argue! Więcej dowiecie się z treści artykułu przygotowanej w języku angielskim.

Willy Picard. Head of the Datalab at Egnyte. Willy Picard is an Engineering Manager at Egnyte. As the head of the Datalab at Egnyte, he is responsible for investigating and putting in production innovative ideas based on machine learning. In his former professional life, he has started his career in academics, obtaining his Ph.D. in 2002 and his “habilitation” degree in 2013. Next, he moved to the industry to work on natural language processing and text mining. He has also been leading innovation initiatives for the defense sector working on conversational bots, data analysis of Bitcoin, and applications of machine learning for fraud detection.


Arguing is not discussing!

Discussion and collaboration are key elements for any serious IT endeavor to be successful. When teams are working together to develop software, there is a need for a variety of perspectives and opinions, which necessitates discussion of things like design choices, architecture, UI, and even implementation specifics. 

It’s important to safeguard against these types of discussions from becoming unproductive arguments. Rather than getting stuck in a continuous cycle of meetings and pointless discussions, I suggest a simple, but important rule when it comes to technical decision-making: Experiment, don’t argue! Before we jump into the details of this framework, let’s go through the most common issues I encounter during technical discussions. 

Fallacious arguments: we’ve all heard them before

What distinguishes a pointless quarrel from a productive discussion? The main characteristic of the former is that people are using mostly invalid opinions to convince the other side. Arguments can be fallacious for various reasons, including the following:

  • The “well-known” solution: this type of arguments is based on the idea that someone has tested some ideas and has published the results of the tests somewhere. However, the results can be irrelevant now or in a different environment. Think about all the benchmarks proving that JavaScript on the server-side does not make sense (and what the creators of Node.js think about this).
  • The cherry-picking validation: this type of reasoning, also known as, “it does not work for me,” is a popular approach to invalidate an argument. What’s wrong with it? In many cases, an evaluation based on a single example is in no way valid: it could be pure (bad)‌ luck that a given “not working” case has been chosen from a large set of successful ones.
  • The biased opinion: this is usually based on former experience leading to inclination for or against certain outcomes. Some have biases towards particular text editors, towards tabs or spaces, or towards specific IDEs.

All of these arguments have something in common: they rely on opinions based on prior knowledge or a small number of data points. They are not based on data coming from tests with measurable results. 

In other words, these types of arguments are not the results of experiments, defined as “a test done in order to learn something or to discover if something works or is true” (according to the Cambridge Dictionary).

So, instead of arguing about which solution is the best one among the various options available, one may decide to conduct experiments to test the options in a measurable manner. Here’s how to get started. 

How to design an experiment

The goal of every experiment should be to check if a hypothesis is true or false. In a way, an experiment is a tool to validate an opinion in a sound, serious, measurable manner.

For each experiment, there is a common ground upon which the stakeholders of the experiment agree. Say you want to determine the appropriate color and size of a button on your web page. The common ground here is that the button is needed in the first place.

Besides the common ground, there should be at least two variants to be confronted. A variant is one proposal for the solution of a given problem. Following on the former example, the first variant could be a red solid button while the second variant could be a red-yellow gradient button. In some cases, the first variant is the current state. In other cases, especially for new features, the variants are all new.

Having identified the potential solutions as variants, we need a way to compare them in a measurable manner. This is the role of metrics. Metrics allow us to measure the performance of each variant. In our case, the number of clicks a button receives could be a good metric, provided we try to find a way to have more clicks to this particular button.

Finally, we need to define rules for decision making to create an actionable experiment. These rules should clearly outline what actions will be taken depending on the values of the metrics. Back to our button example: if we observed an increase of 5% for the number of clicks on the gradient button, should we then replace the current red button with the gradient one?

Experiments ≠ A/B testing

The implementation of experiments as a framework to make technical decisions is too often reduced to A/B testing. A/B testing is a technique to choose one variant out of two potential solutions. You have to pick two groups of users and analyze their reaction to variant A or B. The scope of experiments is wider than A/B testing: you can, for instance, compare the speed of two different implementations of a given algorithm. No user is involved in this experiment, so it is not A/B testing. 

How to get data for experiments

As previously mentioned, experiments should rely on large datasets with many examples. One source of data that’s usually widely available is historical data. You probably already have a lot of historical data since they’re gathered along the way. You have to be careful with them, however, as they may point to information that is not available anymore (links to old MySpace pages), the structure of the data could be different (no 140-character limit for a tweet anymore) or the user behavior might have changed.

You can also use synthetic data. Synthetic data are usually similar to real data but they hardly have the diversity of data that you can find in production. So the best source of data for most experiments are data available in production.

Note that conducting experiments with data in production is a risky business that should be done with great care: you don’t want to pollute your production data with results of experiments, especially when testing the bad variants. Therefore, it is crucial to have a dedicated and secured experiment environment to run the experiments without any impact on production. Having such an experiment environment in place is the key to a wide acceptance of experiments as a technical decision-making tool.

Need for experiment platforms

Your experiment environment should not only mirror data and separate experiments from production, it should also take care of versioning. Yet, even then, versioning of code is not enough. You should also version the datasets used to perform your experiments; this would include both the configurations and the evaluation results. When you start to run dozens of experiments, you need to have a tool that captures the fact that on a given day you started an experiment with a given group of users, with a given configuration of your services, a given version of your algorithm, as well as the values of the metrics for this particular experiment.

Many companies, such as Microsoft, Google, Uber, Airbnb, Pinterest, and Netflix already have experiment platforms in place and write about these platforms on their technical blogs. Other companies offer an experiment platform as a product: Google Optimize, Adobe Target, ax.dev (by Facebook) or Optimizely.

To conclude, introducing experiments as a technical decision-making framework leads to better, data-driven decisions. It also reduces the frustration from arguing during lengthy technical meetings. One may focus on arguing about the experiments to be provided instead of arguing about one’s biases. As Richard Feynman said, “it doesn’t matter how beautiful your theory is, it doesn’t matter how smart you are. If it doesn’t agree with experiment, it is wrong.”

Do you conduct experiments in production or are you a fervent supporter of heated debates in your daily workflow? I’d love to hear about your experience with team quarrels and experiments in production. Feel free to share your decision-making failures, and brag about your successes in the comments!


Zdjęcie główne artykułu pochodzi z unsplash.com.

Patronujemy

 
 
Polecamy
12 rzeczy, których nauczyłem się jako inżynier uczenia maszynowego