Marketing and product teams are tasked with understanding customers. To do so, they look at customer preferences — motivations, expectations and inclinations — which in combination with customer needs drive their purchasing decisions.

In my years as a data scientist I learned that customers — their preferences and needs — rarely (or never?) fall into simple objective buckets or segmentations we use to make sense of them. Instead, customer preferences and needs are complex, intertwined and constantly changing.

While understanding customers is already challenging enough, many modern digital businesses don’t know much about their products either. They operate digital platforms to facilitate the exchange between producers and consumers. The digital platform business model creates markets and communities with network effects that allow their users to interact and transact. The platform business does not control their inventory via a supply chain like linear businesses do.

A good way to describe the platform business is that they do not own the means of production but they instead create the means of connection. Examples of a platform business are Amazon, Facebook, YouTube, Twitter, Ebay, AirBnB, a Property Portal like Zillo, and aggregator businesses like travel booking websites. Over the last few decades the platform businesses came to dominate the economy.

How can we use AI to make sense of our customers and products in the age of the platform business?


This blog post is a continuation of my previous discussion on the new gold standard of behavioural data in Marketing:


In this blog post we use a more advanced Deep Neural Network to model customers and products.


We use a deep Neural Network with the following elements:


  1. Encoder: takes input data describing products or customers and maps it into Feature Embeddings. (An embedding is defined as a projection of some input into another more convenient representation space)

  2. Comparator: combines customer and product feature embeddings into a Preferences Tensor.

  3. Predictor: turns the preferences into a predictive purchase propensity

We use the neural network to predict product purchases as a target as we know that purchase decisions are driven by a customer’s preferences and needs. Therefore we teach the encoders to extract such preferences and needs from customer behavioural data, customer and product attributes.

We can analyse and cluster the learned customer and product features to derive a data driven segmentation. More on this later.

The following code uses TensorFlow 2 and Keras to implement our Neural Network architecture:

The code creates TensorFlow feature columns and can use numerical as well as categorical features. We are using the Keras functional API to define our customer preference neural network which can be compiled with the Adam optimiser using a binary cross-entropy as the loss function.

We will need training data for our customer preference model. As a platform business your raw data will fall into the Big Data category. To prepare TB of raw data from click streams, product searches and transactions we use Spark. The challenge is to bridge the two technologies and feed the training data from Spark into TensorFlow.

The best format for big amounts of TensorFlow training data is to store it in the TFRecord file format, TensorFlow’s own binary storage format based on Protocol Buffers. The binary format greatly improves the performance of loading data and feeding it into model training. If you were to use, for example, csv files you will spend significant compute resources on loading and parsing your data rather than on training your neural network. The TFRecord file format makes sure your data pipeline is not bottlenecking your neural network training.

The Spark-TensorFlow connector allows us to save TFRecords with Spark. Simply add it as a JAR to a new Spark session as follows:

spark = (  SparkSession.builder  .master("yarn")  .appName(app_name)  .config("spark.submit.deployMode", "cluster")  .config("spark.jars.packages","org.tensorflow:spark-tensorflow-connector_2.11:1.15.0")  .getOrCreate())

and write a Spark DataFrame to TFRecords as follows:

并将Spark DataFrame写入TFRecords,如下所示:

(  training_feature_df  .write.mode("overwrite")  .format("tfrecords")  .option("recordType", "Example")  .option("codec", "org.apache.hadoop.io.compress.GzipCodec")  .save(path))

To load the TFRecords with TensorFlow you define the schema of your records and parse the data set into an iterator of python dictionaries using the TensorFlow dataset APIs:


SCHEMA = {  "col_name1": tf.io.FixedLenFeature([], tf.string, default_value="Null"),  "col_name2: tf.io.FixedLenFeature([], tf.float32, default_value=0.0),}data = (  tf.data.TFRecordDataset(list_of_file_paths, compression_type="GZIP")  .map(    lambda record: tf.io.parse_single_example(record, SCHEMA),    num_parallel_calls=num_of_workers  )  .batch(num_of_records)  .prefetch(num_of_batches))

After training our Neural Network there are obvious real-time scoring applications, for example, scoring search results in a product search to address choice paralysis on platforms with thousands and millions of products.


But there is an advanced analytics use-case to look at the product/ user features and preferences for insights and to create a data driven segmentation to help with product development etc. For this we score our entire customer base and product catalogue to capture the outputs of the Encoders and Comparator of our model for clustering.


To capture the output of intermediary neural network layers we can reshape our trained TensorFlow as follows:


trained_customer_preference_model = tf.keras.models.load_model(path)customer_feature_model = tf.keras.Model(  inputs=trained_customer_preference_model.input,  outputs=trained_customer_preference_model.get_layer(    "customer_features").output)

We score our users with Spark using a PandasUDF to score a batch of users at a time for performance reasons:


from pyspark.sql import functions as Fimport numpy as npimport pandas as pdspark = SparkSession.builder.getOrCreate()customerFeatureModelWrapper = CustomerFeatureModelWrapper(path)CUSTOMER_FEATURE_MODEL = spark.sparkContext.broadcast(customerFeatureModelWrapper)@F.pandas_udf("array<float>", F.PandasUDFType.SCALAR)def customer_features_udf(*cols):  model_input = dict(zip(FEATURE_COL_NAMES, cols))  model_output = CUSTOMER_FEATURE_MODEL.value([model_input])  return pd.Series([np.array(v) for v in model_output.tolist()])(  customer_df  .withColumn(    "customer_features",    customer_features_udf(*model_input_cols)  ))

We have to wrap our TensorFlow model into a wrapper class to allow serialisation, broadcasting across the Spark cluster and de-serialisation of the model on all workers. I use MLflow to track model artifacts but you could store them simply on any cloud storage without MLflow. Implement a download function fetching model artifacts from S3 or wherever you store your model.

class CustomerFeatureModelWrapper(object):  def __init__(self, model_path):    self.model_path = model_path    self.model = self._build(model_path)  def __getstate__(self):    return self.model_path  def __setstate__(self, model_path):    self.model_path = model_path    self.model = self._build(model_path)  def _build(self, model_path):    local_path = download(model_path)    return tf.keras.models.load_model(local_path)

You can read more about how MLflow can help you with your Data Science Projects in my previous article:


聚类和细分 (Clustering and Segmentation)

After scoring our customer base and product inventory with Spark we have a dataframe with feature and preference vectors as follows:


+-----------+---------------------------------------------------+|product_id |product_features                                   |+-----------+---------------------------------------------------+|product_1  |[-0.28878614, 2.026503, 2.352102, -2.010809, ...   ||product_2  |[0.39889023, -0.06328985, 1.634547, 3.3479023, ... |+-----------+---------------------------------------------------+
As a first step, we have to create a representative but much smaller sample of customers and products to use in clustering. It is important that you stratify your sample with equal numbers of customers and products per strata. Commonly, we have many anonymous customers with little customer attributes such as demographics etc. for stratification. In such a situation, we can stratify customers by the product attributes of the products the customers interact with as a proxy. This follows our general assumption that their preferences and needs drive their purchase decisions. In Spark you create a new column with the strata key. Get the total counts of customers and products by strata and calculate the faction per strata to sample approximately even counts by strata. You can use Spark’s

DataFrameStatFunctions.sampleBy(col_with_strata_keys, dict_of_sample_fractions, seed)

to create a stratified sample.


To create our segmentation we use T-SNE to visualise the high-dimensional feature vectors of our stratified data sample. T-SNE is a stochastic ML algorithm to reduce dimensionality for visualisation purposes in a way that similar customers and products cluster together. This is also called a neighbour embedding. We can use additional product attributes to colour the t-sne results to interpret our clusters as part of our analysis to generate insights. After we obtain the results from T-SNE, we run DBSCAN on the T-SNE neighbour embeddings to find our clusters.

With the cluster labels from the DBSCAN output we can calculate cluster centroids:


centroids = products[["product_features", "cluster"]].groupby(    ["cluster"])["product_features"].apply(    lambda x: np.mean(np.vstack(x), axis=0))cluster0     [0.5143338, 0.56946456, -0.26320028, 0.4439753...1     [0.42414477, 0.012167327, -0.662183, 1.2258132...2     [-0.0057945233, 1.2221531, -0.22178105, 1.2349......Name: product_embeddings, dtype: object

After we obtained our cluster centroids, we assign all our customer base and product catalogue to their representative cluster. Because so far, we only worked with a stratified sample of maybe 50,000 customer and products.

We use again Spark to assign all our customers and products to their closest cluster centroid. We use the L1 norm (or taxicab distance) to calculate the distance of customers/products to cluster centroids to emphasis the per feature alignment.

distance_udf = F.udf(lambda x, y, i: float(np.linalg.norm(np.array(x) - np.array(y), axis=0, ord=i)), FloatType())customer_centroids = spark.read.parquet(path)customer_clusters = (    customer_dataframe    .crossJoin(        F.broadcast(customer_centroids)    )    .withColumn("distance", distance_udf("customer_centroid", "customer_features", F.lit(1)))    .withColumn("distance_order", F.row_number().over(Window.partitionBy("customer_id").orderBy("distance")))    .filter("distance_order = 1")    .select("customer_id", "cluster", "distance"))+-----------+-------+---------+|customer_id|cluster| distance|+-----------+-------+---------+| customer_1|      4|13.234212|| customer_2|      4| 8.194665|| customer_3|      1|  8.00042|| customer_4|      3|14.705576|

We can then summarise our customer base to get the cluster prominence:


total_customers = customer_clusters.count()(    customer_clusters    .groupBy("cluster")    .agg(        F.count("customer_id").alias("customers"),        F.avg("distance").alias("avg_distance")    )    .withColumn("pct", F.col("customers") / F.lit(total_customers)))+-------+---------+------------------+-----+|cluster|customers|      avg_distance|  pct|+-------+---------+------------------+-----+|      0|     xxxx|12.882028355869513| xxxx||      5|     xxxx|10.084179072882444| xxxx||      1|     xxxx|13.966814632296622| xxxx|

This completes all the steps needed to derive a data driven segmentation from our Neural Network embeddings:


Read more about segmentation and ways to extract insights from our model in my previous article:


To learn more about how to deploy a model for real-time scoring I recommend my previous article on the topic:


  • Compared to the collaborative filtering approach in the linked article, the Neural network learns to generalise and a trained model can be used with new customers and new products. The Neural Network has no cold start problem.

  • If you use at least some behavioural data as input for your customers in addition to historic purchases and other customer profile data, your trained model can make purchase propensity predictions even for new customers without any transactional or customer profile data.


  • The learned product feature embeddings will cluster into a bigger number of distinct clusters than your customer feature embeddings. It’s not unusual that most customers fall into one big cluster. This does NOT mean 90% of your customers are alike. As described in the introduction, most of your customers have complex, intertwined and changing preferences and needs. This means that they cannot be separated into distinct groups. It doesn’t mean that they are the same. The simplification of a cluster is not able to capture this which just reiterates the need for machine learning to make sense of customers.

  • While many stakeholders will love the insights and segmentation the model can produce, the real value of the model is in its ability to predict a purchase propensity.


Jan is a successful thought leader and consultant in the data transformation of companies and has a track record of bringing data science into commercial production usage at scale. He has recently been recognised by dataIQ as one of the 100 most influential data and analytics practitioners in the UK.

