Why clickstream is so important to your online business
Clickstream data allows seeing what actions customers are taking on your website. Given how commerce is shifting more and more online, this data is becoming essential for your business to stay competitive. Before defining what kind of data is this, let’s take a look at the main reasons why a business needs to own it in the first place.
No data science without data
The first reason why you should collect and own clickstream data is to be able to take advantage of data science. Unfortunately, as the name implies data comes first before any science can be made and without it, even the most sophisticated models won’t work. Which is why you would want to pursue strategic data acquisition, which will make your business more defensible in the long run.
Understanding customer – key advantage
Often clickstream is associated with web analytics, due to the reason of being able to analyze your customer’s behavior. For example, you can find out how many customers drop off from the landing page to completing the purchase. The advantage of owning such data is that you can filter by any trackable metric down to individual visitor level without limitations of reporting dashboards that are provided by web analytics tools.
Also, you are free to combine reports with any other data source at your disposal. For example, one can stitches orders, paid advertisement reports, geo and other sources which increases the utility your data assets. Of course, this is possible only when you have full access to the collected dataset, and it’s available in 1 unified location.
Going beyond charts and dashboards
Tracking KPIs with charts and dashboards is helpful for monitoring business health and detecting problems in real-time. Though this is useful when making high-level business decisions. To truly bring business to the next level the data can be utilized for optimizing activity down to each customer level. One of the most popular examples is personalizing customer experience.
Personalization can be done on different customer touch points. For example, when a customer is visiting your website we know from the data what he has bought before, or what pages he has visited. Combining single customer data with other customers, we can recommend relevant products or content tailored specifically to the customer who is browsing your website. The same approach can be extended to email, advertisement campaigns or even physical store. This way customer experience can stay consistent across all touch points. For any business, this can serve as a key differentiator.
A good case study showing how taking advantage of owned data can drive business is Zara. Using data as its backbone they manage each of their 2000 stores inventory and what’s on display on a daily basis. Which would be impossible to do if they would not have full access to the collected dataset.
What is clickstream data
To understand how we can use clickstream dataset, first, we need to define what kind of data it contains and how clickstream data is collected. We can define clickstream as a sequence of events that represent visitor actions on the website.
The most common and useful event is called ‘click’ which indicates what visitor has been viewed. Of course, we are not limited to collecting just clicks, but also impressions, purchases and any other events relevant to the business.
Furthermore, an event can include multiple contexts that enriches it, like how long has the page load took or what type of browser/device the visitor is using. Essentially a good clickstream data clearly define a full set of events which allows inferring complete picture of customer behavior. Conceptually we can look at events having their own grammar.
Later in the article we’ll take a look at different options for tracking events.
Example data output
The best way to gain a deeper understanding of clickstream data is to have a look at particular examples. Below we provide a sample event for page view:
TIME Thu, 25 Apr 2019 08:33:03 GMT
|Event Type||string Pageview|
|Application ID||string joes_bikes|
|Event ID||string e8468c4a-5d95-42aa-81e1-c72d27a5018a|
|Device Created Timestamp||string 2019-04-25T08:33:03.200Z|
|Device Sent Timestamp||string 2019-04-25T08:33:03.204Z|
|Tracker Name||string cf2|
|Tracker Version||string js-2.8.2|
|searchTag||string front lamps|
|Browser User Agent||string Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36|
|Browser Language||string en-US|
In the table above you can see a sample of data sent from a fictional online store joesbikes.com which is based on a real tracking event. Most essential fields are the event timestamp which allows analyzing events as time series.
Another important part is a custom page context which describes viewed paged details. A notable field is search Tag, it provides what the user is searching for and if that matches the page he has viewed. Combining such events into a sequence allows us to see if the path user takes for purchase is optimal or there can be ways to improve it and at the same time improve conversion rate.
Lastly, we can see that we also get browser information. This can be useful to understand what type of devices your visitors are using and especially if there are problems with rendering certain pages. For instance, we can analyze if our mobile visitors convert at the same rate as desktop. Given how important mobile experience is today, its critical for a business to have this visibility.
Now let’s have a look at different event sample of a product impression.
|name||string Joes Leather Gloves|
Here we can see the main attributes of a product shown on the page. The captured event of an impression should help us determine what product was displayed, at which location on the page and what variable attributes did it use. From the above event, we can see that gloves were displayed at 2nd row and 1st column in a container on a page called bestsellers. We can also see the price and review score used for the product. This information alone is enough to determine which products displayed perform as well based on their exposure across all the website. Also, we can determine how well they “compete” with each other given the same or different variables (price, location, etc.).
As you can see from the examples above information that’s being tracked is fairly trivial from a single event perspective. The power comes from having access to these events across all the pages that visitors are interacting with, over a period of time. Then you can measure which pages might need improvement or if the overall website can perform much better. We’ll take a look at a few use cases in the next section.
The easiest way to utilize clickstream data is to see where a website is getting traffic from. Even though it sounds trivial given so many online tools serve this purpose, but getting true numbers down to individual visitor level requires owning the clickstream data. We can analyze not just which source brings us most traffic, but also determine:
- which keywords are most popular,
- see conversion rates from different traffic source visitors,
- do a cohort analysis
- even determine which marketing campaign brought most traffic. This is possible due to automatic parsing of utm query parameters that are made available in the unified datawarehouse. It allows us to track any kind of campaign from paid advertisement to email.
Besides above we can extend the tracking to measure email campaign performance of open/click rates. This is especially useful for making your analytics independent of any ESP (email service provider). This makes it easier to migrate from one email provider to another without losing performance data.
Sales funnel analysis
Quite often to determine how well our website is working for converting visitors into sales, a sales funnel is used. In this case we create stages of customer journey from landing to your website (or app) to paying for a product. Each stage usually has a drop off percentage, which can occur due to many reasons. Clickstream data can expose these problems. For example if visitor in one product page has a much large CTR than in another, we could investigate the reason for it and try to improve, for example update the content on the page. We’ll see later in the experiment testing how we can test our improvements.
Besides just single stage problems, sales funnel can serve us as a health metric to quickly determine if certain stage conversion starts dropping off. Such problems could mean that parts of our system stopped working and requires quick action. For an online business, where every lost hour can cost thousands of dollars having this visibility is critical.
Browse/Cart abandonment and recovery
Whenever a shopper puts a product to a cart there is a high likelihood that cart will be abandoned. As you can see in the chart below, up to 80% of online customers abandon their shopping carts.
This is quite significant for any online business, especially if some of those abandoned carts can be recovered. To act on this event, we need to have a way to track when a customer has added some items to cart and if after a certain period of time there was no order made. With clickstream data we can capture these events as follows:
SELECT email, mobile, first_name, last_name
WHERE visitor_id IN (select * from cart_abandoners)
One doesn’t need to be an SQL expert to understand whats happening above. We’re just fetching all customers, who are in the cart abandoner segment. Though we need to define what cart abandoner is in a clickstream dataset. We can do that by relying on visitor events:
WITH cart_abandoners AS (
SELECT DISTINCT visitor_id FROM customer_clickstream
WHERE event = 'checkout_form_view'
AND visitor_id NOT IN (
SELECT visitor_id FROM customer_clickstream
WHERE event = 'order_confirmation_view'
AND ts_event > TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 2 HOUR) )
AND ts_event > TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 2 HOUR) )
The script above is a little bit more involved, but it clearly shows how clickstream data can be utilized for segmenting visitors by different actions. What we do is just find all customers that visited the checkout page but haven’t viewed the order confirmation page which is shown after purchases has completed. Having this segment we can easily use it either for email or sms campaigns that try to recover a portion of abandoners. Such marketing campaigns are out of scope for this article, but the most obvious action would be sending to these customers a discount voucher code for your products or recommending other products that customer might like.
Nice thing about the above approach is that it can be easily adapted to browse abandonments, meaning when customer is just browsing product pages but not buying anything. We just need to swap ‘checkout_view’ event with ‘product_view’ and exclude carters/buyers. To make it work, the clickstream data has to be updated fairly frequently, in order for marketing automation to have a better chance of recovering customers until they forget the purchase.
Cart/browse abandoners is just a subset of customer segmentation. When we have access to full clickstream dataset we can create segments by any number of parameters, like recency, average purchase amount, geo location or specific products that customer has been viewed or bought in the past. With data the only limit to any segmentation is marketers imagination.
Once simple analysis is in place, it is possible to utilize clickstream data for more difficult tasks, like improving customer experience. A classical use case is product recommendations.
When a store sells a lot of products finding the right product can be difficult. Biggest online retailers like Amazon try to find similarities between products and customers to use it for recommendations. This helps in 2 ways, first it allows for easier product discovery and tailors customer shopping experience based on his interests.
To implement a simple recommendation model using clickstream data is not difficult. What we do is find customers who have viewed certain products and what they bought after. Then we compare all purchases for certain product to purchases made to related products. Once we have those values computed for each product, we can rank them and show on the website once a visitor lands on a product page. In this case, the advantage of owning the data is that we can use any attributes related to the product that might be relevant for recommendations.
Same recommendations can be extended to email or other marketing campaigns without any additional changes to models logic or data.
Tracking Experiments (A/B testing)
The other useful optimization type of analysis is tracking and running A/B experiments. An experiment can help you decide if particular change has any effect on a business relevant KPIs. For example, if we decide to change the design of certain page, to improve conversion rate. Simplest approach would be to update the design and see if after some time there is any improvement in conversion rate on that page. Though conversion rate might change overtime and comparing it with different historical periods can lead to inaccurate assumptions. The best approach is to run 2 different designs simultaneously for different visitors and track the outcome of each. Then if conversion rate improves for 1 design versus the other, we can be confident that it is really better.
Tracking experiments is not too different then any other events. We just need to record which variation visitor is viewing. The harder part is to make sure that visitor views only 1 variation between multiple viewings, otherwise it might skew the results. To do this we just split our visitors by using their user agent and ip address and serve each either one variation or the other.
The advantage of tracking experiments together with other events is that it makes it easy to compare effects on all visitor behaviour for all situations. As an example, we can find out if a new design for mobile visitors works as well as for desktop and how their conversion rate or clicktrough rate differ. Of course there is no limit what kind of experiments can be run, tracked and analyzed.
Another clickstream data use case which is becoming more relevant in mobile internet era is being able to stitch customers to a single profile. For example, customer may open marketing email on mobile and browse some products, but when it comes to purchasing he might switch to a desktop. In this case we would want to know if this is the same customer or a different one.
If we track everything with 1 pipeline, we can find this customer by matching his ip address assuming that his mobile phone most likely shares the same wifi connection as is his desktop. We can also use other “marks” like cookie id, when customer opens an email we track this with his email address hashcode. If the same customer comes back to the website we can find the same hashcode as well even when using a different device.
The idea about identity stitching is to ensure we are matching customers to as many available identifiers as possible in order to be able to have an accurately matching profile. Then a business can tailor customer experience unique to his profile at all touch points.
How to collect clickstream data
The data collection processes for almost all tools are fairly similar, which can be visualized as in the following example:
- Customer visits a web page with his mobile device.
- The tracker code tracks events that customer is performing on his device and sends to a collector server.
- Then those events get saved, validated and enriched.
- Finally, each event is sent to a unified data warehouse.
- Once events are available they can be used by different stakeholders, like business analyst, data scientist or executive.
The above steps are used by most companies collecting clickstream events, the differences are in the storage, data availability and variety of events that can be collected. To be able to choose which tools are better suited for your business needs, let’s describe major categories available.
Vendors – Paid
The biggest category of clickstream data collection providers is of course paid ones. Let’s list several major ones available:
One drawback of using custom tracker is that it has now the structure of their event schemas, which can be error-prone to maintain this code when having a larger variety of events.
Looking at documentation there is also no support for email tracking via pixel. Though this might be intentional given Mixpanel supports their own messaging. This can be a plus or minus depending if a business already uses an existing ESP platform.
Mixpanel does support exporting data to a cloud data warehouse, but it comes with some limitations like expected data latency is around 24 hours, no control of what gets exported or historical period. The schema is fairly limited mostly to custom event fields which are sent via tracker (no page referrer, user agent, IP address, location, campaign fields, etc.).
A customer analytics tool that allows tracking, report, and message customers. It has somewhat similar features as Mixpanel, but not all charting/segmenting capabilities are the same. It supports additional default event fields like campaigns, location, page referrers, user agent. Similarly, Kissmetrics doesn’t have a way to track emails with a pixel, due to the availability of their own emailing feature.
Their API supports SQL queries which allows you to run any arbitrary reports. Although raw event data is exportable only as JSON files. Which means if raw data is needed you would need to implement manual data pipeline to process those files and keep it in sync with a specific data warehouse.
The biggest advantage of Amplitude is the ability to sync all events directly with data warehouse with moderate delay – 30 minutes for Snowflake and 3 hours for Redshift. Also, you can map charts and dashboards directly to SQL queries which allow unlimited customization in terms of reporting.
From the data perspective, all expected data points like page referrer, location, campaign params are included by default. Heap also provides a way to export data to all major cloud data warehouses with a complete schema. Though it is limited by 24 hours delay until new data becomes available.
The major difference is the ability to access raw data that is only supported in GA 360 version. Data gets exported to a cloud data warehouse BigQuery managed by Google Cloud Platform. Another advantage compared to a free version is no data sampling. According to this report – a free version of GA caps the collection to 500k events per month. While GA 360 can collect up to 2 billion events per month. Also, GA 360 has extra integrations to Google Ads, making it easier for companies to optimize their ad spend.
Even though data is exported automatically there is still a possible delay up to 4 hours before it is accessible. This makes it harder to use for more real-time actionable analytics, e.g. browse abandonment. The other biggest hurdle is the price tag. GA 360 costs $150k per year, making it available only for bigger companies.
Adobe Marketing Cloud
The last paid vendor we need to take a look is Adobe. When GA 360 is compared, Adobe is referenced as the main competitor. Looking at their offering overview it provides a flexible reporting dashboard for creating arbitrary reports tailored for business needs. The available schema tracks by default expected columns like location, referrer, browser details.
When it comes to exporting data the only option available is raw files to either FTP or S3. It does support pushing data hourly, but it would require setting up a manual data pipeline and maintaining it to get access to the data on a continuous basis. Given product is targeted to enterprise customers, this is a large limitation comparing to GA 360 or any other mentioned solutions above.
Given the above vendors we can sum up their pros/cons as a whole:
- Minimal or zero development cost
- Usually comes with a feature-rich UI
- Easy to get started even with little data analytics background
- Integrates with many other SaaS providers in other categories
- The cost grows quickly with data volumes
- Limits on collected event customization
- Full history data access is either not available or requires costly manual integrations or has a substantial delay
- No way to extend data collection features
- Migrating to a different vendor is limited or impossible
- PI and/or PII personal data collection/storage is prohibited or limited
Some of the cons can be reduced when integrating with additional products like Segment, but that comes with an additional cost both in vendor fees and maintenance overhead.
Vendors – Free
Most of the vendors provide a free tier option for their service. It can be a reasonable approach if a business collects only a limited amount of events per month or has a small number of sessions/customer profiles being tracked. Following options are available:
GA is a web analytics provider that is most widely used by large and small companies. By some statistics, more than 50% of all websites ranked in first 1 million use GA. It provides essential dashboards for traffic analysis, segmenting customers and attributing traffic sources. The biggest limitations are not being able to access the raw data and sampling events once reached a threshold. As we mentioned above, GA 360 addresses these limits.
At first glance, for most of the business, the limitations are quite generous as not many websites can reach over a million visitors a month. That is when the sampling rate can become a problem. Though the bigger issue is ownership of data, if a business is growing, eventually data will be required in order to use it for anything more than just traffic analysis. The risk of using GA, in this case, is that migration to owning data will be almost impossible without paying the price of GA 360 (even then it’s not exactly clear if full history can be recovered). This is why a business should decide early if giving up data ownership is worth free.
All paid vendors with free plan tier
All other vendors in the paid category that were covered above (except Adobe) have a free plan which unlike GA has lower limits. For example, Mixpanel allows 5M data points per month before charging, Amplitude 10M, while Heap calculates free tier based on sessions allowing 5k per month. Each vendor also imposes other restrictions on their analytics offerings. Though compared to GA, getting full historical data when transitioning from free to paid can be possible if there is no limit on data retention. For example, Heap only retains 3 months of history on the free tier. Therefore, a decision should be made early on if clickstream data is or will be valuable in the future for the business.
- No cost
- Easy to get started
- Suitable for low data volumes
- All same cons as for paid vendors
- Inflexible data collection
- Sampled data
- No data ownership
- Can be impossible to recover historical data
Open source – Self hosted / Managed
In the last 5 years, the importance of data has grown a lot and with it, a lot of new open source projects were either created or made available by bigger companies. Most of the new products are centered around data processing, storage, and management, but there are 2 major ones, tailored for clickstream data collection as well:
- Track – capture an event and send it to the collector
- Collect – receive event and save it in a raw event store
- Enrich – process, validate, enrich the event with extra data and send for storing to a data warehouse
- Store – save valid events to a cloud data warehouse
Each step is decoupled from one another allowing the platform to update, scale and replace processing steps independently. This is important when dealing with large data volumes that can take a fair amount of time to process. It would make it difficult for managing a large data pipeline as a single piece. Also, this architecture allows processing streams without any interruptions. As long as the collector is available, we can safely restart enrich and store modules without having to worry of losing data. This is possible due to utilizing intermediary stores for raw data.
When it comes to tracking events, Snowplow has a large set of default field list which gets collected – page referrer, geolocation, user agent, device type, campaign params. We can also use pixel tracker for receiving email open events.
The benefit of open source is that your business is not locked in to just using 1 data storage or only one way of how you can collect events. Snowplow at the moment supports 3 data warehouses – Amazon Redshift, Google BigQuery, and Snowflake. Even if none of these options are suitable, you are free to store events in Kafka stream and push to any other storage based on business requirements.
Of course, to be able to configure, deploy and operate a data pipeline, one must understand the underlying pieces and how they work together. This requires data engineering skills, which for some companies might be challenging to attain on their own, especially if a company wants to start using data from day one. Therefore, Snowplow provides a managed service where data pipeline is run on your cloud account but managed by Snowplow engineers.
The biggest benefit of running a Snowplow pipeline is data ownership. All the data that’s collected is stored in your own cloud data warehouse. Another thing is data availability which, when running stream collector, is near real-time down to data warehouse level. At the moment, this is not achievable even with expensive paid vendor options, which have at least 30 min delay.
The major benefit of using matomo is data ownership. There is no limit for historical data retention when running Matomo on-premise. As alternative, a managed service is provided where data is collected on Matomo servers. In this case, retention can be limited or custom depending on data volumes.
One drawback of using Matomo when hosting on-premise is the storage. Currently, it is based on MySQL database which can require specialized maintenance and deployment requirements due to being designed as a vertically scalable data store. The other limitation is the schema itself. By design, there is no option to store custom events without having to manually change the database schema.
From a pricing perspective of a managed service, Matomo is very competitive, while collecting around a million pageviews it costs a fraction compared to GA 360. Though there is no information about raw SQL access available in Enterprise plan and query performances when data size increases.
- Full data ownership
- Very flexible data collection options, able to extend/add new features
- Able to integrate with other datasets once data is in a unified data warehouse
- No vendor lock-in
- Can run high maintenance cost
- Requires data engineering in-house expertise or paid managed service
- Only data collection without rich reporting suite
Difference compared to other solutions
At StackTome we are combining the best from both worlds of managed and paid services. The way we differ from Google Analytics, for example, is by following:
- All data that’s collected is complete and not sampled.
- Data is not locked in, but exportable to any cloud data warehouse of your choosing
- Independence of UI functionality allows you to collect data for a variety of purposes. Some examples already predefined – impressions (email, product), A/B experiments, form edits/link clicks
In comparison to just open source products, we provide:
- Fully managed solution – no extra setup or infrastructure costs
- Predefined schemas to cover basis tracking needs (e.g. eCommerce)
- Raw data cleansing and identity stitching, to make data useful out of the box for a variety of applications – customer segmentation, sales funnel analysis, conversion attribution and any other data usage scenario relating to visitor behavior
How does it work
Under the hood, we are combining open source technologies like Snowplow, Kafka, HDFS, and Kubernetes to achieve flexible clickstream collection in a fully managed fashion. This way we enable your business to focus on actual applications instead of data plumbing complexities.
Ownership of all data
Having universal access to clean clickstream data is useful for different stakeholders:
- Marketers to see the performance of their email or advertisement campaigns
- Data analysts for analyzing sales funnel performance, traffic source impact to conversion, cohort analysis
- Data scientists building any model that optimizes visitor experience and lifts conversion rates
Also, available data can be matched to any other dataset available in your cloud data warehouse. For example, sales, SEO, paid advertisement, giving you the ability to see your business from many angles down to individual visitor level.
More than just data
Besides just data access StackTome also utilizes the collected clickstream data to provide solutions on top of it. Once you start collecting events, you can use other products like:
- Customer data platform – customer segmentation and marketing via email, social and search
- Product recommendations – personalizing customer experience on your website
- User-generated content – leveraging relevant customer feedback on semantically relevant pages
Cost of collecting clickstream data
Now, having gone through most popular options on clickstream data collection, let’s compare they feature set and pricing. To compare pricing more accurately we can assume 10M events per month need to be collected, with the ability to export data and history retention no less than 12 months.
|Features / Vendors||Mixpanel||Kissmetrics||Amplitude||Heap||GA 360||Adobe MC||GA||Snowplow||Matomo||Stacktome|
|Default field collection||partial||X||X||X||X||X||X||X||yes||X|
|Dashboards / Reporting||X||X||X||X||X||X||X||–||X||–|
As you can see above cost varies a lot based on given volume, features and data availability. It’s best to look at business needs first and then review available options to meet them. This can be done by asking questions:
- What’s expected event volume per month?
- Is it expected to grow?
- Do we need to store/access full history on a regular basis?
- Is export to data warehouse – ownership of data is necessary?
- What kind of reporting tools is required?
- Do a business need only reports or require other solutions as well?
- How fresh data should be to be able to fit all business reporting and optimization/solution needs?
This is not a comprehensive questionnaire. Each business should define it’s own based on priorities and strategy.
If you managed to get through the article – congratulations! Hopefully, you now have a better understanding of what clickstream data is, how it can be collected, utilized and how much it costs for a business. To summarize, we can say a business should consider owning clickstream data if it manages to answer the question: why it would be beneficial to the business.
Of course, data is not a magic wand that will answer all questions, but without it competing with companies that use data to their advantage in today’s online market will be more challenging than ever. If you want to know more about how StackTome can help you with your data needs, don’t hesitate to contact us.