A guide to data warehousing clickstream data

Why clickstream is so important to your online business

Clickstream data allows seeing what actions customers are taking on your website. Given how commerce is shifting more and more online, this data is becoming essential for your business to stay competitive. Before defining what kind of data is this, let's take a look at the main reasons why a business needs to own it in the first place.

No data science without data

The first reason why you should collect and own clickstream data is to be able to take advantage of data science. Unfortunately, as the name implies data comes first before any science can be made and without it, even the most sophisticated models won’t work. Which is why you would want to pursue strategic data acquisition, which will make your business more defensible in the long run.

Understanding customer - key advantage

Often clickstream is associated with web analytics, due to the reason of being able to analyze your customer's behavior. For example, you can find out how many customers drop off from the landing page to completing the purchase. The advantage of owning such data is that you can filter by any trackable metric down to individual visitor level without limitations of reporting dashboards that are provided by web analytics tools.

Also, you are free to combine reports with any other data source at your disposal. For example, one can stitches orders, paid advertisement reports, geo and other sources which increases the utility your data assets. Of course, this is possible only when you have full access to the collected dataset, and it's available in 1 unified location.

Going beyond charts and dashboards

Tracking KPIs with charts and dashboards is helpful for monitoring business health and detecting problems in real-time. Though this is useful when making high-level business decisions. To truly bring business to the next level the data can be utilized for optimizing activity down to each customer level. One of the most popular examples is personalizing customer experience.

Personalization can be done on different customer touch points. For example, when a customer is visiting your website we know from the data what he has bought before, or what pages he has visited. Combining single customer data with other customers, we can recommend relevant products or content tailored specifically to the customer who is browsing your website. The same approach can be extended to email, advertisement campaigns or even physical store. This way customer experience can stay consistent across all touch points. For any business, this can serve as a key differentiator.

A good case study showing how taking advantage of owned data can drive business is Zara. Using data as its backbone they manage each of their 2000 stores inventory and what's on display on a daily basis. Which would be impossible to do if they would not have full access to the collected dataset.

What is clickstream data

To understand how we can use clickstream dataset, first, we need to define what kind of data it contains and how clickstream data is collected. We can define clickstream as a sequence of events that represent visitor actions on the website.

The most common and useful event is called ‘click’ which indicates what visitor has been viewed. Of course, we are not limited to collecting just clicks, but also impressions, purchases and any other events relevant to the business.

Furthermore, an event can include multiple contexts that enriches it, like how long has the page load took or what type of browser/device the visitor is using. Essentially a good clickstream data clearly define a full set of events which allows inferring complete picture of customer behavior. Conceptually we can look at events having their own grammar.
Traditionally such events are collected using javascript tracker which is loaded with the page on every request. The tracker sends a json POST request to a collector website which stores, validates if it's correct, enriches it with additional data and finally sends it to a data warehouse for further analysis. It can be visualized as below:

Transformation data pipeline — Event transformation real time data pipeline -
Image credit: https://github.com/snowplow/snowplow

Later in the article we’ll take a look at different options for tracking events.

Example data output

The best way to gain a deeper understanding of clickstream data is to have a look at particular examples. Below we provide a sample event for page view:

APP joes_bikes 
EVENT Pageview 
TIME Thu, 25 Apr 2019 08:33:03 GMT 
COLLECTOR collector.stacktome.com 
METHOD POST

Beacon

Event Type	string Pageview
Application ID	string joes_bikes
Event ID	string e8468c4a-5d95-42aa-81e1-c72d27a5018a
Device Created Timestamp	string 2019-04-25T08:33:03.200Z
Device Sent Timestamp	string 2019-04-25T08:33:03.204Z
Platform	string web
Tracker Name	string cf2
Tracker Version	string js-2.8.2

Context

iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0
iglu:com.stacktome/page/jsonschema/1-0-2

id	number 242894
language	string en
country	string uk
productSection	string lamps
deliveryTiming	string next-day
searchTag	string front lamps
type	string HomePage
canonicalUrl	string https://www.joesbikes.com/lamps
domainName	string https://www.joesbikes.com

Browser

Browser User Agent	string Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36
Browser Language	string en-US

In the table above you can see a sample of data sent from a fictional online store joesbikes.com which is based on a real tracking event. Most essential fields are the event timestamp which allows analyzing events as time series.

Another important part is a custom page context which describes viewed paged details. A notable field is search Tag, it provides what the user is searching for and if that matches the page he has viewed. Combining such events into a sequence allows us to see if the path user takes for purchase is optimal or there can be ways to improve it and at the same time improve conversion rate.

Lastly, we can see that we also get browser information. This can be useful to understand what type of devices your visitors are using and especially if there are problems with rendering certain pages. For instance, we can analyze if our mobile visitors convert at the same rate as desktop. Given how important mobile experience is today, its critical for a business to have this visibility.

Now let's have a look at different event sample of a product impression.

Context

iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0

iglu:com.stacktome/product_impression/jsonschema/1-0-2

id	number 104463
name	string Joes Leather Gloves
type	string Gloves
regularPrice	number 29.99
currentPrice	number 19.99
currency	string usd
nReviews	number 11732
avReviews	number 4.3
row	number 2
column	number 1
containerName	string bestsellers

Here we can see the main attributes of a product shown on the page. The captured event of an impression should help us determine what product was displayed, at which location on the page and what variable attributes did it use. From the above event, we can see that gloves were displayed at 2nd row and 1st column in a container on a page called bestsellers. We can also see the price and review score used for the product. This information alone is enough to determine which products displayed perform as well based on their exposure across all the website. Also, we can determine how well they “compete” with each other given the same or different variables (price, location, etc.).

As you can see from the examples above information that's being tracked is fairly trivial from a single event perspective. The power comes from having access to these events across all the pages that visitors are interacting with, over a period of time. Then you can measure which pages might need improvement or if the overall website can perform much better. We’ll take a look at a few use cases in the next section.

Clickstream analysis

When it comes to data analysis clickstream can be one of the hardest and most attractive datasets to use for a variety of purposes. The multi variety comes from the ability to track all kinds of events that are not strictly limited to a single domain. On the other hand, it can be difficult to reconcile with other more accurate datasets like orders, due to not having 100 % complete information. The inaccuracies mostly depend on how well javascript tracker is working, ability to filter out website crawlers, removing duplicate events and identifying unique visitors. Let's say we manage to go over those hurdles and now we can see what analysis we can do with the data.

Traffic analysis

The easiest way to utilize clickstream data is to see where a website is getting traffic from. Even though it sounds trivial given so many online tools serve this purpose, but getting true numbers down to individual visitor level requires owning the clickstream data. We can analyze not just which source brings us most traffic, but also determine:

which keywords are most popular,
see conversion rates from different traffic source visitors,
do a cohort analysis
even determine which marketing campaign brought most traffic. This is possible due to automatic parsing of utm query parameters that are made available in the unified datawarehouse. It allows us to track any kind of campaign from paid advertisement to email.

Besides above we can extend the tracking to measure email campaign performance of open/click rates. This is especially useful for making your analytics independent of any ESP (email service provider). This makes it easier to migrate from one email provider to another without losing performance data.

Sales funnel analysis

Quite often to determine how well our website is working for converting visitors into sales, a sales funnel is used. In this case we create stages of customer journey from landing to your website (or app) to paying for a product. Each stage usually has a drop off percentage, which can occur due to many reasons. Clickstream data can expose these problems. For example if visitor in one product page has a much large CTR than in another, we could investigate the reason for it and try to improve, for example update the content on the page. We’ll see later in the experiment testing how we can test our improvements.

Besides just single stage problems, sales funnel can serve us as a health metric to quickly determine if certain stage conversion starts dropping off. Such problems could mean that parts of our system stopped working and requires quick action. For an online business, where every lost hour can cost thousands of dollars having this visibility is critical.

Browse/Cart abandonment and recovery

Whenever a shopper puts a product to a cart there is a high likelihood that cart will be abandoned. As you can see in the chart below, up to 80% of online customers abandon their shopping carts.

This is quite significant for any online business, especially if some of those abandoned carts can be recovered. To act on this event, we need to have a way to track when a customer has added some items to cart and if after a certain period of time there was no order made. With clickstream data we can capture these events as follows:

SELECT email, mobile, first_name, last_name
FROM customer_clickstream
WHERE visitor_id IN (select * from cart_abandoners)

One doesn’t need to be an SQL expert to understand whats happening above. We’re just fetching all customers, who are in the cart abandoner segment. Though we need to define what cart abandoner is in a clickstream dataset. We can do that by relying on visitor events:

WITH cart_abandoners AS (   
SELECT DISTINCT visitor_id FROM customer_clickstream   
WHERE event = 'checkout_form_view'   
AND visitor_id NOT IN (     
SELECT visitor_id FROM customer_clickstream      
WHERE event = 'order_confirmation_view'        
AND ts_event > TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 2 HOUR) )   
AND ts_event > TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 2 HOUR) )

The script above is a little bit more involved, but it clearly shows how clickstream data can be utilized for segmenting visitors by different actions. What we do is just find all customers that visited the checkout page but haven't viewed the order confirmation page which is shown after purchases has completed. Having this segment we can easily use it either for email or sms campaigns that try to recover a portion of abandoners. Such marketing campaigns are out of scope for this article, but the most obvious action would be sending to these customers a discount voucher code for your products or recommending other products that customer might like.

Nice thing about the above approach is that it can be easily adapted to browse abandonments, meaning when customer is just browsing product pages but not buying anything. We just need to swap ‘checkout_view’ event with ‘product_view’ and exclude carters/buyers. To make it work, the clickstream data has to be updated fairly frequently, in order for marketing automation to have a better chance of recovering customers until they forget the purchase.

Cart/browse abandoners is just a subset of customer segmentation. When we have access to full clickstream dataset we can create segments by any number of parameters, like recency, average purchase amount, geo location or specific products that customer has been viewed or bought in the past. With data the only limit to any segmentation is marketers imagination.

Personalization

Once simple analysis is in place, it is possible to utilize clickstream data for more difficult tasks, like improving customer experience. A classical use case is product recommendations.

When a store sells a lot of products finding the right product can be difficult. Biggest online retailers like Amazon try to find similarities between products and customers to use it for recommendations. This helps in 2 ways, first it allows for easier product discovery and tailors customer shopping experience based on his interests.

To implement a simple recommendation model using clickstream data is not difficult. What we do is find customers who have viewed certain products and what they bought after. Then we compare all purchases for certain product to purchases made to related products. Once we have those values computed for each product, we can rank them and show on the website once a visitor lands on a product page. In this case, the advantage of owning the data is that we can use any attributes related to the product that might be relevant for recommendations.

Same recommendations can be extended to email or other marketing campaigns without any additional changes to models logic or data.

Tracking Experiments (A/B testing)

The other useful optimization type of analysis is tracking and running A/B experiments. An experiment can help you decide if particular change has any effect on a business relevant KPIs. For example, if we decide to change the design of certain page, to improve conversion rate. Simplest approach would be to update the design and see if after some time there is any improvement in conversion rate on that page. Though conversion rate might change overtime and comparing it with different historical periods can lead to inaccurate assumptions. The best approach is to run 2 different designs simultaneously for different visitors and track the outcome of each. Then if conversion rate improves for 1 design versus the other, we can be confident that it is really better.

Experiment metric analysis for A/B tests — A/B test variation metrics by device type

Tracking experiments is not too different then any other events. We just need to record which variation visitor is viewing. The harder part is to make sure that visitor views only 1 variation between multiple viewings, otherwise it might skew the results. To do this we just split our visitors by using their user agent and ip address and serve each either one variation or the other.

The advantage of tracking experiments together with other events is that it makes it easy to compare effects on all visitor behaviour for all situations. As an example, we can find out if a new design for mobile visitors works as well as for desktop and how their conversion rate or clicktrough rate differ. Of course there is no limit what kind of experiments can be run, tracked and analyzed.

Identity Stitching

Another clickstream data use case which is becoming more relevant in mobile internet era is being able to stitch customers to a single profile. For example, customer may open marketing email on mobile and browse some products, but when it comes to purchasing he might switch to a desktop. In this case we would want to know if this is the same customer or a different one.

If we track everything with 1 pipeline, we can find this customer by matching his ip address assuming that his mobile phone most likely shares the same wifi connection as is his desktop. We can also use other “marks” like cookie id, when customer opens an email we track this with his email address hashcode. If the same customer comes back to the website we can find the same hashcode as well even when using a different device.

The idea about identity stitching is to ensure we are matching customers to as many available identifiers as possible in order to be able to have an accurately matching profile. Then a business can tailor customer experience unique to his profile at all touch points.

Customer stitch identity table — Example fields of customer stitched identity data

How to collect clickstream data

The data collection processes for almost all tools are fairly similar, which can be visualized as in the following example:

Example of clickstream data collection system

Customer visits a web page with his mobile device.
Website returns a page to render together with a tracker javascript tag.
The tracker code tracks events that customer is performing on his device and sends to a collector server.
Then those events get saved, validated and enriched.
Finally, each event is sent to a unified data warehouse.
Once events are available they can be used by different stakeholders, like business analyst, data scientist or executive.

The above steps are used by most companies collecting clickstream events, the differences are in the storage, data availability and variety of events that can be collected. To be able to choose which tools are better suited for your business needs, let's describe major categories available.

Vendors - Paid

The biggest category of clickstream data collection providers is of course paid ones. Let's list several major ones available:

Mixpanel

User behavior and retention tracking platform. It provides charting and dashboarding tools to analyze websites or apps user activity. Out of the box it also provides sales funnel, customer segmentation, and cohort analysis. All analytics is centered around user profile, which can be extended with custom events. Tracking is done via javascript tag. It has predefined fields that are collected with all events. Also one can track arbitrary fields with each event with so-called “super properties”.

One drawback of using custom tracker is that it has now the structure of their event schemas, which can be error-prone to maintain this code when having a larger variety of events.

Configuration of segments for Mixpanel — Mixpanel - setting up segmenting by events

Looking at documentation there is also no support for email tracking via pixel. Though this might be intentional given Mixpanel supports their own messaging. This can be a plus or minus depending if a business already uses an existing ESP platform.

Mixpanel does support exporting data to a cloud data warehouse, but it comes with some limitations like expected data latency is around 24 hours, no control of what gets exported or historical period. The schema is fairly limited mostly to custom event fields which are sent via tracker (no page referrer, user agent, IP address, location, campaign fields, etc.).

Kissmetrics

A customer analytics tool that allows tracking, report, and message customers. It has somewhat similar features as Mixpanel, but not all charting/segmenting capabilities are the same. It supports additional default event fields like campaigns, location, page referrers, user agent. Similarly, Kissmetrics doesn’t have a way to track emails with a pixel, due to the availability of their own emailing feature.

Their API supports SQL queries which allows you to run any arbitrary reports. Although raw event data is exportable only as JSON files. Which means if raw data is needed you would need to implement manual data pipeline to process those files and keep it in sync with a specific data warehouse.

Amplitude

A tool designed for product analytics. Compared to Mixpanel and Kissmetrics, Amplitude features are oriented towards understanding how a product is performing. Though event collection is fairly similar using Javascript tracker with tracking default events and customers alike. Events can be analyzed through UI for sales funnel, cohort or segment analysis.

The biggest advantage of Amplitude is the ability to sync all events directly with data warehouse with moderate delay - 30 minutes for Snowflake and 3 hours for Redshift. Also, you can map charts and dashboards directly to SQL queries which allow unlimited customization in terms of reporting.

Amplitude sql example — Amplitude SQL editor for custom reports

Heap

A user analytics tool that allows tracking everything without explicit tracking code. At first glance, there are all the usual reports you can find in Mixpanel or Amplitude, but the main differentiator of Heap is the ability to match events without having to define them beforehand. This allows for non-developers to choose what constitutes an event by matching HTML elements on their website. This eliminates the need for maintaining tracking code. There is one risk of relying on this approach if your design changes this might break the matching logic, but the same can be said about javascript events which tend to rely on some kind of data embedded into the website or app anyway.

From the data perspective, all expected data points like page referrer, location, campaign params are included by default. Heap also provides a way to export data to all major cloud data warehouses with a complete schema. Though it is limited by 24 hours delay until new data becomes available.

Google 360

there is no marketer who has never used Google Analytics, also referred to as GA. Thought a lot less handful of business is familiar with a premium offering GA 360, which is tailored to enterprise customers. At the core, this is the same product that doesn’t differ how data is collected - using Javascript tracker. The data collected by default is the same as well.

The major difference is the ability to access raw data that is only supported in GA 360 version. Data gets exported to a cloud data warehouse BigQuery managed by Google Cloud Platform. Another advantage compared to a free version is no data sampling. According to this report - a free version of GA caps the collection to 500k events per month. While GA 360 can collect up to 2 billion events per month. Also, GA 360 has extra integrations to Google Ads, making it easier for companies to optimize their ad spend.

Even though data is exported automatically there is still a possible delay up to 4 hours before it is accessible. This makes it harder to use for more real-time actionable analytics, e.g. browse abandonment. The other biggest hurdle is the price tag. GA 360 costs $150k per year, making it available only for bigger companies.

GA 360 example schema structure — Some columns available in GA 360 tracking

Adobe Marketing Cloud

The last paid vendor we need to take a look is Adobe. When GA 360 is compared, Adobe is referenced as the main competitor. Looking at their offering overview it provides a flexible reporting dashboard for creating arbitrary reports tailored for business needs. The available schema tracks by default expected columns like location, referrer, browser details.

When it comes to exporting data the only option available is raw files to either FTP or S3. It does support pushing data hourly, but it would require setting up a manual data pipeline and maintaining it to get access to the data on a continuous basis. Given product is targeted to enterprise customers, this is a large limitation comparing to GA 360 or any other mentioned solutions above.

Given the above vendors we can sum up their pros/cons as a whole:

Pros:

Minimal or zero development cost
Usually comes with a feature-rich UI
Easy to get started even with little data analytics background
Integrates with many other SaaS providers in other categories

Cons:

The cost grows quickly with data volumes
Limits on collected event customization
Full history data access is either not available or requires costly manual integrations or has a substantial delay
No way to extend data collection features
Migrating to a different vendor is limited or impossible
PI and/or PII personal data collection/storage is prohibited or limited

Some of the cons can be reduced when integrating with additional products like Segment, but that comes with an additional cost both in vendor fees and maintenance overhead.

Vendors - Free

Most of the vendors provide a free tier option for their service. It can be a reasonable approach if a business collects only a limited amount of events per month or has a small number of sessions/customer profiles being tracked. Following options are available:

Google analytics

GA is a web analytics provider that is most widely used by large and small companies. By some statistics, more than 50% of all websites ranked in first 1 million use GA. It provides essential dashboards for traffic analysis, segmenting customers and attributing traffic sources. The biggest limitations are not being able to access the raw data and sampling events once reached a threshold. As we mentioned above, GA 360 addresses these limits.

At first glance, for most of the business, the limitations are quite generous as not many websites can reach over a million visitors a month. That is when the sampling rate can become a problem. Though the bigger issue is ownership of data, if a business is growing, eventually data will be required in order to use it for anything more than just traffic analysis. The risk of using GA, in this case, is that migration to owning data will be almost impossible without paying the price of GA 360 (even then it's not exactly clear if full history can be recovered). This is why a business should decide early if giving up data ownership is worth free.

All paid vendors with free plan tier

All other vendors in the paid category that were covered above (except Adobe) have a free plan which unlike GA has lower limits. For example, Mixpanel allows 5M data points per month before charging, Amplitude 10M, while Heap calculates free tier based on sessions allowing 5k per month. Each vendor also imposes other restrictions on their analytics offerings. Though compared to GA, getting full historical data when transitioning from free to paid can be possible if there is no limit on data retention. For example, Heap only retains 3 months of history on the free tier. Therefore, a decision should be made early on if clickstream data is or will be valuable in the future for the business.

Pros:

No cost
Easy to get started
Suitable for low data volumes

Cons:

All same cons as for paid vendors
Inflexible data collection
Sampled data
No data ownership
Can be impossible to recover historical data

Open source - Self hosted / Managed

In the last 5 years, the importance of data has grown a lot and with it, a lot of new open source projects were either created or made available by bigger companies. Most of the new products are centered around data processing, storage, and management, but there are 2 major ones, tailored for clickstream data collection as well:

Snowplow

This is an event data collection platform designed for scale. To track clickstream events, we have an option to choose from Javascript on a browser, SDK trackers on server side or mobile trackers for iOS and Android. Snowplow has a predefined workflow of how events should be defined, processed and stored. We can break it down into the following steps:

Track - capture an event and send it to the collector
Collect - receive event and save it in a raw event store
Enrich - process, validate, enrich the event with extra data and send for storing to a data warehouse
Store - save valid events to a cloud data warehouse

Each step is decoupled from one another allowing the platform to update, scale and replace processing steps independently. This is important when dealing with large data volumes that can take a fair amount of time to process. It would make it difficult for managing a large data pipeline as a single piece. Also, this architecture allows processing streams without any interruptions. As long as the collector is available, we can safely restart enrich and store modules without having to worry of losing data. This is possible due to utilizing intermediary stores for raw data.

Snowplow open source project structure at Github

When it comes to tracking events, Snowplow has a large set of default field list which gets collected - page referrer, geolocation, user agent, device type, campaign params. We can also use pixel tracker for receiving email open events.

For extending, standard events Snowplow uses custom contexts which can be tracked together with pageviews or any other predefined event. One major difference to all other trackers used by both paid and open source vendors is the ability to define a custom event schema. This not only makes it explicit what we can collect but also validates events against the schema. As event javascript tracking code is prone to make mistakes like typos, it’s beneficial to catch those mistakes early on. Snowplow provides a tool called schema repo, which does exactly that - stores our schemas and validates if events are matching them.

The benefit of open source is that your business is not locked in to just using 1 data storage or only one way of how you can collect events. Snowplow at the moment supports 3 data warehouses - Amazon Redshift, Google BigQuery, and Snowflake. Even if none of these options are suitable, you are free to store events in Kafka stream and push to any other storage based on business requirements.

Of course, to be able to configure, deploy and operate a data pipeline, one must understand the underlying pieces and how they work together. This requires data engineering skills, which for some companies might be challenging to attain on their own, especially if a company wants to start using data from day one. Therefore, Snowplow provides a managed service where data pipeline is run on your cloud account but managed by Snowplow engineers.

The biggest benefit of running a Snowplow pipeline is data ownership. All the data that's collected is stored in your own cloud data warehouse. Another thing is data availability which, when running stream collector, is near real-time down to data warehouse level. At the moment, this is not achievable even with expensive paid vendor options, which have at least 30 min delay.

Matomo (Piwik)

It can be considered a father in open source web analytics. Matomo or formerly named as Piwik has been founded in 2007. From there it grew over a million sites using it. From tracking perspective, it uses javascript tracker which is able to send custom or default events. By default referer, location, browser details are tracked. Overall schema is tailored for e-commerce type events.

A/B testing example from Matomo UI — A/B test settings using Matomo

The major benefit of using matomo is data ownership. There is no limit for historical data retention when running Matomo on-premise. As alternative, a managed service is provided where data is collected on Matomo servers. In this case, retention can be limited or custom depending on data volumes.

One drawback of using Matomo when hosting on-premise is the storage. Currently, it is based on MySQL database which can require specialized maintenance and deployment requirements due to being designed as a vertically scalable data store. The other limitation is the schema itself. By design, there is no option to store custom events without having to manually change the database schema.

From a pricing perspective of a managed service, Matomo is very competitive, while collecting around a million pageviews it costs a fraction compared to GA 360. Though there is no information about raw SQL access available in Enterprise plan and query performances when data size increases.

Pros:

Full data ownership
Very flexible data collection options, able to extend/add new features
Able to integrate with other datasets once data is in a unified data warehouse
No vendor lock-in

Cons:

Can run high maintenance cost
Requires data engineering in-house expertise or paid managed service
Only data collection without rich reporting suite

StackTome approach

Difference compared to other solutions

At StackTome we are combining the best from both worlds of managed and paid services. The way we differ from Google Analytics, for example, is by following:

All data that’s collected is complete and not sampled.
Data is not locked in, but exportable to any cloud data warehouse of your choosing
Independence of UI functionality allows you to collect data for a variety of purposes. Some examples already predefined - impressions (email, product), A/B experiments, form edits/link clicks

In comparison to just open source products, we provide:

Fully managed solution - no extra setup or infrastructure costs
Predefined schemas to cover basis tracking needs (e.g. eCommerce)
Raw data cleansing and identity stitching, to make data useful out of the box for a variety of applications - customer segmentation, sales funnel analysis, conversion attribution and any other data usage scenario relating to visitor behavior

How does it work

The data collection is no different than any other web analytics. We provide a Javascript tracker script that can be included either directly to the target website or just included in your tag manager bundle. Then collected events will be collected, enriched, cleansed and stored to our underlying data store and become available by continuously exporting them to a target cloud data warehouse of your choice (e.g. BigQuery) for final usage.

Under the hood, we are combining open source technologies like Snowplow, Kafka, Spark, HDFS, and Kubernetes to achieve flexible clickstream collection in a fully managed fashion. This way we enable your business to focus on actual applications instead of data plumbing complexities.

Ownership of all data

Having universal access to clean clickstream data is useful for different stakeholders:

Marketers to see the performance of their email or advertisement campaigns
Data analysts for analyzing sales funnel performance, traffic source impact to conversion, cohort analysis
Data scientists building any model that optimizes visitor experience and lifts conversion rates

Also, available data can be matched to any other dataset available in your cloud data warehouse. For example, sales, SEO, paid advertisement, giving you the ability to see your business from many angles down to individual visitor level.

More than just data

Besides just data access StackTome also utilizes the collected clickstream data to provide solutions on top of it. Once you start collecting events, you can use other products like:

Customer data platform - customer segmentation and marketing via email, social and search
Product recommendations - personalizing customer experience on your website
User-generated content - leveraging relevant customer feedback on semantically relevant pages

Cost of collecting clickstream data

Now, having gone through most popular options on clickstream data collection, let’s compare they feature set and pricing. To compare pricing more accurately we can assume 10M events per month need to be collected, with the ability to export data and history retention no less than 12 months.

Features / Vendors	Mixpanel	Kissmetrics	Amplitude	Heap	GA 360	Adobe MC	GA	Snowplow	Matomo	Stacktome
Default field collection	partial	X	X	X	X	X	X	X	yes	X
Dashboards / Reporting	X	X	X	X	X	X	X	-	X	-
Datawarehouse export	X	-	X	X	X	-	-	X	-	X
Data retention	12M	?	∞	X	∞	?	-	∞	24m	∞
Data delay	24h	-	30m	24h	4h	?	-	1m	1m	1m
Pricing	$800/M	?	?	$500+/M	$12500/m	?	free	$5000/m	$1400/m	$500/m

As you can see above cost varies a lot based on given volume, features and data availability. It’s best to look at business needs first and then review available options to meet them. This can be done by asking questions:

What’s expected event volume per month?
Is it expected to grow?
Do we need to store/access full history on a regular basis?
Is export to data warehouse - ownership of data is necessary?
What kind of reporting tools is required?
Do a business need only reports or require other solutions as well?
How fresh data should be to be able to fit all business reporting and optimization/solution needs?

This is not a comprehensive questionnaire. Each business should define it's own based on priorities and strategy.

Conclusion

If you managed to get through the article - congratulations! Hopefully, you now have a better understanding of what clickstream data is, how it can be collected, utilized and how much it costs for a business. To summarize, we can say a business should consider owning clickstream data if it manages to answer the question: why it would be beneficial to the business.

Of course, data is not a magic wand that will answer all questions, but without it competing with companies that use data to their advantage in today's online market will be more challenging than ever. If you want to know more about how StackTome can help you with your data needs, don’t hesitate to contact us.