Using Public Data Sources for Accurate Segmentation

Public data from governmental agencies often come at a low cost. So-called Open Data are even free. Their integration into segmentation projects might look cumbersome for a variety of technical reasons. Nevertheless, the added value in many situations make them a highly valuable complement to most supervised segmentation tasks.

When preparing for a segmentation project, no matter whether the targets are consumers, HCPs or companies, the usual first step consists of listing all data sources that are deemed relevant to the exercise. Typically, one starts with the data already available in house: past segmentation results, information from sales reps, CRM systems, etc. The next natural step consists of looking at offers from private data providers. Depending on countries’ specific regulations, such data will provide information at different levels of aggregation, albeit rarely at the consumer/HCP/company level. Still, these data are often considered as the holy grail for segmentation, as they often directly relate to the objective of the segmentation. A common mistake however consists of ignoring the aggregation level of such data. If it is not taken into account, it will lead to all targets within a given aggregation group to be given the same value in the segmentation. Of course, the issue does not stand for the entire population, as other data sources help removing some ambiguity in several cases, but it is not uncommon that many potential targets will not be differentiated as a consequence of lack of data. This is where public data become interesting.

A Clear Advantage: Their Low Cost

In many countries, government publish a wealth of data for free (Open Data) or a very limited cost. They can be found either on dedicated government portals (see for instance France, Germany, USA) or on the websites of the respective statistical agencies (for example Australia, Canada). Many countries will actually offer data on both types of sites. You should always make sure that the license allows you to manipulate them for commercial purposes, but this is usually not an issue.

On top of their limited cost, most of these data also present the advantage of having been produced by experts in data collection, so the usual questions you may ask yourself whenever you purchase data, such as completeness, coverage, representativity, to name a few, have been addressed in the best possible way. If any weakness exists, it is likely to be disclosed.

An Important Asset: Relevant Public Sources Exist in Many Cases

A visit to sites like the ones mentioned above might prove slightly overwhelming at first. Still, following a few simple guidelines, one can relatively quickly isolate some interesting datasets.

A common request for useful data in the context of segmentation is geographical granularity. Obviously, if the interesting data is only available at the state or regional level, this is not going to be of much help. On the other hand, you can also not expect to get data at the individual level for clear confidentiality reasons. As a rule of thumb, any data that is at least available at the postcode level usually meets the granularity requirements.

The other aspect to look at is obviously the topic covered by the data. This will of course depend on the object of your segmentation. The good news is that most governmental sites offering data are usually easy to navigate with regards to this aspect. In the case of HCP segmentation, topics of interest will usually cover a variety of socio-demographic measurements (census being the obvious source in this case) and any health-related information. This type of information is usually among the most common in many countries.

The Bottleneck: Integrating These Data

Even though the data themselves come at a very limited cost, integration definitely does not look as straightforward as data from private vendors. They typically are not at the consumer level, and they will actually often come at different levels of aggregation if one has identified different complementary data sources. Still, as these data are geography-based, inclusion of the open data does not raise major issues in most cases.

A second, more complex issue, arises from the number of characteristics that can be obtained from open data websites. One can easily identify hundreds if not thousands of potentially interesting features. It is not uncommon that the number of characteristics that are extracted remains much larger than the number of all originally selected measures (in-house data, private vendors, …). Nonetheless, you should not prevent from selecting a large number of measures, as statistical tools known as data reduction techniques can easily be applied to any number of features. At the expense of a minimal loss in information, a smaller unique number of characteristics can be extracted and easily included in any machine learning algorithm.

Last but not least, the way these data will be used in the segmentation needs to be taken into account during the preparation phase so that one ends up with features that do represent potentially meaningful drivers in the segmentation process. To illustrate that, think of how to segment medical practices by incorporating socio-demographic information. In order to evaluate the impact of this information on the segmentation, it would not make sense to only consider the exact geographical unit where a given practice is located. It will likely also attract people from neighbouring geographical units and this need to be accounted for. A catchment model does exactly that. So, in practice, on top of the previously discussed dimension reduction applied to the original data, the resulting components will be transformed through the catchment model.

The type of transformation requested may vary according to the final objective, but the goal remains in all cases to end up with data that are available at the most disaggregated level while providing a workable series of meaningful measures to add to the process.

All in All: A Huge Potential for Increasing the Segmentation Accuracy

Of course, you should not expect any of the extracted measures taken separately to become the main driver of your analysis. However, based on our experience and taken as a whole, such a set of measures can easily be as important if not more than any of the other data source you are using.

As they are geography-based, if the individuals you are segmenting are grouped in clusters based on location, like Drs in a practice, they will not remove ambiguity within these clusters, but they will already contribute a lot to distinguishing across clusters.

Therefore, it would not be advisable to use only these data to run a segmentation, as using a minimal input of data available at the lowest granularity level is strongly recommended, but, given their incredibly low cost and despite the time investment needed to prepare them, they should in our opinion be part of the standard set of data sources for all your segmentation and targeting projects.