Creating a Tag Cloud with Solr and PHP

by Jason on May 26, 2011

Tag clouds are a really fantastic way to summarize the prominent keywords/tags/words utilized in a system. There really isn’t a better way, visually, to represent this data.

So, how do we create a Tag Cloud using Solr?

The first step is to create a field of the type “text”.

The default Solr schema configures the text type with a whitespace tokenizer, so later on when we query our data, it will be setup to return individual words.

Let’s assume that our text field is named “product_description” and it contains paragraphs of text.

Now, we use a Facet Query to gather the most prominent keywords

Solr Facet Queries can be compared to a relational database’s GROUP BY aggregate query.

Our query parameters:

  • q: *:*
  • facet: true
  • facet.field: product_description
  • wt: json

Full example query:
http://my-solr-server:8983/solr/select?q=*:*&facet=true&facet.field=product_description&wt=json

This example query will select all records (*:*), facet on the product_description field, and return the data set in JSON format.

Normalize the Facet Data

The first thing we need to do when creating a tag cloud is decide the maximum font size and minimum font size.

I personally prefer my maximum font to be 41px and minimum to be 14px, so we’ll go with that.

The hit count for our returned results can be extremely high and/or extremely low, so we’ll need to make sure we normalize the hit counts to be between 14 and 41. We’ll do this with a simple ratio.

Note: For simplicity, I’m going to assume you know how to use CURL, thus I will leave out the code to actually execute the query and assume the RAW JSON data lives inside $data.


/* interpret raw JSON data returned from Solr */
$data = json_decode($data);

/* define minimum and maximum font size */
$max_font = 41.0;
$min_font = 14.0;

//extract facet information
$tags = $data->facet_counts->facet_fields->product_description;

//solr returns the results sorted by the facet count in descending order
//this means the most prominent keyword is first
//extract the first hit count to determine the weight ratio
$keyword_weight_ratio = (float) ($max_font / (float) reset($tags));

//loop through returned result and normalize keyword hit counts
foreach($tags as $keyword=>$weight) {
$tags[$keyword] = round($weight*$keyword_weight_ratio);
}

//return the modified array
return $data;

Create the Tag Cloud

Now that we’ve queried Solr and Normalized the Facet Data, we are ready to create our tag cloud. The approach I’m taking to display the Tag Cloud is the same approach used by Last.fm.


<div class="tag_cloud" style="width: 500px; min-height: 400px; line-height: 1.5px;">
<?php foreach($tags as $keyword=>$font_size): ?>
<span style="font-size: <?= $font_size ?>px">
<a href="#"<?= $keyword ?>/a>
</span>
</div>

With a little styling, the end result will look like this.

Leave your comment

Required.

Required. Not published.

If you have one.