At Spool, we calculate our key metrics in real time. Traditionally, metrics are performed by a batch job (running hourly, daily, etc.). Redis backed bitmaps allow us to perform such calculations in realtime and are extremely space efficient. In a simulation of 128 million users, a typical metric such as “daily unique users” takes less than 50 ms on a MacBook Pro and only takes 16 MB of memory. Spool doesn’t have 128 million users yet but it’s nice to know our approach will scale. We thought we’d share how we do it, in case other startups find our approach useful.
Crash Course on Bitmap and Redis Bitmaps
Bitmap (aka Bitset)
A Bitmap or bitset is an array of zeros and ones. A bit in a bitset can be set to either 0 or 1, and each position in the array is referred to as an offset. Operations such as logical AND, OR, XOR, etc. and other bitwise operations are fair game for Bitmaps.
Population Count
The population count of a Bitmap is the number of bits set to 1. There are efficient algorithms for calculating population count. For instance, the population count of a 90% filled bitset containing 1 billion bits took 21.1 ms on a MacBook Pro. There is even a hardware instruction in SSE4 for the population count of an integer.
Bitmaps in Redis
Redis allows binary keys and binary values. Bitmaps are nothing but binary values. The setbit(key, offset, value) operation, which takes O(1) time, sets the value of a bit to 0 or 1 at the specified offset for a given key.
A simple example: Daily Active Users
To count unique users that logged in today, we set up a bitmap where each user is identified by an offset value. When a user visits a page or performs an action, which warrants it to be counted, set the bit to 1 at the offset representing user id. The key for the bitmap is a function of the name of the action user performed and the timestamp.
In this simple example, every time a user logs in we perform a redis.setbit(daily_active_users, user_id, 1). This flips the appropriate offset in the daily_active_users bitmap to 1. This is an O(1) operation. Doing a population count on this results in 9 unique users that logged in today. The key is daily_active_users and the value is 1011110100100101.
Of course, since the daily active users will change every day we need a way to create a new bitmap every day. We do this by simply appending the date to the bitmap key. For example, if we want to calculate the daily unique users who have played at least 1 song in a music app for a given day, we can set the key name to be play:yyyy-mm-dd. If we want to calculate the number of unique users playing a song each hour, we can name the key name will be play:yyyy-mm-dd-hh. For the rest of the discussion, we will stick with daily unique users that played a song. To collect daily metrics, we will simple set the user’s bit to 1 in the play:yyyy-mm-dd key whenever a user plays a song. This is an O(1) operation.
redis.setbit(play:yyyy-mm-dd, user_id, 1)
The unique users that played a song today is the population count of the bitmap stored as the value for the play:yyyy-mm-dd key.To calculate weekly or monthly metrics, we can simply compute the union of all the daily Bitmaps over the week or the month, and then calculate the population count of the resulting bitmap.
You can also extract more complex metrics very easily. For example, the premium account holders who played a song in November would be:
(play:2011-11-01 ∪ play:2011-11-02 ∪...∪play:2011-11-30) ∩ premium:2011-11
Performance comparison using 128 million users
The table below shows a comparison of daily unique action calculations calculated over 1 day, 7 days and 30 days for 128 million users. The 7 and 30 metrics are calculated by combining daily bitmaps.
| Period | Time (ms) |
|---|---|
| Daily | 50.2 |
| Weekly | 392.0 |
| Monthly | 1624.8 |
Optimizations
In the above example, we can optimize the weekly and monthly computations by caching the calculated daily, weekly, monthly counts in Redis.
This is a very flexible approach. An added bonus of caching is that it allows fast cohort analysis, such as weekly unique users who are also mobile users — the intersection of a mobile users bitmap with a weekly active users bitmap. Or, if we want to compute rolling unique users over the last n days, having cached daily unique counts makes this easy — simply grab the previous n-1 days from your cache and union it with the real time daily count, which only takes 50ms.
Sample Code
A Java code snippet below computes unique users for a given user action and date.
import redis.clients.jedis.Jedis;
import java.util.BitSet;
...
Jedis redis = new Jedis("localhost");
...
public int uniqueCount(String action, String date) {
String key = action + ":" + date;
BitSet users = BitSet.valueOf(redis.get(key.getBytes()));
return users.cardinality();
}
The code snippet below computes the unique users for a given given user action and a list of dates.
import redis.clients.jedis.Jedis;
import java.util.BitSet;
...
Jedis redis = new Jedis("localhost");
...
public int uniqueCount(String action, String... dates) {
BitSet all = new BitSet();
for (String date : dates) {
String key = action + ":" + date;
BitSet users = BitSet.valueOf(redis.get(key.getBytes()));
all.or(users);
}
return all.cardinality();
}



This is possibly the best tech article I’ve seen in months. Thanks for sharing!
What do you use this for?
We use this for all of our metrics. Daily/weekly/monthly actives, recordings going through our system, how many users from a particular type of mobile device are using Spool, how this changes over time, how this varies over different cohorts.
In the build vs. buy analysis, why did you build? And what analytics packages did you consider ‘buying’?
I’m already itching to apply this somewhere. Thanks.
Anna,
Handy Gist on githubs to try it out: https://gist.github.com/1384211
Thank you
This is great stuff
I’m not sure what ONE_BITS actually does. Could you elaborate?
I like the simplicity of this technique. It’s quite similar in some ways to how I index and look up the availability of 135 million domain names at Domize.
I guess the caveat would be that because the only values allowed are 0 or 1 it is only useful for things you want to count once, or not at all (i.e. uniques). Have you thought about how you could use a similar method to collate multiple visits/actions within a time window? e.g. the number of favourites a user creates in a given day.
Anson,
There are couple of ways you can do it in redis.
WATCH favorite:yyy-mm-dd
count = GETRANGE favorite:yyy-mm-dd 2*user_id
MULTI
SETRANGE favorite:yyy-mm-dd 2*user_id (count +1).get_bytes
EXEC
If you don’t care about atomicity, you can drop WATCH, MULTI, EXEC. It’s similar to the approach used by counting bloom filters.
Great reply, Chandra. So, essentially you extend the 1-bit to have a width (2 in your case).
This is absolutely fantastic. We have about 700K users, and are finding our existing usage tracking systems are just not up to it. We’re absolutely going to try this.
What is the max. length of bit you can hold (i.e. max number of users?)
Redis allows max offset of 2^32 -1. So, 4 billion users. At that scale, you would use shading and other tricks in a realistic scenario.
if you’re interfacing with redis via ruby this might useful once you need to start sharding: https://github.com/taf2/redi
How do you handle a set of rather sparse ids? Like Facebook ids or something which are quite large now (billions?) but you’re only likely to see a rather small set of those…
m
I guess the obvious solution is to map sparse ids to dense ones using some sort of auto increment field in your db of choice…
Matt, that’s a good solution. That’s what we do, and redis INCR command is very useful in this regard for doing auto increments on user registration.
Cool post. We’ve been using redis extensively (including bitmaps) and something like this makes a lot of sense for us to adopt/adapt.
Pingback: async I/O News » Fast, easy, realtime metrics using Redis bitmaps
Hi. Maybe I misunderstood something.
128 millions of bits is nearly 15 megabytes. If last reigistered user listen some some tracks each day, he fills database with 15 megabytes values in database. Each of them represents only one listening.
In another hand, I heard that Redis likes when data fits memory.
Help me to deal with this thoughts. And thank you for this cool bitwise article
Alex,
If you have 128 million users and only last few listens to song everyday, using Redis sets would work better. Bitmaps approach works well in a scenario like this: http://www.avc.com/a_vc/2011/07/301010.html
Our schema design employs uuids as the primary id. Besides user primary id, we have an analytics id for each user. Analytics ids are used in bitmaps and also allow future optimizations relatively easy.
Great! Thanks for sharing such a great tech article. It’s nice to carry some tricks like that in our utility belt
Ok,
possible dumb question here, but here goes…
In your example you are showing a 16 bit word with the MSB = user_id 15 and the LSB = user_id 0, so with a 16 bit word you can store a binary value for 16 users?
How does that scale to 128 million users, surely you don’t use a 128 million bit wordlength?
I am still on first cup of coffee here so go easy on me!
These are binary strings, aka byte arrays and binary blob. For 128 million users, you need a binary string 16MB wide. You can think of this as the word length of 128 million bits. It only takes few milliseconds to count 1 bits of this word.
Awesome post – a very creative solution. Is it possible to apply bitwise operators on redis keys inside redis, or do you have to GET the values and do it in code?
Pingback: What the Facebook FTC Settlement Means for Social Media « V E X E D
Pingback: Cheatsheet: 2011 11.16 ~ 11.30 - gOODiDEA.NET
Great stuff. Thanks for sharing. It’s very helpful.
Few thoughts:
1) Requires JDK 7 since it uses BitSet.valueOf()
2) Awesome for counting uniques, but I hit issue when trying to get User IDs back with following code:
jedis.setbit(key, 44428L, true);
jedis.setbit(key, 8L, true);
jedis.setbit(key, 7L, true);
jedis.setbit(key, 1L, true);
gives me back BitSet with following bits set – please notice it does not correspond to correct offset.
BitSet bset = BitSet.valueOf(jedis.get(key.getBytes()));
BitSet: {0, 6, 15, 44427}
Is there a way to get it in right order? Thanks a lot!
What do you mean by “right order”? Descending instead of ascending?
This is an old post but vladaman is right, the values are not going to come out properly using the methods mentioned in this post. I opened an issue with the Jedis client folks to confirm:
https://github.com/xetorthio/jedis/issues/301
Using Java’s built-in BitSet.valueOf() will not give you the bit set that you expect if you just build up a bitset using setbit(). Redis stores those bit values in very literal left-to-right order within the byte, which valueOf() will read backwards. You will need to use a helper method as described in this gist to reverse the bits:
https://gist.github.com/2773136
Very clever. I didn’t know about any of the bit operations.
What cool, clever technique.
Thanks for writing it up, can think of lots of uses it for it already for my own apps.
Pingback: Hírek, érdekességek [2011-12-01] | Kódfejtők
This is really cool. Thank you for sharing. (Awesome product too!)
My only concern with this approach (unless I missed something) is that your bitmap doesn’t represent the number of current users but rather the number of users that have ever created an account on your site (since offsets are set using user ids). So your bitmap would eventually grow much larger than it needs to be (as users disappear/delete accounts) unless user ids are periodically re-counted or new users are added into the holes.
Matt,
This is the idea behind creating specific bitmaps with timestamps, so a daily bitmap would be something like users:2011-12-01 and flip the bits there.
Pingback: The Swiss, Your Brain, and Knowing That You’ll Be Dead Soon – Links: 11/28 – 12/2 | Conduit
Nice post.
About counting daily users, why not keep 3 separate integers and increment them whenever you’re setting a bit for the first time? Then, reset them daily, weekly and monthly depending on the counter type.
JF
Yep, exactly right. That’s another way to do the caching mechanism we talk about as an optimization.
Can this technique be modified to have an array of positive numbers instead of 0 and 1 and calculate more complicated tasks.
For example, calculate number of some actions (which can happen multiple times for each user during each day) performed by users today.
Here, at nk.pl (foremly nasza-klasa.pl), we use much more efficient appoach influenced by streaming algorithms. If you use Chernoffbounds (and other helpful probability theorems) you will quickly realise, that you can estimate with very small error the number of unique users if you restrict your attention to those which have user id which is in some small pseudorandomly chosen set S, and then multiply the result to compensate the size difference between |S| and total number of users in the system.
In particular you can focus on users with id mod 2^k == 0, which is quite efficient to check. This mean that the set S has 1/2^k size of all users. Now you can simply store all ids which pass the check in memory as a HashSet. You have to choose k carefully (according to the aformentioned theorems) to have correct results with high probability. The trick is, that since the number of daily users monotonically increases during the day, the k that you need also increases, so whenver you hit the moment when k must be incremented, you simply can drop all elements from the HashSet which have k-th bit equal to 1, which is also simple.
This allowed us to store many different statistics (not only “unique daily visits”) in realtime on single machine (in our implementation: memcache, but it seems to me easier to implement on Redis).
Jakub, could you elaborate a little bit more on your approach? What other theorems should one look into, to get a deeper understanding of approach described above?
That approach sounds very similar to so-called “Flajolet-Martin sketches”, described in the paper “Probabilistic Counting Algorithms for Data Base Applications.”, listed on the following page: http://algo.inria.fr/flajolet/Publications/
@Julius: They basically assume that one in 1024 users has an id with 10 bits set to one. If you encounter such a user id, you can just assume that you’ve seen 1024 unique users–and set bit ’10′ in a word. Since you’re storing the binary logarithm of the number of visitors, rather than the actual number of visitors, you can count huge numbers of visitors with only a few bits–if you forsake accuracy, and are comfortable with the possibility of a statistical fluke upsetting your numbers.
The bitmaps used in the approached are amenable to the same kind of manipulation (union) you describe in your approach above.
Pingback: Rounded Corners 310 – Gangster /by @assaf
Pingback: Trevor Turk — Links for 12-9-2011
This is really cool. I am going to definitely add redis bitmaps to my sites for enhance analytics. Thanks!
Pingback: “The Little Redis Book”読んでみた « Siguniang's Blog
Here you have a useful snipped that I’ve written down in PHP, if someone is interested.
It basically loops through all the redis keys and return the population (the number of 1′s found) in a bitmap.
It takes around 15 seconds to get each key population considering that we are storing 32 bit integers.
// include Redis
require 'lib/Predis/Autoloader.php';
Predis\Autoloader::register();
$redis = new \Predis\Client();
// get the list of all the keys
$list = $redis->keys("*");
// loop through the list
foreach ($list as $key)
{
// get the strin of the given key
$string = $redis->get($key);
// get the hex representation of that string
$data = implode(unpack("H*", $string));
// loaded into gmp
$bitmap = gmp_init($data, 16);
// count the population of the bitmap
echo $key . " = " . gmp_popcount($bitmap) . "\n";
}
A bit late to the party, but here’s another example in PHP that utilizes the PECL Bitset functionality–>http://avad.hu/t/content/realtime-metrics-using-bitmaps-redis-and-php.
How do you backup these values? Are you saving the bitmaps to a database every minute?
Redis does this automatically.
Pingback: Fast, easy, realtime metrics using Redis bitmaps « « turnings :: daniel berlinger
Pingback: Redis и области его применения | Записки программиста
Thanks for the nice docu. I struggle a bit with the translation to python.
Finally it seems ok.
I’m happy for possible improvements.
Christian
import redis
import random
import bitstring
import time
def byte2hex(byteStr):
return ”.join( [ "%02X " % ord( x ) for x in byteStr ] ).strip(“”).replace(” “,”")
def popcount(redis,bitkey):
a = bin(int(byte2hex(r.get(bitkey)),16))[2:]
return a.count(’1′)
def random_user_offset():
uid = random.randint(0,10000000)
visit = int(round(random.betavariate(1,1),0))
return [uid,visit]
if __name__ == “__main__”:
r = redis.StrictRedis()
r.delete(‘daily_active_users’)
ar = []
start_time = time.time()
for i in range(100):
res = random_user_offset()
#print res
r.setbit(‘daily_active_users’,int(res[0]),int(res[1]))
ar.append(res[1])
print time.time() – start_time, “seconds”
print sum(ar)
print popcount(r,’daily_active_users’)
print time.time() – start_time, “seconds”
Finally something clever and not just a regurgitation of the same old techniques, keep up the great info!
Pingback: 2012년 3월 26일 it 기술 동향 |
I’ve posted a related blog post on probabilistic counting. Counters like LogLog and linear probabilistic counters are more space efficient than bitmap counters and useful when you are counting very large sets (billions) or require a large number of counters. http://www.addthis.com/blog/2012/03/26/probabilistic-counting/#.T3GZjvBSQpo
Pingback: Probabilistic Counting | AddThis Blog
Pingback: Fast, easy, realtime metrics using Redis bitmaps « async I/O News
Pingback: Picking a Cache Server | Brent Ozar PLF
Pingback: What are the underlying data structures used for Redis? | PHP Developer Resource
Pingback: Redis: What are you planning on using the new Redis command BITOP for? - Quora
Pingback: MongoDB for Real-Time Analytics Without MapReduce | DevOpsANGLE
Pingback: Be Careful With your Redis BitSets and Java | Grant Muller