April 2010 – saush

In my previous blog post, I came to the conclusion that

If you’re a hiring manager with a 100 candidates, just interview a sample pool of 7 candidates, choose the best and then interview the rest one at a time until you reach one that better than the best in the sample pool. You will have 90% chance of choosing someone in the top 25% of those 100 candidates.

I mentioned that there are 2 worst case scenarios for this hiring strategy. Firstly, if coincidentally we selected the 7 worst candidates out of 100, then the next one we choose will be taken regardless how good or bad it is. Secondly, if the best candidate is already in the sample pool, then we’ll be going through the rest of the population without finding the best. Let’s talk about these scenarios now.

The first is easily resolved. The probabilities are already considered and calculated in our simulation so we don’t need to worry about that any more. The second is a bit tricky. To find out the failure probability, we tweak the simulation a bit again. This time round instead of getting every sample pool size, we focus on the best results i.e. sample pool size of 7.

require 'rubygems'
require 'faster_csv'

population_size = 100
sample_size = 0..population_size-1
iteration_size = 100000
top = (population_size-25)..(population_size-1)
size = 7
population = []
frequencies = []

iteration_size.times do |iteration|
 population = (0..population_size-1).to_a.sort_by {rand}
 sample = population.slice(0..size-1)
 rest_of_population = population[size..population_size-1]
 best_sample = sample.sort.last
 rest_of_population.each_with_index do |p, i|
 if p > best_sample
 frequencies << i
 break
 end
 end
end

puts "Success probability : #{frequencies.size.to_f*100.0/iteration_size.to_f}%"

Notice we’re essentially counting the times we succeed over the iterations we made. The result for this is a success probability of 93%!

Let’s look at some basic laws of probability now. Let’s say A is the event that the strategy works (the sample pool does not contain the best candidate), so the probability of A is P(A) and this value is 0.93. Let’s say B is the event that the the candidate we find out if this strategy is within the best quartile(regardless if A succeeds or not), so the probability of B is P(B). We don’t know the value of P(B) and we can’t really find P(B) since A and B are not independent. What we really want is $P(A \cap B)$ . What we do know however is P(B|A) which is the probability of B given A happens. This value is 0.9063 (from the previous post),

Using the rule of multiplication:

$P(A \cap B) = P(A) \times P(B|A)$

We get the answer:

$P(A \cap B) =0.93 \times 0.9063 = 0.8428$

which is 84.3% probability that the sample pool does not contain the best candidate and we get the top quartile of the total candidate population.

Next, we want to find out, given that we have a high probability of getting the correct candidates in, how many candidates will we have to interview before we find one that is better than the best in the sample pool? Naturally if we have to interview a lot of candidates before finding one, the strategy isn’t practical. Let’s tweak the simulation again, this time to count the successful ones and get a histogram of the number of interviews before we hit jackpot.

What kind of histogram do you think we will get? An initial thought was that it would be a normal distribution, which is not great news for this strategy, because a histogram peaks around the mean, which means the number of candidates we need to interview is around 40+.

require 'rubygems'
require 'faster_csv'

population_size = 100
sample_size = 0..population_size-1
iteration_size = 100000
top = (population_size-25)..(population_size-1)
size = 7
population = []
frequencies = []

iteration_size.times do |iteration|
 population = (0..population_size-1).to_a.sort_by {rand}
 sample = population.slice(0..size-1)
 rest_of_population = population[size..population_size-1]
 best_sample = sample.sort.last
 rest_of_population.each_with_index do |p, i|
 if p > best_sample
 frequencies << i
 break
 end
 end
end

FasterCSV.open('optimal_duration.csv', 'w') do |csv|
 rest_of_population = population[size..population_size-1]
 rest_of_population.size.times do |i|
 count_array = frequencies.find_all{|f| f == i}
 csv << [i+1, count_array.size.to_f/frequencies.size.to_f]
 end
end

The output is a CSV file called optimal_duration.csv. Charting the data as a histogram, surprisingly we get power law graph like this:

This is good news for or strategy! Looking at our dataset, the probability of the first candidate in the rest of the candidate population to be the one is 13.58% while the probability of getting our man (or woman) in the first 10 candidates we interview is a whopping 63.25%!

So have we nailed the question if the strategy is good one? No, there is still one last question unanswered and this is the breaker. Stay tuned for part 3! (For those who are reading this, an exercise — tell me what you think is the breaker question?)

Garena, where I work now, is in an expansion mode and I have been hiring engineers, sysadmins and so on to feed the development frenzy for platform revamps and product roadmaps. A problem I face when hiring engineers is that we’re not the only companies that are doing so. This is especially true now as many companies have issued their annual bonuses (or lack of) and the ranks of the dissatisfied joined the swelling exodus out of many tech companies. In other words, mass musical chairs with tech companies and engineers.

Needless to say this causes a challenge with hiring. The good news is that there plenty of candidates. The bad news however is to secure the correctly skilled engineer with the right mindset for a growing startup. At the same time, the identification and confirmation needs to be swift because once you are slow even with a loosely-fit candidate you can potentially lose him/her within a day or two. This causes me to wonder — what is the best way to go through a large list of candidates and successfully pick up the best engineer or at least someone who is the top percentile of the list of candidates?

In Why Flip a Coin: The Art and Science of Good Decisions, H.W. Lewis wrote about a similar (though stricter) problem involving dating. Instead of choosing candidates the book talked about choosing a wife and instead of conducting interviews, the problem in the book involved dating. However the difference is in the book you can only date one person at a time while in my situation I can obviously interview more than one candidate. Nonetheless the problems are pretty much the same since if I interview too many candidates and take too long to decide, they will be snapped up by other companies. Not to mention that I will probably foam in the mouth and die from interview overdose before that.

In the book, Lewis suggested this strategy — say we are choosing from a pool of 20 candidates. Instead of interviewing each and every one of those candidates we randomly choose and interview 4 candidates and choose the best out of the sample pool of 4. Now armed with the best candidate from the sample pool, we go through the rest of the candidates one by one until we hit one that is better than him, then hire that candidate.

As you might guess, this strategy is probabilistic and doesn’t guarantee the best candidate. In fact, there are 2 worst case scenarios. First, if we happened to choose the worst 4 candidates of the lot as the sample pool and the first candidate we choose outside of the sample pool is the 5th worst, then we would have gotten the 5th worst candidate. Not good.

Conversely if we have the best candidate in the sample pool, then we run the risk of doing 20 interviews and then lose the best candidate because it took too long to do the interviews. Bad again. So is this a good strategy? Also, what is the best population pool (total number of candidates) and sample pool we want in order to maximize this strategy? Let’s be a good engineer and do another Monte Carlo simulation to find out.

Let’s start with the population pool of 20 candidates, then we iterate through the sample pool of 0 to 19. For each sample pool size, we find the probability that the candidate we choose is the best candidate in the population.

Actually we already know the probability when the sample pool is 0 or 19. When the sample pool is 0, it means we’re going to choose the first candidate we interview (since there is no comparison!) therefore the probability is 1/20 which is 5%. Similarly with a sample pool of 19, we will have to choose the last candidate and the probability of it is also 1/20 which is 5%. Here’s the Ruby code to simulate this. We run it through 100,000 simulations to make the probability as accurate as possible, then save it into a csv file. called optimal.csv.

require 'rubygems'
require 'faster_csv'

population_size = 20
sample_size = 0..population_size-1
iteration_size = 100000

FasterCSV.open('optimal.csv', 'w') do |csv|
sample_size.each do |size|
is_best_choice_count =  0
iteration_size.times do
# create the population and randomize it
population = (0..population_size-1).to_a.sort_by {rand}
# get the sample pool
sample = population.slice(0..size-1)
rest_of_population = population[size..population_size-1]
# this is the best of the sample pool
best_sample = sample.sort.last
# find the best chosen by this strategy
best_next = rest_of_population.find {|i| i > best_sample}
best_population = population.sort.last
# is this the best choice? count how many times it is the best
is_best_choice_count += 1 if best_next == best_population
end
best_probability = is_best_choice_count.to_f/iteration_size.to_f
csv << [size, best_probability]
end
end

The code is quite self explanatory (especially with all the in-code comments) so I won’t go into details. The results are as below in the line chart, after you open it up in Excel and chart it accordingly. As you can see if you choose 4 candidates as the sample pool, you will have roughly 1 out of 3 chance that you choose the best candidate. The best odds are when you choose 7 candidates as the sample pool, in which you get around 38.5% probability that you will choose the best candidate. Doesn’t look good.

But to be honest for some candidates I don’t really need the candidate to be the ‘best’ (anyway such evaluations are subjective). Let’s say I want to get the candidate to be in the top quartile (top 25%). What are my odds then? Here’s the revised code that does this simulation.

require 'rubygems'
require 'faster_csv'

population_size = 20
sample_size = 0..population_size-1
iteration_size = 100000
top = (population_size-5)..(population_size-1)
FasterCSV.open('optimal.csv', 'w') do |csv|
sample_size.each do |size|
is_best_choice_count =  0
is_top_choice_count = 0
iteration_size.times do
population = (0..population_size-1).to_a.sort_by {rand}
sample = population.slice(0..size-1)
rest_of_population = population[size..population_size-1]
best_sample = sample.sort.last
best_next = rest_of_population.find {|i| i > best_sample}
best_population = population.sort.last
top_population = population.sort[top]
is_best_choice_count += 1 if best_next == best_population
is_top_choice_count += 1 if top_population.include? best_next
end
best_probability = is_best_choice_count.to_f/iteration_size.to_f
top_probability = is_top_choice_count.to_f/iteration_size.to_f
csv << [size, best_probability, top_probability]
end
end

The optimal.csv file has a new column, which shows the top quartile (top 5) candidates. The new line chart is shown below, with the results of the previous simulation as a comparison.

Things look brighter now, the most optimal sample pool size is 4 (though for practical purposes, 3 is good enough since the difference between 3 and 4 is small) and the probability of choosing a top quartile candidate shoots up to 72.7%. Pretty good! Now this is with 20 candidates. How about a large candidate pool? How will this strategy stand up in say a population pool of 100 candidates?

As you can see, this strategy doesn’t work in getting the best out of a large pool (sample pool is too large, probability of success is too low) and it is worse than in a smaller population pool. However, if we want the top quartile or so (meaning being less picky), we only need a sample pool of 7 candidates and we can have a probability of 90.63% of getting what we want. This is amazing odds! This means if you’re a hiring manager with a 100 candidates, you don’t need to kill yourself trying to interview everyone. Just interview a sample pool of 7 candidates, choose the best and then interview the rest one at a time until you reach one that better than the best in the sample pool. You will have 90% of choosing someone in the top 25% of those 100 candidates (which is probably what you want anyway)!

saush

technology, people and life in general

saush

Month: April 2010

How to hire the best engineers – part 2

How to hire the best engineers without killing yourself