New algorithm for autobalance=auto

the.ynik · Post by **the.ynik** » Fri Jul 30, 2010 6:36 pm

MrChaos wrote:QUOTE (MrChaos @ Jul 30 2010, 08:23 PM) If CCS is about to go live then wait, please wait, and then lets look at aspects of the.ynik work, MSR's TS AB, and of course crabby old Baker's too. Im not sure what dovetails with the earlier work, and what does not with theynik's work. You can check the results, and beta testing with accurate ranks is also good too.

This is the main problem with 'please wait': there's no time plan that tells us how long we will be waiting.
As far as I can tell from the CSS forum, it's inactive and we could be waiting Two Weeks™.

MrChaos wrote:QUOTE (MrChaos @ Jul 30 2010, 08:23 PM) Again AB and any qualifiers will absolutely effect rank. Positively, negatively it all depends but the less you allow them to stack, AND you accurately sort them the better the rank will work for all.

True. But this actually is a good argument for not letting R6 ship with the old 'sum of ranks' autobalance. Provide at least some alternative to get people away from max. Team Imbalance 1 as soon as possible.
I can hardly imagine the stacks getting any worse than they are now (and even if they do get worse, the game commanders can go back to using max. Team Imbalance 1).

My take on it: get Turkey's implementation of my algorithm into Beta testing ASAP, so that it can get enough testing before the release. When doing some large games on the beta server (this will be necessary anyways to ensure R6 works for everybody): try out the new autobalance. Most people log into Beta with their true rank (they use ASGS to start Beta mode).
After some Beta testing is done, we'll have to discuss whether to revert those changes or keep them in for the R6 release.

MrChaos · Post by **MrChaos** » Fri Jul 30, 2010 6:37 pm

the.ynik you are presenting a moving target. I can't figure out a way to see the results of AB with choices. I say sure you can then we get, but I can't see a pure no choices AB system, then we get I don't like it when they cant choose at all

. It's not a laugh at you, it's a laugh with you.

Go read the work of others on how to introduce autobalance correctly, the literature exists, it's nMicrosoft Research and no I no longer have the links (two computers three houses ago).

Between the two of us Im positive my subjective view is better then your subjective view because I've worked with the data, am a ten year vet, part of AS' implementation, older, handsomer, cooler, Ive got awesomesauce cats, and pants stealing laser monkeys too. Feel me? I hope that got you to laugh.

Im telling you those things because you askd for something at all in my mind and they where in the data as I combed it with Baker. Over and over and over. A clever fellow would instead have asked me, "So what did the statistical review tell you MrChaos?" *crickets chirp* *points left* *runs off stage right*

edit: You need to test it first, and you have to figure out how to test your idea works, not the code. I appreciate your passion and enthuasism. So ok go do the coding, run the beta testing, and be prepared for it never to be used cause that's what I think is going to happen. As a peace offering let me find the link to the MSR document.

The absolute best layman explaination of TrueSkills Ive ever seen
Which he stole whole cloth from this paper

No can do damn it, sorry.

Peace All
MrChaos

the.ynik · Post by **the.ynik** » Fri Jul 30, 2010 7:45 pm

I think I understand how AllegSkill works and why using the team's AllegSkill is better than anything we can do with conservative ranks. But at the same time I want something good (or at least acceptable) for R6, not something perfect later. Though certainly I'm also interested in doing it correctly once CSS is ready.

OK, on the risk of hijacking my own thread [remember, I want a solution for R6, not for R_x where ReleaseDate(R_x) >= ReleaseDate(CSS)]:

MrChaos wrote:QUOTE (MrChaos @ Jul 30 2010, 08:37 PM) Go read the work of others on how to introduce autobalance correctly, the literature exists, it's ny Microsoft Research and no I no longer have the links (two computers three houses ago).

Apart from the TrueSkill website and the Herbrich/Minka/Graepel paper introducing TrueSkill, I couldn't find anything balance-related there (though that may be the fault of the crappy search). Did you see any papers building on top of TrueSkill? (edit: thanks for sneak-editing the links into your previous post)
Nothing seems to deal with post-launch balancing. And that's where in my impression most stacking is happening, though I'm sure you have hard facts on that.

Also, I have some problems understanding the math-heavy parts - I skipped the statistics course at university because it was too early in the morning

(getting up at 6 AM? no way!

)
I did take stochastics instead, though, so I'm not completely lost when a normal distribution or Mr. Bayes is involved.

What comes up repeatedly is the 'probability of a draw' - do you have any idea of how to handle that? Alleg doesn't have draws (except prior to SG time

), though maybe q_draw (equation 7) from the TrueSkill paper could still be used to measure the balance?
If yes, we could try to use my algorithm and set imbalance = 1/q_draw (and possibly keep the 'number of players' imbalance measurement with lower weight to keep balancing the number of newbies).

There are lots of possibilities, it could even be possible to make the 'flexibility' depend on the player's stack rating (yet another piece of information CSS would have to transmit), so that stackers have no choice but anti-stackers get to choose when teams are approximately even. No idea what the social aspects of that would be. Probably lots of stackers complaining loudly on the forums

But I'm interested to know what your ideas for post-launch balance are.

MrChaos wrote:QUOTE (MrChaos @ Jul 30 2010, 08:37 PM) Between the two of us Im positive my subjective view is better then your subjective view because I've worked with the data, am a ten year vet, part of AS' implementation, older, handsomer, cooler, Ive got awesomesauce cats, and pants stealing laser monkeys too. Feel me? I hope that got you to laugh.

MrChaos wrote:QUOTE (MrChaos @ Jul 30 2010, 08:37 PM) Im telling you those things because you askd for something at all in my mind and they where in the data as I combed it with Baker. Over and over and over. A clever fellow would instead have asked me, "So what did the statistical review tell you MrChaos?" *crickets chirp* *points left* *runs off stage right*

So what did the statistical review tell you, MrChaos?

MrChaos · Post by **MrChaos** » Fri Jul 30, 2010 8:19 pm

The code isn't on this machine, Im drawing a blank atm and Baker isnt around either... I may have forgotten something important.

Again Ive not changed the stance that we should not implement until you have proof it will work, and their currently is no path forward to the goal. There is a need but no rush to implement it. Does it work *shrug* and there in lies the rub. Testing it in beta without the actual player skills is a non-starter or a herculean task to insure everyone is using their real one, you'll need enough games that qualify too, along with representative numbers. The task isn't insurmountable bud just need some proof of concept, and work. Most definitely work, and I do NOT want you to walk the path I did where others have and see you quit or become jaded. Listen to silly old MrChaos... ahhh who am I kidding

Peace All

EDIT:

You do realize just giving random people random ranks on beta is a bad thing right? You can validate the code but not the AB system you are proposing given the ranks are incorrect to the person's actual skill level and rank. Hope that made sense thanks *redacted name* for pointing out to me it mostly likely STILL wasn't clear to those reading the thread, that was a point I was trying to make... stoopid English.

HSharp · Post by **HSharp** » Fri Jul 30, 2010 10:31 pm

MrC how did you and Baker test the Trueskill Autobalance?

From my uneducated viewpoint it looks almost impossible to test the balancing algorithm on past game data because all the algorithm can do is balance players (side note: is this algorithm suitable for more then 2 team games?) you can't predict what team a player will want to join so having the sets of 1000's of games of data is very useful indeed for working out an accurate ranking system I can't see how it can work to test a balancing system. The only thing I can see a test for the balancing system work is simply calculating the millions of possibilities of outcomes of various ranked players joining teams and then going through the arduous process of having to use human examination on each result to see if it is a fair balance. I am thick though.

cashto · Post by **cashto** » Sat Jul 31, 2010 5:30 am

You don't want to go off half-cocked, but I disagree with the distinguished gentleman from Detroit, you do not have to "validate it 100% at every level" before proceding. First off, there is no such thing as 100% proof. Second, it's an option selectable by the GC. And if I'm not mistaken, wasn't it this same attitude that you and sarge so detested from TE, the insistence that all your ducks must be in a row, while failing to define what "ducks in a row" means, and then of course never having the time to check your work?

Don't lapse into analysis paralysis. Don't let the perfect be the enemy of the good. The fact is that our current autobalance options have zero theoretical basis behind them, went into effect with zero playtesting. You're holding these new ideas to a ridiculously high standard compared to the status quo.

To be honest, autobalancing according to team mu (instead of mu-3*sigma, as it is today) is going to impact very little. Sigma is randomly distributed and it would take an unusual configuration of high sigma on one side, and low sigmas on the other for it to make a difference. Summing mu-3*sigma is a pretty close approximation to summing mu most of the time. Summing mu is likely to fail for the same reason autobalance today fails: team sizes often become imbalanced, and you end up with something that looks straight out of 300: an army of newbies vs. a small cadre of veterans. According to Allegskill it's fair, but Allegskill was based on an algorithm that assumed games of EQUAL size.

And Turkey makes a good point -- Trueskill team rank assumes that the default mu of 25 (15 in Allegskill terms) is a reasonable approximation of a newbie's skill. But we know that it's almost invariably an overestimate. Unless it's a fake newbie that's played before, it's invariably going to drop. Many long-time players don't even have a 25 mu. If I take a first-time newbie, and you take a rank-15 vet, in no sane world should our team ranks go up by the same amount.

We should be cautious to avoid giving too much deference to what Microsoft currently does. By itself, "you're ignoring years of research" is just an unconvincing argument from authority. Tell me what those years of research have to say about the specific issue at stake, and I'll listen attentively. Otherwise it's just an intellectually lazy dismissal of a legitimate line of inquiry.

Trueskill guarantees nothing more than that two players of equal mu, and low sigma are probably interchangable. A team of 15, 13, 7, and 2 is likely an equal match for another team of 15, 13, 7, and 2 (when sigma is low). Pair off equally rated players, and the remainder is what is going to end up winning or losing the game.

Trueskill does not prove that a team of five 20s and five 0s is an equal match for a team of ten 10s. There's probably some nonlinear effects there; Trueskill doesn't attempt to model it, and it's probably okay to do because maybe those nonlinear effects are pretty small and this sort of rank distribution doesn't crop up very often anyways. Trueskill is also silent about the scenario of five 20s fighting off ten 10s. As anyone who's played a world game knows, those 0s aren't exactly non-entities. They do occasionally wander into enemy sectors in a scout and eye things, and that has an effect on the game. Are seven 10s really an equal match for ten 7s? Is an army of Persians an equal match for 300 Spartans?

I have not dug into the specifics of ynik's proposal, but from the general direction seems correct to me -- it's not enough to balance mu; team size must be balanced as well. An attempt to do both (even if it is ad hoc as HELO was) is bound to be better than one that focuses on one to the total exclusion of the other.

notjarvis · Post by **notjarvis** » Sat Jul 31, 2010 8:41 am

cashto wrote:QUOTE (cashto @ Jul 31 2010, 06:30 AM) To be honest, autobalancing according to team mu (instead of mu-3*sigma, as it is today) is going to impact very little. Sigma is randomly distributed and it would take an unusual configuration of high sigma on one side, and low sigmas on the other for it to make a difference. Summing mu-3*sigma is a pretty close approximation to summing mu most of the time. Summing mu is likely to fail for the same reason autobalance today fails: team sizes often become imbalanced, and you end up with something that looks straight out of 300: an army of newbies vs. a small cadre of veterans. According to Allegskill it's fair, but Allegskill was based on an algorithm that assumed games of EQUAL size.

Who suggested summing Mu? no-one in this thread i saw

.

It's generally agreed that Counting Mu and sigma through that umm statistical sum (or whatever it was) that Microsoft used for Trueskill is the best way to work this I linked the paper earlier in this thread, and which I believe Baker has already written. Whether it's worth doing a quick and dirty implementation in the meantime because it seems better is beyond what I will comment on.

At the heart of any AB system is an estimate of relative team strengths of the two teams. What you can do with all the historical data is go through and see if your AB system predicts the stronger team winning in most cases. I believe this is part of what MrC andthe Sgt did. Please correct me if I'm wrong.

Post by **pkk** » Sat Jul 31, 2010 9:50 am

I've read some stuff about TrueSkill yesterday and found this interesting sentence about matchmaking:

Ralf Herbrich wrote:QUOTE (Ralf Herbrich @ Gamefest 2007)Matchmaking is done based on mean skill (Mu = μ) and not on TrueSkill (μ-3σ)!

Current Autobalance uses AllegSkill to balance teams.the.ynik Autobalance uses relative AllegSkill to balance games.TrueSkill uses relative μ to balance games.

My suggestions of improving current autobalance system:

First you need the relative μ wight of the teams. This is calculated this way:

relative μTeam 1 = μPlayer 1 * PlaytimePlayer 1 / currentGameTime + μPlayer 2 * PlaytimePlayer 2 / currentGameTime + μPlayer 3 * PlaytimePlayer 3 / currentGameTime + μPlayer 4 * PlaytimePlayer 4 / currentGameTime + μPlayer 5 * PlaytimePlayer 5 / currentGameTime

Same for Team 2.

This result should be shown instant of sum of all ranks on Lobby/F6.

Now to Autobalance:

You got the relative μ of both teams. Now you have to do a prediction, how the μ of the new player effects the relative μ of the team. This can be done, if you look into the future. You calculate the relative μ of teams after 15 minutes (just an example).

relative μTeam 1 prediction + new player = μPlayer 1 * PlaytimePlayer 1 / (currentGameTime + 15 )+ μPlayer 2 * PlaytimePlayer 2 / (currentGameTime + 15 ) + μPlayer 3 * PlaytimePlayer 3 / (currentGameTime + 15 ) + μPlayer 4 * PlaytimePlayer 4 / (currentGameTime + 15 ) + μPlayer 5 * PlaytimePlayer 5 / (currentGameTime + 15 ) + μnew Player * 15 / (currentGameTime + 15 )

relative μTeam 1 prediction = μPlayer 1 * PlaytimePlayer 1 / (currentGameTime + 15 )+ μPlayer 2 * PlaytimePlayer 2 / (currentGameTime + 15 ) + μPlayer 3 * PlaytimePlayer 3 / (currentGameTime + 15 ) + μPlayer 4 * PlaytimePlayer 4 / (currentGameTime + 15 ) + μPlayer 5 * PlaytimePlayer 5 / (currentGameTime + 15 )

Now you got four values, two represent the current matchqualitiy and two represent the matchquality, after the player joined.

You compare μTeam 1 prediction with μTeam 2 prediction + new player and μTeam 2 prediction with μTeam 1 prediction + new player.

The player can only join the team, where he improves (changes less) the matchquality.

Some facts about newbies:
Newbies start with μ of 25, this will end up with "unbalanced games"Newbies loose μ faster than anyone else because of high σ.

Question to baker:
How is the current matchquality calculated, if Noobs & Vets (μ =< 25) play against average players (μ > 25)?

the.ynik · Post by **the.ynik** » Sat Jul 31, 2010 9:53 am

notjarvis wrote:QUOTE (notjarvis @ Jul 31 2010, 10:41 AM) At the heart of any AB system is an estimate of relative team strengths of the two teams. What you can do with all the historical data is go through and see if your AB system predicts the stronger team winning in most cases. I believe this is part of what MrC andthe Sgt did. Please correct me if I'm wrong.

True, but AllegSkill does this at the end of the game, looking at the time each player played, etc. Any auto-balance system should look at the current teams only (because if a vet quits an even game, he/another vet should immediately be able to rejoin that side; they shouldn't have to wait until the team's skill averaged over the whole game drops sufficiently low).
Could I do such a 'prediction' at every moment in the game and then take the average (giving equal weight to each moment)? I think AllegSkill does something similar by multiplying a player's skill by the time they played.
Then we could use the historical data to compare my algorithm with AllegSkill. Of course AllegSkill will be better, but it would be interesting how close I will get (especially when compared to simply using 'Max team imbalance = 1' or the old 'sum of ranks' autobalance).
Still, this would only test how good the 'imbalance' measure actually is (and it could help determining the weighting constants, the current values are just the result of experimentation); it doesn't help with the remainder of the algorithm (flexiblity etc.) and it doesn't help with judging the effect on stacking/gameplay. The latter would be much more important to test, but that seems to be impossible without an accurate model of player behavior (incl. the behavior they would show under the new system).

So where is this historical data so that I can give it a try?

the.ynik · Post by **the.ynik** » Sat Jul 31, 2010 10:22 am

pkk: hmm, how newbies are handled is a problem. Also, see my post above for why I wouldn't want the playing time to be part of the calculation.
Your calculation isn't really correct because you're just ignoring sigma. What TrueSkill does is not simply balancing mu; but balancing the probability of a draw, where both mu and sigma of both teams play a part.

I'll have to do some calculations to see how adding a newbie to a team affects that.

Still, TrueSkill assumes each player starts with a fixed skill (which only varies slowly) and TrueSkill just doesn't know it yet. That's simply not true for Allegiance, where new players have almost no "skill" at all because they need to learn the game first. Assuming that a new player has average skill with high uncertainty, even if it works for ranking purposes (I think it does; though not everyone agrees on that), could pose problems for autobalance.
Conservative rank might be better (or might not, I'm completely unsure at the moment).