Add-ons client/server support for add-on ratings

tekelili · Post by **tekelili** » April 6th, 2014, 12:34 pm

GbDorn wrote:I'm against a single global score. If I'm only interested in addons with good story, I don't care about addons with poor story but with a higher score because they have better graphics. The global score would be useless but some generic categories would be helpful.

If I'm only interested in addons with global high quality, some generic categories would be useless. Dont see why it should be removed, as it can also work as a generic category. Btw, I mostly agree with Dugi in duality single score/reviews, mostly because it doesnt scare users from give feedback when you offer them a very quick "fill this field" (I my self use to close any feedback formulary with too many options).

GbDorn · Post by **GbDorn** » April 6th, 2014, 12:45 pm

tekelili wrote:If I'm only interested in addons with global high quality, some generic categories would be useless. Dont see why it should be removed, as it can also work as a generic category.

This is exactly what I've said:

In my post I wrote:You can even have a 'global quality of addon' as a category which would be the same as your single score. The users would be able to choose which category they want to use for sorting.

tekelili wrote:I mostly agree with Dugi in duality single score/reviews, mostly because it doesnt scare users from give feedback when you offer them a very quick "fill this field" (I my self use to close any feedback formulary with too many options).

There would be only 5-7 categories and you don't have to fill all of them if you don't want to.

Dugi · Post by **Dugi** » April 6th, 2014, 1:34 pm

I'm against a single global score.

I know. There was no unclarity about it.

I was told to put there a single global score, my original ideas was only reviews, no numbers. Numbers are heavily subjective and there is little meaning behind them, I know. But as somebody told (I think it was fabi), you need a numeric rating to quickly sort the popular and therefore probably good ones (this isn't like in music, even the less known ones are on the list and will be seen if they're good) from the unpopular and probably bad ones.

You did not write anything against this argument and there is no such synthesis from anything you've written above. The person who wrote the code for the add-on rating/reviewing system is me and you need to persuade me. If you avoid to reply to arguments that are hard to counter, you might persuade people who haven't read it properly, but it will only worsen my opinion about you.

I don't need to have made an addon to know this. But how is this different from reading bad reviews?

If you read a bad review, you see that somebody didn't like something or the thing reviewed is bad. If you care about something for months and you did your best on it and are proud of the good work you've done and you read an overly negative review of your own work, you think that your work was futile and you will be greatly disappointed and frustrated. I've seen one on my work and it was very unpleasant (the guy was probably frustrated by its crappy ending parts). I've written a review that wrote a lot about the weak spots of an add-on and its author was quite sad even if it was clear that I didn't mean that it was bad overall.

If you see how many people agree with the reviews seen there, you'll find the truth and see what is the general opinion and what is the opinion of a single person. But this is quite off topic.

A threshold is unfair with a 100-points scale. If I missed the threshold because someone rated my addon 6.6 instead of 6.7, I would be offended because this is just insignificant at the user level. On the other hand the difference between "slightly above average" and "pretty good" is quite significant at the user level so the threshold effect is less dependent on such a fine accuracy at the population level. And in any case I think the threshold (if any) should be let to the user's choice.

You didn't get my point here. I did not mean to add a threshold somewhere, but if there were some sorts of yes/no rating, there will be some just bellow the threshold ones (the 40%) ones and some just above the threshold (the 60%). The gap between the just above the threshold and the just bellow the threshold is great, but the gap between the just above the threshold and the absolutely awesome (the 95%) is non-existent, although the quality difference is greater. The smaller are the thresholds, better it is (and they are small enough with decimals).

Furthermore, flawless is really bad for a rank, there are many campaigns without flaws. Flawless doesn't mean that it is good about something. A really good campaign isn't one without flaws, it must be interesting and original (and these can hide the flaws, I've played games that were good despite bad graphics, games that were great despite a bad engine, games that were awesome despite a lame story,...).

Or maybe I'll use the lowest score, hoping the author will improve his creation (but he'll have less work to do to earn an 8 than the guy with a pure 7).

This is @€!$^#! When you were a school kid and the teacher gave you a bad result mark from an exam, did you take it as a motivation to learn better next time? No, it either frustrated you and persuaded you that this teacher will never give you a good result or got more desperate because getting the desired final grades just got harder. If you're to motivate somebody to continue improving some work, you must tell him it's good and tell him how he can get it even better. If it's really bad, you should give him suggestions to improve it instead of telling him how it sucks (even if it would be honest).

Giving worse results for something merely frustrates the author and turns potential players away from it.

Use of quantiles would indicate whether people agree or disagree on the score.

I know that, but most people won't understand it. This isn't a lesson of experimental physics, this is a video game. And it might possibly just tell how many trolls are around.

And then you have people speaking different languages.

I know, English isn't my native language. But are the add-on descriptions translated or translatable? Are the add-on names translated or translatable? And far the most add-ons are only in English anyway (even a translation into the author's native language is rare).

Generic labels/categories and reviews are complementary.

I don't disagree with this. But I can't think of any good way to set this up.
Thinking about the tags:
-good graphics tag - implies a threshold that will hurt some authors
-RPG tag - add-ons can be more or less RPG-like, UtBS and IftU have a bit of AMLA that makes it a bit RPG-ish, Legend of the Invincibles has full AMLA system and an extensive gear system but otherwise is a normal campaign, Five Fates is mostly a leader-only campaign (and is therefore more RPG-ish than Legend of the Invincibles even if its RPG system is less vast)
-skirmish, big battle, dungeon, etc... tags - many campaigns blend more of these styles
-long campaign tag - some campaigns are long because they have 20 scenarios, some campaigns are long because they have 200 scenarios

It will also need a jury to give those tags, and they can't be neither objective, neither united in opinions. Somebody will get his add-on rated with a good story tag, other one will not get that tag just because he was rated by a different jury member. Somebody will find AtS RPG-ish, most people won't. A single person can't get it all done, and even a single person changes opinions depending on mood or time.

If you read a bit above, you'll see that I agreed with iceiceice's idea to create suggestions for players depending what played players who played the same add-ons as they did. Its result is the similar of your tagging idea. It doesn't help new players (but players who played only mainline that usually has good spritework and unoriginal stories might not understand these tags much anyway), but doesn't imply the (sorta unsolvable) problems your suggestion implies. I might add it after the current system gets accepted.

Also, I am noticing, also from your previous topic in the Ideas forum, that you're that kind of guy who throws plenty of suggestions without knowing much on the topic and does not even think about realising the things you suggest.

iceiceice · Post by **iceiceice** » April 6th, 2014, 1:52 pm

GbDorn wrote: I'm against a single global score. If I'm only interested in addons with good story, I don't care about addons with poor story but with a higher score because they have better graphics. The global score would be useless but some generic categories would be helpful.

For what it is worth, I'm actually mostly against what GbDorn is proposing. Let me try to explain why.

There are many approaches to reviews in software nowadays esp. because in the last few years we use such systems in our daily lives with increasing frequency. Everyone thinks they know a good way to recommend content and how to make a good list of important criteria, and one school of thought that we are seeing here is that a "holistic", overall quality rating is bad and in fact each person should score in great detail individual "factors" of the content.

There are alternative approaches that still recognize that there are many different aspects to the "quality" of a campaign. For example, one way to is to use machine learning to *implicitly* identify and define the factors. Why might this be a good idea? For one, making an exhaustive list of all the criteria we need to find good add-ons / help people find add-ons they will like is very hard!

1.) How would we know if we got them all, or missed some and are now making bad recommendations because of it?
2.) What if we select too many or some unimportant ones and end up creating noise for our users? Do we have a good way to tell that we did this?
3.) Why should we think that asking users to evaluate these factors is a good way to gauge them? Let me explain with an example. Surely "wine quality" is a thing that exists, there is good wine and bad wine. However when polled in tests most people are notoriously bad at identifying it. Are we sure that "art quality" is not similar in this regard? Actually I would expect that at least in some respects and settings it is.
4.) When it comes to predicting human behavior and "what people will like", often it pays to be cynical. It may be that for many users, the factors which *actually* explain what campaigns they like are quite simple and not at all about "RPG vs large armies game play", or "story quality" or "art quality". It may be that for example, a large class of users is greatly entertained by the sounds that particular units make, and for these people, the best predictor of whether they will like a campaign is if it involves either a dwarf thunderer or a lich (or "thundergobos" as in the swamplings campaign

). It may be that some users only like to play campaigns where they have access to some units with either very powerful attack specials or high resistances, or very powerful AMLAs, and they have less fun otherwise. A large set of scores for very a generic, "politically correct" list of criteria (reported with mean and variance (!)) just might not be as successful at helping users find add-ons they like as an automatic system that silently learns and makes recommendations based precisely on the observed preferences of the users. A good system would by design cluster the users so that if we get a new user whose tastes appear orthogonal to the tastes of a niche group, then their reports have very little impact on the information we give to this user.
5.) Creating a list of categories devalues campaigns that think on a totally different set of axes, when in fact we really want to reward the most creative content. (Look at it this way -- if you don't believe that there could be a new campaign that completely changes the way you think about what makes a good campaign, then why are you still playing campaigns?) Creating "flags" for a high art score may even make it look like we are giving "badges" for achievement in various aspects of add-on making, when really it should just be about fun IMO.

If it were totally up to me i.e. it was my job alone to design and make the most useful add-on guide, I would basically mimic amazon / netflix. That is, I would support simple text reviews with a 5 star quality rating as we always see (I don't think the 5-star vs 10-star vs. 100-star actually makes a big difference, although there are many valid arguments made about these so far), and additionally, I would make automated recommendations based on "here's what other people liked who liked the same things you did", as I described in an earlier post in this thread: http://forums.wesnoth.org/viewtopic.php ... 45#p568409

I would be quite focused on exactly how the question is worded that is asked. In particular I would not ask "rate the overall quality of this add-on", instead I think I would ask "on a scale of 1-5 how much did you like this add-on" or perhaps "how much did you enjoy using this add-on", since that is precisely what I am hoping to automatically predict / make recommendations on the basis of. I think it will make a very big difference exactly what question you ask here, and it's worth it to think carefully. Asking users an unclear question, or to "evaluate overall quality" will cause them to think not about *themselves* but what they think others would like in the add-on / what would constitute a "fair grade" for the add-on... and lead them to make a lot of assumptions. I'd much rather if they just tell me e.g. how much fun they personally had in their particular experience, and let *me* do the averaging / learning / generalizing. (I certainly don't want to ask a question like "rate the replay value of this add-on", i.e. "predict how much fun you or another person might have if they played this campaign several times.")

I think such a recommendation system would be the most helpful with the least amount of overall noise. The scores in various subareas of add-on quality may be more helpful when a user simply hasn't played any campaigns yet and doesn't have good recommendations, but honestly I think written reviews are their best bet then, or just blindly trying new things in order to see if they like them and rate them until they get good recommendations. If someone has never played wesnoth before and we have no idea what they like, I think the best advice is "try some different campaigns and see what you like"... I doubt that we could make a system that would change that, and the reviews are ultimately going to be written mostly by people who have played a lot of campaigns.

I'm not trying to say the other philosophy is wrong or bad, just to explain some perceived flaws and an alternative. Regardless about your ideas about human nature and what you think makes a good add-on, at least I hope we agree that the success criteria for the system should be roughly:

1.) Help users to easily find add-ons that they will like
2.) Don't show them noise / too much information
3.) It should be as easy as possible to make a useful review
4.) Reduce opportunities for political drama

I think the last one is a particularly good argument against the "flags" or "categories". I don't think that there should be a special "review moderator" or class of such officials. Everything should ideally be cleverly automated to the greatest extent possible, but an automated threshold for the averaged score, just hard coded in, seems pretty crude.

In fact IMO one thing that might be a very good idea that we are missing is a "was this review helpful" button which would help to promote / demote reviews.

GbDorn wrote: Ratings have a dual purpose IMO. They give as much info to the user as to the author.

That's one possibility, but I think it would be much simpler if the reviews are thought of as *mainly for the user* and forum posts in the thread for the add-on are for the author. No doubt the author will look at both, but this feature doesn't need to be all things to all people.

GbDorn · Post by **GbDorn** » April 6th, 2014, 4:47 pm

To Dugi: I'm just answering the most important parts.

Dugi wrote:
I wrote:I'm against a single global score.
You did not write anything against this argument

Yes I did. "If I'm only interested in addons with good story, I don't care about addons with poor story but with a higher score because they have better graphics. The global score would be useless but some generic categories would be helpful." and "You can even have a 'global quality of addon' as a category which would be the same as your single score. The users would be able to choose which category they want to use for sorting."

The goal is to have several sorting options. Users can rate each category. The system computes score averages for each category and the users can use any to sort addons. They can even put a threshold if they so desire.

Dugi wrote:You didn't get my point here [about threshold]

Yes I did, I know you don't want a fixed threshold. But it can be set by the user. Or we can discard it entirely since sorting is fine.

Dugi wrote:Flawless doesn't mean that it is good about something.

Flawless means perfect. Perfect means it can't be made better.

Dugi wrote:A really good campaign isn't one without flaws, it must be interesting and original (and these can hide the flaws, I've played games that were good despite bad graphics, games that were great despite a bad engine, games that were awesome despite a lame story,...).

So you use categories for different aspects and you rate them independently? This is also what I want to be able to do with addons.

Dugi wrote:Thinking about the tags: [...] Its result is the similar of your tagging idea

I never talked about "tags" like "RPG" or "dungeon". The "good graphics" tag/flag was based on thresholds which I have addressed above.

I'm fine with sorting, reviews, recommandations ... I'm just asking that sorting be made with a user-selected category (graphics, design, gameplay, originality, overall quality ... precise list to be determined) instead of a unique unchangeable category (overall quality).
I'm also asking to put labels ("near flawless", "pretty good" ... , the actual labels may be customized depending on category) instead of numbers to homogenize category ratings and because as you said "Numbers are heavily subjective". Those labels should appear at least on the rating interface (maybe with the relation label/number) while the average on the main table could be displayed only in numbers.
If you don't implement labels then I suggest you explicitely define what number is supposed to be the average on the UI, as I can guarantee it won't be the same for everyone.

---

To iceiceice:
Even Amazon & co use categories: for books it will be author or genre, for videogames it is mostly genre, etc.
Recommandations would be great but I don't see how a machine learning algorithm is going to solve this when the only data you feed it is "do you like this addon?". Retrieving the other info (like "the user liked this addon because the hero is a dwarvish thunderer") seems horribly hard to implement unless you develop a system able to parse and understand written reviews, or without some human input somewhere in the chain (number of scenarios, genre ... which could be given by the developer).

Quick answers to some points:

iceiceice wrote:Creating a list of categories devalues campaigns that think on a totally different set of axes, when in fact we really want to reward the most creative content. (Look at it this way -- if you don't believe that there could be a new campaign that completely changes the way you think about what makes a good campaign, then why are you still playing campaigns?)

I don't see how the system could detect why such a campaign is original. An "originality" rating and category is quite simple on the other hand.

iceiceice wrote:Creating "flags" for a high art score may even make it look like we are giving "badges" for achievement in various aspects of add-on making, when really it should just be about fun IMO.

iceiceice wrote:I think the last one is a particularly good argument against the "flags" or "categories". I don't think that there should be a special "review moderator" or class of such officials. Everything should ideally be cleverly automated to the greatest extent possible, but an automated threshold for the averaged score, just hard coded in, seems pretty crude.

I have addressed the issues of flags/thresholds above: they could be set by the users for their own use, or sorting may be enough.

iceiceice · Post by **iceiceice** » April 6th, 2014, 5:47 pm

GbDorn: As I alluded to in the earlier post, software recommendations systems are a hot topic in computer science. There are now many technical papers about it. One of my friends did serious research on this. I have been to two talks related to this and I looked at a survey paper once. There was an especially large amount of interest after the high profile "Netflix Challenge" and subsequent bust of that experiment a few years ago.

There's no doubt that the system that I described works for movies, TV shows, and other kinds of content, and we have every reason to believe it would work equally well for campaigns. "I think amazon works mostly by genre." I can't tell you exactly how their proprietary code works, but the papers that have been published suggest that SVD is really the most important idea here.

Here's how you might have thought of it yourself. Your task is that, there is a bunch of content, and a bunch of users, some of whom like or dislike some parts of the content, but haven't rated other parts. You want to help them find new content that they would like. After thinking about it for a bit, you realize (correctly) that different users respond to different factors because of their tastes, and place more or less importance on different aspects of the content. So you shouldn't score content on an absolute scale, that would be wrong. You instead "brainstorm" a list of factors you think people might care about, and try to find a way to measure these factors. What you've done here is break your task (C) into two steps (A) and (B), and introduced a new concept of factor scores. You've provided a way to get the factor scores for the add-ons (A) (just ask the users). In step (B), we want to take the factor scores and use them to find add-ons for a particular user. Actually in your system we don't do (B), the user has to do that themself. But you could imagine that a machine could do this as well -- perhaps we could ask the user to "rank/score the factors" in personal importance, then use that to make a short list of recommendations for them. That would be a complete solution to (C).

However, a critique of this system is that, we don't actually care about the factor scores themselves, we only care about doing (C) well, yet we force everyone to look at and think explicitly about the factors although we don't know how good or important they are. A "modern" (you could alternatively call it "trendy") way to think about it is that instead of "brainstorming" / "making up" the factors, we should do it scientifically, based on the assumption that many people are similar to one another. Concretely: If a factor like "story quality" is actually important to many people, then I should be able to see it in the data, in the table of ratings. That means there should be a large number of people whose "likes" are at least correlated with "story quality" ratings. If that's not the case then it's a bad factor and we should forget about it, even if it sounds good on paper. In a machine learning approach, this will happen automatically -- we just won't learn the factor if it isn't important.

Note that the concept of a "factor" changes slightly now. It's not an english description of an idea about add-ons anymore, instead its just a function (vector) that assigns to each add-on a number. Any possible assignment of numbers could be a legitimate factor. This additional power actually costs us nothing -- we can use singular value decomposition to automatically find the say 5 or 6 factors that e.g. best explain 95% of the observations. Since the factors aren't english anymore, we don't show them to the user of course, but that's actually a good thing, it makes the system simpler and use less assumptions, while achieving (C) more successfully. Actually we don't even compute the factor vectors explicitly, although we could -- in the system I described they are rather implicit in the low rank approximator matrix.

GbDorn wrote: Retrieving the other info (like "the user liked this addon because the hero is a dwarvish thunderer") seems horribly hard to implement unless you develop a system able to parse and understand written reviews, or without some human input somewhere in the chain (number of scenarios, genre ... which could be given by the developer).

Actually the system I described will do this automatically. It will optimize efficiently over *all* possible choices of a factor, meaning any possible vector. This of course includes the "contains a dwarf thunderer" criteria, or any variation on this.

GbDorn wrote: I don't see how the system could detect why such a campaign is original. An "originality" rating and category is quite simple on the other hand.

That's not what I'm trying to talk about. I'm talking about, what if someone made a campaign that changed how we think about the old factors and e.g. set a new bar for art / story telling. That would render all your old factor ratings obsolete. The system I described would automatically update the factors silently as new data comes in -- we would actually be able to scientifically see a paradigm shift occur when the principal component vectors make a large shift.

If you tried to observe a "paradigm shift" in e.g. physics by asking readers of papers to rate the originality of the work at the time that they read the paper, I seriously doubt that e.g. it would uniquely identify the papers of Einstein on relativity as a major paradigm shift, although they undoubtedly were. Originality can't be boiled down to a number and sometimes people might think something is original when they first look at it but it doesn't stand the test of time. Getting off on a tangent here, but I'm just pointing out one reason why a machine-learning based system is thought to be more modern -- it's just inherently more flexible and may perform better without needing maintenance, even when the character of the data and the problem change suddenly.

Edit: Just to clarify: I'm just explaining how I would do it if it were all up to me. Wesnoth is a do-ocracy, and Dugi can do it however he likes of course especially if we've finally approached a consensus on this long running issue. Most people who have posted seem to want to rate explicit factors and show these, and that is fine of course, and we might even use a hybrid system later, but as I've written before there's no need to complicate the task at hand right now. But I just want to put an alternative view out there, and also at some point I become more seriously opposed to having like 10 or 12 factors displayed and voted on, as well as a ton of unnecessary labels and badges and such, because I think at some point it would be obviously harming the feature and making it ineffective. I want to make the point that you can probably ask the user for just one number in total, and still have a very effective review / recommendation system.

Dugi · Post by **Dugi** » April 6th, 2014, 8:01 pm

GbDorn, as iceiceice said, you are trying to persuade me, not to reply something that would make some sense.

Yes I did. [rest about the argument against an absolute rating]

That isn't an argument why there shouldn't be an absolute rating.

You're suggesting to make the add-ons' list confuse new players by asking them what kinds of add-ons they like (because mainline campaigns are in many ways similar, they'll have no idea what do you mean), that will lead to preventing them from finding many add-ons they would like, a player cannot know in advance what bad graphics means if they have only seen the core units that look pretty good and only a very few campaign-specific units, a player cannot know in advance what RPG means in the scope of wesnoth campaigns because there's no such thing in mainline. And it brings back the threshold problem, the problem to categorise something whose variety grows, and the jury problem I mentioned before and you couldn't solve.

Yes I did, I know you don't want a fixed threshold. But it can be set by the user. Or we can discard it entirely since sorting is fine.

How can a user set a threshold? Ask the hypothetical jury to look at the add-ons and tag them just for him? We aren't showing our add-ons system to a president.

Flawless means perfect. Perfect means it can't be made better.

Are you sure this is the case? If an add-on can't be made better, it just means that there are no flaws, and that usually means that the author played it safe, with no originality. If something is actually original, unique and attractive to others thanks to its innovations, it is very likely to have some other issues. And I don't like to call unoriginality a flaw, but it can make things far from perfect.

For example, I have seen a review for a Call of Duty game. As usually good graphics, storyline full of action, fancy explosions, setting in a horrible war as it should be... looks like no flaw, but the real problem is that it was all a war story that didn't essentially differ from hundred other first person shooters' stories. The weapons were the same. The health and ammo system was the same as in all Call of Duty games. The 'this was good, let's make something similar, that's a formula for victory' strategy creates flawless stuff... and thus completely unoriginal stuff.

So you use categories for different aspects and you rate them independently? This is also what I want to be able to do with addons.

Where did I write that? As iceiceice said, the new aspects can appear at any time, and you can easily end up with a hundred categories just to confuse anyone. For example, some time ago I created an add-on that removes all randomness from combat, both in single player and multiplayer - can there a category for something that is new and isn't like anything ever created?

I never talked about "tags" like "RPG" or "dungeon". The "good graphics" tag/flag was based on thresholds which I have addressed above.

What did you say about the threshold problem? Hm, nothing, obviously you have no solutions here and you're trying to conceal it. This also brings the subcategory problem, good sprites and bad protraits can be found as bad graphics by somebody, others will not mind the bad protraits and will care about the good sprites. Same for baseframes and animations, I think that absent animations are an eyesore and tend to ignore the quality of baseframes if the animations are absent, other people don't care about animations at all.
Other categories are equally uncertain and unspecific. Good gameplay may be maximum balance in scenario, it may be interesting objectives, it may mean interesting units... depending on the player. Others have their issues as well.

It is a fallacy to write that numbers are subjective thus near flawless, pretty good, slightly above average, average, slightly below average, pretty bad, catastrophic is better than one two three four five six seven. Numbers expressed by words assigned to them are as subjective than plain numbers. And numbers can be also partially regulated by making statistics, namely download count and the total time spent playing, affect them.

Looking back at your posts, you aren't trying to persuade me, you're just trying to elaborate more on your idea that I disagree with and find some small inconsistencies in my replies. You aren't a member of an irrationalist movement (a.k.a. creationists, anti-vaccine, anti-chemtrails activists, Jehovah's Witnesses, scientology) trying to brainwash people, you are trying to persuade me (because I am certain that no matter what you think about me, you'll not write any piece of code). To persuade me, counter my arguments. If you won't, I will not reply and let it speak for itself.

This is what I want to read from you:
1. A list of categories that would cover everybody's desires or a suitable alternative
2. A solution to the threshold problem, that is how to avoid frustrating people who'd end up just bellow a threshold, for example for being rated as an add-on with a good story
3. A solution to the jury problem, that is who would tag add-ons into these categories objectively and reliably
4. Why is it good to replace the numbers from 1 to 10 by words assigned to each number, if they mean the same, but can't be anything in-between and it isn't clear if flawless is better than awesome
5. Why should be the rating system seen in many programs and on many websites bad in this case. Don't repeat the fallacy that they have also categories - housing, electronics, DIY, appliances, books, computers categories you see on websites like Amazon are an equivalent to Wesnoth's campaign, era, map pack, MP modifications categories

iceiceice · Post by **iceiceice** » April 7th, 2014, 1:06 am

Okay... Dugi, don't overplay your hand here.

There's no need to insult GbDorn even if you think you are talking past eachother or he isn't responding to your points. That only hurts your credibility. If you think someone is really not being reasonable then others will see that. You should just be focused on building a consensus to support this change you have been working on, and maintaining the progress you have made. Obviously there are a lot of people interested in this feature or we wouldn't have been debating this for more than a month now. But I do think there has been much progress towards a consensus since we began, although there are still some parts not everyone agrees on.

You deserve credit for doing the work that you have done so far and the code you have written, and Do-ocracy of course means you get to do pretty much as you please on the parts of the patch that aren't really controversial. But Do-ocracy gives even more power to the dedicated maintainer of the campaign manager code, in this case shadowm. If he doesn't feel confident he or someone else can easily maintain the code if you went on an extended break, or fears that intractable bugs may be introduced in a mission-critical feature like the add-on manager, then it probably won't get merged. If he fears the patch is controversial enough that it could lead to bad feelings or an edit war, then it probably won't get merged. So you don't have even a smidgen of the clout that Jetrel did in your last exchange with him. (And let's please not discuss that further.)

Just stay focused on what you were doing and keep trying to negotiate a consensus. Don't get mad and write things that are going to hurt your cause. I personally am quite hopeful that wesnoth may get this feature in some form eventually

GbDorn · Post by **GbDorn** » April 7th, 2014, 11:23 pm

Dugi, I'm getting tired of your baseless personal attacks and your illogical questions. You've contradicted yourself repeatedly and you distort what I say. You're not even trying to understand my posts even when I'm answering you honestly. Actually it seems you don't even read them, nor your own for that matter.
I'm terribly sorry it has come to this but I don't see the point of continuing this discussion with you any longer.

---

iceiceice wrote:Most people who have posted seem to want to rate explicit factors and show these, and that is fine of course, and we might even use a hybrid system later

Well I hope other devs agree with you on this.
There is no denying that the proposal is an improvement on the current situation. I just feel we could make it better. If it's not possible to patch it later, then it will be a (somewhat) lost opportunity IMO.
Also I still have concerns about non-English speakers like I said earlier. We shouldn't make a bad design decision about a future feature because of a problem in the current state of translations.

Anyway, after giving your post another read I guess machine learning could work given a sufficiently large dataset but we're not Netflix. This will be the key. Do you have any estimation of the Wesnoth userbase for UMC? How many addons does the average user download? How many for the top downloaders? Those are the most important questions in the end.

tekelili · Post by **tekelili** » April 8th, 2014, 11:45 am

Sorry for "double post", my lack of Englisk makes me condensate my ideas and I felt they were incomplete.

I really dont like offer by default extra "fixed fields" for feedback further than goblal score. In most of cases, there would lot of fields that wouldn´t have any sense at all. If a campaign is played with default era, what sense has rating its image quality? We could still think in evaluate custom images... but what deserves a worse score?: an author that knowing his lack of skills didnt add any custom image or a unskilled author that added mediocre custom images with lot of work? In other cases evaluate plot would be also unfair. Of course no point evaluate plot in survivals, but even in some campaigns like World Conquest would be unfair evaluate its plot (wich is probaly the worst ever in a campaign), when it is an add on that has a "decent plot for what is designed". I could think about lot more examples where fields would be irrelevant or unfair, and dont see why is a bad solution if an add-on needs evaluation in extra fields, just offer the chance of a review to allow user give feedback about whta really needs feedback. Just my 2 cents.

Dugi · Post by **Dugi** » April 8th, 2014, 11:51 am

iceiceice wrote:In fact IMO one thing that might be a very good idea that we are missing is a "was this review helpful" button which would help to promote / demote reviews.

The actual code has some sort of 'was this review helpful' button, it's just called differently and doesn't allow demoting reviews, only promoting (each additional function means a lot of extra data to store, and the other functions already store tons of stuff). Just look at the last screenshot I posted.

tekelili wrote:I really dont like offer by default extra "fixed fields" for feedback further than goblal score.

The current reviews function gives text fields for these, but they can be left blank if inapplicable (story for a MP modification) and they will not be shown at all (I know that this allows writing everything in the overall field, but there are better ways to write a review badly).

iceiceice · Post by **iceiceice** » April 9th, 2014, 8:26 am

Dugi wrote: The actual code has some sort of 'was this review helpful' button, it's just called differently and doesn't allow demoting reviews, only promoting (each additional function means a lot of extra data to store, and the other functions already store tons of stuff). Just look at the last screenshot I posted.

I see... I guess I missed it before.

So one thing it makes me think now, what is supposed to happen if people write inappropriate reviews.
I guess that no one has written anything like that on the wiki, but it's a bit different there.
The idea for the new feature is that anyone who has the game should be able to easily write reviews, and we aren't forcing them to log into a forum account, so except for an IP they are anonymous.

If they have to log into a forum account, it will cut down on this behavior just by making them less anonymous, and also mean they could be reported through the existing mechanisms I guess.

If there is a demote mechanism, at least garbage reviews can be demoted and buried, and potentially flagged this way.

Someone with more experience than me in such things would be able to make a better guess, but I would imagine that having at least *one* of these things would help with this problem. Maybe there's a better solution, or it's not that big an issue... any thoughts?

GbDorn wrote: Anyway, after giving your post another read I guess machine learning could work given a sufficiently large dataset but we're not Netflix. This will be the key. Do you have any estimation of the Wesnoth userbase for UMC? How many addons does the average user download? How many for the top downloaders? Those are the most important questions in the end.

So I've never actually built or worked with one of these systems. The keyword is "collaborative filtering". You can read about the Netflix prize here: http://en.wikipedia.org/wiki/Netflix_Prize, and some of the different techniques different people tried.

In the sample they used for this, there were 500,000 users rating 18,000 movies, the data containing 100 million ratings. They report that even naive algorithms that Netflix originally tried got a "root mean square error" of ~1 in a 5 star rating system. So most of the time they could accurately predict any person's rating of any movie to within 1 star. The goal of the contest was to make a 10% improvement. As I understand, the challenge is not that it is extremely hard to predict ratings at all, it is that when you have hundreds of millions of data items, it is too expensive to do SVD and many other sophisticated things, so people have to try other cheaper things and get them to work somehow. In years since there have been many papers published on how to do a cheap "approximate SVD" for example, with collaborative filtering cited as a motivation.

I made a little script that adds the download numbers on addons.wesnoth.org. It says there are there 519 addons on the 1.10 server right now, with a total of 3059390 downloads. I have no idea how many users there are downloading things. I would guess that this is also about 30 times less, so maybe we have between 10k and 20k users. (Does anyone know the real numbers?)

My guess is that you could probably compute the SVD of a 20,000-by-500 matrix directly in matlab, even on a few years ago. So it's probably not an issue if we are even doing it in C++ / asking a library to do it. My guess would be that what you would actually do is compute the SVD on the server once a day -- the SVD matrix can be communicated very cheaply because it is known to have low rank. Once the client has the daily SVD they can use it to make recommendations, even if the user changes their recommendations and downvotes something they liked before. The server can just process what their updates mean for other users later.

Edit: I just installed octave to test this... it found the SVD of a random 10,000-by-500 matrix in about 5 minutes. The 20,000-by-500 one took longer than that, but I didn't pay close attention, sorry. The point is, this is probably affordable, and especially if someone tried to think of ways to actually do it quickly. Note that this was a random *dense* matrix which is usually much harder to work with than a sparse matrix.

Dugi · Post by **Dugi** » April 9th, 2014, 10:39 am

So one thing it makes me think now, what is supposed to happen if people write inappropriate reviews.

Most websites that use this kind of rating don't require logging in, and I guess that their method of avoiding spam there is based on users' rating - they just won't mark spam and crap as helpful, and it will be buried behind the more helpful ones. I think that troll posts are a bigger problem than spam, because you don't download a game to spam (and an automated bot surely wouldn't do that), and no website links work there.

I have seen already some fake add-ons on the add-on server, once it was a couple of campaigns that didn't work at all and had only one scenario file (although their descriptions described them something much better). They were reported on the forums. I have also seen a half-dozen of pseudoadd-ons there, with obscene names and probably no effect, and they were deleted quite soon. There is some moderation on the add-ons server.

I think that a log in system would not fully prevent troll posts and significantly decrease the amount of feedback, but it might allow automatic recommendations based on the user's rating of add-ons and maybe writing reviews might require it. I think that we might do this later (I am not sure how would the automatic recommendations be done, maybe because of my very limited knowledge of statistics, I know how to evaluate and interpolate experimental results in physics, but this is completely different kind of statistics).

iceiceice · Post by **iceiceice** » April 17th, 2014, 4:56 pm

Dugi:

I looked briefly at the code of your pull request. I have a few questions about one section, the part where you compute a "general rating":

https://github.com/Dugy/wesnoth/commit/ ... d2d113e8a1

src/addon/info.cpp

Spoiler:

1.) Why do you take the log of the hours played per year ratio and not the downloads_per_year ratio?

2.) You seem to have tuned these exponentials in a pretty specific way. Why do you get nearly 20 points (max of 25) for having say 600 downloads per year? Why do you get nearly 20 points (max of 25) for having say 10000 hours played per year? How did you pick these numbers?

Also, this means that together these two factors are ten times more important than the average user rating, since you are dividing the user rating by two above. How did you decide that this would be a good idea?

If an add-on is used by many people and they all like it, shouldn't it be recommended to users, even if it is not actively being developed / reuploaded anymore? This system seems to place outsize importance on having constant updates and redownloads from the users.

3.) I don't think it's a good idea to display the general rating number, just to sort by it's value if using it at all, since it doesn't really mean anything to the user. The user is likely to assume wrongly that it reflects the average user rating.

4.) It seems that you don't inform the user that you are tracking their "hours played" number for all add-ons and sending it back to the server unencrypted. That seems like something you should state up front in terms of privacy before you do it. You should allow people to opt out of tracking like that with a preference as well.

5.) It seems that it would be incredibly easy to cheat this rating system because you allow a single user's info to have an enormous impact on the rating. For instance, if a single user just keeps clicking "download" on one add-on, their downloads number will push the general rating through the roof. Even more simply, a person could just leave a wesnoth client open on the campaign for days or weeks at a time and periodically close it to get updates to the server. Even more simply, they could adjust the time stamps in the tracking files you are using to make wesnoth think it had been playing an add-on for a very long time. A system where you use likes on a scale of 1-10, one for each user, simply doesn't have this problem, each person only has "one vote" as it were. Perhaps the add-on should only get credit for "distinct user downloads", and have a cap on how many hours a single person can contribute. Anyways it seems like perhaps some aspects of this should be rethought.

I wrote before I wasn't too concerned about cheaters but it has to do with the ease of cheating. It looks like all these stats can be edited directly in the preferences file.

6.) Looking at how the campaign server handles the info submitted from clients:

https://github.com/Dugy/wesnoth/commit/ ... cc797a3114
src/campaign_server/campaign_server.cpp

Spoiler:

Hmm so the server will look at any game play data you upload, and if it is over 1/30'th of the time the add on has been played, then an error "Somebody tried to upload too many gameplay hours for add-on: " is reported. Then you also nuke their hours_played number, replacing it by the time marked since the last time stamp, divided by 30...

Okay, but what if I upload large amounts of time played to someone else's add-on , to nuke their rating?

In open source software you cannot use a bunch of obscure hacks to achieve security or deter cheaters, you have to do things properly, because people will just look at your code, look at your commit history, if they want to. Your deterrent code is actually making a backdoor for people to manipulate the ratings in a huge way. You need to make it very difficult for one user to have an outsize impact on the ratings of their add-ons or others, no matter what input they deliver to you.

Dugi · Post by **Dugi** » April 17th, 2014, 5:43 pm

1) I tried to make a wild guess of the maximal, usual and minimal hours played for add-ons and downloads for add-ons. Then I made up functions that give reasonable numbers (not 0.3, not 9.8) for all expectable values. And the logarithm fitted in one and didn't fit in the other one, attaching the table I based it on.

Untitled 1.ods: (14.7 KiB) Downloaded 469 times

2) As I said in point 1, I just put there what gave meaningful numbers - expectable maxima giving something over 9, expectable minima giving something bellow 2, the something-in-betweens being distributed quite reasonably.

3) That number is calculated from two values that should be objective to some extent and one value that is subjective. So it is some sort of user rating.

4) I think that it said somewhere in wesnoth that the game collects some data about gameplay and uploads it.

5) The download count was abused in the past, and the impact of this abuse will be just less significant with this patch (though the only person who did it I know of was franz_mp, and he got lazy to do it and started making new versions instead... and then failed totally when updates stopped bumping it). When hours_played are uploaded, the server will decrease the times if they are absurdly high (more than 10% of the time since last uploaded or a certain time if no statistics from that IP address were recorded before), so it basically does something like what you suggested.

6) It should cut also overly high gameplay times from users who haven't published the add-on, or am I mistaken?

Generally about cheating here. There is no material profit from cheating. It merely allows advertising a free product. An add-on is a donation to the players' community and advertising it is like telling look, I contributed more to charity than you. Only children and morons will cheat in order to advertise their add-ons (and people who consider their work so awesome that everybody has to try it, but they will read the reviews and know what the truth is). But creating an add-on needs systematic work, and children and trolls aren't quite fond of that. In order to cheat, they need to find the source code, understand it and find its weakness (with the trial and faulure method, it would be really annoying because failures aren't easy to see). Most users don't even know where they can find the preferences file. That is why I expect that the number of cheaters here to be minimal, if any.

The Battle for Wesnoth Forums

Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings

Re: Add-ons client/server support for add-on ratings