On Review Scoring Systems

I have published many reviews at The New Leaf Journal, most notably of visual novels and anime series. Readers may have noticed that I have not, to date, published numerical scores or rating labels, opting instead to explain my impression in essay format. There are a number of reasons I have not employed review scores for the kinds of reviews that I have been writing. For example, a significant goal of my reviews of 2005-2008 English translations of doujin visual novels from Japan was simply to introduce readers to an interesting bit of visual novel localization history. But there is another reason – I have some general issues with review scoring systems, at least as they are most often implemented.

The two most popular scoring systems in my estimation are 10-point scales and Michelin-style 5-point scales. I think that both can be used effectively with proper foundation – but each comes with a set of problems.

Table of Contents

10-Point Scales vs 5-Point Scales
A Counter Argument For 10 Points
My Argument: The Issue is Defining Ratings
Rating Systems Are Not Always Necessary

10-Point Scales vs 5-Point Scales

Eurogamer is a video game review website that has had an interesting rating journey. Editor Tom Phillips explained in a post on the third iteration of Eurogamer’s rating system that Eurogamer had started with a 10-point system before switching to a badge system wherein the reviewer would indicate whether he or she recommended a game with different terms attached to what was effectively a 4-point scale. After finding that many readers did not think the badge system was intuitive, Eurogamer shifted to a 5-point system. Why did Eurogamer not return to a 10-point system? Mr. Phillips explained:

It’s been eight years since Eurogamer ditched the 10-point review scale and we’re not going back to that – the argument then that it felt broken through overuse of its upper half still stands.

I was originally inspired to write this article by a blog post by Wouter Goreneveld on the Eurogamer news. He added some detail to one tendency in 10-point systems, referencing a board game scoring community called BoardGameGeek which uses a 10-point scale:

If you’re inclined to buy something but you’re not sure, you check out BGG’s average score. Guess which ones to watch out for? Indeed, 7+ = good, 8+ = amazing, 6-ish = meh. There’s never any 9, 10, and almost never anything below 6 (or 5).

I agree that grade inflation is a problem with 10-point systems. In theory, 5 should be average. That is, a 5 should not be bad but instead something that the reviewer thinks is decent and warrants a qualified recommendation. But 5 tends to come off as sub-par instead of par, and in many cases average is really a 6 or 7.

Mr. Goreneveld and some of the articles I came across from following links in his post expressed support for the idea of a 5-point system. Mr. Goreneveld had supported the 5-point scale in an earlier 2022 blog post with the caveat that he thought of the numbers as representing labels (close to Eurogamer’s prior idea), which is in line with Goodreads’ rating guide. He explained that in this system, a score of 2/5 represents an assessment that the subject of the review was “OK” instead of “bad.” However, Mr. Goreneveld noted that 1-5 systems also have a downside:

Sometimes … the difference between a 3 and 4, a 2 and 3, or even a 4 and 5 is confusing or hard to pinpoint.

Where 10-point scales have an issue of under-using the lower half of the scale (if not everything below 6 or 7), one can object that 5-point scales do not provide enough notches to distinguish review subjects from one another. Mr. Goreneveld cited to a blog post on The Tao of Gaming wherein the author explained using a four-point scale for board games and stated that — with 1 being an “avoid” recommendation, 2 being an expression of indifference, 3 being a suggestion, and 4 being an enthusiastic suggestion – he had no difficulty classifying board games, whereas he would have difficulty deciding between a 6 or 7 on a 10-point scale. Moreover, he provided a template for translating is 4-point scale to 1-10 ratings, which were required for a board game forum community.

A Counter Argument For 10 Points

Having cited to several articles that supported 5-point scales (or similar) over 10-point scales, I will present a counter-argument in favor of 10-point scales. The Nihon Review was an anime review and forum site that was mostly active from 2004-2016. It is no longer online, but it survives in the Internet Archive’s Wayback Machine. Nihon review had a number of anime reviewers (the roster changed over time), and they used a 0-10 scoring system. One thing I appreciated about the site was that it was cognizant of the need to explain what different scores meant. See the last full capture of is former rating scale page.

As you can see in the link, Nihon Review attempted to provide a clear explanation of what each score meant. For example, 10 represented “as good as it gets,” but Nihon Review distinguished “as good as it gets” from perfect. 5 represented average, which Nihon Review described as “absolute mediocrity” with “a good balance between what’s ‘acceptable’ and ‘not acceptable.’” Nihon Review also made a good effort to define application of the lower-end of the scale, which was something that The Tao of Gaming provided a strong critique of in the case of 10-point systems (arguing there was little need for a large number of low-end scores). Notably, Nihon Review described a 3 as being poor but having “things that could be liked… kind of.” While I do not think Nihon Review’s definitions fully justified the distinctions between low scores, it made more of an effort to define them than most.

While I like Nihon Review’s 10-point system in theory, I was not sure how it worked in practice. That is, did it address some of the strikes against 10-point systems noted in the essays I discussed in the previous sections? I did a quick survey to find the score distribution (see all anime reviews):

A scatterplot showing the anime review score distribution of the now defunct anime review site, The Nihon Review. — Chart created by N.A. Ferrell in LibreOffice Calc.

Note, before continuing, that it is possible there are some small tabulation errors – but it should be almost perfectly accurate. As you can see, even with The Nihon Review’s best efforts, the distribution still comes with some of the issues highlighted in the above essays. The most common score was 7, with scores generally becoming increasingly less common as you move further above or below 7. However, exempting Nihon Review’s admirable restraint in giving out 10s, more anime scored either an 8 or 9 than 5 or 6, making the overall thrust of the scores a bit top-heavy. But not all was lost – while Nihon Review did have more 4s than 3s, more 3s than 2s, and more 2s than 1s, this suggests that the reviewers did carefully try to apply the rating system on the low end and make distinctions among anime series they considered generally sub-par. This is particularly impressive in light of the number of people who wrote reviews for The Nihon Review over the course of its run.

One reason rating scales may tilt higher is because outside of cases where someone is forced to review something, discretion naturally leads to one being more inclined to review things he or she at least finds tolerable rather than things he or she knows or anticipate will be unpleasant. I do not know exactly how Nihon Review decided who reviewed what, but I would not be surprised if, to the extent reviewers could make requests to handle specific reviews, they were generally predisposed to want to review things they found somewhat interesting since a review necessarily requires watching something to the end. While it may sometimes be fun to write a scathing review (Nihon definitely has some of those), it is less fun to have to actually watch a “1” anime series than to finally get to unload on it. Again – I do not know for certain how much this tendency existed at Nihon Review, but it is the sort of thing that could push the average reviews up to a 7.

My Argument: The Issue is Defining Ratings

I do not think there is a universal answer to which rating system is the best. For example, I think that 10-point systems and 5-point systems serve different purposes. A well-implemented 10-point (or even more elaborate) system makes sense if one’s goal is to make fine-grained distinctions among different subjects. A well-ordered 5-point system (or simpler) makes sense if the goal is to simply deliver a recommendation or express one’s own general enjoyment or lack thereof . Rather than choosing a rating system in the abstract, one should consider what is to be rated and what his or her objectives are in sharing ratings.

(I also think review systems should note what, precisely, they are scoring. For example, reviewers may emphasize fun more than technical merits or vice versa. An anime review site may be more interested in animation quality and production than anime stories, or vice versa. Some of this can also come through in the sort of review lists I advocate below. These sorts of definitions are more important in scenarios where a set of media can be reviewed based on the same criteria – for example a site covering only kinetic, non-interactive visual novels would be able to engage in formulating a granular, category-specific rating in a way that a more general game or visual novel review project would not. I add this note to highlight that more can go into a review system than what I discuss here.)

Regardless of the rating system, however, it is important to define how the reviewer or review publication understands its own ratings. For example, THEM Anime Reviews 4.0 is one of the longest-running English anime review sites, tracing its history back to the 1990s. Like The Nihon Review, it has multiple reviewers and has had many reviewers come and go over the years. (For whatever it is worth, I generally found The Nihon Review more useful in the early 2010s). Unlike Nihon, THEM uses a simple five-star scale.

I recently read the Them review of The Angel Next Door Spoils Me Rotten, which I reviewed in 7,900+ words. The reviewer, Allen Moody, gave Angel Next Door three stars out of five. Despite having described Angel Next Door as a borderline candidate for my top-six series of 2023, three stars sounds about right to me. However, Mr. Moody lodged several complaints against Angel Next Door and the purported misogynistic tendencies of the main protagonist with which I do not agree. The parameters of my disagreement are beyond the scope of this article and, if one reads both my review and Mr. Moody’s, the reason should be generally clear. In general, the thrust of my disagreement is that I do not think the series endorsed some of the protagonist’s sentiments in the way Mr. Moody (as I read the review) suggests. To the contrary, in fact, I think one of the main points of the series was the protagonist being forced to get over himself.

But I digress.

For the purpose of this article, however, what struck me is that if I did agree with Mr. Moody’s assessment of Angel Next Door, I would have probably been disinclined to give the series three stars since I think the final assessment of the story depends very much on whether one thinks that the two main characters and their relationship were written coherently and convincingly. Although I do not have a fine-tuned five-point scale, agreeing with Mr. Moody would give me more of a two-star feel. Thus, the interesting point here is less my disagreements with the review and more my different understanding of the review scale.

This innocuous Angel Next Door review gave me an idea. It is necessary, but not sufficient, for a reviewer to define each step of his or her rating system. Sufficiency requires an additional step: Seeding the review system with real reviews. For example, were I to start scoring anime here at The New Leaf Journal, I would first define my rating scale and then publish a list of my scores for a non-insignificant number of anime series. I would not necessarily need to publish full reviews for all of these series, just example scores with multiple anime for each step on my scoring ladder.

The first benefit of seeding a rating list is that it adds concrete examples to each step of the rating ladder. Using anime ratings as an example – if I see reviewed anime that I recognize I will have a clearer picture of how the particular rating system works than I would otherwise have through referencing the description of the scoring ladder alone.

Second, seeing recognized media on a rating list gives readers an idea about the specific reviewer’s preferences. For someone who is familiar with the medium – whether it happens to be anime, board games, books, movies, video games, or the like – this can be very instructive. For example, Sorrow-kun (once editor) of The Nihon Review, gave Mawaru Penguindrum, a 2011 anime series, a 9 out of 10. Penguindrum is a love it or hate it series and I fall on the side of thinking that it is awful (I will refrain out of kindness from deciding between a 2 and a 3 in this forum). However, Sorrow-kun published many reviews – some of which I generally agree with. 5 points sounds right for Shigofumi (its ending song is a clean 10, however). I would go a step higher than his 8 for one of my favorite movies, 5 Centimeters Per Second and at least one step lower than his 5 for Allison & Lillia, but I can see in the large sample that we often land in the same ballpark. Were I very invested, I could glean an understanding into what types of anime his recommendations may be useful for me and which types I should be skeptical about based on my own preferences.

Another benefit of seeding a rating list comes on the low-end, especially for 10-point scales and the like. For whatever it is worth, I was much more likely to finish a bad series 10 years ago than I am now (see Penguindrum…). These days I drop most series that I would be on pace for a below average score, and I have even dropped some that may harbor some average potential out of my lack of interest. But I have certainly seen enough anime series that I could – were I to need to seed a 10-point style ranking – include to provide examples for all 10 steps on the ranking ladder. Doing so would ensure that readers would grasp what I understand to be a bad series even if I am unlikely to write about bad series in full article format – with the notable exception of School Days and its “nice boat.”

The process of seeding a rating list before reviewing things will assist the reviewer in calibrating the rating scale. Without having a rating index, one may be inclined to give an average series a 7 instead of a 5 – especially if he or she is steeped in an environment with grade inflation issues (see e.g., major anime sites with user ratings). With a large enough sample size – one may see that he or she should move ratings for many reviewed items down a notch to ensure a proper distribution and a baseline for future reviews.

The foregoing segues into another point: If one cannot adequately seed a rating system, perhaps he or she has not read, played, seen, or watched enough to be a credible reviewer. For example, while I have played many video games over the last 30 years, I do not think I have played enough games in a general sense – especially in the last 15 years or so – to be a credible video game reviewer. Perhaps I could create a scoring system for a subset of games – say, for example, al|together visual novel translations – but I would not consider myself a general purpose game reviewer. Conversely, I do think I could be a credible Nihon Review/THEM Anime style reviewer given the number of series I have watched to completion.

I conclude this section by noting that I think my suggestion is applicable for multi-reviewer projects such as the former Nihon Review and THEM. In this case, I would recommend that each individual reviewer provide a seed review list with scores on his or her bio page. This would provide initial context for readers to evaluate new reviewers and additional context beyond the reviewer’s published full reviews. Moreover, because a review publication should have a standard, well-defined review format, a seed list will give readers examples of how each individual reviewer interprets the format.

Rating Systems Are Not Always Necessary

Having written extensively about rating systems, I conclude by disregarding the foregoing and taking the position that not every review requires a rating. For example, The New Leaf Journal is, at its core, a writing website. I write about things when I have something to say. While The Angel Next Door Spoils Me Rotten was not a great series, I decided to give it a full review both because I had written a side article about it, and because I ultimately concluded it made an interesting review subject – flaws and all (I also had a health-of-site motive which turned out to be well-founded). My review was very long, closing on 8,000 words. Would it have been better if I attached a score or badge to it? Would readers have gleaned more if I said it was a 6 or 7 out of 10 or a 3 out of 5? My objectives in writing the review were as follows:

Provide a useful summary of the series for people who have not watched it so they can decide whether they are interested in watching it.
Write about the series in such a way that my review will provide value to people who saw the series and are interested in reading someone else’s opinion about it, all the while analyzing it without spoiling it for people who have not seen it.
Provide a clearly articulated assessment of the series strengths and weaknesses with benchmarks in the form of references to other series such that people understand my evaluation of it.

Slapping a score on the review would not have hurt beyond the potential that some less curious readers may simply look at a score without taking the time to understand what the score means or why I gave it a score. But I do not think a score would have added anything meaningful to my very long, detailed review of Angel Next Door. I leave it to readers to decide how engaging my review was, but I think I generally hit all three points that I prioritized. I summarized the series without spoiling key events, I gave people who have not seen the series plenty of information to decide whether they may want to watch it in the future, and people who saw the series will understand some of the references I make in the review to plot developments despite the fact I did not explicitly describe later plot developments. In the end, I made clear that I generally liked Angel Next Door and it erred on the side of being above average (especially in its second half), but that it was limited by its mediocre-to-poor production values and its heavy-handed manner of drilling home certain points despite trusting the viewer to pick up on nuance from other points.

Because of what The New Leaf Journal is and my limited anime review coverage – a review system is not necessary here in the way it may be for other projects. But perhaps I will undertake the onerous task of putting together a review system in accordance with my principles as a sort of demonstration at some point in the future.