Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluation #22

Open
garfieldnate opened this issue Mar 7, 2020 · 4 comments
Open

evaluation #22

garfieldnate opened this issue Mar 7, 2020 · 4 comments

Comments

@garfieldnate
Copy link

Hi Ahmed, since you commented on my issue in the half-life regression repo, I was wondering if you have any plans on doing any evaluation of ebisu with the data from Duolingo or elsewhere (I only know of the one from Duolingo). It would be great to have some concrete results for comparison with other review systems.

@fasiha
Copy link
Owner

fasiha commented Mar 8, 2020

Again, thanks for coming by and commenting!

The Duolingo paper's database is partial, meaning it doesn't provide results for each quiz that a user is given for a given flashcard. Since the database skips reviews for the same user–flashcard pair, Ebisu can't readily be applied to this case.

Furthermore: it's not clear what is a robust way to measure the performance of such algorithms 😕 the Duolingo paper uses the mean absolute error (MAE) between the algorithm's prediction of quiz performance versus actual performance, but that's a crappy metric because most of the time, with old flashcards, incorrect quizzes are completely unpredictable, and are far fewer than successful quizzes. (Furthermore, optimizing this metric doesn't serve the Duolingo algorithm well, as I explained in response to your issue on the halflife regression repo. I have some not-yet-published notes on this problem and how to fix it.)

And honestly, at the end of the day, taking a step back and looking at the big picture, for real-world flashcards, all you really need is a system to predict a reasonable half-life. Half-life regression can give you that (I've adjusted the algorithm to use the binomial likelihood and balanced training data and it yields more sensible results than what the original paper gives). So can Ebisu. (Anki, not really since it schedules quizzes in advance and can't properly handle over/under-reviewing.)

Let me know what you think.

@garfieldnate
Copy link
Author

Please bear with me and my general lack of math knowledge in this area.

...the database skips reviews for the same user–flashcard pair,... Eibsu can't readily be applied to this case.

I didn't understand this point. Are you referring to the fact that the scores given are for entire review sessions instead of for each individual memory test, which is what Ebisu uses?

MAE... a crappy metric

I'm not well-versed in what types of metrics fit what circumstances, but I guess you are saying that, even though the exponential forgetting curve is supported in the psychological literature, the collected data either contained too little signal or was not well modelled by just decay curves, so it didn't make sense to try to fit the cost function so strictly. I guess for Duolingo it was probably more the business metrics (student retention and interaction) plus the positive user feedback that were really important. I think an even better metric would be how fast a student makes general progress in learning the subject area (although in Duolingo's case I can easily make it through their learning paths without learning much).

BTW while we're talking about Duolingo, I found another analysis of their paper here: https://papousek.github.io/analysis-of-half-life-regression-model-made-by-duolingo.html.

most of the time, with old flashcards, incorrect quizzes are completely unpredictable

I don't understand this point either. What do you mean by "old"? Any of these things below?

  • Flashcards that have been reviewed many times can still be forgotten because of typos or tip-of-the-tongue forgetting, which is common even in well learned subjects (like one's native language).
  • Once the predicted half-life becomes large, the prediction becomes less accurate.
  • Facts/flashcards that a student has been aware of for a long time tend to get integrated with their mind in a way that the the exponential forgetting curve is no longer a good-enough model for reviewing purposes.

Also, we could just clean the data of old flashcards and compare models with the cleaned data, right?

I have not-yet-published notes

Can't wait to see them!

I've adjusted the algorithm to use the binomial likelihood and balanced training data and it yields more sensible results than what the original paper gives.

Can't wait to see that, either! In general, I was very excited to discover Duolingo's repo, but the nature of the data there seems to make it hard to use as-is.

all you really need is a system to predict a reasonable half-life

I still think an evaluation of some kind would be valuable. It's too bad the Duolingo data is so different from what flashcard apps usually use. It would be interesting if we could somehow create a large dataset from volunteer contributions of personal Anki review data.

@fasiha
Copy link
Owner

fasiha commented Mar 9, 2020

...the database skips reviews for the same user–flashcard pair,... Eibsu can't readily be applied to this case.

I didn't understand this point. Are you referring to the fact that the scores given are for entire review sessions instead of for each individual memory test, which is what Ebisu uses?

No I mean, for each user–flashcard pair in the database, all successive review sessions aren't included. Consider the following rows in the database for user_id "u:FO" and lexeme_id "76390c1350a8dac31186187e2fe1e178", and note how the history_seen column goes from 6, to 8, to 14… that means we see the results of the sixth time this user saw this flashcard, then the eighth, the fourteenth, etc., and we don't have any data for the first five times the user saw this flashcard, etc.:

p_recall,timestamp,delta,user_id,learning_language,ui_language,lexeme_id,lexeme_string,history_seen,history_correct,session_seen,session_correct
1.0,1362076081,27649635,u:FO,de,en,76390c1350a8dac31186187e2fe1e178,lernt/lernen<vblex><pri><p3><sg>,6,4,2,2
1.0,1362082044,5963,u:FO,de,en,76390c1350a8dac31186187e2fe1e178,lernt/lernen<vblex><pri><p3><sg>,8,6,6,6
0.0,1362082297,253,u:FO,de,en,76390c1350a8dac31186187e2fe1e178,lernt/lernen<vblex><pri><p3><sg>,14,12,1,0
1.0,1362082362,65,u:FO,de,en,76390c1350a8dac31186187e2fe1e178,lernt/lernen<vblex><pri><p3><sg>,15,12,1,1
0.5,1362082389,27,u:FO,de,en,76390c1350a8dac31186187e2fe1e178,lernt/lernen<vblex><pri><p3><sg>,16,13,2,1

Ebisu works with an initial memory model that it then propagates through updateRecall after each quiz session. It can't readily handle such large gaps in the data, where we don't know what happened.

MAE... a crappy metric

I'm not well-versed in what types of metrics fit what circumstances, but I guess you are saying that, even though the exponential forgetting curve is supported in the psychological literature, the collected data either contained too little signal or was not well modelled by just decay curves, so it didn't make sense to try to fit the cost function so strictly.

Not quite what I meant—actually exponential decay is baked into the algorithm but I refer more to how the algorithm works. The HLR algorithm (stochastic gradient descent, which they cite in their paper as AdaGrad) zooms in on the weights (the three numbers that you complained about in your issue) that makes the predicted session probability best match the actual session success percentage. Then, in Table 2, they compute MAE (mean absolute error between these predictions and the actual percentage over all data in the test set) and use that to say, look, HLR has lowest MAE compared to logistic regression or Pimsleur or even HLR with fewer feature sets, so it's the best. This literally happens because stochastic gradient descent minimizes the MAE.

But that doesn't mean an algorithm that gets lower MAE is that great, as you noted in your issue: low MAE might happen with stupid nonsense weights, which is what the "HLR -lex" algorithm does.

So if I compared an improved HLR algorithm or Ebisu against HLR by measuring MAE on a complete database of reviews from Anki volunteers, it wouldn't convince me of their relative merits.

BTW while we're talking about Duolingo, I found another analysis of their paper here: https://papousek.github.io/analysis-of-half-life-regression-model-made-by-duolingo.html.

Thank you!!, I hadn't seen this! "It seems that the presented model has almost no predictive power." Just so. I imagine that Duolingo published a very basic early version of whatever they actually use because the model estimated by their code is nonsense.

most of the time, with old flashcards, incorrect quizzes are completely unpredictable

Flashcards that have been reviewed many times can still be forgotten because of typos or tip-of-the-tongue forgetting, which is common even in well learned subjects (like one's native language).

This. Looking at the plot of session success rates for the first million rows in the database shows that as the number of times you've studied a fact grows, failures are ever-present.

million

HLR and Ebisu both work in such a way that as they grow more confident in your memory, they lower the predicted probability of error, so incorporating failures can jar HLR's attempt to fit them. Ebisu handles this better since it's sequential (it sees the failures after fitting a model over time) but when you update a well-established Ebisu model with a failure, you might not like the result (you might not like how big or how small the decrease in halflife is).

Once the predicted half-life becomes large, the prediction becomes less accurate.

Facts/flashcards that a student has been aware of for a long time tend to get integrated with their mind in a way that the the exponential forgetting curve is no longer a good-enough model for reviewing purposes.

Hmm, interesting! I wonder if these are restatements of the above?

I do often think about the real world result of flashcard study, and what the end result ought to be (integrating the fact totally in your mental model so you can retire the flashcard—which might only make sense in the domain I'm familiar with, of language learning).

In general, I was very excited to discover Duolingo's repo, but the nature of the data there seems to make it hard to use as-is.

Yes, it's a very unusual take on the problem—an algorithm that predicts halflife as a function of how often you've seen a flashcard & how often you've remembered it, without caring about the timing of those quizzes. So in that way, it's far less choosy than e.g. Ebisu. But I haven't at all studied the real promise of the algorithm, which is to incorporate relationships between flashcards. E.g., if you've just missed a flashcard on one conjugation of a verb, what should that do to your pending review of another conjugation? Should it be postponed or preponed or…? I don't know and am curious to figure out.

I think an even better metric would be how fast a student makes general progress in learning the subject area (although in Duolingo's case I can easily make it through their learning paths without learning much).

Yeah this would be a gold-standard experimental study but you'd need so many students to average out all other factors (about students, material, surrounding quiz app, etc.) before you could reliably say which algorithm worked better. I'm honestly not holding my breath for this. I still think that even the dumbest algorithm will work nicely with well-designed flashcards studied by energetic students with an app that allows them to tweak their confidence about a fact.

The thing I'm actually more confident about is, as mentioned above, using the network of relationships between flashcards, and figuring out how a review on one ought to shift the reviews of the others. We're just barely starting to see ideas along these lines. My own flashcard sets use foreign language sentences that are linked to dictionary entries—both sentences and dictionary entries have Ebisu models, and reviewing a sentence results counts as a passive review of all the dictionary entries it contains (a passive review means, you reuse the same model as before, just update the timestamp to now). That way, you rarely review dictionary words themselves because you're always reviewing them in the context of sentences, which I really like.

Hope this helps, happy to explain anything further.

@lars76
Copy link

lars76 commented Apr 28, 2024

Hi, I've been thinking about a new metric for evaluating spaced repetition algorithms and came across this thread. Since I have just started experimenting with these algorithms, I wanted to know what you think about it. I also haven't read exactly how your algorithm works yet. My metric is also not perfect in the sense that it actually measures the learning progress but I think it is better than comparing "predicted result of user for card 1" and "actual result of user for card 1".

Dataset:
We have samples t with the corresponding result of either 0 (forget) or 1 (retrieve) + some other features. This entire review history exists for every single card. The dataset comes from Anki: https://github.com/open-spaced-repetition/srs-benchmark

Ground truth:
We always group up to the next “forget result” (https://en.wikipedia.org/wiki/Forgetting_curve). For example, t=1 with result=1, t=3 with result=1, t=6 with result=0. Then we have the interval [1, 6]. Our ground truth CDF is P(T <= t) for t in [1, 6].

Prediction:
Let us describe a simple algorithm. The forgetting curve can be modeled as an exponential distribution P(T = t) = e^(-lambda * t). The corresponding CDF is P(T <= t) = 1 - e^(-lambda * t).

Metric:
We can use the Continuous Ranked Probability Score, which is essentially the MSE between the two CDFs: https://datascience.stackexchange.com/questions/63919/what-is-continuous-ranked-probability-score-crps

Justification:
Human memory starts with the knowledge of a fact (recall = 1) and after a certain time t we forget it. A good algorithm should accurately describe this forgetting curve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants