Skip to content

Greek vocab lists drawn from corpora with common words

Notifications You must be signed in to change notification settings

swo/greek_corpus

Repository files navigation

To do:

  • Revamp tsv writer in terms of objects, so that all the functionality gets wrapped into each group of words

Most common Greek words

I pulled lists of the most common Greek words appearing in a corpus of web pages from SketchEngine. Then I used WiktionaryParser to pull definitions from Wiktionary. I packaged the results as a tsv that can be uploaded to Anki, a flashcard app.

Files

Data files

  • anki.tsv is the Anki flashcard list
  • db.json is a database with the words, their frequencies, and their definitions
  • words.txt is just a list of the words included in the lists
  • raw/ contains un-tracked files downloaded from SketchEngine that are parsed

Script files

  • parse_html.py turns the html files in raw/ into database entries
  • fetch_definitions.py populates the database with Wiktionary definitions
  • make_anki_tsv.py translates the database into the Anki-ready file

Alternative data sources

I considered but did not ultimately pursue scraping a Greek-English dictionary website (e.g., dict.com or Word Reference) in part because of the difficulty in reliably parsing the pages and also because of licensing concerns.

About

Greek vocab lists drawn from corpora with common words

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages