TDT4310-project-sorted-japa.../todo.md

667 B

  • Fix pulling data sources
  • JMDict
    • Ingest data
  • Tatoeba
    • Ingest data
    • Disambiguate and connect to JMDict senses
  • NHK News
    • Ingest data
    • Disambiguate This should be done through a combination of mecab and leveshtein to the sense glossary (although, please mention in the report that it might be bad dropping the ones still ambiguous, because there might be a pattern to it. Single words might have lots and lots of similar glosses, and be marked as very rare as a result)
      • TF IDF
  • Test out weight combinations Some notes:
    • Sentence length cost should probably increase exponentially.