Skip to content

Ennumerates frequency values for ngrams within text corpora of over 500 student writing samples ranging from 150-200 words per submission.

Notifications You must be signed in to change notification settings

zmuhls/ngram-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

Summary:

Use case for this program involved the tokenization and analysis of approximately ~500 first-year student writing samples at around 150-200 words per submission. It can also be coupled with findings drawn from associated NLP methods geared toward sensemaking and pattern recognition among rhetorically contained linguistic datasets.

Such was the case with the metadiscursive register of students who write about their writing in response to the pedagogical language of an assignment description: one whose parameters narrow the pool of lexical and phraseological expressions among online learners. The instructional design & rhetoric of these writing prompts thus underwent strong qualitative growth year-over-year due the patterned insights of these learner analytics.

Files:

  1. Python script designed to preprocess, sort, and ennumerate the frequency values for bigram and trigram tokens within considerably large text corpora.
  2. A sizeable linguistic dataset drawn from a distributed set of text artificats
  3. Plain or rich text file for each linguistic dataset it to be concetenated, added as a commit, with its filename wrapped in quotes and added as input in script.

Addendum:

This repository comprises part of my final research project in fulfillment of Kyle Gorman's Fall 2019 section of Methods in Computational Linguistics at The Graduate Center, CUNY.

About

Ennumerates frequency values for ngrams within text corpora of over 500 student writing samples ranging from 150-200 words per submission.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages