Skip to content

[WIP] Determine if text is about the Rust game or Rust programming language

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

CosmicHorrorDev/rust_text_classifier

Repository files navigation

Rust Text Classifier

Current Manifestation: u/AutoShadow0133

A Reddit bot powered by a text classifier for determining if text is about the videogame Rust or the programming language Rust

Overview

A high-level overview of the actions of the bot is simply to listen to the stream of new textposts on the r/rust subreddit. When it sees a new post then it runs a text classifier that's been trained to discern text about the videogame vs programming language. If it believes that the title+body is about the videogame (based on a configurable prediction threshold, see analysis for more information), then it simply leaves a comment listing this conclusion along with several popular Rust game subreddits that might fit the post.

Installation

Note: This has only been tested on Linux and the following assumes a Linux environment

Required Configuration

This repo is essentially assumed to be installed under /opt/rust_text_classifier with only one required change being

  • A config file called config.json (sample_config.json acts as a template)

If desired a different posts_corpus can be used and several files will be automatically generated

  • posts.db simply keeps track of classifications on posts
  • text_classifier.pkl which is a pickled form of the classifier to avoid having to retrain each time the program is launched

Dependencies

This project uses poetry for handling dependencies and virtual environments. With poetry installed getting all the dependencies setup and then running the bot is as simple as running the following from the project dir

poetry install --no-dev  # Only need to do this once
poetry run ./bot  # Uses the virtual environment created above

Alternatively you can use your system's package manager, or you can manually use pip to install the dependencies (I don't think it supports reading from pyproject.toml yet, but I could be wrong)

Analysis

Note: The classifier always uses an equal number of posts from each category, so even though there are more posts about the game available it will only select enough to match the posts about the lang

There is an analysis script for some basic (read as hacky) analysis. This test simply trains a classifier off 80% of the posts found in posts_corpus and then tests the accuracy using the remaining 20% of the posts. This test is run 100 times with the values for each category being averaged together and reported. This is repeated using 50%, 60%, and 70% as the threshold.

The current corpus I'm using is a set of 400 removed r/rust posts along with 400 r/rust posts about the lang (Big thanks to the moderators for helping me get access to relevant removed posts)

Category Threshold Correct Incorrect Ignored
Lang 50% 97.20% 2.80% 0.00%
Game 50% 95.07% 4.93% 0.00%
Lang 60% 91.24% 0.89% 7.87%
Game 60% 85.89% 1.91% 12.20%
Lang 70% 75.38% 0.26% 24.37%
Game 70% 64.96% 1.01% 34.02%

Notes on Posts Corpus

The posts corpus is generated by running a simple script that fetches all new text posts from r/rust along with a number of Rust Game subreddits every 5 minutes. From there the posts from r/rust are manually classified into r_rust_correct and r_rust_incorrect. This is done to best match the information that the bot will attempt to classify although using older data would likely also work well.

The classifier also prefers posts from r_rust_incorrect before all other Rust Game posts since it best matches the data it's attempt to classify, so it will use posts from there before any other Rust Game posts (The loaded posts are always shuffled though to guarantee better randomness for testing accuracy).

License

All contents in this repo excluding the content within the posts_corpus directory is licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

About

[WIP] Determine if text is about the Rust game or Rust programming language

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages