New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spam detection engine #11319
base: develop
Are you sure you want to change the base?
Add spam detection engine #11319
Conversation
…are-analyzer-events
Add Gitlab action workflow Patch the generator Running linters Gemfiles
Add language service Normalize gems
* Add BayesStrategy * Add Bayes Analyzer * Refactor strategy intialization process
…are-analyzer-events
…dim into ale-add-spam-detection
@@ -6,7 +6,7 @@ module Decidim | |||
module Generators | |||
def self.edge_git_branch | |||
if Decidim::Generators.version.match?(/\.dev$/) | |||
"develop" | |||
"ale-add-spam-detection" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ale-add-spam-detection" | |
"develop" |
@@ -42,7 +42,7 @@ def source_paths | |||
desc: "Use a specific branch from GitHub's version" | |||
|
|||
class_option :repository, type: :string, | |||
default: "https://github.com/decidim/decidim.git", | |||
default: "https://github.com/tremend-cofe/decidim.git", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default: "https://github.com/tremend-cofe/decidim.git", | |
default: "https://github.com/decidim/decidim.git", |
@@ -433,7 +433,7 @@ def branch | |||
end | |||
|
|||
def repository | |||
@repository ||= options[:repository] || "https://github.com/decidim/decidim.git" | |||
@repository ||= options[:repository] || "https://github.com/tremend-cofe/decidim.git" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@repository ||= options[:repository] || "https://github.com/tremend-cofe/decidim.git" | |
@repository ||= options[:repository] || "https://github.com/decidim/decidim.git" |
@@ -458,7 +458,7 @@ def target_gemfile | |||
root = if options[:path] | |||
expanded_path | |||
elsif branch.present? | |||
"https://raw.githubusercontent.com/decidim/decidim/#{branch}/decidim-generators" | |||
"https://raw.githubusercontent.com/tremend-cofe/decidim/#{branch}/decidim-generators" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"https://raw.githubusercontent.com/tremend-cofe/decidim/#{branch}/decidim-generators" | |
"https://raw.githubusercontent.com/decidim/decidim/#{branch}/decidim-generators" |
@@ -17,7 +17,7 @@ module Decidim | |||
let(:test_version) { "0.27.0.dev" } | |||
|
|||
it "returns the develop branch" do | |||
expect(subject.edge_git_branch).to eq("develop") | |||
expect(subject.edge_git_branch).to eq("ale-add-spam-detection") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect(subject.edge_git_branch).to eq("ale-add-spam-detection") | |
expect(subject.edge_git_branch).to eq("develop") |
* Add event handlers and spec data * Fixing failng specs * Fix Catgeory error in untrain * fix decidim ai tests
* Add Strategy module * Add more namespaces
* Add resources to be analyzed * mend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had another round of code review and I've just left few improvement ideas. Feel free to adjust and justify to your needs. Note that this review is only regarding the code, I have not tested it in action as the previous analysis is still valid that I did in the previous PR.
The fundamental problem I see with the current approach is that the default configuration won't work for most of the users, i.e. when using the in-memory database. If configured with the Redis backend, the current approach should work fine.
Either the loading and training process need to be adjusted OR Redis has to be the default and requirement for using this module.
The training data won't be sufficient with the Naive Bayes classifier. We would need about 50k good comments/profile descriptions and 50k bad comments/profile descriptions for it to be somewhat reliable. The current datasets contain about 6k entries, most of which comes from the SMS dataset. Some analysis I did before is available in the other already closed PR.
Probably with our use cases, we will swap the classifier with some really dumb implementation that will likely work fine for our use cases. What I was thinking as a baseline after the previous round of review:
- Analyze the language of the content (English or other non-platform language is an indication of spam)
- Check if the content contains a link with a URL that is not in the allowed list of URLs (indication of spam)
- Check if the content contains any words that are commonly used by spammers (indication of spam)
- For user profiles, check if the profile has a URL (indication of a spammer)
And then draw a score based on this analysis. We've used this baseline for a couple of our instances and it has worked relatively well identifying spammer profiles, with very low error rate.
The current implementation allows doing this, so for this need, it will work fine.
I think the Naive Bayes classifier will be difficult until there is enough training data for it, both ham and spam. And the data also has to be properly licensed (AGPL compatible) for it to ship with the core modules.
Add this line to your application's Gemfile: | ||
|
||
```ruby | ||
gem "decidim-tools-ai" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gem "decidim-tools-ai" | |
gem "decidim-ai" |
@@ -0,0 +1,55 @@ | |||
# Decidim::Ai | |||
|
|||
The Decidim::AI is a library that aims to provide Artificial Intelligence tools for Decidim. This plugin has been initially developed aiming to analyze the content and provide spam classification using Naive Bayes algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Decidim::AI is a library that aims to provide Artificial Intelligence tools for Decidim. This plugin has been initially developed aiming to analyze the content and provide spam classification using Naive Bayes algorithm. | |
The Decidim::Ai is a library that aims to provide Artificial Intelligence tools for Decidim. This plugin has been initially developed aiming to analyze the content and provide spam classification using Naive Bayes algorithm. |
|
||
delegate :train, :reset, to: :backend | ||
|
||
def classify(content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential problem I see with this approach is something that I've also mentioned before.
If we are using the Naive Bayes classifier, it is good at classifying specific types of content if trained properly.
E.g. the content entered to profiles typically differs from the content entered to comments. Based on my prior analysis, I don't think the same dataset works for both of these content types.
I have not tested the classifier with a properly balanced (i.e. 50% spam, 50% ham) large enough dataset that would contain both, profile descriptions as well as comment data, so my assumption could be wrong. But just reading through the resources, I think the classification would need to be specific to the context about what data is being analyzed.
On the other hand, we have the problem that if we are using the in-memory database (as is the default) and start adding multiple in-memory classifiers, it will lead to the application requiring a lot of memory to run.
So this would work well if one global classifier works for all types of content but with the Naive Bayes model I'm not sure if that is the case.
While I don't expect the first iteration to be perfect, I would consider adding at least the possibility to have multiple classifiers for different types of content, as that could be required (depending on the classifier strategy).
module Importer | ||
class File | ||
def self.call(file) | ||
service = Decidim::Ai.spam_detection_instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default configuration suggest using the in-memory database which would mean that the training data is persisted only during the runtime.
When using the training strategy suggested by the documentation (i.e. the rake tasks), the classifier being used at application runtime will never receive the training data through this strategy.
So when you run the training rake tasks, the classifier is trained only during those tasks. The classifier loaded by the server application will never receive the training data.
I think a better way to think about this would be to pre-train a model, persist/dump the trained model to somewhere (file system, database, etc.) and then load that model when the application starts.
This strategy would work when using the redis backend for the classifier but with the default in-memory backend, it won't work.
After the configuration is added, you need to run the below command, so that the reporting user is created. | ||
|
||
```ruby | ||
bundle exec rake decidim:spam:data:create_reporting_user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bundle exec rake decidim:spam:data:create_reporting_user | |
bundle exec rake decidim:ai:create_reporting_user |
If you have an existing installation, you can use the below command to train the engine with your existing data: | ||
|
||
```ruby | ||
bundle exec rake decidim:spam:train:moderation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bundle exec rake decidim:spam:train:moderation | |
bundle exec rake decidim:ai:load_plugin_dataset | |
bundle exec rake decidim:ai:load_application_dataset | |
bundle exec rake decidim:ai:train_using_database |
I would actually like if this happened in only one rake task (like in the docs) but just commenting based on what I see in the rake tasks.
wrapped.untrain :ham, translated_attribute(resource.send(field)) | ||
wrapped.train :spam, translated_attribute(resource.send(field)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be persisted to the in-memory classifier when the application is run in multiple processes or load is divided onto multiple servers.
I would not train the classifier on-the-fly. Instead, I would suggest to train the model as a separate task, dump the trained model after training and then load the trained model at application startup.
Also, this trains only one language (i.e. the default language).
query.find_each(batch_size: 100) do |resource| | ||
classification = resource_hidden?(resource) ? :spam : :ham | ||
fields.each do |field_name| | ||
train classification, translated_attribute(resource.send(field_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this will still fail if the resource data is nil
.
The #train
method on the classifier is delegated directly to the backend, so there is no safe checks regarding this.
Also, this trains only one language (i.e. the default language).
module SpamDetection | ||
module Resource | ||
class UserBaseEntity < Base | ||
def fields = [:about] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just leave the same comment here as I left before, just FYI.
Sometimes the spammers are filing the about section and then removing it later after their campaign ends (and their client stops paying). This information is still available through the version history of the profile.
Another thing that is generally a high indication that the profile is a spammer is that if they have filled in the personal URL to their profile. This is not currently considered and mileage may vary depending on the website.
module SpamDetection | ||
module Resource | ||
class Comment < Base | ||
def fields = [:body] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also from the previous review, the training data shipped with the module has the user's name in the beginning of the comment. So I commented there that should we consider the same convention here too?
🎩 What? Why?
This PR adds the spam detection mechanism, created in a stand alone bundle that can be installed also in older decidim installations. Please refer to decidim-tools-ai/Readme.md for configuration details.
📌 Related Issues
Link your PR to an issue
Testing
📷 Screenshots
Please add screenshots of the changes you're proposing