Add spam detection engine #11319

alecslupu · 2023-07-22T08:36:45Z

🎩 What? Why?

This PR adds the spam detection mechanism, created in a stand alone bundle that can be installed also in older decidim installations. Please refer to decidim-tools-ai/Readme.md for configuration details.

📌 Related Issues

Link your PR to an issue

Requires Add Events support for Command #11064
Related to Use content classification systems for better SPAM detection #10038
Related to Use content classification systems for better SPAM detection #10151
Fixes Use content classification systems for better SPAM detection #10038

Testing

Follow the installation instructions in the readme file
Index the data
Create some content and check to see if it get's marked as spam

📷 Screenshots

Please add screenshots of the changes you're proposing

…are-analyzer-events

Add Gitlab action workflow Patch the generator Running linters Gemfiles

Add language service Normalize gems

* Add BayesStrategy * Add Bayes Analyzer * Refactor strategy intialization process

…m-detection

…are-analyzer-events

…tion

* Add BayesStrategy * Add Bayes Analyzer * Refactor strategy intialization process

…dim into ale-add-spam-detection

…m-detection

alecslupu · 2023-07-22T08:38:20Z

decidim-generators/lib/decidim/generators.rb

@@ -6,7 +6,7 @@ module Decidim
  module Generators
    def self.edge_git_branch
      if Decidim::Generators.version.match?(/\.dev$/)
-        "develop"
+        "ale-add-spam-detection"


Suggested change

"ale-add-spam-detection"

"develop"

alecslupu · 2023-07-22T08:38:34Z

decidim-generators/lib/decidim/generators/app_generator.rb

@@ -42,7 +42,7 @@ def source_paths
                            desc: "Use a specific branch from GitHub's version"

      class_option :repository, type: :string,
-                                default: "https://github.com/decidim/decidim.git",
+                                default: "https://github.com/tremend-cofe/decidim.git",


Suggested change

default: "https://github.com/tremend-cofe/decidim.git",

default: "https://github.com/decidim/decidim.git",

alecslupu · 2023-07-22T08:38:52Z

decidim-generators/lib/decidim/generators/app_generator.rb

@@ -433,7 +433,7 @@ def branch
      end

      def repository
-        @repository ||= options[:repository] || "https://github.com/decidim/decidim.git"
+        @repository ||= options[:repository] || "https://github.com/tremend-cofe/decidim.git"


Suggested change

@repository ||= options[:repository] || "https://github.com/tremend-cofe/decidim.git"

@repository ||= options[:repository] || "https://github.com/decidim/decidim.git"

alecslupu · 2023-07-22T08:39:05Z

decidim-generators/lib/decidim/generators/app_generator.rb

@@ -458,7 +458,7 @@ def target_gemfile
        root = if options[:path]
                 expanded_path
               elsif branch.present?
-                 "https://raw.githubusercontent.com/decidim/decidim/#{branch}/decidim-generators"
+                 "https://raw.githubusercontent.com/tremend-cofe/decidim/#{branch}/decidim-generators"


Suggested change

"https://raw.githubusercontent.com/tremend-cofe/decidim/#{branch}/decidim-generators"

"https://raw.githubusercontent.com/decidim/decidim/#{branch}/decidim-generators"

alecslupu · 2023-07-22T08:39:18Z

decidim-generators/spec/lib/generators_spec.rb

@@ -17,7 +17,7 @@ module Decidim
        let(:test_version) { "0.27.0.dev" }

        it "returns the develop branch" do
-          expect(subject.edge_git_branch).to eq("develop")
+          expect(subject.edge_git_branch).to eq("ale-add-spam-detection")


Suggested change

expect(subject.edge_git_branch).to eq("ale-add-spam-detection")

expect(subject.edge_git_branch).to eq("develop")

* Add event handlers and spec data * Fixing failng specs * Fix Catgeory error in untrain * fix decidim ai tests

* Add Strategy module * Add more namespaces

* Add resources to be analyzed * mend

…m-detection

ahukkanen

I've had another round of code review and I've just left few improvement ideas. Feel free to adjust and justify to your needs. Note that this review is only regarding the code, I have not tested it in action as the previous analysis is still valid that I did in the previous PR.

The fundamental problem I see with the current approach is that the default configuration won't work for most of the users, i.e. when using the in-memory database. If configured with the Redis backend, the current approach should work fine.

Either the loading and training process need to be adjusted OR Redis has to be the default and requirement for using this module.

The training data won't be sufficient with the Naive Bayes classifier. We would need about 50k good comments/profile descriptions and 50k bad comments/profile descriptions for it to be somewhat reliable. The current datasets contain about 6k entries, most of which comes from the SMS dataset. Some analysis I did before is available in the other already closed PR.

Probably with our use cases, we will swap the classifier with some really dumb implementation that will likely work fine for our use cases. What I was thinking as a baseline after the previous round of review:

Analyze the language of the content (English or other non-platform language is an indication of spam)
Check if the content contains a link with a URL that is not in the allowed list of URLs (indication of spam)
Check if the content contains any words that are commonly used by spammers (indication of spam)
For user profiles, check if the profile has a URL (indication of a spammer)

And then draw a score based on this analysis. We've used this baseline for a couple of our instances and it has worked relatively well identifying spammer profiles, with very low error rate.

The current implementation allows doing this, so for this need, it will work fine.

I think the Naive Bayes classifier will be difficult until there is enough training data for it, both ham and spam. And the data also has to be properly licensed (AGPL compatible) for it to ship with the core modules.

ahukkanen · 2024-05-09T13:11:06Z

decidim-ai/README.md

+Add this line to your application's Gemfile:
+
+```ruby
+gem "decidim-tools-ai"


Suggested change

gem "decidim-tools-ai"

gem "decidim-ai"

ahukkanen · 2024-05-09T13:11:16Z

decidim-ai/README.md

@@ -0,0 +1,55 @@
+# Decidim::Ai
+
+The Decidim::AI is a library that aims to provide Artificial Intelligence tools for Decidim. This plugin has been initially developed aiming to analyze the content and provide spam classification using Naive Bayes algorithm.


Suggested change

The Decidim::AI is a library that aims to provide Artificial Intelligence tools for Decidim. This plugin has been initially developed aiming to analyze the content and provide spam classification using Naive Bayes algorithm.

The Decidim::Ai is a library that aims to provide Artificial Intelligence tools for Decidim. This plugin has been initially developed aiming to analyze the content and provide spam classification using Naive Bayes algorithm.

ahukkanen · 2024-05-09T13:24:33Z

decidim-ai/lib/decidim/ai/spam_detection/strategy/bayes.rb

+
+          delegate :train, :reset, to: :backend
+
+          def classify(content)


One potential problem I see with this approach is something that I've also mentioned before.

If we are using the Naive Bayes classifier, it is good at classifying specific types of content if trained properly.

E.g. the content entered to profiles typically differs from the content entered to comments. Based on my prior analysis, I don't think the same dataset works for both of these content types.

I have not tested the classifier with a properly balanced (i.e. 50% spam, 50% ham) large enough dataset that would contain both, profile descriptions as well as comment data, so my assumption could be wrong. But just reading through the resources, I think the classification would need to be specific to the context about what data is being analyzed.

On the other hand, we have the problem that if we are using the in-memory database (as is the default) and start adding multiple in-memory classifiers, it will lead to the application requiring a lot of memory to run.

So this would work well if one global classifier works for all types of content but with the Naive Bayes model I'm not sure if that is the case.

While I don't expect the first iteration to be perfect, I would consider adding at least the possibility to have multiple classifiers for different types of content, as that could be required (depending on the classifier strategy).

ahukkanen · 2024-05-09T13:32:21Z

decidim-ai/lib/decidim/ai/spam_detection/importer/file.rb

+      module Importer
+        class File
+          def self.call(file)
+            service = Decidim::Ai.spam_detection_instance


The default configuration suggest using the in-memory database which would mean that the training data is persisted only during the runtime.

When using the training strategy suggested by the documentation (i.e. the rake tasks), the classifier being used at application runtime will never receive the training data through this strategy.

So when you run the training rake tasks, the classifier is trained only during those tasks. The classifier loaded by the server application will never receive the training data.

I think a better way to think about this would be to pre-train a model, persist/dump the trained model to somewhere (file system, database, etc.) and then load that model when the application starts.

This strategy would work when using the redis backend for the classifier but with the default in-memory backend, it won't work.

ahukkanen · 2024-05-09T13:34:47Z

decidim-ai/README.md

+After the configuration is added, you need to run the below command, so that the reporting user is created.
+
+```ruby
+bundle exec rake decidim:spam:data:create_reporting_user


Suggested change

bundle exec rake decidim:spam:data:create_reporting_user

bundle exec rake decidim:ai:create_reporting_user

ahukkanen · 2024-05-09T13:36:09Z

decidim-ai/README.md

+If you have an existing installation, you can use the below command to train the engine with your existing data:
+
+```ruby
+bundle exec rake decidim:spam:train:moderation


Suggested change

bundle exec rake decidim:spam:train:moderation

bundle exec rake decidim:ai:load_plugin_dataset

bundle exec rake decidim:ai:load_application_dataset

bundle exec rake decidim:ai:train_using_database

I would actually like if this happened in only one rake task (like in the docs) but just commenting based on what I see in the rake tasks.

ahukkanen · 2024-05-09T13:41:50Z

decidim-ai/app/jobs/decidim/ai/train_hidden_resource_data_job.rb

+            wrapped.untrain :ham, translated_attribute(resource.send(field))
+            wrapped.train :spam, translated_attribute(resource.send(field))


This won't be persisted to the in-memory classifier when the application is run in multiple processes or load is divided onto multiple servers.

I would not train the classifier on-the-fly. Instead, I would suggest to train the model as a separate task, dump the trained model after training and then load the trained model at application startup.

Also, this trains only one language (i.e. the default language).

ahukkanen · 2024-05-09T13:48:47Z

decidim-ai/lib/decidim/ai/spam_detection/resource/base.rb

+            query.find_each(batch_size: 100) do |resource|
+              classification = resource_hidden?(resource) ? :spam : :ham
+              fields.each do |field_name|
+                train classification, translated_attribute(resource.send(field_name))


Note that this will still fail if the resource data is nil.

The #train method on the classifier is delegated directly to the backend, so there is no safe checks regarding this.

Also, this trains only one language (i.e. the default language).

ahukkanen · 2024-05-09T13:52:05Z

decidim-ai/lib/decidim/ai/spam_detection/resource/user_base_entity.rb

+    module SpamDetection
+      module Resource
+        class UserBaseEntity < Base
+          def fields = [:about]


I'll just leave the same comment here as I left before, just FYI.

Sometimes the spammers are filing the about section and then removing it later after their campaign ends (and their client stops paying). This information is still available through the version history of the profile.

Another thing that is generally a high indication that the profile is a spammer is that if they have filled in the personal URL to their profile. This is not currently considered and mileage may vary depending on the website.

ahukkanen · 2024-05-09T13:54:54Z

decidim-ai/lib/decidim/ai/spam_detection/resource/comment.rb

+    module SpamDetection
+      module Resource
+        class Comment < Base
+          def fields = [:body]


Also from the previous review, the training data shipped with the module has the user's name in the beginning of the comment. So I commented there that should we consider the same convention here too?

alecslupu added 26 commits July 13, 2023 19:00

Add events for comments

64532b9

Add events for debates

98798ea

Add events for meetings

c6a07b0

Update the proposals commands

ab8bcbd

Refactor with_events

0d11ecc

Apply review recommendations

cf9a7a0

Merge branch 'develop' of github.com:decidim/decidim into fature/prep…

2df8d59

…are-analyzer-events

Create decidim-ai module

d204dc7

Add Gitlab action workflow Patch the generator Running linters Gemfiles

change description

99e4c98

Add language detection-service

94952c1

Add language service Normalize gems

Add registry strategy (#253)

6533af8

Add SpamDetectionService class (#255)

d2f3823

Add BayesStrategy (#256)

6daeb01

* Add BayesStrategy * Add Bayes Analyzer * Refactor strategy intialization process

Merge branch 'develop' of github.com:decidim/decidim into ale-add-spa…

b27b5a8

…m-detection

Merge branch 'develop' of github.com:decidim/decidim into fature/prep…

8d5ae48

…are-analyzer-events

Merge branch 'fature/prepare-analyzer-events' into ale-add-spam-detec…

692a2fa

…tion

Change the pipeline working dir

9d57036

Fixing spam suite

c6a772a

Revert event changes

e27dc50

Revert event changes

6dc2c88

Merge branch 'fature/prepare-analyzer-events' into ale-add-spam-detec…

7dd8754

…tion

Add BayesStrategy (#256) (#257)

d587f3c

* Add BayesStrategy * Add Bayes Analyzer * Refactor strategy intialization process

Merge branch 'ale-add-spam-detection' of github.com:tremend-cofe/deci…

367ca39

…dim into ale-add-spam-detection

Running linters

8f51cc7

Merge branch 'develop' of github.com:decidim/decidim into ale-add-spa…

daec039

…m-detection

Add score calculation (#262)

f6a1dca

alecslupu commented Jul 22, 2023

View reviewed changes

alecslupu added 3 commits July 23, 2023 05:07

Add event handlers and spec data (#263)

6583933

* Add event handlers and spec data * Fixing failng specs * Fix Catgeory error in untrain * fix decidim ai tests

Refactor AI namespaces (#269)

4a1cc71

* Add Strategy module * Add more namespaces

Add resources to be indexed (#254)

215aa57

* Add resources to be analyzed * mend

andreslucena unassigned ahukkanen Mar 7, 2024

Merge branch 'develop' of github.com:decidim/decidim into ale-add-spa…

ce04e00

…m-detection

alecslupu dismissed github-actions’s stale review via ce04e00 March 8, 2024 19:40

github-actions bot previously approved these changes Mar 8, 2024

View reviewed changes

Revert wicked_pdf version change

6c90048

alecslupu dismissed github-actions’s stale review via 6c90048 March 8, 2024 19:42

github-actions bot previously approved these changes Mar 8, 2024

View reviewed changes

alecslupu added 2 commits March 11, 2024 12:16

Merge branch 'develop' of github.com:decidim/decidim into ale-add-spa…

cc8ab5d

…m-detection

Add spam csv dictionaries to the list of exceptions

06479e3

alecslupu dismissed github-actions’s stale review via 06479e3 March 11, 2024 11:12

github-actions bot previously approved these changes Mar 11, 2024

View reviewed changes

Additional CSV

b5ce0db

alecslupu dismissed github-actions’s stale review via b5ce0db March 11, 2024 11:18

github-actions bot previously approved these changes Mar 11, 2024

View reviewed changes

Spell checks

51deb6e

alecslupu dismissed github-actions’s stale review via 51deb6e March 11, 2024 11:36

github-actions bot previously approved these changes Mar 11, 2024

View reviewed changes

alecslupu added 2 commits March 11, 2024 13:55

Merge branch 'develop' of github.com:decidim/decidim into ale-add-spa…

4e354d3

…m-detection

Spell checks

b5151f3

alecslupu dismissed github-actions’s stale review via b5151f3 March 11, 2024 12:08

github-actions bot previously approved these changes Mar 11, 2024

View reviewed changes

alecslupu added 2 commits May 1, 2024 18:52

Merge branch 'develop' of github.com:decidim/decidim into ale-add-spa…

312a2c5

…m-detection

Add Autolabeler config for Ai module

e55dd3d

alecslupu dismissed github-actions’s stale review via e55dd3d May 1, 2024 15:59

probot-autolabeler bot added configuration dependencies Pull requests that update a dependency file or issues that talk about updating dependencies labels May 1, 2024

github-actions bot approved these changes May 1, 2024

View reviewed changes

aramollis mentioned this pull request May 6, 2024

Add autoblock spam users feature decidim-ice/decidim-module-decidim_awesome#299

Open

ahukkanen requested changes May 9, 2024

View reviewed changes

ahukkanen mentioned this pull request May 20, 2024

Use content classification systems for better SPAM detection #10038

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spam detection engine #11319

Add spam detection engine #11319

alecslupu commented Jul 22, 2023 •

edited

alecslupu Jul 22, 2023

alecslupu Jul 22, 2023

alecslupu Jul 22, 2023

alecslupu Jul 22, 2023

alecslupu Jul 22, 2023

ahukkanen left a comment

ahukkanen May 9, 2024

ahukkanen May 9, 2024

ahukkanen May 9, 2024

ahukkanen May 9, 2024

ahukkanen May 9, 2024

ahukkanen May 9, 2024

ahukkanen May 9, 2024

ahukkanen May 9, 2024

ahukkanen May 9, 2024

ahukkanen May 9, 2024

	default: "https://github.com/tremend-cofe/decidim.git",
	default: "https://github.com/decidim/decidim.git",

	@repository \|\|= options[:repository] \|\| "https://github.com/tremend-cofe/decidim.git"
	@repository \|\|= options[:repository] \|\| "https://github.com/decidim/decidim.git"

	"https://raw.githubusercontent.com/tremend-cofe/decidim/#{branch}/decidim-generators"
	"https://raw.githubusercontent.com/decidim/decidim/#{branch}/decidim-generators"

	expect(subject.edge_git_branch).to eq("ale-add-spam-detection")
	expect(subject.edge_git_branch).to eq("develop")

		@@ -0,0 +1,55 @@
		# Decidim::Ai

		The Decidim::AI is a library that aims to provide Artificial Intelligence tools for Decidim. This plugin has been initially developed aiming to analyze the content and provide spam classification using Naive Bayes algorithm.

	bundle exec rake decidim:spam:data:create_reporting_user
	bundle exec rake decidim:ai:create_reporting_user

-bundle exec rake decidim:spam:train:moderation
+bundle exec rake decidim:ai:load_plugin_dataset
+bundle exec rake decidim:ai:load_application_dataset
+bundle exec rake decidim:ai:train_using_database

		wrapped.untrain :ham, translated_attribute(resource.send(field))
		wrapped.train :spam, translated_attribute(resource.send(field))

Add spam detection engine #11319

Are you sure you want to change the base?

Add spam detection engine #11319

Conversation

alecslupu commented Jul 22, 2023 • edited

🎩 What? Why?

📌 Related Issues

Testing

📷 Screenshots

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahukkanen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alecslupu commented Jul 22, 2023 •

edited