Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use content classification systems for better SPAM detection #10038

Open
9 tasks done
andreslucena opened this issue Nov 7, 2022 · 4 comments · May be fixed by #11319
Open
9 tasks done

Use content classification systems for better SPAM detection #10038

andreslucena opened this issue Nov 7, 2022 · 4 comments · May be fixed by #11319
Assignees
Labels
module: core type: feature PRs or issues that implement a new feature

Comments

@andreslucena
Copy link
Member

andreslucena commented Nov 7, 2022

Ref: SPAM06

This proposal was original created by @ahukkanen and available at
https://meta.decidim.org/processes/roadmap/f/122/proposals/16256

There are a couple changes introduced by @decidim/product

Is your feature request related to a problem? Please describe.

SPAM users are becoming bigger and bigger problem for all Decidim instances. They register profiles to put advertisement in their profile bio or a SPAM link in their personal URL and they are flooding the comments section with SPAM.

This is a real issue that is causing lots of extra work for the moderators of the platform. We should apply some automation in order to help their work.

Describe the solution you'd like

There is a gem available named Classifier Reborn which provides two alternative content classification algorithms:

  • Bayes - The system is trained using a predefined set of sentences to detect what are considered good and what are considered bad. When classifying content, it applies a word density search for the new content against this predefined database and provides a probability if the new content is considered good or bad.

  • Latent Semantic Indexer (LSI) - Behaves with similar logic as above but adds semantic indexing to the equation. Slower but more flexible.

More information available from:

Based on one of these algorithms, we could calculate a SPAM probability score for any content the user enters + the user profile itself when it is updated because in the past years we have been seeing many users who create SPAM profiles to get a back link to their site for improved SEO scores.

The only automated action that will be done is to report this user account (and SPAM contents) to the Moderation panel, so a human can review this report and hide/block if it's SPAM indeed. In the future this could be evolved to automatically hiding the content after we have more experience.

Describe alternatives you've considered

  • Manually moderating all users/content that are considered SPAM - very work heavy
  • Using 3rd party APIs to detect SPAM but they are likely not any better as what is suggested above + they come with a cost (or alternatively with a privacy impact)

Additional context

The suggested content classification systems with the predefined databases are likely to work only for English. I haven't dug deeper whether such databases are available for other languages.

But, as of our experience, most of the SPAM users are spamming in English, so I think such classification systems could solve the problem at least for English SPAM.

If the classification needs to be applied to other languages as well, there could be some way to train the system further with other datasets. By default, it could just be trained in English to get rid of most of the SPAM users.

See original proposal at Metadecidim.

Does this issue could impact on users private data?

No.

Funded by

Decidim Association

Acceptance criteria

  • Given that I'm sysadmin
    When I run the command bin/rails decidim:spam:train:moderation
    Then the algorithm is trained with the past moderated contents.
  • Given that I'm sysadmin
    When I run the command bin/rails decidim:spam:train:file[path/to/file]
    Then the algorithm is trained with a spam database file.
  • Given that I'm moderator or an admin
    When I block a participant
    Then the algorithm is trained with its profile data.
  • Given that I'm moderator or an admin
    When I hide a content
    Then the algorithm is trained with its data.
  • Given that I'm a registered confirmed user
    When I create a proposal with some words that appear as spam (for instance, "You are the lucky winner! Claim your holiday prize.")
    Then the system automatically report this content.
  • Given that I'm a registered confirmed user
    When I edit a proposal with some words that appear as spam (for instance, "You are the lucky winner! Claim your holiday prize.")
    Then the system automatically report this content.
  • Given that I'm a registered confirmed user
    When I create a comment with some words that appear as spam (for instance, "You are the lucky winner! Claim your holiday prize.")
    Then the system automatically report this content.
  • Given that I'm a registered confirmed user
    When I edit a comment with some words that appear as spam (for instance, "You are the lucky winner! Claim your holiday prize.")
    Then the system automatically report this content.
  • Given that I'm a registered confirmed user
    When I edit my profile with some words that appear as spam (for instance, "You are the lucky winner! Claim your holiday prize.")
    Then the system automatically report this participant.
@andreslucena andreslucena added module: core type: feature PRs or issues that implement a new feature labels Nov 7, 2022
@andreslucena andreslucena mentioned this issue Nov 7, 2022
6 tasks
@alecslupu alecslupu self-assigned this Nov 7, 2022
@alecslupu
Copy link
Contributor

Considering the ClassifierReborn documentation :

Bayesian Classifier

Bayesian Classifiers are accurate, fast, and have modest memory requirements.

Latent Semantic Indexer (LSI)

Latent Semantic Indexing engines are not as fast or as small as Bayesian classifiers, but are more flexible, providing fast search, and clustering detection as well as semantic analysis of the text that theoretically simulates human learning.

As per documentation
This function rebuilds the index if needs_rebuild? returns true. For very large document spaces, this indexing operation may take some time to complete, so it may be wise to place the operation in another thread.

As a rule, indexing will be fairly swift on modern machines until you have well over 500 documents indexed, or have an incredibly diverse vocabulary for your documents.

@alecslupu
Copy link
Contributor

Using the LSI algorithm is a no go.
Even though i would clean the data, the indexing could t

Indexing 2000 out of 30.000 comments took creating a 9100 List of words.

real	455m24.087s
user	0m0.156s
sys	0m0.022s

@carolromero
Copy link
Member

@alecslupu so far what I have tried in your staging (leaving comments in different languages with the word Scort, adding and editing profiles, proposals, etc.) has worked just fine. It would be optimal to test it with Metadecidim data, being careful with personal data.
Ping @andreslucena @ahukkanen

@alecslupu alecslupu linked a pull request Jan 14, 2024 that will close this issue
@ahukkanen
Copy link
Contributor

I have a feeling that my remarks might have gone lost in the wild and scattered over multiple issues, discussions, PRs and chats, I will post this information here collected in one piece so that it is not lost in the wilderness, as I have a feeling it might take a while to develop a good enough solution for the SPAM classification.

This issue is not only technical. It has a lot to do with the strategy on how to solve it properly. There is nothing wrong with the technical implementation done by @alecslupu (given the requirements), but the main problem is that the problem and the desired solution has not been researched and specified well enough to begin with.

TL;DR / conclusions:

  • The selected algorithm (Naïve Bayes) is fine for the given task
  • Classifying needs to be context specific (i.e. different models for content SPAM and profile SPAM), it also has to be based on multiple factors, not only the content given to the classifier
  • The classifiers need large enough sample sizes which have close to equal weights (SPAM/HAM) ← This is the most difficult problem to solve
  • Content SPAM is easier to solve than profile SPAM
  • Alternative ways could be considered fighting the profile SPAM
  • Classification needs to work with several languages, even languages that are not configured for the instance because spammers use multiple languages.
  • I would suggest making a pre-trained model which is shipped with Decidim instead of training it for each and every instance specifically as is the strategy with the current implementation. This will work fine for majority of cases and the rest can train their own models for specific contexts.

What are the remaining problems?

So as I have already posted in the reviews of #11319 and #10151, the automatic classification boils down to two things:

  1. Classification needs to be context specific, i.e. the same classifier is not able to detect both content SPAM and profile SPAM with the current strategy
  • One classifier can be used for detecting content SPAM (i.e. comments, proposal content, discussion content, meeting content, etc.)
  • Another classifier is needed for classifying profile spammers as the content there is very different from the another category
  1. Having large enough sample sets for the content to be classified which has close to equal weights, 50% SPAM and 50% HAM. My assumption is that we need around 100k entries for both SPAM and non-SPAM (i.e. “HAM”) per classifier for it to be reliable enough.

How to fight SPAM?

Generally there is no one way to detect SPAM (such as text classification) but you need to combine multiple things together based on the context to be classified. There’s some good conversation already on this topic especially regarding the spammer profiles at #8239 which also shows that the most reliable classification method relies not only on the content these users write in their profile bio but on multiple factors on how the spammers typically behave. And as time goes on, they will learn, and the classification strategy has to be adjusted accordingly.

Here is a good resource I found regarding classifying SPAM in general:
https://systemdesignschool.io/blog/spam-detection

Note that typical email SPAM filters not only rely on the content classification but they use multiple factors to make more accurate guesses. E.g. content frequency, content tone, volume and sender verification. On top of this, some other commercial SPAM classification tools can also check the sender against blocklists or check the location of the user (which is by the way one technique we’ve used successfully in the past to fight SPAM on a Decidim website).

This also applies to detecting the typical behavior of a profile spammer on a website. Most of them come from certain areas of the world and they typically behave in a certain way: fill out their profile description, image and URL, have very little amount of logins, typically a small amount of interactions on the platform, etc. The more these factors are taken into account during the classification, the more accurate the guess is.

One algorithm we have used previously successfully to find profile spammers with very high accuracy:

  • Does the profile have any comments? If no, increment profile spammer score.
  • Does the profile contain a personal URL? If yes, increment profile spammer score.
  • Is the profile description filled in? If yes, increment profile spammer score (in the particular contexts we have used this in, most people do not fill the description).
  • Does the profile description contain a URL? If yes, increment profile spammer score.
  • Which language is the profile description written in? If language that would be rare for the context, increment profile spammer score.
  • From which country the user is from based on their IP? If from a country that would be rare for the context, increment spammer and profile spammer score.
  • Which languages are the comments written in? If languages that would be rare for the context, increment spammer score.
  • Do the comments contain links? If yes, increment spammer score.

This results to a list of users which are likely to be spammers which can be further analyzed by admins and decided manually which are spammers and which are not. There are also few considerations we’ve done in order to nullify the guess as they are some factors that indicate the user is not likely to be a spammer, such as:

  • If the user has an official email address of the organization managing the instance, they are not likely a spammer.
  • If the user is authorized, they are not likely a spammer.

Also there are other ways that are already implemented to make profile spamming less interesting for the spammers, such as limiting the link juice on the profile URLs (#12827, #8779, #9047). Another idea I’ve had would be to take the personal URL completely out of the profile as at least in our experience, it is largely unnecessary. Almost no real users use it. Even better if we could get rid of the profile description but I understand that on some sites it might be nice to have. And by that I really mean “nice to have” because I don’t think many people find it useful, at least in our experience. Currently both of these fields just mostly serve as honeypot traps to catch spammers.

This is mostly about the profile SPAMming. Note that comment and content SPAM is a different problem to solve but most of the points made here also applies there.

How does the selected algorithm (Naïve Bayes) compare to alternatives?

I have been testing different algorithms and different strategies for classifying the SPAM, including:

  • Naïve Bayes classifier as in the current implementation
  • fastText algorithm with the same classifier data
  • LLMs
  • LLMs fine-tuned with the SPAM content to be classified

First of all, comparing Naïve Bayes and fastText, I see them equally good for the problem scope of classifying SPAM. If they are fed the same training data, they are almost equally reliable. Last year I did some tests regarding these with the profile descriptions at MetaDecidim and their performance regarding the classification task alone was comparable. fastText is able to do a lot more out of the box with its pre-trained models, such as language identification and word vectors (e.g. context classification for proposals). But if we only need to detect SPAM, there is really no difference between these algorithms. Might be some nuance differences but overall they perform equally well with the same training data.

Regarding the LLMs, while there is a lot of hype around them being able to solve this kind of problems “automatically”, there are a some of problems with using them here:

  1. Running a large LLM (e.g. GPT-2, Llama, Mistral) is not suitable on the same machine as Decidim. They generally require a GPU and at least 16 GB of VRAM (GPU RAM) to work fast, some require even more. Running these models on a CPU is very slow and will not be suitable for the problem scope.
  2. Running a smaller LLM also requires at least one GPU and few GBs of VRAM. I also don’t see this suitable for Decidim applications as it would add a GPU requirement for running Decidim and would therefore make hosting it very expensive. The GPU also has to be reserved for the specific task at hand for it to have enough memory available when needed.
  3. Even if GPU wasn’t the problem, the model is just as strong as the sample data it has been trained with for the specific task. For classifying SPAM reliably on an LLM, you would need to take a pre-trained LLM and fine-tune it for the specific problem, i.e. classifying SPAM. With large enough sample sizes, this would work but it would still require a GPU and it would be any better than the more performant alternatives, such as Naïve Bayes which was selected for the implementation.
  4. Even if all the above would be a non-issue and everyone had a cost effective 64 GB GPU available for every Decidim server, using a GPU on top of the current infrastructure would use a lot of energy and cause a lot more carbon emissions. Not very efficient for the given problem scope.

I tested some of these models (only the open versions of them) with the actual data to be analyzed (i.e. MetaDecidim profiles), and none of them performed any better than the simpler classification models. They can be used for the task but using them requires a lot more energy and resources, without any added benefits to the alternatives.

People get too excited when they see something behaving close to a human. The truth is that they are just algorithms that have a lot of data fed to them. They can be used for SPAM detection but they are definitely not the best tools for that. SPAM detection problem relates to being able to classify a text to a certain category which can be done much more easily and more efficiently.

As the conclusion, I think that for the specific task at hand, Naïve Bayes is a fine selection. It just needs a lot of data samples to be reliable enough. Don’t eat the LLM hype regarding this particular problem without chewing.

Once there are enough content samples and if there is still doubt about the classification algorithm, use PyCaret to compare different classification algorithms and their precision (does not compare against LLMs but it does not matter):
https://pycaret.gitbook.io/docs/get-started/quickstart#compare-models

Content samples

Having a large enough sample sizes for the data to be classified is the key to getting it working correctly. But there are several problems regarding this, such as not having enough clean content to train it with. I believe finding SPAM is not the problem but rather finding enough HAM (non-SPAM).

The current implementation at #11319 relies on a sample set of 230 SPAM profiles, 5574 SMS messages and 104 SPAM comments. The only dataset that contains non-SPAM content is the SMS set. The total content distribution is about 90% SPAM and 10% HAM. I don’t have to be too much of a statistical researcher or a wizard to predict that this will classify about 90% of the content as SPAM.

And just to verify this, (last year) I ran the current model against all profiles at MetaDecidim and it classified about 94% of the profiles as spammers. While there are a lot of spammer profiles, manual inspection shows that it does a lot of mistakes on genuine profiles. E.g. “product manager”, “product owner”, “PO at company name”, “building products with purpose”, “Telecommunication Engineer - Developer”, filling the profile description only with emojis or writing product description in any other language than English will all classify as SPAM, just as some examples. This is comparable to a model that classifies 9/10 profiles as SPAM at random.

It is likely good at detecting SMS SPAM at its current state but that is most it can do. There is no SMS SPAM in Decidim. It might find some comment SPAM but probably does a lot of mistakes for that too. It won’t be able to detect profile SPAM reliably at its current state. It is just as good that the data you feed to it.

We need a lot more samples of data for each content category we are classifying. The more data it is fed, the more reliable the guessing will be. And the data needs to have close to equal weights both on SPAM and HAM. I would start with a ballbark of 100k samples for both HAM and SPAM per category. This means is that the datasets needed are:

  • 200k comments, ~50% SPAM, ~50% HAM
  • 200k profile descriptions, ~50% SPAM, ~50% HAM

I believe finding SPAM would not be the biggest problem here but finding large enough sets of good content will probably be a problem. From MetaDecidim alone you can get about 30k SPAM profile descriptions, as most of the registered users are spammers. But it needs to be manually classified to begin with.

Here is where LLMs might be helpful. If it’s not possible to find this much actual data, you can take all that data available and generate the missing part with LLMs by feeding them all the source data. I actually tried this as an experiment, I used only 10 HAM and 10 SPAM descriptions to generate about 200k entries using the open GPT-2 model, and the end result was a dataset which I fed to the Naïve Bayes classifier. It categorized about 70% of the profiles as SPAM and 30% as HAM. Probably it still does a lot of mistakes as there was no further work done for that data but it could be improved by experimenting with different models and feeding it more accurate data to begin with. But it just shows that it is likely to do less mistakes than the current model because a lot more data has been fed to it. And yes, on the other hand it will also miss some spammers but I would suggest rather aiming for a more precise model than a model that does a lot of mistakes.

The problem with generating this type of content using the LLMs is again the data they have been fed with. The GPT-2 model for instance has a lot of weight on news and wiki articles, which gives you somewhat of an idea on the content it will generate. The lucky part regarding the profiles context is that many news articles contain content that resembles a profile description, e.g. “[...] says Dr. John Doe who works as the research lead at XYZ”. And on the other hand, it has read a lot of ads too, as most news articles contain ads. It would need more research and developing the prompts in a way to make these models generate the needed content but it might work. It just needs more research on how to develop the correct prompts and which model to use.

If this is used, the final content still needs to be manually classified by humans into the desired categories. Some cleanup work is also needed on the datasets and some entries they create need to be omitted as they make no sense. This is some amount of manual work that can be also crowd sourced.

Content language

Another problem regarding a working model is the content language which is not only English. There are two alternative ways to tackle this problem:

  1. Detect the content language to be analyzed and translate it to English automatically for classification
  2. Train the model in all possible languages to detect SPAM reliably in all languages

The first one would be easier to implement but again, it requires a heavy model to run and would add a GPU requirement for the system to work fast. Detecting the language is easy and can be done fast even with a CPU (e.g. using fastText) but translating text to another language is slower and works the best using a GPU.

The remaining option is then to translate all the content to all possible languages that the spammers are using. This can also include languages used by the platform as sometimes the spammers are targeting those languages specifically.

Also note that in some cases, participants can also post actual content in languages that are not configured for the instance (there may be language groups that the city serves in their language even if they don’t officially translate all the website content to all these languages). So it is not always a direct indication of a spammer that they are posting in some other language that is configured for the system.

In the following table I have collected all the languages used by the spammers based on the MetaDecidim profile analysis and also the languages currently supported by Decidim based on the language files shipped within the Decidim gems (I know many of these languages are not actually translated). There is an open tool called Argos Translate that supports most of these languages, only exception being Karakalpak (kaa) which is not actually translated currently in Decidim although the language files are shipped with the gems. It works fairly well and is based on the data available through the OpenNMT translations.

Note that this might have changed as the analysis was done last year, but it can serve as a starting point. The explanations of the table columns:

  • Code: language code
  • Name: language name
  • Spam: this language was used by spammers
  • Decidim: this language is shipped with the Decidim gems (although may not be fully translated)
Code Name Spam Decidim
af Afrikaans X
am Amharic X
ar Arabic X X
bg Bulgarian X X
bn Bengali X
ca / val Catalan X X
cs Czech X X
cy Chuvash X
da Danish X X
de German X X
el Greek X
en English X X
eo Esperanto X
es Spanish X X
et Estonian X X
eu Basque X X
fa Persian X X
fi Finnish X X
fr French X X
ga Irish X X
gl Galician X X
gn Guarani X
he Hebrew X X
hi Hindi X
hr Croatian X X
ht Haitian X
hu Hungarian X X
id Indonesian X X
is Icelandic X X
it Italian X X
ja Japanese X X
ka Georgian X
kaa Karakalpak X
ko Korean X X
lb Luxembourgish X
lo Lao X
lt Lithuanian X X
lv Latvian X
ms Malay X
mt Maltese X X
nl Dutch X X
no / nb Norwegian X X
oc Occitan X
om Oromo X
pl Polish X X
pt Portuguese X X
ro Romanian X X
ru Russian X X
si Sinhala X
sk Slovak X X
sl Slovenian X X
so Somali X
sq Albanian X X
sr Serbian X
sv Swedish X X
sw Swahili X X
th Thai X X
ti Tigrinya X
tl Tagalog X
tr Turkish X X
uk Ukrainian X
vi Vietnamese X
zh Chinese X X

I would assume that training the models specifically to each language would be the most reliable method and then selecting the model based on automatic detection of the source text language e.g. using fastText which is fairly reliable and also fast. It has a pre-trained model for language detection.

Having the source content only in English should not be a limiting factor as it is possible to translate the datasets to all these languages with only one exception (which is irrelevant). It might not be as reliable as using real content but I don’t see it feasible to collect enough SPAM and HAM content in all these languages or finding an LLM that supports all these languages, so regarding this I believe some corners can be cut.

Steps forward

In all of this I just wanted to transfer all the relevant points for further planning on how to proceed with this issue. It requires quite much more work that was initially planned for this issue.

My suggestion would be as follows:

  • The first version to ship should be only the classifier without any sample content shipped with Decidim. This should serve as the base implementation for further work but it would not do anything by default and it would be totally useless for most people.
  • Don’t enforce the utilization of the automatic spam detector. It should be opt-in at this stage as the implementers need to train it with their own data. Also make it clear that the classifier is only as good as the data fed to it and it needs a lot of data in multiple languages to work reliably. Provide documentation and examples on how to train it.
  • In order to minimize profile spamming, I would also suggest removing the personal URL field from the profiles to begin with. It gives the spammers less incentive to create these profiles to begin with. If you also remove the profile description field, you will automatically cut the scope of this task into half.
  • Collect enough SPAM and HAM content for each content category, i.e. profile descriptions and online comments. Use publicly available sources that allow utilization of that data, check the data licensing carefully from each source. Minimum would be 1000 per content category and per classification, i.e. 4000 entries in total, balanced 50% SPAM and 50% HAM.
    • Note that it is easy to find profile SPAM almost from any Decidim instance, so for the SPAM profiles, fetching that data might be enough.
  • Develop a method to generate around 100k SPAM and 100k HAM entries for each content category using an LLM. Experiment with different models and see which one works the best. Examine the generated content manually and confirm that most of it fits the context. If not, adjust the prompts and the model.
  • Clean up these datasets of total 400k content entries e.g. through crowd sourcing. There are platforms available that allow you to crowd source this type of work for a fair price given the difficulty of the work. The content entries need to make sense to begin with and the LLMs can sometimes generate content that does not make sense or has something irrelevant embedded into it. During this phase, also remove all the entries that do not make any sense. For a successful crowd sourcing project, the key is to specify it clearly and specifically what is the expected end result (e.g. “remove parts that contain irrelevant phrases”, “flag content that does not make any sense”, etc.)
  • After the cleanup, do a manual classification for each dataset, i.e. classify it by humans through crowd sourcing. Even if the prompt was HAM, the LLM might have created unwanted content. Therefore, it is important to have it checked by humans to begin with. In order to be extra sure, I would suggest classifying the content at least 3 times because people working in these bulk tasks sometimes try to cut corners and the quality of the work might be low (the more you pay, the better the quality). In order to avoid invalid classifications, use multiple judgements and draw the conclusion on average. E.g. with 3 classifications, it should be somewhat reliable if 2 of the classifications are the same.
  • Once each content category is cleaned up and manually classified, run it through machine translation and translate it to all the languages listed in the table above.
  • Final result should be a clean classified dataset for each content category and each language which can be fed to the classifier. Create one model per content category and per language in order to avoid conflicting words (i.e. words that mean different things in different languages).
  • Once the models are ready, adjust the classification algorithm so that the source language is detected, the correct model is selected based on the detected language and finally the classification is done against the selected model.
  • On top of the content classification, I would also suggest considering other factors for the classification, not only the content itself. This would make the classifier more reliable. Some examples of the factors that can be considered are available earlier in this post with an example algorithm.

Note that this process is not a guaranteed way to get a working classification system. A lot of the work that goes to machine learning is based on trial and error. This should give a good starting point for further work and evaluation but there is no way to know how well this works in action than trying it. If it fails, it will be a lot of wasted effort. If it works, it will be great. Only thing I can say is that the first version will not definitely work perfectly but it would be more reliable than the current model. And once again, there will never be a perfect classifier as the spammers will also learn. But if it works 80% of the time, it is 80% less work for all the administrators (where this problem is not already solved through other means).

And finally, spammers are also capable of learning. Once the method is created and shipped, it will only work for a certain period of time. As time goes on, the model needs to be adjusted as the spammers adjust their work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: core type: feature PRs or issues that implement a new feature
Projects
None yet
4 participants