Skip to content

fractalliter/stackoverflow-language-clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed K-means algorithm for clustering question-answers on StackOverFlow

The aim is to compare distribution of the question-answers on StackOverFlow data for following list of programming languages with help of K-means algorithm with Scala 3 and Apache Spark distributed computing.

val langs =
    List(
      "JavaScript",
      "Java",
      "PHP",
      "Python",
      "C#",
      "C++",
      "Ruby",
      "CSS",
      "Objective-C",
      "Perl",
      "Scala",
      "Haskell",
      "MATLAB",
      "Clojure",
      "Groovy"
    )

prerequisites

You need to have JDK 11 or higher and SBT build tool installed on your machine

You can check for Java like follow:

java --version

You might see something like below:

openjdk 11.0.12 2021-07-20
OpenJDK Runtime Environment 18.9 (build 11.0.12+7)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.12+7, mixed mode)

for installing sbt visit sbt reference manual

How to run

At the root of the project run sbt

sbt run

after a couple of seconds or more the result of the iterations and also the final clustering will be printed to console.

[info] Resulting clusters:
[info]   Score  Dominant language (%percent)  Questions
[info] ================================================
[info]    1546  Java              (100.0%)            8
[info]    1432  JavaScript        (100.0%)           27
[info]     722  Python            (100.0%)           34
[info]     586  C++               (100.0%)           19
[info]     574  Ruby              (100.0%)           14
[info]     548  Objective-C       (100.0%)           30
[info]     491  CSS               (100.0%)           28
[info]     485  C#                (100.0%)           63
[info]     465  PHP               (100.0%)           34
[info]     289  Perl              (100.0%)            1
[info]     279  JavaScript        (100.0%)          300
[info]     266  Scala             (100.0%)            3
[info]     182  Haskell           (100.0%)            7
[info]     180  Java              (100.0%)          395
[info]     153  Python            (100.0%)          326
[info]     141  CSS               (100.0%)          226
[info]     122  C++               (100.0%)          272
[info]     120  Ruby              (100.0%)          199
[info]      97  Objective-C       (100.0%)          408
[info]      81  C#                (100.0%)         1228
[info]      72  Clojure           (100.0%)           26
[info]      71  PHP               (100.0%)          606
[info]      60  Scala             (100.0%)          104
[info]      47  MATLAB            (100.0%)           27
[info]      46  Groovy            (100.0%)           10
[info]      35  Haskell           (100.0%)          175
[info]      32  Perl              (100.0%)          179
[info]      18  Clojure           (100.0%)          180
[info]       9  Groovy            (100.0%)          190
[info]       7  MATLAB            (100.0%)          888
[info]       4  Haskell           (100.0%)         4903
[info]       3  Scala             (100.0%)         6312
[info]       2  Perl              (100.0%)        11532
[info]       2  Python            (100.0%)        85751
[info]       2  Clojure           (100.0%)         1789
[info]       2  C++               (100.0%)        88910
[info]       1  PHP               (100.0%)       155254
[info]       1  C#                (100.0%)       177686
[info]       1  Ruby              (100.0%)        26769
[info]       1  CSS               (100.0%)        55438
[info]       1  Java              (100.0%)       188364
[info]       1  JavaScript        (100.0%)       179390
[info]       1  MATLAB            (100.0%)         6213
[info]       1  Groovy            (100.0%)         1310
[info]       1  Objective-C       (100.0%)        46504

About

A distributed K-means clustering on stackoverflow scored question-answers regarding language diversity with Apache Spark

Topics

Resources

License

Stars

Watchers

Forks

Languages