Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect typo-squatting #1074

Closed
carols10cents opened this issue Sep 22, 2017 · 18 comments
Closed

Detect typo-squatting #1074

carols10cents opened this issue Sep 22, 2017 · 18 comments
Labels
A-publish B-needs-investigation C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works

Comments

@carols10cents
Copy link
Member

Edit distance of some small amount away from an existing crate, when detected send an email to help@crates.io with a link to the crate and a link to the crate that its name is close to?

@TheDan64
Copy link

There were a few good ideas on the recent rust subreddit post. Maybe they could be evaluated?

@carols10cents
Copy link
Member Author

It would be awesome if you could list the ideas here with pros/cons of each!

@TheDan64
Copy link

Sure, I will do my best to summarize:

  • Silently flag any created crate with a name that is a Levenshtein distance to some number of popular crates' names, using a distance of < 2 or so. This would allow admin(s) to manually review flagged crates. Possibly with a dashboard for easier use and the ability to whitelist crates. (src)
    • PRO: Levenshtein would detect crate foubar as too similar to foobar but foobar_plugin would be fine.
    • PRO: Overwhelming support if up-votes are anything to go by
    • CON: May be a lot of manual work
  • See if crate has an identical API. Possibly taking crates with build.rs files as prime suspects. (src)
    • PRO: As I understand it, a lot of typosquatted files keep the original source (or something really close to it), so an API "diff" may be able to spot similarities
    • PRO: build.rs files have the ability to run any rust code prior to compilation so checking for them is a great idea
    • CON: May be computationally expensive
    • CON: What if you can't actually build the API due to needing to link to some non rust dependency (IE FFI)? This might be documented in the README for example, but there's probably no way to automate that.
  • (Disclaimer: My own idea, derived from the first idea) A first time crate publish to crates.io would check the Levenshtein distance and disallow publishing when a crate name is too similar to another existing crate. It would then display a helpful error message on how to contact someone from crates.io for a manual review to validate that their repo looks legit. On a successful review, said user would be provided with a one time token that would need to be provided to the publish command. The token would only work for the same name and repository previously reviewed. This token would not be required on subsequent publishes. (src)
    • PRO: Prevents typosquatting names that are similar enough to possibly be confused with the original name (but by no means prevents all typosquatting)
    • PRO: Would be strong against automated typosquatting due to requiring manual involvement (ensure repo looks legit, send email/message, get token, etc)
    • CON: Does potentially set up a barrier for users. Though, in practice, a legitimate user probably doesn't want a name that's very similar to an existing crate most of the time.
    • CON: A sneaky user could still change their source code prior to initial (or subsequent) publish to the malicious code, but hopefully the difficulty of setting up a legitimate looking repo would dissuade this
  • npm has suffered similar problems. Since rust developers have some ties to npm developers/community, they might have valuable insight to provide (src)
    • PRO: Learning from more experienced developers is always a good idea
  • Namespaced crates (src)
    • CON: namespaces can still be typosquatted too
    • PRO: Typosquatted namespace may be easier to detect (according to one user)

@carols10cents
Copy link
Member Author

My preference would be to start with the lightest weight solution here, which would be the first one you noted, which is very similar to the description of this issue. Before changing policies or putting up barriers, I would like to be notified about what is happening, how often, by whom, and have time to adjust e.g. the edit distance before taking more drastic measures.

@esclear
Copy link

esclear commented Oct 1, 2017

Hey, I would be interested in implementing this.

I think that we would need a list of popular crates first (possibly, like, the 50 most downloaded crates).

Having such a list, would make it possible to check whether there are already crates that might be typo-squats.

An actual implementation of just silently flagging the crate / sending the email upon creation shouldn't be hard to do in the end.

@Deedasmi
Copy link
Contributor

Deedasmi commented Oct 1, 2017

Hi esclear,

@carols10cents is looking at getting me a snapshot of the db in order to look into this. I'd be happy to work with you on it.

@esclear
Copy link

esclear commented Oct 4, 2017

Sure 👍

@esclear
Copy link

esclear commented Oct 4, 2017

Okay, I'm currently working on doing some data analysis.

The 50 most popular crates (considering all-time downloads) so far are:

50 Most popular crates
[name]: [download count]
libc: 6436499
bitflags: 4691705
winapi: 3960881
serde: 3804623
rustc-serialize: 3662357
rand: 3522616
winapi-build: 3513904
kernel32-sys: 3324020
log: 3298796
lazy_static: 3057077
gcc: 2874197
regex-syntax: 2517388
regex: 2487540
time: 2479426
aho-corasick: 2358949
memchr: 2344215
url: 2197372
num-traits: 2138777
num: 2045803
pkg-config: 2042036
num_cpus: 2027906
semver: 2014506
utf8-ranges: 2009102
thread_local: 1982737
matches: 1878739
byteorder: 1804312
thread-id: 1704671
serde_json: 1627898
unicode-normalization: 1624605
unicode-xid: 1567442
env_logger: 1565078
strsim: 1560390
unicode-bidi: 1511795
toml: 1489054
num-integer: 1483323
num-iter: 1460484
openssl-sys: 1425247
traitobject: 1377021
rustc_version: 1337977
idna: 1318020
hyper: 1314579
term: 1296335
cfg-if: 1286001
mime: 1280755
tempdir: 1274150
itoa: 1255251
httparse: 1214742
unicase: 1198959
dtoa: 1159151
openssl: 1146845

I shall provide a list of other crates with more or less similar names to these tomorrow.

@esclear
Copy link

esclear commented Oct 4, 2017

Okay, I accidentally did it right now.
Seems like levenshtein distances above 3 produce results, that don't help at all deciding whether crates have similar names.

Turns out, that for the idna, toml, itoa, term, mime and dtoa-crates (plus some more) the levenshtein-approach could be a little problematic, if we use distances >= 3 as an indication of possible typo-squatting, because there will be many false positives.

Using a levenshtein distance lower than 3 as an indicator of possible typo-squatting would yield the following result:

Crates with similar names to the 50 most popular crates along with the levenshtein distance
================================================================================
Crates with a name similar to 'libc':
rlibc: 1
lib: 1
libr: 1
libs: 1
libcw: 1
dlib: 2
glib: 2
irc: 2
tic: 2
ioc: 2
loirc: 2
libxm: 2
loc: 2
odbc: 2
xlib: 2
abc: 2
ilc: 2
libgo: 2
zlib: 2
lif: 2
lifx: 2
life: 2
flic: 2
lin: 2
libpm: 2
kic: 2
libdw: 2
eirc: 2
lab: 2
lia: 2
lion: 2
link: 2
itc: 2
lit: 2
zinc: 2
smbc: 2
list: 2
lisp: 2
libmcs: 2
dbc: 2
lich: 2
oic: 2
liar: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'bitflags':
--------------------------------------------------------------------------------
Crates with a name similar to 'winapi':
hidapi: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'serde':
serde1: 1
serve: 1
spade: 2
servo: 2
surge: 2
serde09: 2
serde07: 2
serde06: 2
serde08: 2
verne: 2
server: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'rustc-serialize':
--------------------------------------------------------------------------------
Crates with a name similar to 'rand':
rad: 1
ring: 2
pad: 2
random: 2
range: 2
maud: 2
ramp: 2
ransid: 2
rdrand: 2
ron: 2
strand: 2
fann: 2
raw: 2
an: 2
ann: 2
racc: 2
rink: 2
renv: 2
rsnl: 2
raft: 2
round: 2
ram: 2
rain: 2
rimd: 2
rin: 2
yard: 2
rawr: 2
fnd: 2
anl: 2
rpn: 2
crank: 2
aud: 2
rng: 2
bank: 2
rat: 2
rafy: 2
jank: 2
kard: 2
rass: 2
ang: 2
manx: 2
wan: 2
tank: 2
raml: 2
crane: 2
rapt: 2
rn: 2
run: 2
nano: 2
raui: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'winapi-build':
--------------------------------------------------------------------------------
Crates with a name similar to 'kernel32-sys':
--------------------------------------------------------------------------------
Crates with a name similar to 'log':
slog: 1
clog: 1
loc: 1
elog: 1
ulog: 1
zlog: 1
logd: 1
cog: 1
lol: 1
glob: 2
nom: 2
png: 2
lzw: 2
xdg: 2
ogg: 2
peg: 2
lz4: 2
gag: 2
sig: 2
slug: 2
svg: 2
cfg: 2
lyon: 2
ioc: 2
lua: 2
lzf: 2
dot: 2
bow: 2
pod: 2
ion: 2
ilog2: 2
pom: 2
las: 2
lux: 2
ao: 2
ron: 2
soa: 2
cow: 2
lcs: 2
po: 2
la: 2
io: 2
com: 2
ld: 2
load: 2
slag: 2
lib: 2
or: 2
to: 2
lif: 2
mod: 2
tox: 2
flow: 2
cg: 2
msg: 2
lin: 2
pos: 2
gong: 2
gg: 2
joy: 2
mg: 2
ddg: 2
lolog: 2
rox: 2
big: 2
lgl: 2
glop: 2
co: 2
cdg: 2
aof: 2
omg: 2
lde: 2
bop: 2
lab: 2
aoa: 2
lru: 2
lia: 2
go: 2
lv2: 2
lion: 2
mon: 2
xor: 2
lit: 2
nflog: 2
pop: 2
grog: 2
rpg: 2
uom: 2
ox: 2
oplog: 2
rng: 2
img: 2
blob: 2
gol: 2
mob: 2
rug: 2
plot: 2
fblog: 2
lcm: 2
cov: 2
ang: 2
vow: 2
cox: 2
cogs: 2
o2: 2
lxd: 2
sloc: 2
alo: 2
lcd: 2
colog: 2
no: 2
dok: 2
stlog: 2
dow: 2
dmg: 2
zou: 2
wol: 2
os: 2
lofi: 2
moo: 2
hg: 2
vox: 2
jot: 2
org: 2
bot: 2
lock: 2
l: 2
eom: 2
tug: 2
mov: 2
zlo: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'lazy_static':
--------------------------------------------------------------------------------
Crates with a name similar to 'gcc':
cc: 1
gc: 1
gcm: 1
ecc: 1
gcd: 1
gpc: 1
gif: 2
mac: 2
gl: 2
gfx: 2
crc: 2
gdk: 2
glx: 2
gtk: 2
rc: 2
geo: 2
gio: 2
gag: 2
irc: 2
xcb: 2
ghp: 2
tic: 2
scm: 2
gj: 2
ecs: 2
git: 2
ioc: 2
glm: 2
ocl: 2
eco: 2
cdc: 2
sct: 2
grpc: 2
sc: 2
gauc: 2
loc: 2
rci: 2
vnc: 2
spc: 2
lcs: 2
nfc: 2
ocf: 2
abc: 2
ucl: 2
sdc: 2
ilc: 2
pcb: 2
ncl: 2
gmp: 2
gml: 2
gst: 2
c: 2
rwc: 2
vcd: 2
racc: 2
gpx: 2
pct: 2
hlc: 2
tcp: 2
ice: 2
mc: 2
mcmc: 2
gip: 2
cg: 2
wlc: 2
jch: 2
gel: 2
rcu: 2
ct: 2
pcp: 2
egc: 2
c4: 2
gg: 2
jec: 2
fcm: 2
gr: 2
pci: 2
gw2: 2
pcx: 2
gdb: 2
kic: 2
co: 2
pcsc: 2
opc: 2
orc: 2
gds: 2
ecp: 2
blc: 2
jsc: 2
gph: 2
ucd: 2
gbm: 2
go: 2
muc: 2
kcp: 2
itc: 2
mcs: 2
bfc: 2
rfc: 2
ccv: 2
mcq: 2
rc4: 2
uci: 2
act: 2
gql: 2
rc2: 2
mpc: 2
gol: 2
rc5: 2
hc: 2
rc6: 2
nccl: 2
c3: 2
lcm: 2
gpt: 2
dbc: 2
ger: 2
tac: 2
wrc: 2
lcd: 2
rpc: 2
oic: 2
xch: 2
fac: 2
ucx: 2
dmc: 2
gem: 2
ci: 2
ca: 2
ghcl: 2
fc: 2
svc: 2
b2c: 2
ga: 2
vkc: 2
sci: 2
aac: 2
ge: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'regex-syntax':
--------------------------------------------------------------------------------
Crates with a name similar to 'regex':
reep: 2
redox: 2
verex: 2
rex: 2
redux: 2
rget: 2
rhex: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'time':
mime: 1
timer: 1
utime: 1
time2: 1
timi: 1
ptime: 1
toml: 2
take: 2
home: 2
simd: 2
tree: 2
tic: 2
timeit: 2
nice: 2
tee: 2
tape: 2
timely: 2
tiled: 2
wire: 2
pipe: 2
trie: 2
file: 2
cite: 2
tin: 2
kite: 2
uptime: 2
twre: 2
game: 2
tick: 2
gimei: 2
ice: 2
life: 2
tini: 2
pine: 2
nine: 2
rimd: 2
timmy: 2
sim: 2
timber: 2
im: 2
tinf: 2
rome: 2
twine: 2
img: 2
vice: 2
dice: 2
tiger: 2
tiny: 2
tml: 2
bite: 2
lame: 2
tma: 2
rie: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'aho-corasick':
--------------------------------------------------------------------------------
Crates with a name similar to 'memchr':
memcmp: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'url':
curl: 1
ucl: 1
uil: 1
uri: 1
rurl: 1
xrl: 1
cgl: 2
gl: 2
crc: 2
ar: 2
rc: 2
hsl: 2
irc: 2
utp: 2
err: 2
ruru: 2
ocl: 2
sql: 2
usb: 2
afl: 2
egl: 2
r0: 2
mtl: 2
rb: 2
qml: 2
rel: 2
udt: 2
r: 2
hal: 2
stl: 2
sdl: 2
rural: 2
ncl: 2
curs: 2
gml: 2
cql: 2
dl: 2
ml: 2
dtl: 2
ers: 2
or: 2
oil: 2
cal: 2
udp: 2
utm: 2
mel: 2
rlp: 2
uwp: 2
gel: 2
rsl: 2
srv: 2
rx: 2
rure: 2
tql: 2
gr: 2
dual: 2
uio: 2
lgl: 2
rla: 2
orc: 2
ucd: 2
pal: 2
lru: 2
orf: 2
psl: 2
anl: 2
rcurl: 2
uom: 2
rs: 2
mml: 2
uci: 2
drm: 2
gql: 2
rp: 2
usi: 2
irs: 2
gol: 2
rls: 2
rurel: 2
tml: 2
rt: 2
orm: 2
qrs: 2
wrc: 2
srp: 2
tpl: 2
srt: 2
kr: 2
rlq: 2
u9: 2
wol: 2
ucx: 2
try: 2
ur20: 2
u2f: 2
dgl: 2
rrt: 2
surt: 2
rn: 2
nl: 2
mrh: 2
org: 2
kerl: 2
l: 2
dhl: 2
lol: 2
quil: 2
zr: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'num-traits':
numtraits: 1
enum_traits: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'num':
nom: 1
nue: 1
rum: 1
unum: 1
dnum: 1
npm: 1
nvm: 1
nix: 2
pem: 2
ntp: 2
scm: 2
lua: 2
glm: 2
pom: 2
lux: 2
bus: 2
nn: 2
nfc: 2
stm: 2
nx: 2
fsm: 2
gcm: 2
drum: 2
nemo: 2
ncl: 2
dux: 2
net: 2
mux: 2
nfd: 2
pam: 2
nbt: 2
rump: 2
avm: 2
nasm: 2
rui: 2
com: 2
sun: 2
mm: 2
numer: 2
kvm: 2
fun: 2
hue: 2
nps: 2
bud: 2
utm: 2
osm: 2
numrs: 2
mmm: 2
cue: 2
nss: 2
nlp: 2
nio: 2
ram: 2
dump: 2
itm: 2
y4m: 2
shm: 2
tui: 2
sem: 2
m: 2
fcm: 2
du: 2
ndn: 2
sim: 2
dym: 2
gbm: 2
im: 2
muc: 2
jump: 2
nomi: 2
tuf: 2
uom: 2
numpy: 2
aud: 2
npy: 2
drm: 2
rux: 2
xpm: 2
kus: 2
dui: 2
ruma: 2
tun: 2
rdm: 2
rtm: 2
nxu: 2
rug: 2
lcm: 2
svm: 2
orm: 2
hmm: 2
dua: 2
nsq: 2
unums: 2
out: 2
cui: 2
jam: 2
mem: 2
no: 2
nvml: 2
u9: 2
na: 2
gem: 2
pm: 2
evm: 2
nes: 2
pvm: 2
rumo: 2
nl: 2
tupm: 2
run: 2
dup: 2
eom: 2
tug: 2
rpm: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'pkg-config':
--------------------------------------------------------------------------------
Crates with a name similar to 'num_cpus':
--------------------------------------------------------------------------------
Crates with a name similar to 'semver':
server: 1
stemmer: 2
weaver: 2
seer: 2
serve: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'utf8-ranges':
--------------------------------------------------------------------------------
Crates with a name similar to 'thread_local':
--------------------------------------------------------------------------------
Crates with a name similar to 'matches':
mates: 2
matchdb: 2
matcha: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'byteorder':
--------------------------------------------------------------------------------
Crates with a name similar to 'thread-id':
--------------------------------------------------------------------------------
Crates with a name similar to 'serde_json':
serde-hjson: 2
serde_ubjson: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'unicode-normalization':
--------------------------------------------------------------------------------
Crates with a name similar to 'unicode-xid':
unicode-bidi: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'env_logger':
--------------------------------------------------------------------------------
Crates with a name similar to 'strsim':
strum: 2
stream: 2
strom: 2
stdsimd: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'unicode-bidi':
unicode-xid: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'toml':
tool: 1
tml: 1
tomlq: 1
time: 2
nom: 2
home: 2
atom: 2
mowl: 2
sfml: 2
tobj: 2
stomp: 2
ocl: 2
yaml: 2
pool: 2
pom: 2
tofu: 2
qml: 2
gml: 2
topd: 2
com: 2
ml: 2
comm: 2
to: 2
oil: 2
tox: 2
html: 2
tql: 2
soma: 2
omg: 2
coll: 2
nomi: 2
timi: 2
totp: 2
rome: 2
uom: 2
mml: 2
gol: 2
comp: 2
goal: 2
tors: 2
atoms: 2
tolk: 2
tpl: 2
raml: 2
nvml: 2
toks: 2
pomf: 2
wol: 2
rofl: 2
tma: 2
eom: 2
lol: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'num-integer':
--------------------------------------------------------------------------------
Crates with a name similar to 'num-iter':
enumiter: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'openssl-sys':
openssl2-sys: 1
openal-sys: 2
openssl-src: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'traitobject':
--------------------------------------------------------------------------------
Crates with a name similar to 'rustc_version':
rust-version: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'idna':
idea: 1
itoa: 2
ena: 2
id3: 2
ion: 2
ilda: 2
dns: 2
iota: 2
icns: 2
edn: 2
mdns: 2
ndn: 2
ideal: 2
dua: 2
dfa: 2
idmap: 2
na: 2
daa: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'hyper':
hypr: 1
caper: 2
hopper: 2
hypox: 2
kyber: 2
typed: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'term':
tera: 1
xterm: 1
aterm: 1
dterm: 1
tar: 2
fern: 2
pem: 2
err: 2
zero: 2
tee: 2
relm: 2
ber: 2
geom: 2
text: 2
webm: 2
ers: 2
twre: 2
item: 2
sem: 2
ver: 2
lerp: 2
reru: 2
derp: 2
drm: 2
bert: 2
teko: 2
tors: 2
orm: 2
ger: 2
peri: 2
mem: 2
tea: 2
nero: 2
try: 2
gem: 2
evm: 2
utem: 2
terra: 2
xero: 2
beam: 2
tupm: 2
kerl: 2
form: 2
eom: 2
werk: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'cfg-if':
--------------------------------------------------------------------------------
Crates with a name similar to 'mime':
time: 1
mio: 2
miow: 2
home: 2
simd: 2
mpmc: 2
timer: 2
utime: 2
nice: 2
mint: 2
wire: 2
pipe: 2
midi: 2
file: 2
cite: 2
mm: 2
kite: 2
game: 2
gimei: 2
ice: 2
time2: 2
life: 2
mcmc: 2
mmm: 2
pine: 2
nine: 2
miso: 2
rimd: 2
mage: 2
mvge: 2
sim: 2
mimic: 2
im: 2
maze: 2
timi: 2
rome: 2
mml: 2
img: 2
ptime: 2
vice: 2
dice: 2
mimir: 2
mines: 2
bite: 2
lame: 2
mfte: 2
mem: 2
mief: 2
mote: 2
rie: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'tempdir':
--------------------------------------------------------------------------------
Crates with a name similar to 'itoa':
dtoa: 1
ftoa: 1
idna: 2
iron: 2
atom: 2
ioc: 2
ion: 2
soa: 2
ilda: 2
atoi: 2
io: 2
iota: 2
to: 2
i-o: 2
tox: 2
tba: 2
itm: 2
item: 2
aoa: 2
ioat: 2
inox: 2
ithos: 2
itc: 2
idea: 2
xtea: 2
tea: 2
btoi: 2
pitot: 2
tma: 2
ctop: 2
ta: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'httparse':
http2parse: 2
pktparse: 2
srtparse: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'unicase':
unbase: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'dtoa':
itoa: 1
ftoa: 1
atom: 2
dyon: 2
dot: 2
soa: 2
atoi: 2
dtl: 2
dmoj: 2
to: 2
dbox: 2
tox: 2
tba: 2
aoa: 2
xtea: 2
dua: 2
dfa: 2
dok: 2
tea: 2
dow: 2
dtab: 2
btoi: 2
drow: 2
tma: 2
daa: 2
ctop: 2
ta: 2
--------------------------------------------------------------------------------
Crates with a name similar to 'openssl':
openal: 2
opemssh: 2
--------------------------------------------------------------------------------
================================================================================

Thus, I would suggest treating names as possible typo-squads if:

  • The name is shorter not longer than four characters and the levenshtein distance to a popular crates name is smaller than 2.
    Or
  • The name is at least four characters long and the levenshtein distance is smaller or equal to 3.

@Deedasmi
Copy link
Contributor

Deedasmi commented Oct 5, 2017

Because of #159 I've been hesitant to look at this via crates.io search.

I agree that we will have to adjust distances based on word length though.

@esclear
Copy link

esclear commented Oct 6, 2017

This wasn't done via the search. I got a list of all package names from the crates.io-index and the 50 most popular crates along with downloads from the API.

After discussing typo-squatting with some friends, In my opinion it would be sufficient to flag any crates which name is similar to a popular crates name within a levenshtein distance of 1.
In the end it is unlikely to have two typos and have that match a typo-squat-crate, because for a higher distance it is more unlikely to hit such a crate and any adversary would have to submit many more crates to increase the chances of a developer actually using the typo-squat crate.

@Eh2406
Copy link
Contributor

Eh2406 commented Oct 13, 2017

I to have some prior work on this and would love to be invalved in moving this forwerd!

I was starting to research adding a typo check to cargo-edit. It would be convenient if there was a API for getting the possible typos from crates.io. It would also be nice if they appeared prominently in the search results. For a good, but non malicious, example I think request should suggest reqwest.

Perhaps a link from each crates page Not what you are looking for? Try crates with similar names.? Then a page sorting crates from newest to oldest with a link to similar names and the suggestion to e-mail help@crates.io if you see something suspicious?

@Turbo87 Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works and removed C-feature-request labels Feb 11, 2021
@epompeii
Copy link

Resurrecting this thread in light of recent events. I have a proposed solution that is a bit of a mix of @TheDan64 's points number one and two.

Proposed Solution:

Whenever a new crate is published on crates.io, check whether another similarly named crate already exists, using Levenshtein distance as mentioned above. If it does, perform a basic code comparison, and if the code is substantially similar:

  • Alert the Rust Security Response WG
  • Add a visible "Under Security Review" flag on crates.io
  • Print a warning before downloading the crate via cargo add with a "Did you mean ___?` message.

The parameters of the Levenshtein distance used could be tuned as needed to help optimize the number of code comparisons performed. Also, the relative popularity of a crate may need to be taken into consideration, both in terms of risk and in terms of prioritization for the Rust Security Response WG.

I was originally thinking this should just be a cargo feature, but I think this would be better handled centrally on crates.io with the cargo add behavior simply utilizing it.

Links:

@jbg
Copy link

jbg commented May 13, 2022

I like the proposal except that I worry that the warning won't be seen by most people since it depends on the use of the non-built-in cargo add. I always edit Cargo.toml directly to add dependencies and as far as I'm aware all my colleagues do as well. With a warning on cargo add there would also be no warning for transitive dependencies on typosquatted crates. I filed EmbarkStudios/cargo-deny#421 which might help for users of cargo-deny.

Ideally a warning could be printed by something within cargo itself, rather than a third-party plugin (cargo-edit / cargo-deny), but that's a bit tricky since if you don't run cargo update and just directly cargo build after editing Cargo.toml, the malicious code could already be running by the time you see the warning.

@jbg
Copy link

jbg commented May 13, 2022

Also note that if the manual review approach is taken, it would be necessary to review each version, otherwise a simple avoidance of the protection is to upload the initial release of a typosquatted crate with a small bugfix (so it looks like you just needed to publish a fork with the fix) and then once it passes security review, publish an update with the malicious code.

@epompeii
Copy link

Yeah, I very much agree that it will be difficult to help cover all workflows.
Though cargo add is now a mainline feature of cargo itself: rust-lang/cargo#10472

And that is another good point, there may need to be some form of perpetual/on-going checks.

@Eh2406
Copy link
Contributor

Eh2406 commented May 17, 2022

We should also compare notes with other community that have tried it in the past or have it now.
https://docs.google.com/spreadsheets/d/12QlaYEtcp2ZwZRfZPHR4D3YpY8k770hYBeFQ6-N7Mts/edit#gid=1022416269 row 10, suggests that:

  • RubyGems.org has something like this see (Consider relaxing levenshtein distance rules rubygems/rubygems.org#2058)
  • Packagist/Composer has "new packages require minimum distance from popularly downloaded package names, manual review after submission for others"
  • ConanCenter has a manual review
  • PyPI have some systems
  • npm "Yes, but there is room for improvement"

That document was collected by OpenSSF Working Group on Securing Software Repositories, so when we have a proposal we can ask for peoples input there.

@Turbo87
Copy link
Member

Turbo87 commented Feb 12, 2024

we have integrated https://github.com/rustfoundation/typomania last year and are expanding its integration in the near future. I guess this means the original issue is resolved :)

@Turbo87 Turbo87 closed this as completed Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-publish B-needs-investigation C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
None yet
Development

No branches or pull requests

9 participants