Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: gabriel-vasile/mimetype
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v1.4.5
Choose a base ref
...
head repository: gabriel-vasile/mimetype
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v1.4.6
Choose a head ref
  • 15 commits
  • 21 files changed
  • 5 contributors

Commits on Aug 6, 2024

  1. default ftyp detection to mp4; fix #562

    There are too many ftyp codes to look for (about 100 registered + the
    unregistered ones.) Previously all ftyps where on the same level in the
    detection tree. Now mp4 is parent to all other ftyps.
    gabriel-vasile committed Aug 6, 2024
    Copy the full SHA
    71a146e View commit details
  2. Copy the full SHA
    56bdf43 View commit details
  3. add support for .dvb video/vnd.dvb.file

    gabriel-vasile committed Aug 6, 2024
    Copy the full SHA
    31e6b1c View commit details

Commits on Aug 29, 2024

  1. Copy the full SHA
    d7081cc View commit details
  2. use a pool of buffers to alleviate memory allocs in csv; related to #553

    
    
    When iterating over multiple files, csv detector allocated a new buffer
    for each file. This change adds a pool of buffers that can be reused
    between detections. The same pool is shared between csv and tsv
    detectors.
    gabriel-vasile committed Aug 29, 2024
    Copy the full SHA
    4cc383c View commit details

Commits on Sep 26, 2024

  1. feat: Add parquet file detection (#578)

    Adds parquet file detection. See [docs](https://parquet.apache.org/docs/file-format/)
    for specification
    
    Co-authored-by: Keith Kelly <kkelly@morningconsult.com>
    kwkelly and Keith Kelly authored Sep 26, 2024
    Copy the full SHA
    c4abedc View commit details

Commits on Sep 30, 2024

  1. add application/xml as alias; close #227 (#581)

    According to RFC7303, text/xml is an alias to application/xml. But
    considering we we're using text/xml as the main mime type until now,
    changing to main=application/xml alias=text/xml would cause trouble to
    users. So for now, we're keeping as: main=text/xml alias=application/xml
    gabriel-vasile authored Sep 30, 2024
    Copy the full SHA
    c78cb11 View commit details

Commits on Oct 10, 2024

  1. Make mso detection work similar to what file/file does

    https://github.com/file/file/blob/7c62d696b06e53fc5be015c41a57513278ac6c54/magic/Magdir/msooxml
    The algorithms is not 100% percent reliable. For example, a
    zero compression zip containing a docx will still sometimes be detected
    as docx instead of zip (it depends on how many files and the order of
    files in the zip)
    
    Second thing in this PR is removing some test data fixtures.
    From now, I'll try as much as possible to write regular unit tests
    without relying on test file fixtures. #575 (comment)
    related #550 #575
    closes #400
    gabriel-vasile committed Oct 10, 2024
    Copy the full SHA
    c6c5e4f View commit details

Commits on Oct 11, 2024

  1. add benchmark action that leaves comment on PR (#588)

    * better benchmarks
    
    benchmark each detector with negative and positive inputs
    
    * add benchmark action that leaves comment on PR
    gabriel-vasile authored Oct 11, 2024
    Copy the full SHA
    7798415 View commit details

Commits on Oct 13, 2024

  1. Bump the github-actions group across 1 directory with 2 updates (#586)

    Bumps the github-actions group with 2 updates in the / directory: [actions/checkout](https://github.com/actions/checkout) and [github/codeql-action](https://github.com/github/codeql-action).
    dependabot[bot] authored Oct 13, 2024
    Copy the full SHA
    9349e46 View commit details
  2. Bump golang.org/x/net in the gomod group across 1 directory (#585)

    Bumps the gomod group with 1 update in the / directory: [golang.org/x/net](https://github.com/golang/net).
    dependabot[bot] authored Oct 13, 2024
    Copy the full SHA
    3cf98ef View commit details
  3. retract v1.4.4; closes #575. (#591)

    * retract v1.4.4; closes #575
    gabriel-vasile authored Oct 13, 2024
    Copy the full SHA
    fd16da2 View commit details
  4. action for benchmarking detectors (#590)

    * add action for benchmarking each detector
    gabriel-vasile authored Oct 13, 2024
    Copy the full SHA
    458b62d View commit details

Commits on Oct 14, 2024

  1. Bump actions/checkout from 4.1.7 to 4.2.1 in the github-actions group (

    …#592)
    
    Bumps the github-actions group with 1 update: [actions/checkout](https://github.com/actions/checkout).
    
    
    Updates `actions/checkout` from 4.1.7 to 4.2.1
    - [Release notes](https://github.com/actions/checkout/releases)
    - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
    - [Commits](actions/checkout@v4.1.7...v4.2.1)
    
    ---
    updated-dependencies:
    - dependency-name: actions/checkout
      dependency-type: direct:production
      update-type: version-update:semver-minor
      dependency-group: github-actions
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    dependabot[bot] authored Oct 14, 2024
    Copy the full SHA
    8a780a5 View commit details
  2. Remove GPL test file (#583)

    Co-authored-by: Gabriel Vasile <gabriel.vasile@email.com>
    canadacow and gabriel-vasile authored Oct 14, 2024
    Copy the full SHA
    2998a94 View commit details
32 changes: 32 additions & 0 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Run benchmarks
on:
pull_request:
branches: [master]

permissions:
contents: read

jobs:
benchmark:
runs-on: ubuntu-latest
steps:
# Base for comparison is master branch.
- name: Checkout code
uses: actions/checkout@v4.2.1
with:
ref: master
- name: Install Go
uses: actions/setup-go@v5.0.2
with:
go-version-file: 'go.mod'

# 30 runs with 100ms benchtime seems to result in acceptable p-values
# When I tried with count=10, it would be unreliable because of the actions
# runner is in a shared environment and CPU and mem would be affected by others. (or so I think)
- run: go test -run=none -bench=. -count=30 -benchtime=100ms -timeout=20m > /tmp/prev
- name: Checkout code
uses: actions/checkout@v4.2.1
- run: go test -run=none -bench=. -count=30 -benchtime=100ms -timeout=20m > /tmp/curr

- run: go install golang.org/x/perf/cmd/benchstat@latest
- run: benchstat /tmp/prev /tmp/curr
6 changes: 3 additions & 3 deletions .github/workflows/codeql.yml
Original file line number Diff line number Diff line change
@@ -20,13 +20,13 @@ jobs:

steps:
- name: Check out code
uses: actions/checkout@v4.1.7
uses: actions/checkout@v4.2.1

- name: Initialize CodeQL
uses: github/codeql-action/init@v3.25.14
uses: github/codeql-action/init@v3.26.12
with:
languages: go
queries: security-and-quality

- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v3.25.14
uses: github/codeql-action/analyze@v3.26.12
9 changes: 3 additions & 6 deletions .github/workflows/go.yml
Original file line number Diff line number Diff line change
@@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4.1.7
uses: actions/checkout@v4.2.1
- name: Install Go
uses: actions/setup-go@v5.0.2
with:
@@ -24,13 +24,10 @@ jobs:
version: "v1.58"

test:
strategy:
matrix:
platform: [ubuntu-latest, windows-latest]
runs-on: ${{ matrix.platform }}
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4.1.7
uses: actions/checkout@v4.2.1
- name: Install Go
if: success()
uses: actions/setup-go@v5.0.2
5 changes: 4 additions & 1 deletion go.mod
Original file line number Diff line number Diff line change
@@ -2,4 +2,7 @@ module github.com/gabriel-vasile/mimetype

go 1.20

require golang.org/x/net v0.27.0
require golang.org/x/net v0.30.0

// v1.4.4 had a test file detected as malicious by antivirus software. #575
retract v1.4.4
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
golang.org/x/net v0.27.0 h1:5K3Njcw06/l2y9vpGCSdcxWOYHOUk3dVNGDXN+FvAys=
golang.org/x/net v0.27.0/go.mod h1:dDi0PyhWNoiUOrAS8uXv/vnScO4wnHQO4mj9fn/RytE=
golang.org/x/net v0.30.0 h1:AcW1SDZMkb8IpzCdQUaIq2sP4sZ4zw+55h6ynffypl4=
golang.org/x/net v0.30.0/go.mod h1:2wGyMJ5iFasEhkwi13ChkO/t1ECNC4X4eBKkVFyYFlU=
2 changes: 2 additions & 0 deletions internal/magic/binary.go
Original file line number Diff line number Diff line change
@@ -21,6 +21,8 @@ var (
SWF = prefix([]byte("CWS"), []byte("FWS"), []byte("ZWS"))
// Torrent has bencoded text in the beginning.
Torrent = prefix([]byte("d8:announce"))
// PAR1 matches a parquet file.
Par1 = prefix([]byte{0x50, 0x41, 0x52, 0x31})
)

// Java bytecode and Mach-O binaries share the same magic number.
43 changes: 32 additions & 11 deletions internal/magic/ftyp.go
Original file line number Diff line number Diff line change
@@ -1,22 +1,14 @@
package magic

import "bytes"
import (
"bytes"
)

var (
// AVIF matches an AV1 Image File Format still or animated.
// Wikipedia page seems outdated listing image/avif-sequence for animations.
// https://github.com/AOMediaCodec/av1-avif/issues/59
AVIF = ftyp([]byte("avif"), []byte("avis"))
// Mp4 matches an MP4 file.
Mp4 = ftyp(
[]byte("avc1"), []byte("dash"), []byte("iso2"), []byte("iso3"),
[]byte("iso4"), []byte("iso5"), []byte("iso6"), []byte("isom"),
[]byte("mmp4"), []byte("mp41"), []byte("mp42"), []byte("mp4v"),
[]byte("mp71"), []byte("MSNV"), []byte("NDAS"), []byte("NDSC"),
[]byte("NSDC"), []byte("NSDH"), []byte("NDSM"), []byte("NDSP"),
[]byte("NDSS"), []byte("NDXC"), []byte("NDXH"), []byte("NDXM"),
[]byte("NDXP"), []byte("NDXS"), []byte("F4V "), []byte("F4P "),
)
// ThreeGP matches a 3GPP file.
ThreeGP = ftyp(
[]byte("3gp1"), []byte("3gp2"), []byte("3gp3"), []byte("3gp4"),
@@ -53,6 +45,17 @@ var (
Heif = ftyp([]byte("mif1"), []byte("heim"), []byte("heis"), []byte("avic"))
// HeifSequence matches a High Efficiency Image File Format (HEIF) file sequence.
HeifSequence = ftyp([]byte("msf1"), []byte("hevm"), []byte("hevs"), []byte("avcs"))
// Mj2 matches a Motion JPEG 2000 file: https://en.wikipedia.org/wiki/Motion_JPEG_2000.
Mj2 = ftyp([]byte("mj2s"), []byte("mjp2"), []byte("MFSM"), []byte("MGSV"))
// Dvb matches a Digital Video Broadcasting file: https://dvb.org.
// https://cconcolato.github.io/mp4ra/filetype.html
// https://github.com/file/file/blob/512840337ead1076519332d24fefcaa8fac36e06/magic/Magdir/animation#L135-L154
Dvb = ftyp(
[]byte("dby1"), []byte("dsms"), []byte("dts1"), []byte("dts2"),
[]byte("dts3"), []byte("dxo "), []byte("dmb1"), []byte("dmpf"),
[]byte("drc1"), []byte("dv1a"), []byte("dv1b"), []byte("dv2a"),
[]byte("dv2b"), []byte("dv3a"), []byte("dv3b"), []byte("dvr1"),
[]byte("dvt1"), []byte("emsg"))
// TODO: add support for remaining video formats at ftyps.com.
)

@@ -86,3 +89,21 @@ func QuickTime(raw []byte, _ uint32) bool {
}
return bytes.Equal(raw[:8], []byte("\x00\x00\x00\x08wide"))
}

// Mp4 detects an .mp4 file. Mp4 detections only does a basic ftyp check.
// Mp4 has many registered and unregistered code points so it's hard to keep track
// of all. Detection will default on video/mp4 for all ftyp files.
// ISO_IEC_14496-12 is the specification for the iso container.
func Mp4(raw []byte, _ uint32) bool {
if len(raw) < 12 {
return false
}
// ftyps are made out of boxes. The first 4 bytes of the box represent
// its size in big-endian uint32. First box is the ftyp box and it is small
// in size. Check most significant byte is 0 to filter out false positive
// text files that happen to contain the string "ftyp" at index 4.
if raw[0] != 0 {
return false
}
return bytes.Equal(raw[4:8], []byte("ftyp"))
}
13 changes: 10 additions & 3 deletions internal/magic/magic.go
Original file line number Diff line number Diff line change
@@ -153,9 +153,6 @@ func ftyp(sigs ...[]byte) Detector {
if len(raw) < 12 {
return false
}
if !bytes.Equal(raw[4:8], []byte("ftyp")) {
return false
}
for _, s := range sigs {
if bytes.Equal(raw[8:12], s) {
return true
@@ -242,3 +239,13 @@ func min(a, b int) int {
}
return b
}

type readBuf []byte

func (b *readBuf) advance(n int) bool {
if n < 0 || len(*b) < n {
return false
}
*b = (*b)[n:]
return true
}
45 changes: 3 additions & 42 deletions internal/magic/ms_office.go
Original file line number Diff line number Diff line change
@@ -5,58 +5,19 @@ import (
"encoding/binary"
)

var (
xlsxSigFiles = [][]byte{
[]byte("xl/worksheets/"),
[]byte("xl/drawings/"),
[]byte("xl/theme/"),
[]byte("xl/_rels/"),
[]byte("xl/styles.xml"),
[]byte("xl/workbook.xml"),
[]byte("xl/sharedStrings.xml"),
}
docxSigFiles = [][]byte{
[]byte("word/media/"),
[]byte("word/_rels/document.xml.rels"),
[]byte("word/document.xml"),
[]byte("word/styles.xml"),
[]byte("word/fontTable.xml"),
[]byte("word/settings.xml"),
[]byte("word/numbering.xml"),
[]byte("word/header"),
[]byte("word/footer"),
}
pptxSigFiles = [][]byte{
[]byte("ppt/slides/"),
[]byte("ppt/media/"),
[]byte("ppt/slideLayouts/"),
[]byte("ppt/theme/"),
[]byte("ppt/slideMasters/"),
[]byte("ppt/tags/"),
[]byte("ppt/notesMasters/"),
[]byte("ppt/_rels/"),
[]byte("ppt/handoutMasters/"),
[]byte("ppt/notesSlides/"),
[]byte("ppt/presentation.xml"),
[]byte("ppt/tableStyles.xml"),
[]byte("ppt/presProps.xml"),
[]byte("ppt/viewProps.xml"),
}
)

// Xlsx matches a Microsoft Excel 2007 file.
func Xlsx(raw []byte, limit uint32) bool {
return zipContains(raw, xlsxSigFiles...)
return zipContains(raw, []byte("xl/"), true)
}

// Docx matches a Microsoft Word 2007 file.
func Docx(raw []byte, limit uint32) bool {
return zipContains(raw, docxSigFiles...)
return zipContains(raw, []byte("word/"), true)
}

// Pptx matches a Microsoft PowerPoint 2007 file.
func Pptx(raw []byte, limit uint32) bool {
return zipContains(raw, pptxSigFiles...)
return zipContains(raw, []byte("ppt/"), true)
}

// Ole matches an Open Linking and Embedding file.
22 changes: 21 additions & 1 deletion internal/magic/text_csv.go
Original file line number Diff line number Diff line change
@@ -1,12 +1,28 @@
package magic

import (
"bufio"
"bytes"
"encoding/csv"
"errors"
"io"
"sync"
)

// A bufio.Reader pool to alleviate problems with memory allocations.
var readerPool = sync.Pool{
New: func() any {
// Initiate with empty source reader.
return bufio.NewReader(nil)
},
}

func newReader(r io.Reader) *bufio.Reader {
br := readerPool.Get().(*bufio.Reader)
br.Reset(r)
return br
}

// Csv matches a comma-separated values file.
func Csv(raw []byte, limit uint32) bool {
return sv(raw, ',', limit)
@@ -18,7 +34,11 @@ func Tsv(raw []byte, limit uint32) bool {
}

func sv(in []byte, comma rune, limit uint32) bool {
r := csv.NewReader(bytes.NewReader(dropLastLine(in, limit)))
in = dropLastLine(in, limit)

br := newReader(bytes.NewReader(in))
defer readerPool.Put(br)
r := csv.NewReader(br)
r.Comma = comma
r.ReuseRecord = true
r.LazyQuotes = true
20 changes: 20 additions & 0 deletions internal/magic/text_csv_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
package magic

import (
"io"
"math/rand"
"testing"
)

func BenchmarkCsv(b *testing.B) {
r := rand.New(rand.NewSource(0))
data := make([]byte, 4096)
if _, err := io.ReadFull(r, data); err != io.ErrUnexpectedEOF && err != nil {
b.Fatal(err)
}

b.ReportAllocs()
for i := 0; i < b.N; i++ {
Csv(data, 0)
}
}
Loading