Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency sometimes occurs across multiple runs on the same file #50

Open
ViperelB opened this issue May 21, 2023 · 2 comments
Open

Comments

@ViperelB
Copy link

ViperelB commented May 21, 2023

Running with the default command on larger files (over 1GB) leads to inconsistency across multiple runs

$ duplicut '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.found' -o '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.DUPLICUT'

duplicut successfully removed 0 duplicates and 42 filtered lines in 05 seconds

$ duplicut '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.found' -o '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.DUPLICUT'

duplicut successfully removed 384 duplicates and 0 filtered lines in 02 seconds

$ duplicut '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.found' -o '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.DUPLICUT'

duplicut successfully removed 0 duplicates and 384 filtered lines in 02 seconds

$ duplicut '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.found' -o '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.DUPLICUT'

duplicut successfully removed 0 duplicates and 385 filtered lines in 02 seconds

$ duplicut '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.found' -o '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.DUPLICUT'

duplicut successfully removed 221 duplicates and 385 filtered lines in 02 seconds

$ duplicut '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.found' -o '/media/sf_kalishare/Wordlists/hashmob.net_2023-05-14.medium.DUPLICUT'

duplicut successfully removed 378 duplicates and 385 filtered lines in 02 seconds

Any idea why this is occurring? i was expecting the same results over and over again,
On further testing it seems the cleanup stats go wild when writing to an already existing file, resulting in various inconsistent file size and words.

@ViperelB
Copy link
Author

ViperelB commented May 21, 2023

It's also a little weird that using "duplicut -l 24" on a wordlist, then reusing the result with -l 16 afterwards, the size will always be smaller than doing -l 16 on the original wordlist. Technically it should be the same amount of duplicate/filtered lines taken out.

visual example:
Original = 2.03 GB -> duplicut -l 24 = 1.11 GB -> duplicut -l 16 = 563 MB
Original = 2.03 GB -> duplicut -l 16 = 867 MB

example2:
0...9999999.dict file (94.1mb) -> sort -u = same as original (94.1mb)
0...9999999.dict file (94.1mb) -> duplicut (successfully removed 0 duplicates and 0 filtered lines in 09 seconds) = 83.5mb (opened in notepad++ it seems the file suddenly stops after "8874998" while the original goes up to 9999999

@nil0x42
Copy link
Owner

nil0x42 commented Jan 26, 2024

Do any of your files contain nullbytes ?
Duplicut makes an important assumption: The input file is a standard passwords wordlist with no binary content.
The first pass 'patches' lines to me removed by overwriting a nullbyte to their first char, so seconds pass assumes tha lines starting with a nullbyte must be ignored on the second pass.

Also, duplicut makes virtual chunks of the file depending on currently available memory, and starts each chunk after next newline, so having nullbytes in your files would explain such weird behavior. Please let me know if your files do contain nullbytes, so i can investigate further if there is a possible bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants