Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan is not getting completed for file containing unicode characters like - 北京朝阳区 #641

Open
KBiru opened this issue Dec 1, 2022 · 4 comments
Labels
bug The issue describes a malfunctioning aspect of the project. needs more info The issue has been reviewed, but the information provided by the reporter is incomplete. P4 Future work. E.g. something we might to get on in the future. Might be used for future ideas too.

Comments

@KBiru
Copy link

KBiru commented Dec 1, 2022

[This is an extension for issue no. #626 ]

Hi Team,

The scan is crashing for some files [without any errors or scan reports], I went through each line and found out that - if the code contains some string like - 北京朝阳区, detect-secret does not scan the file it exits without any errors. Is there some plugin or filters I should use to avoid this?
[Note - it is known that the particular file contains secret]

I mean are unicode strings getting handled properly? Also if I want to have should_exclude_secret filter for certain unicode regexes, then how to add it in the transient settings?
So far I could not do it.

Using the python package of the detect-secrets (python 3.10)
detect-secrets version = 1.4.0
OS = Windows 10

Please let me know if there is any information.

Thanks,
Bireswar

@jpdakran jpdakran added triaged The issue has been reviewed but has not been solved yet. bug The issue describes a malfunctioning aspect of the project. P4 Future work. E.g. something we might to get on in the future. Might be used for future ideas too. pending The issue still needs to be reviewed by one of the maintainers. and removed triaged The issue has been reviewed but has not been solved yet. labels Mar 22, 2023
@jpdakran
Copy link
Member

@KBiru Hi. Thank you for reporting this. What is the file type of the file?

@jpdakran jpdakran added triaged The issue has been reviewed but has not been solved yet. needs more info The issue has been reviewed, but the information provided by the reporter is incomplete. and removed pending The issue still needs to be reviewed by one of the maintainers. triaged The issue has been reviewed but has not been solved yet. labels Mar 22, 2023
@KBiru
Copy link
Author

KBiru commented Mar 23, 2023

The file type is normal XML, but the encoding is utf-8, so anything on utf-8 and containing characters like I mentioned, breaks scan process as detect-secrets take only default encoding of the OS it is running on, for example in windows it only tries to decode using cp1252 though the file is in utf-8.
I think this is causing the issue, it does not understand the file encoding and then process, rather it only takes care of the default encoding of the system.

@jpdakran
Copy link
Member

@KBiru Can you give me an example of a snippet of this file? For example trim the file down enough and sanitize it so there is not sensitive information while still causing the error. So I can attempt to reproduce this?

@KBiru
Copy link
Author

KBiru commented Apr 18, 2023

@jpdakran I tried to create the same behavior as before but it seems like now it gives the following error -
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 352: character maps to

So, to be more clear the same content does work if the file encoding is same as system's default encoding but gives out error if it is different one -
to demonstrate the same use the following content and paste in a file encoded as utf-8 and the system must be windows -
[Expected behavior - it will fail as windows system's default encoding is cp1252 not utf-8]
`
+GSMSF18232735

  •   <td align="left" style="FONT-SIZE: 9px"><font size="-2"><a href="" target="URL">View URL</a></a></font></td>
    
  •   <td align="left"><font size="-1">勋</font></td>
    
  •   <td align="left"><font size="-1">张</font></td>
    
  •   <td align="left"><font size="-1">北京朝阳区</font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1">8.0</font></td>
    
  •   <td align="left"><font size="-1">东莞市悠派智能展示科技有限公司</font></td>
    
  •   <td align="left"><font size="-1"></font></td>	
    
  •   <td align="left"><font size="-1"></font></td>		
    
  •   <td align="left"><font size="-1">茶山镇塘角工业区</font></td>
    
  •   <td align="left"><font size="-1"></font></td>		
    
  •   <td align="left"><font size="-1">东莞市</font></td>	
    
  •   <td align="left"><font size="-1">CN</font></td>	
    
  •   <td align="left"><font size="-1">523382</font></td>		
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1">Professional Services</font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1">Marketing</font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1">No</font></td>		
    
  •   <td align="left"><font size="-1">No</font></td>	
    
  •   <td align="left"><font size="-1">No</font></td>
    
  •   <td align="left"><font size="-1">No</font></td>
    
  •   <td align="left"><font size="-1">Yes</font></td>
    
  •   <td align="left"><font size="-1">No</font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1"></font></td>
    
  •   <td align="left"><font size="-1"> </font></td>
    
  •   <td align="left"><font size="-1"></font></td>	
    
  •   <td align="left"><font size="-1"></font></td>	
    
  •   <td align="left"><font size="-1"></font></td>		
    
  •   </tr>`
    

What I wanted to discuss is that is the file encoding getting handled dynamically or this is not a feature for the tool yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue describes a malfunctioning aspect of the project. needs more info The issue has been reviewed, but the information provided by the reporter is incomplete. P4 Future work. E.g. something we might to get on in the future. Might be used for future ideas too.
Projects
None yet
Development

No branches or pull requests

2 participants