Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser giving unicode characters for Arabic language #47

Open
abdul756 opened this issue May 4, 2024 · 10 comments
Open

Parser giving unicode characters for Arabic language #47

abdul756 opened this issue May 4, 2024 · 10 comments
Labels
question Further information is requested

Comments

@abdul756
Copy link

abdul756 commented May 4, 2024

Parser giving unicode characters for Arabic language how to parse files for languages other than english
Am attaching the output

Code

files = pw.io.fs.read(
    data_dir,
    mode="streaming",
    format="binary",
    autocommit_duration_ms=50,
)
parser = ParseUnstructured()
documents = files.select(text = parser(pw.this.data))
pw.io.csv.write(documents, "output_stream_en_7.csv")

نظام التكاليف القضائية.pdf

output_stream_ar_7.csv
Please help in resolving this issue

I would request help; I would like to see an example; I would like to understand the cause of the issue.

@abdul756 abdul756 added the question Further information is requested label May 4, 2024
@dxtrous
Copy link
Member

dxtrous commented May 4, 2024

Looks like one of the pipeline steps has introduced spurious UTF escaping of your characters. Easily fixable but likely to be an annoyance.

To help pin-point the cause, could you replace the pw.io.csv.write line by pw.debug.compute_and_print(documents), look at your terminal output, and see if the problem persists?

@abdul756
Copy link
Author

abdul756 commented May 6, 2024

Yeah I will try and update you

@abdul756
Copy link
Author

abdul756 commented May 6, 2024

ﻣﻣﻠﻛﺔ\n\nﺗﻛون\n\nاﻟﺗﻲ\n\nاﻟدوﻟﯾﺔ\n\nواﻻﺗﻔﺎﻗﯾﺎت\n\nواﻟﻣﻌﺎھدات\n\nاﻷﻧظﻣﺔ\n\nﺑﮫ\n\nﺗﻘﺿﻲ\n\nﻣﺎ\n\nﻣراﻋﺎة\n\nﻣﻊ\n\nﻋﻠﯾﮭم.\n\nأو\n\nﻣﻧﮭم\n\nﻛﺎﻧت ﺳواءً\n\nﺗﻘﺎم\n\nاﻟﺗﻲ\n\nاﻟدﻋﺎوى\n\nﻓﻲ ،\n\nﺟﻧﺎﺋﯾﺔ\n\nﻏﯾر\n\nﻣﺎﻟﯾﺔ\n\nﻗﺿﺎﯾﺎ\n\nﻓﻲ\n\nاﻟﻘﺿﺎﺋﯾﺔ\n\nاﻟﺗﻛﺎﻟﯾف\n\nاﺳﺗﺣﻘﺎق\n\nوﻗت\n\nواﻟﻣوﻗوﻓون\n\nاﻟﻣﺳﺟوﻧون\n\n.1\n\nﻋﻣل.\n\nﻋﻘود\n\nﻋن\n\nاﻟﻧﺎﺷﺋﺔ\n\nﺑﻣﺳﺗﺣﻘﺎﺗﮭم\n\n؛ ﻟﻠﻣطﺎﻟﺑﺔ\n\nﻋﻧﮭم\n\nواﻟﻣﺳﺗﺣﻘون\n\nﻣﻧﮫ\n\nواﻟﻣﺳﺗﺛﻧون\n\nاﻟﻌﻣل\n\nﺑﻧظﺎم\n\nاﻟﻣﺷﻣوﻟون\n\nاﻟﻌﻣﺎل\n\n.2\n\nاﻟﺣﻛوﻣﯾﺔ.\n\nواﻷﺟﮭزة\n\nاﻟوزارات\n\n.3\n\nﺑذﻟك.\n\nاﻟﺧﺎﺻﺔ\n\nواﻟﻘواﻋد\n\nاﻹﺟراءات\n\nاﻟﻼﺋﺣﺔ\n\nوﺗﺣدد\n\nﻋﺷرة\n\nاﻟﺛﺎﻣﻧﺔ\n\nاﻟﻘﺿﺎﺋﯾﺔ.\n\nاﻟﺗﻛﺎﻟﯾف\n\nﺑدﻓﻊ ﻋﻠﯾﮫ\n\nاﻟﻣﺣﻛوم\n\nﻓﯾﻠزم\n\nاﻟﻘﺿﺎﺋﯾﺔ\n\nاﻟﺗﻛﺎﻟﯾف\n\nﻣن\n\nاﻟﻣُﻌﻔﻰ\n\nﻟﻣﺻﻠﺣﺔ\n\nاﻟدﻋوى\n\nﻓﻲ ﺣﻛم\n\nﺻدر\n\nإذا\n\n(،\n\nﻋﺷرة\n\nاﻟﺳﺎﺑﻌﺔ\n\nاﻟﻣﺎدة )\n\nﺑﮫ\n\nﺗﻘﺿﻲ\n\nﻣﺎ\n\nﻣراﻋﺎة\n\nﻣﻊ\n\nاﻟرد.\n\nﻣﺳوﻏﺎت\n\nﺗواﻓرت\n\nإذا\n\nِھﺎ\n\nوردّ\n\nﻋﻠﯾﮫ.\n\nواﻹﺷراف\n\nﻋﻣﻠﮫ\n\nاﻟﻘﺿﺎﺋﯾﺔ ،\n\nإﺟراءات\n\nﻋﺷرة\n\nاﻟﺗﺎﺳﻌﺔ\n\nاﻟﺳﻌودي.\n\nاﻟﻣرﻛزي\n\nاﻟﺑﻧك\n\nﻟدى\n\nاﻟﻣﺎﻟﯾﺔ\n\nوزارة\n\nﺟﺎري\n\nﺣﺳﺎب ﻓﻲ\n\nاﻟﻣﺣﺻﻠﺔ\n\nاﻟﻘﺿﺎﺋﯾﺔ\n\nاﻟﺗﻛﺎﻟﯾف\n\nﻣﺑﺎﻟﻎ\n\nﺗودع\n\nاﻟﻌﺷرون\n\nاﻟﺗﻛﺎﻟﯾف\n\nﺑﺗﺣﺻﯾل\n\nاﻟطﻠب -\n\nإﻟﯾﮭﺎ\n\nاﻟﻣﻘدم\n\nأو ،\n\nاﻟدﻋوى\n\nإﻟﯾﮭﺎ\n\nاﻟﻣرﻓوع\n\nاﻟﻣﺣﻛﻣﺔ\n\nﻓﻲ -\n\nاﻟﻣﺧﺗﺻﺔ\n\nاﻹدارة\n\nﻣﻧﮫ\n\nﺑﻘرار\n\nاﻟﻌدل\n\nوزﯾر\n\nﯾﺣدد\n\nواﻟﻌﺷرون\n\nاﻟﺣﺎدﯾﺔ\n\nوﻗواﻋد\n\nﻟﮫ\n\nاﻟﺗراﺧﯾص\n\nأﺣﻛﺎم\n\nاﻟﻼﺋﺣﺔ\n\nوﺗﺣدد\n\nاﻟﻧظﺎم.\n\nﻟﺗطﺑﯾﻖ\n\nاﻟﻣﺳﺎﻧدة\n\nﺑﺎﻷﻋﻣﺎل\n\nﺑﺎﻟﻘﯾﺎم\n\nاﻟﺧﺎص\n\nﻟﻠﻘطﺎع\n\nاﻟﺗرﺧﯾص\n\nاﻟﻌدل\n\nﻟوزﯾر\n\nواﻟﻌﺷرون\n\nاﻟﺛﺎﻧﯾﺔ\n\nاﻟوزراء.\n\nﻣﺟﻠس ﻣن\n\nﺑﻘرار\n\nوﺗﺻدر\n\nاﻟﻧظﺎم ،\n\nﺻدور\n\nﺗﺎرﯾﺦ\n\nﻣن\n\nﯾوﻣﺎً\n\nﺳﺗﯾن (\n\nﺧﻼل )\n\nاﻟﻼﺋﺣﺔ\n\nاﻟﻌدل\n\nوزارة\n\nﺗﻌد\n\nواﻟﻌﺷرون\n\nاﻟﺛﺎﻟﺛﺔ\n\nاﻟرﺳﻣﯾﺔ.\n\nاﻟﺟرﯾدة\n\nﻓﻲ\n\nﻧﺷره\n\nﺗﺎرﯾﺦ\n\nﻣن\n\nﯾوﻣﺎً\n\nوﺛﻣﺎﻧﯾن (\n\nﻣﺎﺋﺔ\n\nﺑﻌد )\n\nﺑﺎﻟﻧظﺎم\n\nﯾﻌﻣل\n\n2021 2021\n\nاﻟﻌﺪل اﻟﻌﺪل\n\nﻟﻮزارة ﻟﻮزارة\n\n© ©\n\nﻣﺤﻔﻮﻇﺔ ﻣﺤﻔﻮﻇﺔ\n\nاﻟﺤﻘﻮق اﻟﺤﻘﻮق\n\nﺟﻤﻴﻊ ﺟﻤﻴﻊ', pw.Json({'filetype': 'application/pdf', 'languages': ['eng'], 'links': [], 'page_number': 4})),)

when i made the mode static and used pw.debug.compute_and_print(documents) the output in terminal is able to show in arabic , so how to fix this in streaming mode while using pw.io.csv.write

@berkecanrizai
Copy link
Contributor

Hey @abdul756 , streaming mode is only usable when you are running the app with pw.run().

In this case, seems like parser is working ok. If you want to dump the text into some file and keep the app running (so that when a new content or new file arrives, new data is put into your csv file), you can run your code with streaming mode enabled, you can achieve this with the addition of pw.run() at the end of the code.

So, it will look as:

documents = folder.select(text=parser(pw.this.data))
pw.io.csv.write(documents, "output_stream_en_7.csv")

pw.run()

You can run this in notebook or in regular python file.

This will start the pipeline that will keep running until you close. After running the pw.run, you will see the output file being created.

If, you are interested in taking a dump for one time in a static manner, you can run the following:

df = pw.debug.table_to_pandas(documents)
df.to_csv("outputs_en.csv")

this will put the content into csv file. In this case, we take data into Pandas DataFrame, then write it to a file. This one doesn't need a pw.run() since it is statically running for single time.

@abdul756
Copy link
Author

abdul756 commented May 6, 2024

In streaming mode the output is not coming its giving only unicode characters, you can refer the file i attached with the issue

@berkecanrizai
Copy link
Contributor

In streaming mode the output is not coming its giving only unicode characters, you can refer the file i attached with the issue

Yes, I just replicated the issue with another file. The static mode works ok (refer to the df.to_csv snippet above), and internally, the Pathway table also stores the data correctly in Arabic characters, however writing them with csv casts them to Unicode.
So, you can use it in your app without any issues, in the meantime, we are investigating. Will keep updated.

@abdul756
Copy link
Author

abdul756 commented May 6, 2024

Sure thanks

@abdul756 abdul756 closed this as completed May 6, 2024
@KamilPiechowiak
Copy link
Contributor

@abdul756 thanks for reporting this. The problem will be fixed in the next release (it'll be released this week).

@KamilPiechowiak
Copy link
Contributor

KamilPiechowiak commented May 10, 2024

Hey @abdul756,
Today Pathway v0.11.0 has been released. Your problem should be fixed now. However nobody in the team knows Arabic language. Could you update your pathway version and confirm that it is a satisfactory solution to your problem?

@abdul756
Copy link
Author

I will test and update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants