This 5 line Python script will convert a Non-English XML file to a JSON file with Perfection

Whatever I write It comes from my own experience. Last week I was working on one of mine pet project which was initially developed in Native Android using Java 4 years ago. So now I just wanted to rewrite it using Flutter to make it available on Android and iOS from the same codebase. One thing that was challenging for me was to handle XML files which were responsible for two major functionalities.

If you have no idea about android and want to know more then you may read this very next paragraph. In case you just want to convert XML into JSON you can skip it.

Let me explain a little bit more. Suppose you have written 100 pages of content in English or whatever language you were proficient, now you want to develop an android application which can display this content as an E-book like pattern and You are also sure that content will change so you may just want to put it inside Android app and any user can read it offline. So you put your whole content in the Android project inside a folder res(which stands for resources) in a file string-array file.

This file looks like this

<?xml version="1.0" encoding="utf-8"?>
<resources>
<string-array name="sloka">
<item>ओ३म् भूर्भुवः स्वः तत्सवितुर्वरेण्यं भर्गो देवस्य धीमहि धियो यो नः प्रचोदयात्।</item>     <item>ओ३म् भूर्भुवः स्वः तत्सवितुर्वरेण्यं भर्गो देवस्य धीमहि धियो यो नः प्रचोदयात्।</item>
</string-array>
</resources>

This is a valid XML that contains only two slokas (Holy lines from Hindu scriptures). It contained more than 300 slokas. Rest is removed for the sake of brevity.

In native android, every configuration, layout, and content is stored in XML files and android read and render it beautifully . As I was not willing to parse these XML files using an XML parser in Flutter so I decided to convert them into JSON files.

So I opt for Python to do the conversion. I set up Python using the Anaconda tool and my project structure was like this.

I will first explain the problem that I was facing using code snippets. Then I will show you the code which solved that problem.

This is index.py which is the main script file that is going to do the conversion. In simple words, It is just reading a file bl_content.xml and converting it JSON file.

index.py

import xmltodict
import json
with open('bl_content.xml',mode='r') as in_file:
xml = in_file.read()
with open('shabd_content.json',mode='w') as out_file:
json.dump(xmltodict.parse(xml,process_namespaces=True), out_file)

Output

{"string-array": {"item": ["\u0930\u093e\u091c\u0938\u094d\u0925\u093e\u0928 \u092e\u0947\u0902 \u092e\u0930\u0942\u092d\u0942\u092e\u093f \u0915\u093e \u0928\u093e\u0917\u094c\u0930, \u0928\u093e\u0917\u094c\u0930\u0940 \u092c\u0948\u0932\u094b\u0902 \u0915\u0947 \u0932\u093f\u090f

It was not the output that I was not expecting. It was converting all non- English characters to their respective Unicode I had to avoid this somehow.

If you are experimenting with new programming language then it becomes tough to figure out what is going wrong. So I googled and went through many solutions but nothing worked. So I started debugging step by step

Step 1

I had to figure out whether the encoding problem was occurring due to File Read Operation or File Write operation.

To understand this I just consoled the data read from the XML file.

As you can see data is in the same format(It is not in Unicode format) even after reading from an XML file. That made one thing clear there is no problem in XML reading method (xmltodict.parse). So it also made clear that something was missing in json.dump method.

so I decided to go through json.dump method documentation and find that I needed to pass ensure_ascii=False which is True by default.

Final Script

import xmltodict
import json
with open('bl_content.xml',mode='r') as in_file:
xml = in_file.read()
with open('shabd_content.json',mode='w') as out_file:
json.dump(xmltodict.parse(xml,process_namespaces=True), out_file,ensure_ascii=False)

Look at ensure_ascii=False in the very last line of codeWhich did magic for me.

Final output

{
 "item": ["ओ३म् भूर्भुवः स्वः तत्सवितुर्वरेण्यं भर्गो देवस्य धीमहि धियो यो नः प्रचोदयात्।", "ओ३म् भूर्भुवः स्वः तत्सवितुर्वरेण्यं भर्गो देवस्य धीमहि धियो यो नः प्रचोदयात्।"]
}

This is a valid JSON and can be directly put into the Flutter assets folder and can be parsed easily and rendered in a listview. If you are not concerned about that You have just learned how to avoid Unicode Horror.

I hope You have enjoyed it.

Thank you for your time.

Leave a Reply