March 26

Windows, Java, and an Internationalization Mess

The project which I’m currently working on is a Java project, powered by Spring, built by Maven. I use Ubuntu/Linux to build and run the project locally, but everyone else on the team uses Windows XP. We recently got back a set of translations that included pages in Chinese, Japanese, Korean, and Russian - all languages with characters not included in ISO 8859-1.

The first issue we encountered was that Java will not read properties files (the standard key=value .properties format) in any character encoding except ISO 8859-1. So we converted all the .properties format files in XML properties files with the “<?xml version="1.0" encoding="UTF-8" ?>” declaration for the encoding. And all was well in the world (from my perspective at least).

I built and ran the project, and checked out some the (what I call at least) exotic language pages. I saw Cyrillic characters in Russian, so I was happy. But when someone built and ran the project on a Windows computer, they saw boxes and question marks.

The problem is in how Java handles files based on what the environment specifies. On my Ubuntu computer, my environment is for UTF-8, but on Windows, it’s set to cp1252 (an MS proprietary extension of ISO 8859-1). So when Maven copies files during the build process, Java re-encodes the files to cp1252, which results in lots of question marks and boxes and other such problems.

The solution is to add MAVEN_OPTS="-Dfile.encoding=UTF-8" to the environment before you run mvn. That overrides Java detecting Windows’ cp1252 encoding, and makes everything work.

Now can someone tell me how Windows doesn’t do UTF-8 natively, especially in a world with more speakers of these exotic languages than those with languages that can be expressed in cp1252?

Comments

  1. David said on July 11th, 2008

    Hi, I am a complete beginner in developing and designing websites and i have added the russian language to our website. There is only one page that doesnt work out of all of them, i think its because it has java or some kind of script language in the page. the others work perfectly fine so I just need help with the one. i changed the page to UTF-8 but its all question marks when it comes up on the website. you mentioned something about adding MAVEN_OPTS=”-Dfile.encoding=UTF-8″ to the environment. What is this and where do I put it? Please keep in mind that i have no idea about all this technical website language. thank you for any help that you can give me.

Add a comment

Browse posts by month

Browse by author

We're hiring!

Come take a look at careers with Molecular