Java Forums

Main Menu
Home
Today's Posts
FAQ
Search
Contact Us

Java Network
Java Tips
Java Tips Blog

Sponsored Links





Welcome to the Java Forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community, you will:

  • have access to post topics
  • communicate privately with other members (PM)
  • not see advertisements between posts
  • have the possibility to earn one of our surprises if you are an active member
  • access many other special features that will be introduced later.

Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact us.

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 02-08-2008, 03:47 AM
Member
 
Join Date: Feb 2008
Posts: 1
j_kathiresan is on a distinguished road
Xml Parse throws SaxParseException. Encoding is UTF-8 insteadof ISO-8859-1 ?
Hi All,

I'm having some korean characters in my xml. when i tried to parse the xml i'm getting SaxParseException .

<?xml version="1.0" encoding="UTF-8"?> --- Throwing Exception

<?xml version="1.0" encoding="ISO-8859-1"?> --- No Exception, successfully parsed

I'm not sure why UTF-8 is failing and ISO is passing. But I'm always getting xml with UTF-8 format? Can anyone know the reason?

I also like to know the differences between UTF-8 and ISO, i don't find any good article/document for this.

Thanks,
J.Kathir
Bookmark Post in Technorati
Reply With Quote
Sponsored Links
  #2 (permalink)  
Old 03-28-2008, 06:08 PM
DonCash's Avatar
Moderator
 
Join Date: Aug 2007
Location: London, UK
Posts: 226
DonCash will become famous soon enoughDonCash will become famous soon enough
Quote:

The international standard ISO 10646 defines the Universal Character Set (UCS). UCS is a superset of all other character set standards. It guarantees round-trip compatibility to other character sets. This means simply that no information is lost if you convert any text string to UCS and then back to its original encoding.

UCS contains the characters required to represent practically all known languages. This includes not only the Latin, Greek, Cyrillic, Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese, Japanese and Korean Han ideographs as well as scripts such as Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian, Ogham, Myanmar, Sinhala, Thaana, Yi, and others. For scripts not yet covered, research on how to best encode them for computer usage is still going on and they will be added eventually. This includes not only historic scripts such as Cuneiform, Hieroglyphs and various Indo-European notations, but even some selected artistic scripts such as Tolkien’s Tengwar and Cirth. UCS also covers a large number of graphical, typographical, mathematical and scientific symbols, including those provided by TeX, PostScript, APL, the International Phonetic Alphabet (IPA), MS-DOS, MS-Windows, Macintosh, OCR fonts, as well as many word processing and publishing systems. The standard continues to be maintained and updated. Ever more exotic and specialized symbols and characters will be added for many years to come.

ISO 10646 originally defined a 31-bit character set. The subsets of 216 characters where the elements differ (in a 32-bit integer representation) only in the 16 least-significant bits are called the planes of UCS.

The most commonly used characters, including all those found in major older encoding standards, have been placed into the first plane (0x0000 to 0xFFFD), which is called the Basic Multilingual Plane (BMP) or Plane 0. The characters that were later added outside the 16-bit BMP are mostly for specialist applications such as historic scripts and scientific notation. Current plans are that there will never be characters assigned outside the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over one million potential future characters. The ISO 10646-1 standard was first published in 1993 and defines the architecture of the character set and the content of the BMP. A second part ISO 10646-2 was added in 2001 and defines characters encoded outside the BMP. In the 2003 edition, the two parts were combined into a single ISO 10646 standard. New characters are still being added on a continuous basis, but the existing characters will not be changed any more and are stable.

UCS assigns to each character not only a code number but also an official name. A hexadecimal number that represents a UCS or Unicode value is commonly preceded by “U+” as in U+0041 for the character “Latin capital letter A”. The UCS characters U+0000 to U+007F are identical to those in US-ASCII (ISO 646 IRV) and the range U+0000 to U+00FF is identical to ISO 8859-1 (Latin-1). The range U+E000 to U+F8FF and also larger ranges outside the BMP are reserved for private use. UCS also defines several methods for encoding a string of characters as a sequence of bytes, such as UTF-8 and UTF-16.
There is more information on this here:

UTF-8 and Unicode FAQ

I'm not sure if this will work but you could try replacing the encoding string UTF-8 to ISO before you parse the XML. Or you could try removing this line completely.
__________________
Did this post help you? Please me! || Don't forget to: Mark your Thread as Solved

Last edited by DonCash : 03-28-2008 at 06:11 PM.
Bookmark Post in Technorati
Reply With Quote
Sponsored Links
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Some help with encoding... nm123 Networking 0 04-15-2008 01:22 AM
Difference between Throws and Throw Poonam New To Java 7 02-06-2008 05:52 PM
Main method with throws Exception bugger New To Java 3 01-07-2008 03:48 PM
throws Exception javaplus New To Java 1 11-06-2007 08:32 PM
org.xml.sax.SAXParseException: Content is not allowed in trailing section boy22 XML 1 07-24-2007 01:07 AM


All times are GMT +3. The time now is 02:03 AM.


VBulletin, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2006 - 2007, www.java-forums.org