commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Gao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (IO-331) BOMInputStream wrongly detects UTF-32LE_BOM files as UTF-16LE_BOM files in method getBOM()
Date Fri, 01 Jun 2012 04:24:22 GMT

     [ https://issues.apache.org/jira/browse/IO-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

David Gao updated IO-331:
-------------------------

    Description: 
Hi,

The BOMInputStream works great for most UTF encoded files when detecting Byte Order Marks.
However, if a file is UTF-32LE encoded with BOM the class takes it as UTF-16LE instead. This
is not expected behavior.

The problem comes from method getBOM(). And the first two bytes for UTF-16LE and UTF-32LE
are the same, which might be the root cause of the problem.

The following lists the bytes for UTF encodings for reference. The content is a BOM followed
by letter 't'.
||Encoding||Byte 1||Byte 2||Byte 3||Byte 4|| || || || ||
|UTF8|EF|BB|BF|74| | | | |
|UTF16-LE|FF|FE|74|00| | | | |
|UTF16-BE|FE|FF|00|74| | | | |
|UTF32-LE|FF|FE|00|00|74|00|00|00
|UTF32-BE|00|00|FE|FF|00|00|00|74


I personally used the following code to work around this problem at the moment. Hope it helps.

{code}
	private void detectBOM(InputStream in) throws IOException{
		List<ByteOrderMark> all=availableBOMs();
		int max=0;
        for (ByteOrderMark bom : all) {
            max = Math.max(max, bom.length());
        }
		byte[] firstBytes=new byte[max];
		for (int i = 0; i < max; i++) {
			firstBytes[i]=(byte) in.read();
			System.out.print(Integer.toHexString(firstBytes[i] & 0xff).toUpperCase()+" ");
		}
		
		boolean found=false;
		for (int j = max; j >1; j--) {
			byte[] _copy=Arrays.copyOf(firstBytes, j);
			for (ByteOrderMark mark : all) {
				found=Arrays.equals(_copy, mark.getBytes());
				if (found) {
					System.out.println("\nBOM is: "+mark.getCharsetName());
					break;
				}
			}
			if (found) break;
		}
	}
	
	private static List<ByteOrderMark> availableBOMs(){
		List<ByteOrderMark> all=new ArrayList<ByteOrderMark>();
		all.add(ByteOrderMark.UTF_8);
		all.add(ByteOrderMark.UTF_16BE);
		all.add(ByteOrderMark.UTF_16LE);
		all.add(ByteOrderMark.UTF_32BE);
		all.add(ByteOrderMark.UTF_32LE);
		return all;
	}
{code}

  was:
Hi,

The BOMInputStream works great for most UTF encoded files when detecting Byte Order Marks.
However, if a file is UTF-32LE encoded with BOM the class takes it as UTF-16LE instead. This
is not expected behavior.

The problem comes from method getBOM(). And the first two bytes for UTF-16LE and UTF-32LE
are the same, which might be the root cause of the problem.

The following lists the bytes for UTF encodings for reference. The content is a BOM followed
by letter 't'.
||Encoding||Byte 1||Byte 2||Byte 3||Byte 4|| || || || ||
|UTF8|EF|BB|BF|74| | | |
|UTF16-LE|FF|FE|74|00| | | |
|UTF16-BE|FE|FF|00|74| | | |
|UTF32-LE|FF|FE|00|00|74|00|00|00
|UTF32-BE|00|00|FE|FF|00|00|00|74


I personally used the following code to work around this problem at the moment. Hope it helps.

{code}
	private void detectBOM(InputStream in) throws IOException{
		List<ByteOrderMark> all=availableBOMs();
		int max=0;
        for (ByteOrderMark bom : all) {
            max = Math.max(max, bom.length());
        }
		byte[] firstBytes=new byte[max];
		for (int i = 0; i < max; i++) {
			firstBytes[i]=(byte) in.read();
			System.out.print(Integer.toHexString(firstBytes[i] & 0xff).toUpperCase()+" ");
		}
		
		boolean found=false;
		for (int j = max; j >1; j--) {
			byte[] _copy=Arrays.copyOf(firstBytes, j);
			for (ByteOrderMark mark : all) {
				found=Arrays.equals(_copy, mark.getBytes());
				if (found) {
					System.out.println("\nBOM is: "+mark.getCharsetName());
					break;
				}
			}
			if (found) break;
		}
	}
	
	private static List<ByteOrderMark> availableBOMs(){
		List<ByteOrderMark> all=new ArrayList<ByteOrderMark>();
		all.add(ByteOrderMark.UTF_8);
		all.add(ByteOrderMark.UTF_16BE);
		all.add(ByteOrderMark.UTF_16LE);
		all.add(ByteOrderMark.UTF_32BE);
		all.add(ByteOrderMark.UTF_32LE);
		return all;
	}
{code}

    
> BOMInputStream wrongly detects UTF-32LE_BOM files as UTF-16LE_BOM files in method getBOM()
> ------------------------------------------------------------------------------------------
>
>                 Key: IO-331
>                 URL: https://issues.apache.org/jira/browse/IO-331
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Streams/Writers
>    Affects Versions: 2.3
>         Environment: OS: Win 7 x64
> JDK: 1.7.03
>            Reporter: David Gao
>              Labels: BOMInputStream, UTF-32LE
>         Attachments: UTF-32LE_Y.txt
>
>
> Hi,
> The BOMInputStream works great for most UTF encoded files when detecting Byte Order Marks.
However, if a file is UTF-32LE encoded with BOM the class takes it as UTF-16LE instead. This
is not expected behavior.
> The problem comes from method getBOM(). And the first two bytes for UTF-16LE and UTF-32LE
are the same, which might be the root cause of the problem.
> The following lists the bytes for UTF encodings for reference. The content is a BOM followed
by letter 't'.
> ||Encoding||Byte 1||Byte 2||Byte 3||Byte 4|| || || || ||
> |UTF8|EF|BB|BF|74| | | | |
> |UTF16-LE|FF|FE|74|00| | | | |
> |UTF16-BE|FE|FF|00|74| | | | |
> |UTF32-LE|FF|FE|00|00|74|00|00|00
> |UTF32-BE|00|00|FE|FF|00|00|00|74
> I personally used the following code to work around this problem at the moment. Hope
it helps.
> {code}
> 	private void detectBOM(InputStream in) throws IOException{
> 		List<ByteOrderMark> all=availableBOMs();
> 		int max=0;
>         for (ByteOrderMark bom : all) {
>             max = Math.max(max, bom.length());
>         }
> 		byte[] firstBytes=new byte[max];
> 		for (int i = 0; i < max; i++) {
> 			firstBytes[i]=(byte) in.read();
> 			System.out.print(Integer.toHexString(firstBytes[i] & 0xff).toUpperCase()+" ");
> 		}
> 		
> 		boolean found=false;
> 		for (int j = max; j >1; j--) {
> 			byte[] _copy=Arrays.copyOf(firstBytes, j);
> 			for (ByteOrderMark mark : all) {
> 				found=Arrays.equals(_copy, mark.getBytes());
> 				if (found) {
> 					System.out.println("\nBOM is: "+mark.getCharsetName());
> 					break;
> 				}
> 			}
> 			if (found) break;
> 		}
> 	}
> 	
> 	private static List<ByteOrderMark> availableBOMs(){
> 		List<ByteOrderMark> all=new ArrayList<ByteOrderMark>();
> 		all.add(ByteOrderMark.UTF_8);
> 		all.add(ByteOrderMark.UTF_16BE);
> 		all.add(ByteOrderMark.UTF_16LE);
> 		all.add(ByteOrderMark.UTF_32BE);
> 		all.add(ByteOrderMark.UTF_32LE);
> 		return all;
> 	}
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message